-
Notifications
You must be signed in to change notification settings - Fork 5
Add instructlab-knowledge pipeline #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Anastas Stoyanovsky <[email protected]>
Signed-off-by: Alina Ryan <[email protected]>
Signed-off-by: Ali Maredia <[email protected]>
create_seed_dataset.py used in this notebook is heavily inspired by docprocessor.py from the sdg-hub repo. Co-authored-by: Abhishek B <[email protected]> Co-authored-by: shiv <[email protected]> Signed-off-by: Ali Maredia <[email protected]>
Signed-off-by: Ali Maredia <[email protected]>
Signed-off-by: Anastas Stoyanovsky <[email protected]>
Signed-off-by: Anastas Stoyanovsky <[email protected]>
Signed-off-by: Anastas Stoyanovsky <[email protected]>
Signed-off-by: Anastas Stoyanovsky <[email protected]>
Signed-off-by: Anastas Stoyanovsky <[email protected]>
0a68bf8
to
db981ac
Compare
Signed-off-by: Anastas Stoyanovsky <[email protected]>
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "5da70a08-1895-4d1f-8f50-93e2134b2e23", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# design goals:\n", | ||
"#\n", | ||
"# - understandability\n", | ||
"# - modularity\n", | ||
"# - configurability" | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "5da70a08-1895-4d1f-8f50-93e2134b2e23", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# design goals:\n", | |
"#\n", | |
"# - understandability\n", | |
"# - modularity\n", | |
"# - configurability" | |
] | |
}, |
"# - configurability" | ||
] | ||
}, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cell feels like it should be a util
function we import
@anastasds Great work, there were just a couple of things I'd like to see fixed and several small work items I wanted to make sure are documented so we can make issues out of them once this PR is merged. Changes for the PR:
Since I didn't already have a qna.yaml file here I got an error that the file didn't exist to unlink and the cell couldn't execute. I wonder if @fabianofranz @iamemilio @alinaryan hit the same issues. Follow up work once this PR is merged:
|
I have a PR with minor adjustments to the formatting and usability, in case you'd like to make it part of this: #10. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anastasds Thank you for putting the steps together in this PR. I got up to the chunking step and then hit this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[9], line 9
7 chunk_iter = chunker.chunk(dl_doc=doc.document)
8 chunk_objs = list(chunk_iter)
----> 9 chunks = [chunker.contextualize(chunk=chunk) for chunk in chunk_objs]
11 print(f"Extracted {len(chunks)} chunks from {doc.document.name}")
13 for chunk in chunks:
File [~/.local/lib/python3.13/site-packages/pydantic/main.py:891](http://localhost:8888/home/alina/.local/lib/python3.13/site-packages/pydantic/main.py#line=890), in BaseModel.__getattr__(self, item)
888 return super().__getattribute__(item) # Raises AttributeError if appropriate
889 else:
890 # this is the current error
--> 891 raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'HybridChunker' object has no attribute 'contextualize'
"WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME\n", | ||
"WORKSPACE_DIR.mkdir(exist_ok=True)\n", | ||
"\n", | ||
"SOURCE_DOCUMENT = None # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a "call to action" we can provide or way to make it extremely obvious to the user that they need to set their source doc here?
No description provided.