Skip to content

Add instructlab-knowledge pipeline #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Add instructlab-knowledge pipeline #7

wants to merge 11 commits into from

Conversation

anastasds
Copy link

No description provided.

Signed-off-by: Anastas Stoyanovsky <[email protected]>
alinaryan and others added 7 commits May 7, 2025 11:29
create_seed_dataset.py used in this notebook
is heavily inspired by docprocessor.py from the
sdg-hub repo.

Co-authored-by: Abhishek B <[email protected]>
Co-authored-by: shiv <[email protected]>
Signed-off-by: Ali Maredia <[email protected]>
Signed-off-by: Anastas Stoyanovsky <[email protected]>
Signed-off-by: Anastas Stoyanovsky <[email protected]>
@anastasds anastasds requested review from alimaredia and alinaryan May 7, 2025 19:28
Signed-off-by: Anastas Stoyanovsky <[email protected]>
Signed-off-by: Anastas Stoyanovsky <[email protected]>
@anastasds anastasds force-pushed the level1 branch 2 times, most recently from 0a68bf8 to db981ac Compare May 7, 2025 20:29
Signed-off-by: Anastas Stoyanovsky <[email protected]>
Comment on lines +3 to +16
{
"cell_type": "code",
"execution_count": 1,
"id": "5da70a08-1895-4d1f-8f50-93e2134b2e23",
"metadata": {},
"outputs": [],
"source": [
"# design goals:\n",
"#\n",
"# - understandability\n",
"# - modularity\n",
"# - configurability"
]
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
"cell_type": "code",
"execution_count": 1,
"id": "5da70a08-1895-4d1f-8f50-93e2134b2e23",
"metadata": {},
"outputs": [],
"source": [
"# design goals:\n",
"#\n",
"# - understandability\n",
"# - modularity\n",
"# - configurability"
]
},

"# - configurability"
]
},
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cell feels like it should be a util function we import

@alimaredia
Copy link
Contributor

@anastasds Great work, there were just a couple of things I'd like to see fixed and several small work items I wanted to make sure are documented so we can make issues out of them once this PR is merged.

Changes for the PR:

  1. Could we change the name of the notebook from level1.ipynb -> instructlab_knowledge.ipynb or something else more reflective of the workflow being run in the notebook.
  2. Could this PR remove the entire doc-preprocessing-to-sdg directory since it's redundant to the instructlab-knowledge one.
  3. Could you add a description of the notebook at the very top of the notebook with the input and the output. Right now there's just a list of design goals, which should still be mentioned.
  4. The only issue I hit was in the authoring section in the Path.unlink()
# Path.unlink(qna_output_path) # shouldn't be necessary but was. jupyter caching thing?

Since I didn't already have a qna.yaml file here I got an error that the file didn't exist to unlink and the cell couldn't execute. I wonder if @fabianofranz @iamemilio @alinaryan hit the same issues.

Follow up work once this PR is merged:

  • Make notebook process more than 1 document, and then as a follow up to that create more than 1 qna.yaml
  • The list called docs is defined in the chunking section but doesn't get called until the authoring section. This seems to violate the design goal of modularity. I do understand the need to not run conversions over and over, but I wonder how time and resource intensive conversions are from docling json and if they are fast enough that maybe this concern isn't needed.
  • We might want to consider the defaults for keys being stored as environment vars just to protect the user from not putting keys in a notebook, forgetting about them and then sharing or committing them. You didn't do anything wrong here, I just wonder what the best practice is.
  • This must have gotten lost in the combining work, but in my level1 branch I had changed the output of the chunking section to be a single chunks.jsonl file instead of a collection of .txt files, and then refactored the seed dataset creation code accordingly.
  • In the spirit of having artifacts be JSON/JSONL I wonder if we can just evolve the qna.yaml to be a .json file instead. I know Laura had this idea.
  • A UX review from @JustinXHale
  • Minor documentation improvements I'll submit in a follow up PR.

@fabianofranz
Copy link

I have a PR with minor adjustments to the formatting and usability, in case you'd like to make it part of this: #10.

Copy link
Member

@alinaryan alinaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anastasds Thank you for putting the steps together in this PR. I got up to the chunking step and then hit this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[9], line 9
      7 chunk_iter = chunker.chunk(dl_doc=doc.document)
      8 chunk_objs = list(chunk_iter)
----> 9 chunks = [chunker.contextualize(chunk=chunk) for chunk in chunk_objs]
     11 print(f"Extracted {len(chunks)} chunks from {doc.document.name}")
     13 for chunk in chunks:

File [~/.local/lib/python3.13/site-packages/pydantic/main.py:891](http://localhost:8888/home/alina/.local/lib/python3.13/site-packages/pydantic/main.py#line=890), in BaseModel.__getattr__(self, item)
    888     return super().__getattribute__(item)  # Raises AttributeError if appropriate
    889 else:
    890     # this is the current error
--> 891     raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')

AttributeError: 'HybridChunker' object has no attribute 'contextualize'

"WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME\n",
"WORKSPACE_DIR.mkdir(exist_ok=True)\n",
"\n",
"SOURCE_DOCUMENT = None # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a "call to action" we can provide or way to make it extremely obvious to the user that they need to set their source doc here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants