Add instructlab-knowledge pipeline #7

anastasds · 2025-05-05T19:25:53Z

No description provided.

Signed-off-by: Anastas Stoyanovsky <[email protected]>

Signed-off-by: Alina Ryan <[email protected]>

Signed-off-by: Ali Maredia <[email protected]>

create_seed_dataset.py used in this notebook is heavily inspired by docprocessor.py from the sdg-hub repo. Co-authored-by: Abhishek B <[email protected]> Co-authored-by: shiv <[email protected]> Signed-off-by: Ali Maredia <[email protected]>

Signed-off-by: Ali Maredia <[email protected]>

Signed-off-by: Anastas Stoyanovsky <[email protected]>

iamemilio · 2025-05-08T19:20:29Z

notebooks/instructlab-knowledge/level1.ipynb

+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "5da70a08-1895-4d1f-8f50-93e2134b2e23",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# design goals:\n",
+    "#\n",
+    "# - understandability\n",
+    "# - modularity\n",
+    "# - configurability"
+   ]
+  },


Suggested change

{

"cell_type": "code",

"execution_count": 1,

"id": "5da70a08-1895-4d1f-8f50-93e2134b2e23",

"metadata": {},

"outputs": [],

"source": [

"# design goals:\n",

"#\n",

"# - understandability\n",

"# - modularity\n",

"# - configurability"

]

},

iamemilio · 2025-05-08T20:20:47Z

notebooks/instructlab-knowledge/level1.ipynb

+    "# - configurability"
+   ]
+  },
+  {


This cell feels like it should be a util function we import

alimaredia · 2025-05-09T13:58:17Z

@anastasds Great work, there were just a couple of things I'd like to see fixed and several small work items I wanted to make sure are documented so we can make issues out of them once this PR is merged.

Changes for the PR:

Could we change the name of the notebook from level1.ipynb -> instructlab_knowledge.ipynb or something else more reflective of the workflow being run in the notebook.
Could this PR remove the entire doc-preprocessing-to-sdg directory since it's redundant to the instructlab-knowledge one.
Could you add a description of the notebook at the very top of the notebook with the input and the output. Right now there's just a list of design goals, which should still be mentioned.
The only issue I hit was in the authoring section in the Path.unlink()

# Path.unlink(qna_output_path) # shouldn't be necessary but was. jupyter caching thing?

Since I didn't already have a qna.yaml file here I got an error that the file didn't exist to unlink and the cell couldn't execute. I wonder if @fabianofranz @iamemilio @alinaryan hit the same issues.

Follow up work once this PR is merged:

Make notebook process more than 1 document, and then as a follow up to that create more than 1 qna.yaml
The list called docs is defined in the chunking section but doesn't get called until the authoring section. This seems to violate the design goal of modularity. I do understand the need to not run conversions over and over, but I wonder how time and resource intensive conversions are from docling json and if they are fast enough that maybe this concern isn't needed.
We might want to consider the defaults for keys being stored as environment vars just to protect the user from not putting keys in a notebook, forgetting about them and then sharing or committing them. You didn't do anything wrong here, I just wonder what the best practice is.
This must have gotten lost in the combining work, but in my level1 branch I had changed the output of the chunking section to be a single chunks.jsonl file instead of a collection of .txt files, and then refactored the seed dataset creation code accordingly.
In the spirit of having artifacts be JSON/JSONL I wonder if we can just evolve the qna.yaml to be a .json file instead. I know Laura had this idea.
A UX review from @JustinXHale
Minor documentation improvements I'll submit in a follow up PR.

fabianofranz · 2025-05-09T16:08:45Z

I have a PR with minor adjustments to the formatting and usability, in case you'd like to make it part of this: #10.

alinaryan

@anastasds Thank you for putting the steps together in this PR. I got up to the chunking step and then hit this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[9], line 9
      7 chunk_iter = chunker.chunk(dl_doc=doc.document)
      8 chunk_objs = list(chunk_iter)
----> 9 chunks = [chunker.contextualize(chunk=chunk) for chunk in chunk_objs]
     11 print(f"Extracted {len(chunks)} chunks from {doc.document.name}")
     13 for chunk in chunks:

File [~/.local/lib/python3.13/site-packages/pydantic/main.py:891](http://localhost:8888/home/alina/.local/lib/python3.13/site-packages/pydantic/main.py#line=890), in BaseModel.__getattr__(self, item)
    888     return super().__getattribute__(item)  # Raises AttributeError if appropriate
    889 else:
    890     # this is the current error
--> 891     raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')

AttributeError: 'HybridChunker' object has no attribute 'contextualize'

alinaryan · 2025-05-09T19:30:37Z

notebooks/instructlab-knowledge/level1.ipynb

+    "WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME\n",
+    "WORKSPACE_DIR.mkdir(exist_ok=True)\n",
+    "\n",
+    "SOURCE_DOCUMENT = None # to process a specific document, set its path here; otherwise, the entire source documents repository will be used\n",


Is there a "call to action" we can provide or way to make it extremely obvious to the user that they need to set their source doc here?

anastasds force-pushed the level1 branch from 917bd03 to c9c804f Compare May 5, 2025 19:26

Initial commit

3ffc30a

Signed-off-by: Anastas Stoyanovsky <[email protected]>

anastasds force-pushed the level1 branch from c9c804f to 3ffc30a Compare May 5, 2025 19:31

alinaryan and others added 7 commits May 7, 2025 11:29

Add config and conversion steps to L1

75b50b5

Signed-off-by: Alina Ryan <[email protected]>

Revise comments in chunking notebook

68b6528

Signed-off-by: Ali Maredia <[email protected]>

feat: Add sdg seed dataset creation notebook

9b7794a

create_seed_dataset.py used in this notebook is heavily inspired by docprocessor.py from the sdg-hub repo. Co-authored-by: Abhishek B <[email protected]> Co-authored-by: shiv <[email protected]> Signed-off-by: Ali Maredia <[email protected]>

rename doc preprocessing-to-sdg

6ae8492

Signed-off-by: Ali Maredia <[email protected]>

Add rough authoring cells

ed205f1

Signed-off-by: Anastas Stoyanovsky <[email protected]>

Bring in chunking/seed cells

0c710b9

Signed-off-by: Anastas Stoyanovsky <[email protected]>

Get it working end to end using a single file

3360583

Signed-off-by: Anastas Stoyanovsky <[email protected]>

anastasds requested review from alimaredia and alinaryan May 7, 2025 19:28

Add seed example generation

2eeaebf

Signed-off-by: Anastas Stoyanovsky <[email protected]>

anastasds force-pushed the level1 branch from 57fe375 to 2eeaebf Compare May 7, 2025 19:45

Cleanup

6f14d4c

Signed-off-by: Anastas Stoyanovsky <[email protected]>

anastasds force-pushed the level1 branch 2 times, most recently from 0a68bf8 to db981ac Compare May 7, 2025 20:29

Fixes to get it working end to end

bf6a8c4

Signed-off-by: Anastas Stoyanovsky <[email protected]>

anastasds force-pushed the level1 branch from db981ac to bf6a8c4 Compare May 7, 2025 20:33

iamemilio reviewed May 8, 2025

View reviewed changes

notebooks/instructlab-knowledge/level1.ipynb

"# - configurability"

]

},

{

Copy link

iamemilio May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cell feels like it should be a util function we import

alinaryan reviewed May 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add instructlab-knowledge pipeline #7

Add instructlab-knowledge pipeline #7

anastasds commented May 5, 2025

iamemilio May 8, 2025

iamemilio May 8, 2025

alimaredia commented May 9, 2025

fabianofranz commented May 9, 2025

alinaryan left a comment

alinaryan May 9, 2025

Add instructlab-knowledge pipeline #7

Are you sure you want to change the base?

Add instructlab-knowledge pipeline #7

Conversation

anastasds commented May 5, 2025

iamemilio May 8, 2025

Choose a reason for hiding this comment

iamemilio May 8, 2025

Choose a reason for hiding this comment

alimaredia commented May 9, 2025

Changes for the PR:

Follow up work once this PR is merged:

fabianofranz commented May 9, 2025

alinaryan left a comment

Choose a reason for hiding this comment

alinaryan May 9, 2025

Choose a reason for hiding this comment