Skip to content

Minor adjustments for beter usability #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 31 additions & 14 deletions notebooks/instructlab-knowledge/level1.ipynb
Original file line number Diff line number Diff line change
@@ -1,5 +1,22 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "af99f876-0ffd-4079-aeb7-4cead05daaf4",
"metadata": {},
"source": [
"# 🐶 Data Pre-Processing\n",
"\n",
"This notebook goes through each of the stages of data pre-processing. Once a SDG seed dataset is created, a user can run through an SDG notebook and generate samples.\n",
"\n",
"1. [Document Conversion](#Document-Conversion)\n",
"1. [Chunking](#Chunking)\n",
"1. [Authoring](#Authoring)\n",
"1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)\n",
"\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": 1,
@@ -58,7 +75,7 @@
"id": "344b7ac5-fc2a-40a8-8e1f-e8dd8b1153e7",
"metadata": {},
"source": [
"# Document Conversion\n",
"## Document Conversion\n",
"\n",
"This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document. A Docling Document is the representation of the document after conversion that can be exported as JSON. The JSON output of this notebook can then be used in others such as one that uses Docling's chunking methods."
]
@@ -186,7 +203,7 @@
"id": "2482060c-a49f-4345-aa47-d54301939387",
"metadata": {},
"source": [
"## Initialize the Chunker\n",
"### Initialize the Chunker\n",
"\n",
"Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.\n",
"The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document\n",
@@ -213,7 +230,7 @@
"id": "54ce1d6f-b8d3-470c-b3c9-675911f0ee92",
"metadata": {},
"source": [
"## Load and chunk the converted docling document\n",
"### Load and chunk the converted docling document\n",
"\n",
"Next lets convert the document we want to chunk up into a Docling Document."
]
@@ -264,7 +281,7 @@
"id": "0fb38545-eb84-4923-8fc4-d10ed08eab26",
"metadata": {},
"source": [
"## View the Chunks\n",
"### View the Chunks\n",
"\n",
"To view the chunks, run through the following cell. As you can see the document is broken into small pieces with metadata about the chunk based on the document's format"
]
@@ -295,7 +312,7 @@
"id": "42c4160f-7508-4c72-b28d-b56aa4975b26",
"metadata": {},
"source": [
"## Save the chunks to a text file for each chunk\n",
"### Save the chunks to a text file for each chunk\n",
"\n",
"Each chunk is saved to an individual text file in the format: `{docling-json-file-name}-{chunk #}.txt`. Having chunking in this format is important as an input to create-sdg-seed-data notebook."
]
@@ -318,7 +335,7 @@
"id": "a510f8c7-8cd3-4867-8742-9f4f9cda9e9f",
"metadata": {},
"source": [
"# Authoring"
"## Authoring"
]
},
{
@@ -383,7 +400,7 @@
"id": "d65ec755-e3de-40ab-bf3a-23ebb29a705d",
"metadata": {},
"source": [
"## Initialize QA generator, supplying details for which model to use\n",
"### Initialize QA generator, supplying details for which model to use\n",
"\n",
"GenerateOptions controls which model is used for QA generation by setting generate_options.provider below. Three options are available:\n",
"\n",
@@ -418,7 +435,7 @@
"id": "919199c0-3747-409a-85ab-0155ef3ebe9d",
"metadata": {},
"source": [
"## Configure subset selection"
"### Configure subset selection"
]
},
{
@@ -436,7 +453,7 @@
"id": "d2421d07-3e6c-4355-95f4-da8e157557c7",
"metadata": {},
"source": [
"## Run QA generation on selected chunks"
"### Run QA generation on selected chunks"
]
},
{
@@ -499,7 +516,7 @@
"id": "ea64b8f0-dd6c-4776-8646-9731433f909b",
"metadata": {},
"source": [
"## Read generated QAs and restructure"
"### Read generated QAs and restructure"
]
},
{
@@ -543,7 +560,7 @@
"id": "9b6d6c26-f4d5-420d-ae78-ac28cf39efd3",
"metadata": {},
"source": [
"## Define metadata for qna.yaml"
"### Define metadata for qna.yaml"
]
},
{
@@ -562,7 +579,7 @@
"id": "dafa8927-e56c-448b-b88b-f8d854c25d4d",
"metadata": {},
"source": [
"## Output qna.yaml"
"### Output qna.yaml"
]
},
{
@@ -626,7 +643,7 @@
"id": "ed9ea149-844b-4330-90ec-d0ca7ab12b90",
"metadata": {},
"source": [
"## View generated qna.yaml"
"### View generated qna.yaml"
]
},
{
@@ -813,7 +830,7 @@
"id": "1f101076-a50f-49ea-a83b-46eaa8b39cc4",
"metadata": {},
"source": [
"# Create Seed Dataset for SDG\n",
"## Create Seed Dataset for SDG\n",
"\n",
"This notebook combines the contents from the qna.yaml and the chunks from the source document to create a seed dataset for the synthetic data generation process.\n",
"\n",