instructlab · anastasds · May 12, 2025 · May 9, 2025
diff --git a/notebooks/instructlab-knowledge/level1.ipynb b/notebooks/instructlab-knowledge/level1.ipynb
@@ -1,5 +1,22 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "af99f876-0ffd-4079-aeb7-4cead05daaf4",
+   "metadata": {},
+   "source": [
+    "# 🐶 Data Pre-Processing\n",
+    "\n",
+    "This notebook goes through each of the stages of data pre-processing. Once a SDG seed dataset is created, a user can run through an SDG notebook and generate samples.\n",
+    "\n",
+    "1. [Document Conversion](#Document-Conversion)\n",
+    "1. [Chunking](#Chunking)\n",
+    "1. [Authoring](#Authoring)\n",
+    "1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)\n",
+    "\n",
+    "***"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 1,
@@ -58,7 +75,7 @@
    "id": "344b7ac5-fc2a-40a8-8e1f-e8dd8b1153e7",
    "metadata": {},
    "source": [
-    "# Document Conversion\n",
+    "## Document Conversion\n",
     "\n",
     "This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document. A Docling Document is the representation of the document after conversion that can be exported as JSON. The JSON output of this notebook can then be used in others such as one that uses Docling's chunking methods."
    ]
@@ -186,7 +203,7 @@
    "id": "2482060c-a49f-4345-aa47-d54301939387",
    "metadata": {},
    "source": [
-    "## Initialize the Chunker\n",
+    "### Initialize the Chunker\n",
     "\n",
     "Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.\n",
     "The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document\n",
@@ -213,7 +230,7 @@
    "id": "54ce1d6f-b8d3-470c-b3c9-675911f0ee92",
    "metadata": {},
    "source": [
-    "## Load and chunk the converted docling document\n",
+    "### Load and chunk the converted docling document\n",
     "\n",
     "Next lets convert the document we want to chunk up into a Docling Document."
    ]
@@ -264,7 +281,7 @@
    "id": "0fb38545-eb84-4923-8fc4-d10ed08eab26",
    "metadata": {},
    "source": [
-    "## View the Chunks\n",
+    "### View the Chunks\n",
     "\n",
     "To view the chunks, run through the following cell. As you can see the document is broken into small pieces with metadata about the chunk based on the document's format"
    ]
@@ -295,7 +312,7 @@
    "id": "42c4160f-7508-4c72-b28d-b56aa4975b26",
    "metadata": {},
    "source": [
-    "## Save the chunks to a text file for each chunk\n",
+    "### Save the chunks to a text file for each chunk\n",
     "\n",
     "Each chunk is saved to an individual text file in the format: `{docling-json-file-name}-{chunk #}.txt`. Having chunking in this format is important as an input to create-sdg-seed-data notebook."
    ]
@@ -318,7 +335,7 @@
    "id": "a510f8c7-8cd3-4867-8742-9f4f9cda9e9f",
    "metadata": {},
    "source": [
-    "# Authoring"
+    "## Authoring"
    ]
   },
   {
@@ -383,7 +400,7 @@
    "id": "d65ec755-e3de-40ab-bf3a-23ebb29a705d",
    "metadata": {},
    "source": [
-    "## Initialize QA generator, supplying details for which model to use\n",
+    "### Initialize QA generator, supplying details for which model to use\n",
     "\n",
     "GenerateOptions controls which model is used for QA generation by setting generate_options.provider below. Three options are available:\n",
     "\n",
@@ -418,7 +435,7 @@
    "id": "919199c0-3747-409a-85ab-0155ef3ebe9d",
    "metadata": {},
    "source": [
-    "## Configure subset selection"
+    "### Configure subset selection"
    ]
   },
   {
@@ -436,7 +453,7 @@
    "id": "d2421d07-3e6c-4355-95f4-da8e157557c7",
    "metadata": {},
    "source": [
-    "## Run QA generation on selected chunks"
+    "### Run QA generation on selected chunks"
    ]
   },
   {
@@ -499,7 +516,7 @@
    "id": "ea64b8f0-dd6c-4776-8646-9731433f909b",
    "metadata": {},
    "source": [
-    "## Read generated QAs and restructure"
+    "### Read generated QAs and restructure"
    ]
   },
   {
@@ -543,7 +560,7 @@
    "id": "9b6d6c26-f4d5-420d-ae78-ac28cf39efd3",
    "metadata": {},
    "source": [
-    "## Define metadata for qna.yaml"
+    "### Define metadata for qna.yaml"
    ]
   },
   {
@@ -562,7 +579,7 @@
    "id": "dafa8927-e56c-448b-b88b-f8d854c25d4d",
    "metadata": {},
    "source": [
-    "## Output qna.yaml"
+    "### Output qna.yaml"
    ]
   },
   {
@@ -626,7 +643,7 @@
    "id": "ed9ea149-844b-4330-90ec-d0ca7ab12b90",
    "metadata": {},
    "source": [
-    "## View generated qna.yaml"
+    "### View generated qna.yaml"
    ]
   },
   {
@@ -813,7 +830,7 @@
    "id": "1f101076-a50f-49ea-a83b-46eaa8b39cc4",
    "metadata": {},
    "source": [
-    "# Create Seed Dataset for SDG\n",
+    "## Create Seed Dataset for SDG\n",
     "\n",
     "This notebook combines the contents from the qna.yaml and the chunks from the source document to create a seed dataset for the synthetic data generation process.\n",
     "\n",