Semantic Chunking Implementation Guide

Overview

Upgrading from simple character-based chunking (1000 chars, 200 overlap) to semantic chunking using LlamaIndex for better retrieval quality.

Why Semantic Chunking?

Simple Chunking (Current):

"The tariff code for chocolate is 1806.32. This applies to products
containing cocoa. The rate is 5.6%. <SPLIT> Special provisions apply
for organic chocolate under code 1806.32.10..."

❌ Splits mid-concept
❌ May break tables
❌ Loses semantic coherence

Semantic Chunking (New):

Chunk 1: "The tariff code for chocolate is 1806.32. This applies to
         products containing cocoa. The rate is 5.6%."
         
Chunk 2: "Special provisions apply for organic chocolate under code
         1806.32.10. These require certification..."

✅ Keeps related concepts together
✅ Tables extracted and handled specially
✅ Better retrieval quality

Implementation

Pipeline:

1. Docling Extraction
   ↓ (Markdown with structure preserved)
2. LlamaIndex MarkdownElementNodeParser
   ↓ (Extract tables)
3. LlamaIndex SemanticSplitterNodeParser  
   ↓ (Semantic text chunking)
4. NVIDIA NIM Embeddings
   ↓ (1024-dim vectors)
5. Milvus Indexing
   ✅ (HNSW index)

Key Features

1. Semantic Splitting

SemanticSplitterNodeParser(
    buffer_size=1,  # Group sentences
    breakpoint_percentile_threshold=95,  # Split when similarity drops
    embed_model=nvidia_nim_wrapper
)

Uses embeddings to detect topic changes
Splits at semantic boundaries
Preserves conceptual coherence

2. Table Extraction

MarkdownElementNodeParser(llm=None, num_workers=1)

Identifies tables in markdown
Extracts them whole (no mid-table splits)
Creates dedicated table nodes

3. Failure Persistence (NEW!)

When chunks fail after all retries:

persist_failed_chunks(chunks, filename, error, collection)

Saves to: /data/ingestion_failures/{collection}/{timestamp}_{filename}.json

Format:

{
  "timestamp": "2025-11-27T10:30:00",
  "collection": "us_tariffs",
  "source_file": "Chapter_17.pdf",
  "error": "HTTP 400 after 3 retries",
  "chunk_count": 3,
  "chunks": [
    {
      "text": "First 500 chars...",
      "full_text": "Complete chunk text",
      "type": "text",
      "metadata": {...},
      "error": "HTTP 400",
      "batch_index": 42
    }
  ]
}

4. Replay Mechanism

python scripts/replay_failed_chunks.py us_tariffs

What it does:

Scans /data/ingestion_failures/us_tariffs/
Loads all failure logs
Retries each failed chunk individually
Inserts successful chunks to Milvus
Archives processed logs to processed/ subdirectory
Reports recovery statistics

Recovery Strategy (4-Tier)

Tier 1: Batch with Retry

Send 10 chunks as batch
→ Fail → Wait 1s, retry
→ Fail → Wait 2s, retry  
→ Fail → Wait 4s, retry

Tier 2: Individual Chunk Processing

Split batch into 10 individual chunks
→ Try each separately
→ Success: Index to Milvus
→ Fail: Move to Tier 3

Tier 3: Persist to Disk

Save failed chunk to JSON:
  /data/ingestion_failures/{collection}/{timestamp}_{file}.json
→ Continue with other chunks
→ Job completes successfully

Tier 4: Manual Replay (Later)

python replay_failed_chunks.py {collection}
→ Load persisted failures
→ Retry with fresh API state
→ Report recovery rate

Comparison: Simple vs Semantic

Aspect	Simple (Current)	Semantic (New)
Method	Character-based (1000 chars)	Embedding similarity-based
Boundaries	Arbitrary	Semantic topic changes
Tables	May split mid-table	Extracted whole
Coherence	⚠️ Can break concepts	✅ Keeps ideas together
Retrieval	Good	⭐ Excellent
Speed	Fast (~2-5s per file)	Slower (~10-15s per file)
Quality	Adequate	Superior

Current Collections (Simple Chunking)

Collection	Chunks	Files	Status
us_tariffs	24,452	132 PDFs	✅ Complete
congress	1,710,693	4,747 txt	✅ Complete
sustainability	19,626	80 files	✅ Complete

Total: 1,754,771 chunks with simple chunking

Re-ingestion Plan with Semantic Chunking

Option A: Fresh Start (Recommended)

# Drop existing collections and reingest with semantic chunking
DROP_EXISTING=true

✅ Clean slate
✅ Consistent chunking strategy
✅ Best quality
⏱️ Time: 6-8 hours total

Option B: Incremental Addition

# Keep existing, add semantic versions as new collections
# us_tariffs_semantic, congress_semantic, sustainability_semantic

✅ Keep current collections working
✅ Compare quality side-by-side
✅ Can A/B test
⏱️ Time: 6-8 hours (parallel to existing)

Option C: Test First

# Reingest just sustainability with semantic (smallest collection)
# Compare quality in UI
# Then decide on tariffs/congress

✅ Low risk
✅ Quick validation
⏱️ Time: 1-2 hours for test

Files Created

Scripts:

scripts/ingest_with_semantic_chunking.py - Main semantic ingestion
scripts/replay_failed_chunks.py - Failure replay tool

Docker:

docker/ingestion-docling.Dockerfile - Updated with LlamaIndex

Docs:

SEMANTIC_CHUNKING_GUIDE.md - This file

Failure Persistence Details

Storage Location:

/data/ingestion_failures/
├── us_tariffs/
│   ├── 20251127_103045_Chapter_17.pdf.json
│   ├── 20251127_114523_Chapter_85.pdf.json
│   └── processed/  (archived after replay)
├── congress/
│   └── ...
└── sustainability/
    └── ...

PVC Mount:

The ingestion jobs already mount /data PVC, so failures persist across:

Pod restarts
Job restarts
Node replacements

Replay Usage:

# From local machine
kubectl run replay-tool --rm -i --restart=Never \
  --image=962716963657.dkr.ecr.us-west-2.amazonaws.com/docling-ingestion:v3 \
  -n rag-blueprint \
  --overrides='...(mount /data PVC)...' \
  -- python /scripts/replay_failed_chunks.py us_tariffs

# Or from within cluster as a Job
kubectl apply -f k8s/replay-failures-job.yaml

Next Steps

Build new Docker image with LlamaIndex:

docker build -f docker/ingestion-docling.Dockerfile -t docling-ingestion:v3-semantic .
docker push {ECR}/docling-ingestion:v3-semantic

Test on sustainability first:

# Small collection, quick to test
kubectl apply -f k8s/sustainability-semantic-job.yaml

Compare quality in UI
Decide on full re-ingestion

Benefits of This Approach

✅ Retry Logic: 3 attempts with exponential backoff
✅ Batch-Splitting: Try individually if batch fails
✅ Failure Persistence: Save failures to disk
✅ Replay Tool: Recover failed chunks later
✅ Incremental: Survives restarts
✅ Size-Aware: Prevents gRPC crashes
✅ Semantic Quality: Better retrieval
✅ Table Handling: Tables stay intact

Result: Bulletproof ingestion with superior chunk quality! 🎯

Questions to Consider

Re-ingest all 3 collections or test semantic on sustainability first?
Drop existing collections or create parallel _semantic versions?
Build Docker image now or test locally first?

What would you like to do?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Chunking Implementation Guide

Overview

Why Semantic Chunking?

Simple Chunking (Current):

Semantic Chunking (New):

Implementation

Pipeline:

Key Features

1. Semantic Splitting

2. Table Extraction

3. Failure Persistence (NEW!)

4. Replay Mechanism

Recovery Strategy (4-Tier)

Tier 1: Batch with Retry

Tier 2: Individual Chunk Processing

Tier 3: Persist to Disk

Tier 4: Manual Replay (Later)

Comparison: Simple vs Semantic

Current Collections (Simple Chunking)

Re-ingestion Plan with Semantic Chunking

Option A: Fresh Start (Recommended)

Option B: Incremental Addition

Option C: Test First

Files Created

Failure Persistence Details

Storage Location:

PVC Mount:

Replay Usage:

Next Steps

Benefits of This Approach

Questions to Consider

FilesExpand file tree

SEMANTIC_CHUNKING_GUIDE.md

Latest commit

History

SEMANTIC_CHUNKING_GUIDE.md

File metadata and controls

Semantic Chunking Implementation Guide

Overview

Why Semantic Chunking?

Simple Chunking (Current):

Semantic Chunking (New):

Implementation

Pipeline:

Key Features

1. Semantic Splitting

2. Table Extraction

3. Failure Persistence (NEW!)

4. Replay Mechanism

Recovery Strategy (4-Tier)

Tier 1: Batch with Retry

Tier 2: Individual Chunk Processing

Tier 3: Persist to Disk

Tier 4: Manual Replay (Later)

Comparison: Simple vs Semantic

Current Collections (Simple Chunking)

Re-ingestion Plan with Semantic Chunking

Option A: Fresh Start (Recommended)

Option B: Incremental Addition

Option C: Test First

Files Created

Failure Persistence Details

Storage Location:

PVC Mount:

Replay Usage:

Next Steps

Benefits of This Approach

Questions to Consider