Author: John Mitchell (@whmatrix) Status: ACTIVE Audience: ML Engineers / Data Architects / Recruiters Environment: GPU recommended (CPU-only mode available) Fast Path:
cd mini-index && python demo_query.py(under 60 seconds)
Portfolio repository demonstrating large-scale semantic indexing pipelines.
8,355,163 vectors indexed across Wikipedia Featured Articles, ArXiv ML abstracts, and StackExchange Python using e5-large-v2 embeddings and FAISS IndexFlatIP.
Professional-grade semantic search indexing system demonstrating production-ready RAG capabilities
This repo implements the indexing pipeline. For the full operational context, see:
- Universal Protocol v4.23 — Deliverable contracts, audit contracts, and quality gates
- Example deliverable structure: Semantic_Indexing_Output__Example_Structure.pdf
- RAG Readiness Audit examples:
- Earlier foundational work: semantic-indexing-batch-01 (superseded)
This portfolio showcases three diverse, real-world datasets indexed with a Universal Protocol-compliant pipeline:
- Wikipedia Featured Articles - High-quality encyclopedia content (352,606 vectors)
- StackExchange Python Q&A - Real developer community discussions (7,513,263 vectors)
- ArXiv ML Abstracts - Scientific research from ML/CS domains (489,294 vectors)
Total: 8,355,163 vectors across all datasets ✅ Production Complete
PORTFOLIO_INDEXING_PROJECTS/
├── README.md # This file
├── datasets/ # Raw source datasets (26.5 GB total)
│ ├── wiki_featured/ # 491 MB, 5 JSONL files
│ ├── arxiv_ml_abstracts/ # 7.0 GB, 9 JSONL files
│ └── stackexchange_python/ # 19 GB, 75 Parquet files
│
├── scripts/ # All indexing & processing scripts
│ ├── indexers/ # Core indexing engines (UAIO-compliant)
│ │ ├── index_wiki_featured.py
│ │ ├── index_arxiv_ml_abstracts.py
│ │ ├── index_stackexchange_python.py
│ │ └── index_stackexchange_split{1-5}.py
│ │
│ ├── preparation/ # Dataset preparation scripts
│ │ ├── prepare_wiki_featured.sh
│ │ ├── prepare_arxiv_ml.sh
│ │ ├── prepare_stackexchange_python.sh
│ │ └── PREPARE_ALL_PORTFOLIO_DATASETS.sh
│ │
│ ├── wrappers/ # DRY-RUN and PRODUCTION wrappers
│ │ ├── RUN_PORTFOLIO_WIKI_*.sh
│ │ ├── RUN_PORTFOLIO_ARXIV_*.sh
│ │ ├── RUN_PORTFOLIO_STACKEXCHANGE_*.sh
│ │ └── RUN_PORTFOLIO_ALL_*.sh
│ │
│ └── merge/ # Multi-process merge utilities
│ ├── merge_stackexchange_splits.py
│ └── merge_stackexchange_splits_efficient.py
│
├── results/ # Final indexed results
│ ├── indexes/ # FAISS indexes (ready for deployment)
│ │ ├── wiki_featured/ # 352,606 vectors
│ │ ├── arxiv_ml_abstracts/ # 489,294 vectors
│ │ ├── stackexchange_python/ # 7,513,263 vectors (merged)
│ │ └── stackexchange_split{1-5}/ # Individual splits (archived)
│ │
│ └── work_dirs/ # Temporary work directories
│
└── documentation/ # Complete usage guides
├── PORTFOLIO_DATASETS_README.md
└── ORGANIZATION.md
All indexers implement the Universal Batch Indexing & Verification Engine (UAIO):
- ✅ Producer/Consumer Architecture - Streaming pipeline with queue coordination
- ✅ RAM Balancer - Auto pause at 90%, resume at 70%
- ✅ GPU Balancer - OOM handling with batch size reduction
- ✅ Locked Batch Size - 1300 (proven stable)
- ✅ Memory Footprint - 1-3GB VRAM per indexer
- ✅ Signal Handling - Graceful shutdown (SIGINT/SIGTERM)
- ✅ Checkpointing - Every 1M vectors, fully resumable
- ✅ Atomic Writes - .tmp file swaps prevent corruption
- ✅ Integrity Checks - len(vectors) == len(chunks) == len(metadata)
Parallel Processing: Successfully demonstrated 7 concurrent indexers (1 Wikipedia + 1 ArXiv + 5 StackExchange splits) running in parallel with proper VRAM management.
Split-Merge Pattern: StackExchange dataset split into 5 parallel operations (15 files each) for 5× speedup, then merged into single unified index.
Resource Management:
- VRAM: 18GB / 49GB (7 indexers × ~2.5GB each)
- RAM: 87GB+ free with automatic pressure balancing
- GPU: 100% utilization, optimal throughput
- Source: English Wikipedia featured/quality articles
- Format: JSONL (5 files, 491 MB)
- Content: Full encyclopedia articles with title, categories, and content
- Rows: ~39,716 articles
- Vectors: 352,606 (avg 8.9 chunks per article)
- Use Case: General knowledge retrieval, content understanding
- Status: ✅ Complete
- Source: Stack Exchange Python-related Q&A dataset
- Format: Parquet (75 files, 19 GB)
- Content: Questions, accepted answers, tags, scores
- Rows: 6,378,706 Q&A pairs
- Vectors: 7,513,263 (avg 1.178 chunks per Q&A)
- Use Case: Technical support, developer tools, code Q&A
- Innovation: Split into 5 parallel indexers for 5× speedup, then merged
- Status: ✅ Complete (merged from 5 splits)
- Source: RedPajama ArXiv ML/CS papers
- Format: JSONL (9 files, 7.0 GB, RedPajama format)
- Content: Scientific papers with arxiv_id, date, and content
- Rows: ~123K papers
- Vectors: 489,294 (avg 4.0 chunks per paper)
- Use Case: Research retrieval, academic applications
- Status: ✅ Complete
- Model:
intfloat/e5-large-v2 - Dimensions: 1024
- Precision: FP16
- Prefix:
"passage: "for all chunks
- Chunking: 1500 characters, word-aligned
- Shard Size: 200K vectors per .npy file
- Checkpoint Interval: 1M vectors
- Batch Size: 1300 (locked, auto-reduces on OOM)
- Queue Soft Max: 50K items
- Type: FAISS IndexFlatIP (inner product similarity)
- Outputs:
vectors.index- FAISS index filechunks.json- Text chunksmetadata.jsonl- Chunk metadatasummary.json- Index statistics
Normalization: All embeddings are L2-normalized (unit norm) at encode time.
- Formula:
v_norm = v / ||v||_2where||v_norm||_2 = 1.0 - Applied to both passage embeddings (at index time) and query embeddings (at search time)
Why IndexFlatIP equals cosine similarity:
- Cosine similarity:
cos(a, b) = dot(a, b) / (||a|| * ||b||) - When
||a|| = ||b|| = 1.0:cos(a, b) = dot(a, b)(inner product) - FAISS IndexFlatIP computes exact inner product — no approximation, no quantization
Asymmetric encoding (per E5 model specification):
- Queries: prefixed with
"query: "before encoding - Chunks: prefixed with
"passage: "before encoding - This asymmetry is trained into the model and is required for optimal retrieval
Score interpretation:
| Score Range | Meaning |
|---|---|
| 0.90 - 1.00 | Near-identical semantic content |
| 0.80 - 0.90 | Strongly related |
| 0.70 - 0.80 | Topically related |
| < 0.70 | Weak or incidental overlap |
Scores are meaningful for ranking within a single index but not for cross-index or cross-model comparison. A score of 0.85 in a Wikipedia index is not comparable to 0.85 in a StackExchange index.
Query-time flow:
- Input query string
- Prefix with
"query: " - Encode with e5-large-v2, L2-normalize
- Compute inner product against all indexed vectors (FAISS IndexFlatIP)
- Return top-k results ranked by descending score
Reproducibility: Identical input text + model version + normalization = deterministic embeddings. Rebuilding from the same dataset produces byte-identical FAISS indices.
This repo indexes 8.35M vectors from 26.5 GB of source data — it requires an NVIDIA GPU with 48 GB VRAM and 128 GB RAM.
Want to try the pipeline without the hardware requirements? Use research-corpus-discovery, which runs the same embedding model and FAISS index type on a small PDF corpus:
git clone https://github.com/whmatrix/research-corpus-discovery
cd research-corpus-discovery
pip install -r scripts/requirements.txt
python scripts/build_index.py --pdf_dir ./sample_docs/ --output_dir ./demo_index
python scripts/query.py --index ./demo_index/faiss.index --chunks ./demo_index/chunks.jsonlSee QUICK_START.md for a full walkthrough. The pipeline, embedding model (e5-large-v2), and index type (FAISS IndexFlatIP) are the same — only the scale differs.
If you have the datasets locally, you can also smoke-test with dry-run mode, which validates the pipeline without running GPU embeddings:
./scripts/wrappers/RUN_PORTFOLIO_ALL_DRYRUN.shscripts/— All indexing code (fully runnable)mini-index/— Tiny 20-doc demo (proves pipeline, <5 seconds)ARCHITECTURE.md— System design + data flow diagramsdocumentation/— Setup guides + specifications.github/workflows/— Smoke-test CI/CD (on every push)
- Datasets (26.5GB total)
- Wikipedia Featured: Public, fetch with
scripts/preparation/prepare_wiki_featured.sh - ArXiv ML: Public, fetch with
scripts/preparation/prepare_arxiv_ml.sh - StackExchange: Public, fetch with
scripts/preparation/prepare_stackexchange_python.sh
- Wikipedia Featured: Public, fetch with
- Full Production Indices (70GB total)
- Or rebuild locally (see "Reproduce the Full Index" below)
# Step 1: Get datasets (automatic download)
./scripts/preparation/PREPARE_ALL_PORTFOLIO_DATASETS.sh
# Step 2: Dry-run validation (no GPU, fast)
./scripts/wrappers/RUN_PORTFOLIO_ALL_DRYRUN.sh
# Step 3: Build production index (GPU recommended)
./scripts/wrappers/RUN_PORTFOLIO_ARXIV_PRODUCTION.sh
./scripts/wrappers/RUN_PORTFOLIO_WIKIPEDIA_PRODUCTION.sh
./scripts/wrappers/RUN_PORTFOLIO_STACKEXCHANGE_PRODUCTION.shDon't want to download 26.5GB? Verify the pipeline works instantly:
cd mini-index
pip install sentence-transformers faiss-cpu
python demo_query.pyThis loads a real FAISS index, runs 3 semantic queries, returns ranked results, and proves the pipeline is end-to-end functional. See mini-index/summary.json for quality metrics.
cd <repo-root>/
# Prepare all datasets
./scripts/preparation/PREPARE_ALL_PORTFOLIO_DATASETS.sh
# Test with dry-run
./scripts/wrappers/RUN_PORTFOLIO_ALL_DRYRUN.sh
# Run production (individual)
./scripts/wrappers/RUN_PORTFOLIO_WIKI_PRODUCTION.sh
./scripts/wrappers/RUN_PORTFOLIO_STACKEXCHANGE_PRODUCTION.sh
./scripts/wrappers/RUN_PORTFOLIO_ARXIV_PRODUCTION.shFor StackExchange, use split indexers for 5× speedup:
# Start all 5 splits (run manually with staggered starts)
python3 scripts/indexers/index_stackexchange_split1.py > /tmp/se_split1.log 2>&1 &
python3 scripts/indexers/index_stackexchange_split2.py > /tmp/se_split2.log 2>&1 &
# ... (wait for each to start embedding before launching next)
# After all complete, merge
python3 scripts/merge/merge_stackexchange_splits.py# Watch GPU usage
watch -n 1 nvidia-smi
# Monitor individual indexer progress
tail -f /tmp/wiki_production.log
tail -f /tmp/stackexchange_production.log
tail -f /tmp/arxiv_production.log
# Check split progress
tail -f /tmp/stackexchange_split{1..5}.logRepository Structure (Production-Ready):
./results/indexes/
├── wiki_featured/ # 352,606 vectors
├── arxiv_ml_abstracts/ # 489,294 vectors
├── stackexchange_python/ # 7,513,263 vectors (merged)
└── stackexchange_split{1-5}/ # Individual splits (archived)
Each index directory contains:
<dataset_name>/
├── vectors.index # FAISS IndexFlatIP (1024-dim, ~29GB for StackExchange)
├── chunks.json # Text chunks corresponding to vectors (~5.8GB for StackExchange)
├── metadata.jsonl # Per-vector metadata (~1.3GB for StackExchange)
└── summary.json # Dataset summary & integrity verification
- Single Indexer: ~1300 vectors per batch (~13 seconds per batch)
- 7 Parallel Indexers: ~9,100 vectors per 13 seconds = ~700 vectors/sec aggregate
- VRAM Efficiency: ~2.5GB per indexer (well below 3GB target)
- Successfully demonstrated 7 concurrent indexers
- Linear scaling with available VRAM (can run 2-15 indexers on RTX A6000)
- RAM balancer prevents memory exhaustion
- GPU stays at 100% utilization
- Multi-Domain Expertise: Wikipedia (general), StackExchange (technical), ArXiv (academic)
- Production-Ready: Signal handling, checkpointing, atomic writes, integrity checks
- Scalable Architecture: Parallel processing, split-merge patterns
- Resource Efficient: Optimal VRAM/RAM usage, automatic balancing
- Real-World Data: 10M+ vectors from authentic sources
- Enterprise Patterns: Logging, monitoring, graceful degradation
- ✅ Parallel multi-dataset indexing
- ✅ Split-merge workflow for large datasets
- ✅ Automatic resource management
- ✅ OOM recovery and batch adaptation
- ✅ Resumable long-running operations
- ✅ Integrity verification and atomic commits
See /documentation/PORTFOLIO_DATASETS_README.md for:
- Detailed dataset preparation instructions
- Where to download source data
- Expected formats and structures
- Troubleshooting guides
- Advanced configuration
- GPU: NVIDIA RTX A6000 (48GB) or similar
- RAM: 128GB+ recommended for large datasets
- Storage:
- NVMe: ~500GB for work directories
- HDD: ~100GB for final indexes
- Software:
- Python 3.8+
- PyTorch with CUDA
- FAISS-GPU
- sentence-transformers
This portfolio demonstrates production-ready indexing pipelines built with:
- Embedding Model: intfloat/e5-large-v2
- Index Engine: FAISS (Facebook AI)
- Architecture: Universal Batch Indexing & Verification Engine (UAIO)
All code follows Universal Protocol standards for reproducibility and compliance.
Created: December 2025 Status: Production-Ready Contact: Professional portfolio demonstration
| Deliverable | Format | Guarantee |
|---|---|---|
| Vector index | FAISS IndexFlatIP (exact cosine via L2-normalized inner product) | Deterministic, byte-reproducible |
| Chunk corpus | JSONL with metadata | len(vectors) == len(chunks) == len(metadata) |
| Audit summary | JSON manifest | Pass/fail quality gates per Universal Protocol v4.23 |
What this is not: No human-judged relevance labels. No MRR/MAP/NDCG claims. Scores are cosine similarity (vector alignment), not precision or recall. Domain suitability requires independent evaluation.
Reproduce it: git clone https://github.com/whmatrix/semantic-indexing-batch-02 && cd semantic-indexing-batch-02/mini-index && pip install sentence-transformers faiss-cpu && python demo_query.py
This index demonstrates large-scale semantic indexing capability (8.35M+ vectors) but makes no claims about retrieval quality, relevance, or suitability for specific applications. Use case specificity and evaluation require domain-specific testing.
This indexing run conforms to the Universal Protocol v4.23.
All dataset ingestion, chunking, embedding, FAISS construction, and validation artifacts follow the schemas and constraints defined there.