Target: All core loops < 5ms (excluding LLM latency) Version: 1.0.0 Last Updated: 2026-01-01
This guide provides comprehensive performance optimization recommendations for ReasonKit deployments, from development laptops to production clusters.
- Baseline Performance Expectations
- Configuration Options
- Memory Optimization
- Embedding Model Selection
- Batch Processing
- Profiling and Benchmarking
- Production Deployment
- Troubleshooting
All non-LLM operations must complete within the specified targets:
| Operation | Target | Expected Baseline | Notes |
|---|---|---|---|
| Retrieval Operations | |||
| Sparse search (BM25) @ 100 docs | < 5ms | ~2.5ms | Pure text matching |
| Hybrid search @ 100 docs | < 10ms | ~8.0ms | Sparse + dense fusion |
| Concurrent 8 queries | < 15ms | ~12ms | Parallel execution |
| Fusion Operations | |||
| RRF @ 100 results (2 methods) | < 5ms | ~1.2ms | Reciprocal Rank Fusion |
| Weighted sum @ 100 results | < 5ms | ~1.5ms | Linear combination |
| Multi-method (3 methods) | < 10ms | ~3.5ms | Complex fusion |
| Embedding Operations | |||
| Mock embedding (cache hit) | < 0.1ms | ~0.02ms | In-memory lookup |
| Batch embedding @ 100 | < 5ms | ~4.5ms | Parallel processing |
| Cosine similarity @ 384-dim | < 1ms | ~0.08ms | Vector math |
| RAPTOR Operations | |||
| Tree creation @ 100 leaves | < 50ms | ~8ms | Hierarchical clustering |
| Tree traversal @ 1000 nodes | < 5ms | ~2.5ms | DFS/BFS |
| Node lookup | < 0.1ms | ~0.03ms | HashMap access |
| Ingestion Operations | |||
| Fixed chunking @ 10KB | < 5ms | ~1.8ms | Text splitting |
| Sentence chunking @ 10KB | < 10ms | ~3.2ms | NLP-based splitting |
| Parallel @ 100 docs | < 50ms | ~35ms | Rayon parallelism |
| ThinkTools Operations | |||
| Protocol orchestration | < 5ms | ~1-2ms | Excluding LLM time |
| Template rendering | < 1ms | ~100-500us | String substitution |
| Confidence extraction | < 0.5ms | ~50-200us | Regex parsing |
| Profile lookup | < 0.1ms | ~0.03ms | Registry access |
LLM calls are the dominant factor in end-to-end latency:
| Provider | Model | Typical Latency | Notes |
|---|---|---|---|
| OpenAI | gpt-4o | 1-5s | Varies by prompt length |
| Anthropic | claude-sonnet-4 | 1-4s | Consistent performance |
| Anthropic | claude-opus-4 | 2-8s | Higher quality, slower |
| Local | DeepSeek-V3 | 0.5-2s | Self-hosted, GPU required |
| Local | Llama 3.3 70B | 1-3s | Self-hosted, GPU required |
Key Insight: ReasonKit's overhead is typically <1% of total execution time. Focus optimization efforts on reducing LLM calls, not on ReasonKit internals.
Structured prompting via ThinkTools reduces output variance:
| Metric | Raw Prompts | Structured Prompts | Improvement |
|---|---|---|---|
| Mean Agreement Rate (TARa) | 96.0% | 98.0% | +2.0 pp |
| Inconsistency Rate | 4.0% | 2.0% | -2.0 pp |
| Variance Reduction | - | - | 5.0% |
Results from Claude Sonnet 4 at temperature=0, N=5 runs per question.
[general]
data_dir = "./data"
log_level = "info" # trace, debug, info, warn, error
[storage]
backend = "qdrant" # "qdrant" or "local"
[storage.qdrant]
host = "localhost"
port = 6334
grpc_port = 6333
embedded = true # Use embedded mode (no external server)
collection = "reasonkit_docs"
vector_size = 1024 # Must match embedding model dimensions
distance = "Cosine" # Cosine, Euclid, or Dot
quantization = true # 4x memory reduction, ~2% accuracy loss
[storage.local]
documents_path = "./data/documents"
index_path = "./data/indexes"
[embedding]
backend = "api" # "api" or "local"
[embedding.api]
provider = "openai"
model = "text-embedding-3-small"
dimensions = 1536
batch_size = 100
[embedding.local]
model_path = "./models/bge-m3-onnx"
device = "cpu" # or "cuda"
[indexing]
bm25_enabled = true
hnsw_m = 16 # Higher = more accuracy, more memory
hnsw_ef_construction = 200 # Higher = slower indexing, better quality
[processing]
chunk_size = 512 # tokens
chunk_overlap = 50 # tokens
min_chunk_size = 100 # Avoid tiny fragments
[retrieval]
top_k = 10
alpha = 0.7 # 0 = BM25 only, 1 = vector only
min_score = 0.0
rerank = false # Enable for better quality, higher latency
[raptor]
enabled = false
max_depth = 3
cluster_size = 10
summarizer = "claude-sonnet-4-5"
[server]
host = "127.0.0.1"
port = 8080
cors = true
timeout = 30| Setting | Performance Impact | When to Use |
|---|---|---|
embedded = true |
Fastest startup, no network overhead | Development, single-node |
embedded = false |
Slightly higher latency, scalable | Production, multi-node |
quantization = true |
4x memory reduction, ~2% accuracy loss | Memory-constrained |
quantization = false |
Full precision | Maximum accuracy required |
[indexing]
hnsw_m = 16 # Connections per node (8-64)
hnsw_ef_construction = 200 # Build-time beam width (100-500)| hnsw_m | Memory/Vector | Search Speed | Accuracy |
|---|---|---|---|
| 8 | ~256 bytes | Very fast | 95% |
| 16 | ~512 bytes | Fast | 98% |
| 32 | ~1KB | Medium | 99% |
| 64 | ~2KB | Slow | 99.5% |
Recommendation: Start with hnsw_m = 16 for most use cases.
The alpha parameter controls the balance between sparse (BM25) and dense (vector) search:
alpha = 0.0 -> 100% BM25 (keyword matching)
alpha = 0.5 -> 50/50 hybrid
alpha = 0.7 -> 70% vector, 30% BM25 (recommended default)
alpha = 1.0 -> 100% vector (semantic only)
| Use Case | Recommended Alpha | Rationale |
|---|---|---|
| Technical documentation | 0.5 | Keywords matter |
| General knowledge | 0.7 | Semantic similarity important |
| Code search | 0.3 | Exact matches critical |
| Conversational search | 0.8 | Semantic understanding key |
# Core settings
export REASONKIT_DATA_DIR="/path/to/data"
export REASONKIT_LOG_LEVEL="info"
# API keys (never commit these!)
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export COHERE_API_KEY="..."
export VOYAGE_API_KEY="..."
# Qdrant connection (if not embedded)
export QDRANT_HOST="localhost"
export QDRANT_PORT="6334"
export QDRANT_API_KEY="..."
# Performance tuning
export RAYON_NUM_THREADS="8" # Parallel processing threads
export TOKIO_WORKER_THREADS="4" # Async runtime threads| Component | Memory Per Unit | Scaling Factor |
|---|---|---|
| Raw document | ~1x document size | Linear |
| Chunks | ~1.1x raw (overlap) | Linear |
| BM25 index | ~0.2x corpus size | Linear |
| Dense embeddings (f32) | 4 bytes _ dim _ chunks | Linear |
| Dense embeddings (quantized) | 1 byte _ dim _ chunks | Linear |
| HNSW graph | ~512 bytes * chunks (m=16) | Linear |
| RAPTOR tree | ~2x leaf nodes | Logarithmic |
[storage.qdrant]
quantization = true # Scalar quantization to uint8Impact: ~2% accuracy reduction for 4x memory savings.
[embedding.api]
model = "text-embedding-3-small"
dimensions = 512 # Down from 1536| Dimensions | Memory/Chunk | Accuracy (relative) |
|---|---|---|
| 1536 | 6KB | 100% |
| 1024 | 4KB | 98% |
| 512 | 2KB | 95% |
| 256 | 1KB | 90% |
[processing]
chunk_size = 1024 # Up from 512
chunk_overlap = 100Doubles content per chunk, halves total chunks.
[raptor]
enabled = falseRAPTOR trees add ~2x overhead for hierarchical summaries.
# Check process memory
ps aux | grep rk
# Detailed memory breakdown
cat /proc/$(pgrep rk)/status | grep -E "VmRSS|VmSize"
# Qdrant collection stats
curl localhost:6333/collections/reasonkit_docs | jq '.result.points_count, .result.vectors_count'| Corpus Size | Chunks (512 tok) | Memory (quantized) | Memory (full) |
|---|---|---|---|
| 1,000 docs | ~10,000 | ~100MB | ~400MB |
| 10,000 docs | ~100,000 | ~1GB | ~4GB |
| 100,000 docs | ~1,000,000 | ~10GB | ~40GB |
| 1,000,000 docs | ~10,000,000 | ~100GB | ~400GB |
| Provider | Model | Dimensions | Cost/1M tokens | Quality |
|---|---|---|---|---|
| OpenAI | text-embedding-3-large | 3072 | $0.13 | Excellent |
| OpenAI | text-embedding-3-small | 1536 | $0.02 | Good |
| Cohere | embed-english-v3.0 | 1024 | $0.10 | Excellent |
| Voyage | voyage-2 | 1024 | $0.10 | Excellent |
| Anthropic | (via Voyage) | 1024 | $0.10 | Excellent |
| Model | Dimensions | Memory | Speed (CPU) | Speed (GPU) |
|---|---|---|---|---|
| BGE-M3 | 1024 | ~2GB | 50ms/text | 5ms/text |
| BGE-Large-EN | 1024 | ~1.5GB | 40ms/text | 4ms/text |
| all-MiniLM-L6 | 384 | ~100MB | 10ms/text | 1ms/text |
| E5-Large-v2 | 1024 | ~1.5GB | 40ms/text | 4ms/text |
High accuracy, cost acceptable:
-> OpenAI text-embedding-3-large (3072-dim)
-> Cohere embed-english-v3.0 (1024-dim)
Good accuracy, cost-sensitive:
-> OpenAI text-embedding-3-small (1536-dim)
-> Local BGE-M3 with GPU
Low latency, self-hosted:
-> Local all-MiniLM-L6 (384-dim, CPU-friendly)
-> Local E5-Large-v2 with GPU
Multilingual:
-> BGE-M3 (best multilingual support)
-> Cohere embed-multilingual-v3.0
# Install with local-embeddings feature
cargo build --release --features local-embeddings
# Download BGE-M3 ONNX model
mkdir -p models
curl -L https://huggingface.co/BAAI/bge-m3/resolve/main/onnx/model.onnx \
-o models/bge-m3-onnx/model.onnx
curl -L https://huggingface.co/BAAI/bge-m3/resolve/main/tokenizer.json \
-o models/bge-m3-onnx/tokenizer.jsonConfigure in config/default.toml:
[embedding]
backend = "local"
[embedding.local]
model_path = "./models/bge-m3-onnx"
device = "cuda" # or "cpu"ReasonKit automatically caches embeddings to avoid recomputation:
Cache location: $REASONKIT_DATA_DIR/embeddings_cache/
Cache key: SHA-256(model_id + text)
Cache format: Binary (4 bytes per dimension)
Cache performance:
| Operation | Latency |
|---|---|
| Cache hit | < 0.1ms |
| Cache miss (API) | 100-500ms |
| Cache miss (local) | 5-50ms |
rk ingest /path/to/document.md# Process directory recursively
rk ingest /path/to/documents/ --recursive
# With parallel processing
rk ingest /path/to/documents/ --parallel 8
# Stream from stdin (for pipelines)
find /path -name "*.md" | rk ingest --stdin# Set thread count via environment
export RAYON_NUM_THREADS=8
# Or via CLI
rk ingest /path --threads 8Optimal thread count:
| CPU Cores | Recommended Threads | Notes |
|---|---|---|
| 2 | 2 | Limited parallelism |
| 4 | 4 | Good for development |
| 8 | 6-8 | Leave 2 for system |
| 16+ | 12-14 | Diminishing returns |
[embedding.api]
batch_size = 1 # One at a time[embedding.api]
batch_size = 100 # Batch requests// In custom code
let embedder = BatchEmbedder::new(config)
.with_batch_size(100)
.with_concurrent_batches(4);
embedder.embed_streaming(chunks_iter).await?;| Corpus Size | Documents | Chunks | Ingestion Time |
|---|---|---|---|
| Small | 100 | 1,000 | ~30s |
| Medium | 1,000 | 10,000 | ~5min |
| Large | 10,000 | 100,000 | ~1hr |
| Enterprise | 100,000 | 1,000,000 | ~10hr |
Times assume API embeddings with batch_size=100. Local embeddings with GPU are ~5x faster.
# Run all benchmarks
cargo bench
# Run specific benchmark suite
cargo bench --bench retrieval_bench
cargo bench --bench fusion_bench
cargo bench --bench embedding_bench
cargo bench --bench thinktool_bench# Save current performance as baseline
cargo bench -- --save-baseline main
# Compare against baseline after changes
cargo bench -- --baseline main./scripts/run_benchmarks.sh # Run all
./scripts/run_benchmarks.sh fusion # Run specific
./scripts/run_benchmarks.sh --baseline # Save baseline
./scripts/run_benchmarks.sh --compare # Check regressions| Suite | File | What It Tests |
|---|---|---|
| Retrieval | benches/retrieval_bench.rs |
BM25, hybrid search, scaling |
| Fusion | benches/fusion_bench.rs |
RRF, weighted sum, multi-method |
| Embedding | benches/embedding_bench.rs |
Dense/sparse, batch, cache |
| RAPTOR | benches/raptor_bench.rs |
Tree operations, traversal |
| Ingestion | benches/ingestion_bench.rs |
Chunking, cleaning, parallel |
| ThinkTools | benches/thinktool_bench.rs |
Protocol execution, profiles |
retrieval_bench/sparse_search/query_0
time: [2.3821 ms 2.4234 ms 2.4697 ms]
change: [-1.8234% +0.2543% +2.3109%] (p = 0.78 > 0.05)
No change in performance detected.
- time: [lower bound, estimate, upper bound] at 95% confidence
- change: Comparison to previous run
- p-value: < 0.05 indicates statistically significant change
After running benchmarks, view detailed reports:
open target/criterion/report/index.htmlReports include:
- Performance trends over time
- Violin plots showing distribution
- Regression analysis
- Historical comparisons
# Install flamegraph
cargo install flamegraph
# Generate flamegraph
cargo flamegraph --bin rk -- query "test query"# Record performance data
perf record -g cargo run --release -- query "test"
# Generate report
perf report# Time Profiler
xcrun xctrace record --template 'Time Profiler' --launch rk query "test"Test LLM output consistency with structured prompting:
cd benchmarks
python variance_benchmark.py --runs 20 --model claude-sonnet-4-20250514
# Quick test
python variance_benchmark.py --quick
# Output to specific directory
python variance_benchmark.py -o results/- CPU: 4 cores
- RAM: 8GB
- Storage: 50GB SSD
- Corpus: Up to 10,000 documents
- CPU: 8-16 cores
- RAM: 32GB
- Storage: 200GB NVMe
- Corpus: Up to 100,000 documents
- CPU: 32+ cores
- RAM: 128GB+
- Storage: 1TB+ NVMe
- GPU: NVIDIA A100 (for local embeddings)
- Corpus: 1M+ documents
For high availability and scale:
# docker-compose.yml
version: "3.8"
services:
qdrant-1:
image: qdrant/qdrant:latest
volumes:
- qdrant_data_1:/qdrant/storage
environment:
- QDRANT__CLUSTER__ENABLED=true
ports:
- "6333:6333"
- "6334:6334"
qdrant-2:
image: qdrant/qdrant:latest
volumes:
- qdrant_data_2:/qdrant/storage
environment:
- QDRANT__CLUSTER__ENABLED=true
- QDRANT__CLUSTER__P2P__PORT=6335
volumes:
qdrant_data_1:
qdrant_data_2:ReasonKit uses HTTP connection pooling for optimal performance:
// Already configured in llm.rs
reqwest::Client::builder()
.timeout(Duration::from_secs(120))
.pool_max_idle_per_host(10)
.pool_idle_timeout(Duration::from_secs(90))
.tcp_keepalive(Duration::from_secs(60))
.build()Impact: Eliminates TLS handshake overhead (100-500ms per call).
| Metric | Target | Alert Threshold |
|---|---|---|
| Query latency p50 | < 100ms | > 500ms |
| Query latency p99 | < 500ms | > 2s |
| Ingestion rate | > 10 docs/sec | < 1 doc/sec |
| Memory usage | < 80% | > 90% |
| Error rate | < 0.1% | > 1% |
| Cache hit rate | > 80% | < 50% |
# Enable metrics endpoint
rk serve --metrics-port 9090Import the provided dashboard:
grafana/dashboards/reasonkit-overview.json
Symptoms: Queries take > 1s (excluding LLM time)
Diagnosis:
# Check collection stats
curl localhost:6333/collections/reasonkit_docs | jq
# Check index status
rk stats --verboseSolutions:
- Ensure HNSW index is built:
rk index rebuild - Enable quantization for memory pressure
- Reduce
top_kif returning too many results - Check BM25 index:
rk index check-bm25
Symptoms: OOM errors or swap thrashing
Diagnosis:
ps aux | grep rk
cat /proc/$(pgrep rk)/status | grep VmRSSSolutions:
- Enable quantization (
quantization = true) - Reduce embedding dimensions
- Increase chunk size
- Use external Qdrant instead of embedded
Symptoms: Documents take > 10s each to ingest
Diagnosis:
# Check embedding latency
time rk embed "test text"
# Check chunking speed
time rk chunk /path/to/doc.mdSolutions:
- Increase
batch_sizefor API embeddings - Use local embeddings with GPU
- Increase parallel threads
- Check network latency to embedding API
Symptoms: Same query produces different results
Diagnosis:
# Run variance benchmark
python benchmarks/variance_benchmark.py --quickSolutions:
- Use structured prompting (ThinkTools profiles)
- Set
temperature = 0for deterministic output - Enable self-consistency voting
- Use the
--paranoidprofile for critical queries
Before every release:
- All benchmarks compile without errors
- No regressions > 5% vs baseline
- Core loops remain < 5ms
- Scalability confirmed (linear growth)
- No memory leaks in batch operations
- Cache hit rates stable
# Full quality check
just qa
# Benchmark comparison
cargo bench -- --baseline main
./scripts/run_benchmarks.sh --compareEnable detailed performance logging:
RUST_LOG=reasonkit=debug rk query "test"
# Even more detail
RUST_LOG=reasonkit=trace rk query "test"- Check the GitHub Issues
- Review TROUBLESHOOTING.md
- Ask in GitHub Discussions
- Email: support@reasonkit.sh
| File | Purpose | Key Tests |
|---|---|---|
benches/retrieval_bench.rs |
Search performance | BM25, hybrid, scaling, concurrent |
benches/fusion_bench.rs |
Result fusion | RRF, weighted sum, multi-method |
benches/embedding_bench.rs |
Embedding ops | Dense, sparse, batch, cache |
benches/raptor_bench.rs |
RAPTOR trees | Create, traverse, lookup |
benches/ingestion_bench.rs |
Document processing | Chunk, clean, parallel |
benches/thinktool_bench.rs |
ThinkTools protocols | Execute, profile, concurrent |
benchmarks/variance_benchmark.py |
LLM consistency | TARa, structured vs raw |
- Configured appropriate embedding model
- Set optimal chunk size for content type
- Enabled quantization if memory-constrained
- Configured connection pooling
- Set appropriate HNSW parameters
- Tested with representative workload
- Query latency within targets
- Memory usage stable
- Cache hit rates acceptable
- Error rates minimal
- Ingestion throughput sufficient
- Run full benchmark suite weekly
- Compare against baseline monthly
- Review and compact indexes quarterly
- Update embedding models as needed
- Criterion.rs Documentation
- Rust Performance Book
- Qdrant Performance Tuning
- ReasonKit Architecture
- ThinkTools Guide
- Benchmark Summary
"All core loops < 5ms" - This is our contract with users.
ReasonKit Performance Engineering Version 1.0.0 | https://reasonkit.sh