StevenBtw · StevenBtw · Jan 14, 2026 · Jan 14, 2026 · Jan 14, 2026 · Jan 14, 2026
@@ -50,13 +50,17 @@ LLM_DEFAULT_MODEL=azure-gpt4mini
 LLM_TEMPERATURE=0.6
 
 # LLM cache settings
-LLM_CACHE_DIR=workspace/cache
+LLM_CACHE_DIR=workspace/cache/llm
 LLM_CACHE_TTL=0
+
+# Graph cache settings
+GRAPH_CACHE_DIR=workspace/cache/graph
+
 LLM_MAX_RETRIES=3
 LLM_TIMEOUT=60
 LLM_MAX_TOKENS=
 
-# Disable caching (true/false)
+# Disable caching (true/false) - when true, skips reading cache but still writes
 LLM_NOCACHE=false
 
 # Rate limiting settings
@@ -211,6 +215,12 @@ LLM_TOKEN_LIMIT_ANTHROPIC_HAIKU=32000
 # Workspace directory for cloned repositories and runtime data
 REPOSITORY_WORKSPACE_DIR=workspace/repositories
 
+# ==============================================================================
+# OUTPUT CONFIGURATION
+# ==============================================================================
+# Default output directory for exported ArchiMate XML files
+OUTPUT_DIR=workspace/output/model.xml
+
 # ==============================================================================
 # DATABASE CONFIGURATION (DuckDB)
 # ==============================================================================

@@ -76,7 +76,7 @@ jobs:
         run: uv sync --extra dev
 
       - name: Run tests
-        run: uv run pytest --cov --cov-report=xml --cov-fail-under=75 -v -m "not integration"
+        run: uv run pytest --cov --cov-report=xml --cov-fail-under=79 -v -m "not integration"
 
       - name: Upload coverage
         uses: codecov/codecov-action@v4

@@ -44,7 +44,7 @@ Repository --> Extraction --> Graph --> Derivation --> ArchiMate Model --> Expor
 |  |  +---------+ +-------------+  |  |  +-------------+ +-------------+  |   |
 |  |  |  Neo4j  | |  Database   |  |  |  |  Extraction | |  Derivation |  |   |
 |  |  |  (neo4j)| |  (database) |  |  |  |             | |             |  |   |
-|  |  +---------+ +-------------+  |  |  |  - Business | |  - Enrich   |  |   |
+|  |  +---------+ +-------------+  |  |  |  - Business | |  - Prep     |  |   |
 |  |  +---------+ +-------------+  |  |  |  - TypeDef  | |  - Generate |  |   |
 |  |  |  Graph  | |  ArchiMate  |  |  |  |  - Method   | |  - Refine   |  |   |
 |  |  |  (graph)| |  (archimate)|  |  |  |  - Tech     | |             |  |   |
@@ -212,7 +212,7 @@ Each layer has a `ruff.toml` file enforcing boundaries:
                  +------------+------------+
                               |
 3. DERIVE        +------------+------------+
-   (Enrich)      |  PageRank, Louvain,     |
+   (Prep)        |  PageRank, Louvain,     |
                  |  K-core enrichment       |
                  +------------+------------+
                               |
@@ -227,7 +227,7 @@ Each layer has a `ruff.toml` file enforcing boundaries:
                  |  Quality Assurance       |
                  +------------+------------+
                               |
-4. EXPORT        +--------> .archimate XML file
+4. EXPORT        +--------> .xml file (Open Exchange ArchiMate format)
 ```
 
 ### Data Storage

@@ -121,6 +121,26 @@ deriva benchmark run \
 
 > **Important:** Use `--no-cache` for initial benchmarks to measure actual LLM variance. Cached runs always produce identical outputs.
 
+#### Per-Repository Mode
+
+By default, multiple repos are combined into one model. Use `--per-repo` to benchmark each repository individually:
+
+```bash
+# Per-repo mode: each repo gets its own benchmark runs
+deriva benchmark run \
+  --repos repo1,repo2 \
+  --models gpt4 \
+  -n 3 \
+  --per-repo
+# Creates 6 runs: 2 repos × 1 model × 3 iterations
+```
+
+**When to use per-repo mode:**
+
+- Comparing model performance across different codebases
+- Testing prompts that may work better for certain repo structures
+- Getting independent consistency scores per repository
+
 ### 2. Analyze Results
 
 ```bash
@@ -190,6 +210,7 @@ deriva benchmark run --repos <repos> --models <models> [options]
   --no-cache          Disable all LLM caching
   --nocache-configs   Configs to skip cache for (comma-separated)
   --no-export-models  Disable exporting ArchiMate model files
+  --per-repo          Run each repo as separate benchmark (default: combine all)
   -v, --verbose       Show detailed text progress
   -q, --quiet         Disable progress bar display
 
@@ -322,7 +343,7 @@ Deriva supports a two-phase derivation architecture via the `defer_relationships
 - Reduced ordering effects improve consistency
 - Graph-aware filtering more effective with complete element set
 
-See [optimization_guide.md](optimization_guide.md#separated-derivation-phases-phase-46) for implementation details.
+See [OPTIMIZATION.md](OPTIMIZATION.md#separated-derivation-phases-phase-46) for implementation details.
 
 ---
 
@@ -338,5 +359,5 @@ See [optimization_guide.md](optimization_guide.md#separated-derivation-phases-ph
 
 ## Further Reading
 
-- [optimization_guide.md](optimization_guide.md) - Detailed case studies, prompt engineering findings, and optimization log
+- [OPTIMIZATION.md](OPTIMIZATION.md) - Detailed case studies, prompt engineering findings, and optimization log
 - [CONTRIBUTING.md](CONTRIBUTING.md) - Architecture and development patterns
@@ -6,15 +6,37 @@ Deriving ArchiMate models from code using knowledge graphs, heuristics and LLM's
 
 # v0.6.x - Deriva (December 2025 - January 2026)
 
-## v0.6.7 - (Unreleased)
+## v0.6.7 - (January 15 2026)
+
+### Caching & Performance
+- **Graph Cache**: New `cache.py` in graph adapter with hash-based cache for expensive graph queries
+- **Common Cache Utils**: Shared `cache_utils.py` module unifying cache patterns across graph and LLM adapters
+
+### Pipeline Phases
+- **Derivation Prep Phase**: Renamed `enrich` phase to `prep` throughout codebase (modules, services, configs, CLI, tests)
+- **Extraction Phases**: Added `--phase classify` and `--phase parse` options to extraction CLI for granular control
+
+### Configuration Rationalization
+- **Settings Principle**: New "Who Changes It" architecture - `.env` for ops/deployment (secrets, connections, provider settings), database for user tuning (algorithms, thresholds)
+- **Algorithm Settings in DB**: PageRank damping/iterations/tolerance, Louvain resolution, confidence thresholds, batch sizes now in `system_settings` table
+- **LLM Settings in .env**: Rate limits, timeouts, backoff config remain in environment (provider-specific operational settings)
+
+### Benchmarking
+- **Rich Progress Bars**: Fixed phase tracking in CLI benchmark runs with proper Rich progress display
+- **Per-Repo Flag**: New `--per-repo` flag for running multiple repositories without combining results
+- **XML Export**: Changed default export format from `.archimate` to `.xml` for broader compatibility
+
+### Documentation
+- **MD Files Review**: Comprehensive pass on all markdown files for accuracy and consistent style
+- **Config Pattern Docs**: Updated CONTRIBUTING.md with configuration ownership table and rationale
 
 ### Fixed
 - **Graph bugs**: Fixed Neo4j relationship syntax in structural_consistency.py and fixed bug in duplicate_elements.py
 - **bench-hash Cache Fix**: Fixed cache hit detection in manager.py
 
 ### Updated
-- **Smarter retries**:Added retry-after header parsing to rate_limiter.py and updated providers.py to pass headers to rate limiter
-- **Muted Neo4j**: Supressed Neo4j notifications during benchmark runs, with toggle in .env
+- **Smarter retries**: Added retry-after header parsing to rate_limiter.py and updated providers.py to pass headers to rate limiter
+- **Muted Neo4j**: Suppressed Neo4j notifications during benchmark runs, with toggle in .env
 
 ---
 

@@ -168,7 +168,7 @@ User clicks "Run Pipeline" in app/app.py  OR  runs `deriva run` in CLI
 │ with PipelineSession() as session:                          │
 │     session.run_extraction(repo_name="my-repo")             │
 │     session.run_derivation()                                │
-│     session.export_model("output.archimate")                │
+│     session.export_model("output.xml")                      │
 │                                                             │
 │ # For reactive UI (Marimo):                                 │
 │     stats = session.get_graph_stats()                       │
@@ -178,19 +178,20 @@ User clicks "Run Pipeline" in app/app.py  OR  runs `deriva run` in CLI
 ┌─────────────────────────────────────────────────────────────┐
 │ EXTRACTION (inside services.extraction)                     │
 ├─────────────────────────────────────────────────────────────┤
+│ Phases: classify → parse                                    │
 │ 1. Load config from DuckDB via services.config              │
 │ 2. Get repos from RepositoryManager                         │
-│ 3. Call modules.extraction.classification [PURE]            │
-│ 4. Call modules.extraction.structural/* [PURE]              │
-│ 5. Call modules.extraction.llm/* [PURE + LLM]               │
+│ 3. Classify: modules.extraction.classification [PURE]       │
+│ 4. Parse: modules.extraction.structural/* [PURE]            │
+│ 5. Parse: modules.extraction.llm/* [PURE + LLM]             │
 │ 6. Persist via GraphManager.add_node() [I/O]                │
 └─────────────────────────────────────────────────────────────┘
     ↓
 ┌─────────────────────────────────────────────────────────────┐
 │ DERIVATION (inside services.derivation)                     │
 ├─────────────────────────────────────────────────────────────┤
-│ Phases: enrich → generate → refine                          │
-│ 1. Enrich: Graph enrichment (PageRank, communities, k-core) │
+│ Phases: prep → generate → refine                             │
+│ 1. Prep: Graph enrichment (PageRank, communities, k-core)    │
 │ 2. Generate: Query candidates with enrichment data [I/O]    │
 │ 3. Generate: Call modules.derivation.{element}.generate()   │
 │ 4. Generate: Persist via ArchimateManager.add_element()     │
@@ -1153,7 +1154,7 @@ def has_node_sources(config: Dict) -> bool
 
 Derivation uses a hybrid approach combining graph algorithms with LLM:
 
-- **enrich phase** - Graph enrichment (PageRank, Louvain communities, k-core analysis)
+- **prep phase** - Graph enrichment (PageRank, Louvain communities, k-core analysis)
 - **generate phase** - LLM-based element derivation using graph metrics for filtering
 - **refine phase** - Cross-graph validation (duplicates, orphans, structural consistency)
 
@@ -1230,12 +1231,28 @@ def generate(
 <details>
 <summary><strong>Configuration Pattern</strong></summary>
 
-### Two Types of Configuration
+### Configuration Principle: "Who Changes It"
 
-Deriva has two configuration systems:
+Deriva splits configuration by **ownership** - who needs to change it and why:
 
-1. **Environment variables (`.env`)** - Runtime settings for adapters (connections, API keys, paths)
-2. **Database configs (DuckDB)** - Pipeline behavior (extraction steps, derivation prompts, patterns)
+| Category              | Location   | Owner      | Examples                                     |
+| --------------------- | ---------- | ---------- | -------------------------------------------- |
+| **Secrets & Keys**    | `.env`     | Ops/Deploy | API keys, passwords                          |
+| **Infrastructure**    | `.env`     | Ops/Deploy | Connection URIs, paths, provider URLs        |
+| **Provider Settings** | `.env`     | Ops/Deploy | LLM rate limits, timeouts, model definitions |
+| **Algorithm Tuning**  | Database   | Users      | PageRank damping, Louvain resolution         |
+| **Quality Thresholds**| Database   | Users      | Confidence thresholds, batch sizes           |
+| **Pipeline Configs**  | Database   | Users      | Extraction/derivation prompts, patterns      |
+
+**Rationale:**
+
+- **`.env`** = deployment-specific, rarely changes, requires restart
+- **Database** = tunable during optimization, versioned for rollback, UI-editable
+
+### Two Configuration Systems
+
+1. **Environment variables (`.env`)** - Infrastructure and provider settings
+2. **Database configs (DuckDB)** - Pipeline behavior and tuning parameters
 
 ### .env File (Adapter Configuration)
 

@@ -75,21 +75,6 @@ uv run python -m deriva.cli.cli benchmark run \
 # Step 4: Update config and repeat until 100%
 ```
 
-### A/B Testing Script
-
-For rapid iteration, use `scripts/ab_test.py`:
-
-```bash
-# Test a single config
-python scripts/ab_test.py DataObject --runs 5
-
-# Compare against baseline
-python scripts/ab_test.py ApplicationService --runs 5 --baseline bench_20260110_074211
-
-# Analyze existing session
-python scripts/ab_test.py DataObject -a bench_20260110_074602
-```
-
 ---
 
 ## Prompt Engineering Principles
@@ -141,6 +126,8 @@ If the answer is "no" or "it depends on the domain", the prompt is overfitting.
 
 Guide the LLM to use GENERIC category names (data, entity, document) rather than domain-specific names.
 
+> **Empirical support:** Liang 2025 achieved 100% accuracy on domain-specific tasks by providing carefully engineered in-context learning prompts with explicit domain constraints. Their finding that domain-specific instructions improved performance by 30% on complex cases validates the importance of abstraction-level guidance in prompts.
+
 ### Key Techniques
 
 <details>
@@ -263,10 +250,14 @@ Key findings from academic research on LLM-based ArchiMate derivation:
 | Finding | Source | Implication for Deriva |
 |---------|--------|------------------------|
 | Few-shot prompting works without fine-tuning | Chaaben 2022 | Use in-context examples, not trained models |
+| Domain-specific ICL prompts can achieve 100% accuracy | Liang 2025 | Invest in tailored prompt engineering per element type |
 | Guidance texts significantly improve output | Coutinho 2025 | Include domain-specific instruction documents |
 | Chain-of-thought may decrease performance | Chen 2023 | Prefer direct instructions over reasoning chains |
 | High precision, low recall is the norm | Chen 2023 | Expect correct but incomplete outputs |
+| Code-to-ArchiMate: 68% precision, 80% recall | Castillo 2019 | Industrial benchmark baseline for extraction |
+| NLP model extraction: 83-96% correctness | Arora 2016 | Achievable with explicit naming rules |
 | LLMs show higher consistency than humans | Reitemeyer 2025 | Multiple runs can improve reliability |
+| **Consistency ≠ accuracy (independent properties)** | Raj 2025 | Validate correctness separately from consistency |
 | Human-in-the-loop is essential | All sources | Design for validation, not full automation |
 
 ### Naming Conventions
@@ -376,6 +367,8 @@ When deriving ArchiMate elements, use these definitions and code signals:
 
 ### Validation Strategies
 
+> **Critical caveat:** Consistency and accuracy are independent properties (Raj 2025). High consistency does NOT guarantee correctness. A process could consistently produce incorrect results. Always validate accuracy separately through manual review or ground truth comparison.
+
 <details>
 <summary><strong>Multi-Run Aggregation</strong></summary>
 
@@ -630,6 +623,8 @@ RETURN n.id, n.name, n.pagerank, n.kcore_level
 
 Semantic nodes extracted by LLM have no structural relationships in the code graph. When derivation uses these as sources, the LLM has less context, leading to inconsistent outputs.
 
+This observation aligns with broader challenges in neural-symbolic integration: Cai 2025 identifies "representation gaps between neural network outputs and structured symbolic representations" as a fundamental challenge, particularly for complex relational reasoning. The graph-based filtering approach helps bridge this gap by grounding LLM interpretation in structural context.
+
 **Recommendation:** For element types that can use either structural or semantic sources, prefer structural sources or require minimum graph connectivity.
 
 ---
@@ -901,10 +896,11 @@ See [Graph-Based Optimization](#graph-based-optimization) for the full methodolo
 4. **Add determinism instruction** - "Output stable, deterministic results" in every LLM prompt
 5. **Test one config at a time** - Use `--nocache-configs` for targeted testing
 6. **Examples drive consistency** - A good example JSON is more effective than verbose rules
-7. **Abstraction level is key** - Use generic category names, not domain-specific names
+7. **Abstraction level is key** - Use generic category names, not domain-specific names (Liang 2025: +30% improvement)
 8. **Graph-based selection over name-based** - Filter by structural properties (in_degree, pagerank)
 9. **Never use repository-specific rules** - All optimizations must be generic
 10. **Prefer structural sources over semantic** - TypeDefinition/Method sources are more stable than BusinessConcept
+11. **Consistency ≠ accuracy** - High consistency doesn't guarantee correctness; validate both independently (Raj 2025)
 
 ---
 
@@ -1034,10 +1030,15 @@ After Phase 4 optimizations (5 runs, mistral-devstral2, flask_invoice_generator)
 
 | Citation | Reference | Key Contribution |
 |----------|-----------|------------------|
+| Arora 2016 | Arora et al., "Extracting domain models from natural-language requirements" | Industrial NLP extraction: 83-96% correctness, explicit naming rules |
+| Cai 2025 | Cai et al., "Practices, opportunities and challenges in the fusion of knowledge graphs and large language models" | KG-LLM integration taxonomy (KEL/LEK/LKC), neural-symbolic representation gaps |
+| Castillo 2019 | Castillo et al., "ArchiRev - Reverse engineering toward ArchiMate models" | Code-to-ArchiMate benchmark: 68% precision, 80% recall |
 | Chaaben 2022 | Chaaben et al., "Towards using Few-Shot Prompt Learning for Automating Model Completion" | Few-shot prompting without fine-tuning, frequency-based ranking |
 | Chaaben 2024 | Chaaben et al., "On the Utility of Domain Modeling Assistance with LLMs" | 20% time reduction, 33-56% suggestion contribution rates |
 | Chen 2023 | Chen et al., "Automated Domain Modeling with LLMs: A Comparative Study" | F1 scores (0.76 classes, 0.34 relationships), chain-of-thought caution |
 | Coutinho 2025 | Coutinho et al., "LLM-Based Modeling Assistance for Textual Ontology-Driven Conceptual Modeling" | Guidance texts significantly improve output quality |
+| Liang 2025 | Liang et al., "Integrating Large Language Models for Automated Structural Analysis" | Domain-specific ICL achieves 100% accuracy; benchmarking methodology |
+| Raj 2025 | Raj et al., "Semantic Consistency for Assuring Reliability of Large Language Models" | **Critical:** Consistency and accuracy are independent properties |
 | Reitemeyer 2025 | Reitemeyer & Fill, "Applying LLMs in Knowledge Graph-based Enterprise Modeling" | LLMs show higher consistency than humans, human-in-the-loop essential |
 | Wang 2025 | Wang & Wang, "Assessing Consistency and Reproducibility in LLM Outputs" | 3-5 runs optimal for consistency |