Skip to content
Merged

Dev #20

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,17 @@ LLM_DEFAULT_MODEL=azure-gpt4mini
LLM_TEMPERATURE=0.6

# LLM cache settings
LLM_CACHE_DIR=workspace/cache
LLM_CACHE_DIR=workspace/cache/llm
LLM_CACHE_TTL=0

# Graph cache settings
GRAPH_CACHE_DIR=workspace/cache/graph

LLM_MAX_RETRIES=3
LLM_TIMEOUT=60
LLM_MAX_TOKENS=

# Disable caching (true/false)
# Disable caching (true/false) - when true, skips reading cache but still writes
LLM_NOCACHE=false

# Rate limiting settings
Expand Down Expand Up @@ -211,6 +215,12 @@ LLM_TOKEN_LIMIT_ANTHROPIC_HAIKU=32000
# Workspace directory for cloned repositories and runtime data
REPOSITORY_WORKSPACE_DIR=workspace/repositories

# ==============================================================================
# OUTPUT CONFIGURATION
# ==============================================================================
# Default output directory for exported ArchiMate XML files
OUTPUT_DIR=workspace/output/model.xml

# ==============================================================================
# DATABASE CONFIGURATION (DuckDB)
# ==============================================================================
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ jobs:
run: uv sync --extra dev

- name: Run tests
run: uv run pytest --cov --cov-report=xml --cov-fail-under=75 -v -m "not integration"
run: uv run pytest --cov --cov-report=xml --cov-fail-under=79 -v -m "not integration"

- name: Upload coverage
uses: codecov/codecov-action@v4
Expand Down
6 changes: 3 additions & 3 deletions ARCHITECTURE.MD
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Repository --> Extraction --> Graph --> Derivation --> ArchiMate Model --> Expor
| | +---------+ +-------------+ | | +-------------+ +-------------+ | |
| | | Neo4j | | Database | | | | Extraction | | Derivation | | |
| | | (neo4j)| | (database) | | | | | | | | |
| | +---------+ +-------------+ | | | - Business | | - Enrich | | |
| | +---------+ +-------------+ | | | - Business | | - Prep | | |
| | +---------+ +-------------+ | | | - TypeDef | | - Generate | | |
| | | Graph | | ArchiMate | | | | - Method | | - Refine | | |
| | | (graph)| | (archimate)| | | | - Tech | | | | |
Expand Down Expand Up @@ -212,7 +212,7 @@ Each layer has a `ruff.toml` file enforcing boundaries:
+------------+------------+
|
3. DERIVE +------------+------------+
(Enrich) | PageRank, Louvain, |
(Prep) | PageRank, Louvain, |
| K-core enrichment |
+------------+------------+
|
Expand All @@ -227,7 +227,7 @@ Each layer has a `ruff.toml` file enforcing boundaries:
| Quality Assurance |
+------------+------------+
|
4. EXPORT +--------> .archimate XML file
4. EXPORT +--------> .xml file (Open Exchange ArchiMate format)
```

### Data Storage
Expand Down
25 changes: 23 additions & 2 deletions BENCHMARKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,26 @@ deriva benchmark run \

> **Important:** Use `--no-cache` for initial benchmarks to measure actual LLM variance. Cached runs always produce identical outputs.

#### Per-Repository Mode

By default, multiple repos are combined into one model. Use `--per-repo` to benchmark each repository individually:

```bash
# Per-repo mode: each repo gets its own benchmark runs
deriva benchmark run \
--repos repo1,repo2 \
--models gpt4 \
-n 3 \
--per-repo
# Creates 6 runs: 2 repos × 1 model × 3 iterations
```

**When to use per-repo mode:**

- Comparing model performance across different codebases
- Testing prompts that may work better for certain repo structures
- Getting independent consistency scores per repository

### 2. Analyze Results

```bash
Expand Down Expand Up @@ -190,6 +210,7 @@ deriva benchmark run --repos <repos> --models <models> [options]
--no-cache Disable all LLM caching
--nocache-configs Configs to skip cache for (comma-separated)
--no-export-models Disable exporting ArchiMate model files
--per-repo Run each repo as separate benchmark (default: combine all)
-v, --verbose Show detailed text progress
-q, --quiet Disable progress bar display

Expand Down Expand Up @@ -322,7 +343,7 @@ Deriva supports a two-phase derivation architecture via the `defer_relationships
- Reduced ordering effects improve consistency
- Graph-aware filtering more effective with complete element set

See [optimization_guide.md](optimization_guide.md#separated-derivation-phases-phase-46) for implementation details.
See [OPTIMIZATION.md](OPTIMIZATION.md#separated-derivation-phases-phase-46) for implementation details.

---

Expand All @@ -338,5 +359,5 @@ See [optimization_guide.md](optimization_guide.md#separated-derivation-phases-ph

## Further Reading

- [optimization_guide.md](optimization_guide.md) - Detailed case studies, prompt engineering findings, and optimization log
- [OPTIMIZATION.md](OPTIMIZATION.md) - Detailed case studies, prompt engineering findings, and optimization log
- [CONTRIBUTING.md](CONTRIBUTING.md) - Architecture and development patterns
28 changes: 25 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,37 @@ Deriving ArchiMate models from code using knowledge graphs, heuristics and LLM's

# v0.6.x - Deriva (December 2025 - January 2026)

## v0.6.7 - (Unreleased)
## v0.6.7 - (January 15 2026)

### Caching & Performance
- **Graph Cache**: New `cache.py` in graph adapter with hash-based cache for expensive graph queries
- **Common Cache Utils**: Shared `cache_utils.py` module unifying cache patterns across graph and LLM adapters

### Pipeline Phases
- **Derivation Prep Phase**: Renamed `enrich` phase to `prep` throughout codebase (modules, services, configs, CLI, tests)
- **Extraction Phases**: Added `--phase classify` and `--phase parse` options to extraction CLI for granular control

### Configuration Rationalization
- **Settings Principle**: New "Who Changes It" architecture - `.env` for ops/deployment (secrets, connections, provider settings), database for user tuning (algorithms, thresholds)
- **Algorithm Settings in DB**: PageRank damping/iterations/tolerance, Louvain resolution, confidence thresholds, batch sizes now in `system_settings` table
- **LLM Settings in .env**: Rate limits, timeouts, backoff config remain in environment (provider-specific operational settings)

### Benchmarking
- **Rich Progress Bars**: Fixed phase tracking in CLI benchmark runs with proper Rich progress display
- **Per-Repo Flag**: New `--per-repo` flag for running multiple repositories without combining results
- **XML Export**: Changed default export format from `.archimate` to `.xml` for broader compatibility

### Documentation
- **MD Files Review**: Comprehensive pass on all markdown files for accuracy and consistent style
- **Config Pattern Docs**: Updated CONTRIBUTING.md with configuration ownership table and rationale

### Fixed
- **Graph bugs**: Fixed Neo4j relationship syntax in structural_consistency.py and fixed bug in duplicate_elements.py
- **bench-hash Cache Fix**: Fixed cache hit detection in manager.py

### Updated
- **Smarter retries**:Added retry-after header parsing to rate_limiter.py and updated providers.py to pass headers to rate limiter
- **Muted Neo4j**: Supressed Neo4j notifications during benchmark runs, with toggle in .env
- **Smarter retries**: Added retry-after header parsing to rate_limiter.py and updated providers.py to pass headers to rate limiter
- **Muted Neo4j**: Suppressed Neo4j notifications during benchmark runs, with toggle in .env

---

Expand Down
39 changes: 28 additions & 11 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ User clicks "Run Pipeline" in app/app.py OR runs `deriva run` in CLI
│ with PipelineSession() as session: │
│ session.run_extraction(repo_name="my-repo") │
│ session.run_derivation() │
│ session.export_model("output.archimate")
│ session.export_model("output.xml")
│ │
│ # For reactive UI (Marimo): │
│ stats = session.get_graph_stats() │
Expand All @@ -178,19 +178,20 @@ User clicks "Run Pipeline" in app/app.py OR runs `deriva run` in CLI
┌─────────────────────────────────────────────────────────────┐
│ EXTRACTION (inside services.extraction) │
├─────────────────────────────────────────────────────────────┤
│ Phases: classify → parse │
│ 1. Load config from DuckDB via services.config │
│ 2. Get repos from RepositoryManager │
│ 3. Call modules.extraction.classification [PURE]
│ 4. Call modules.extraction.structural/* [PURE]
│ 5. Call modules.extraction.llm/* [PURE + LLM]
│ 3. Classify: modules.extraction.classification [PURE] │
│ 4. Parse: modules.extraction.structural/* [PURE] │
│ 5. Parse: modules.extraction.llm/* [PURE + LLM] │
│ 6. Persist via GraphManager.add_node() [I/O] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ DERIVATION (inside services.derivation) │
├─────────────────────────────────────────────────────────────┤
│ Phases: enrich → generate → refine │
│ 1. Enrich: Graph enrichment (PageRank, communities, k-core) │
│ Phases: prep → generate → refine
│ 1. Prep: Graph enrichment (PageRank, communities, k-core)
│ 2. Generate: Query candidates with enrichment data [I/O] │
│ 3. Generate: Call modules.derivation.{element}.generate() │
│ 4. Generate: Persist via ArchimateManager.add_element() │
Expand Down Expand Up @@ -1153,7 +1154,7 @@ def has_node_sources(config: Dict) -> bool

Derivation uses a hybrid approach combining graph algorithms with LLM:

- **enrich phase** - Graph enrichment (PageRank, Louvain communities, k-core analysis)
- **prep phase** - Graph enrichment (PageRank, Louvain communities, k-core analysis)
- **generate phase** - LLM-based element derivation using graph metrics for filtering
- **refine phase** - Cross-graph validation (duplicates, orphans, structural consistency)

Expand Down Expand Up @@ -1230,12 +1231,28 @@ def generate(
<details>
<summary><strong>Configuration Pattern</strong></summary>

### Two Types of Configuration
### Configuration Principle: "Who Changes It"

Deriva has two configuration systems:
Deriva splits configuration by **ownership** - who needs to change it and why:

1. **Environment variables (`.env`)** - Runtime settings for adapters (connections, API keys, paths)
2. **Database configs (DuckDB)** - Pipeline behavior (extraction steps, derivation prompts, patterns)
| Category | Location | Owner | Examples |
| --------------------- | ---------- | ---------- | -------------------------------------------- |
| **Secrets & Keys** | `.env` | Ops/Deploy | API keys, passwords |
| **Infrastructure** | `.env` | Ops/Deploy | Connection URIs, paths, provider URLs |
| **Provider Settings** | `.env` | Ops/Deploy | LLM rate limits, timeouts, model definitions |
| **Algorithm Tuning** | Database | Users | PageRank damping, Louvain resolution |
| **Quality Thresholds**| Database | Users | Confidence thresholds, batch sizes |
| **Pipeline Configs** | Database | Users | Extraction/derivation prompts, patterns |

**Rationale:**

- **`.env`** = deployment-specific, rarely changes, requires restart
- **Database** = tunable during optimization, versioned for rollback, UI-editable

### Two Configuration Systems

1. **Environment variables (`.env`)** - Infrastructure and provider settings
2. **Database configs (DuckDB)** - Pipeline behavior and tuning parameters

### .env File (Adapter Configuration)

Expand Down
33 changes: 17 additions & 16 deletions OPTIMIZATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,21 +75,6 @@ uv run python -m deriva.cli.cli benchmark run \
# Step 4: Update config and repeat until 100%
```

### A/B Testing Script

For rapid iteration, use `scripts/ab_test.py`:

```bash
# Test a single config
python scripts/ab_test.py DataObject --runs 5

# Compare against baseline
python scripts/ab_test.py ApplicationService --runs 5 --baseline bench_20260110_074211

# Analyze existing session
python scripts/ab_test.py DataObject -a bench_20260110_074602
```

---

## Prompt Engineering Principles
Expand Down Expand Up @@ -141,6 +126,8 @@ If the answer is "no" or "it depends on the domain", the prompt is overfitting.

Guide the LLM to use GENERIC category names (data, entity, document) rather than domain-specific names.

> **Empirical support:** Liang 2025 achieved 100% accuracy on domain-specific tasks by providing carefully engineered in-context learning prompts with explicit domain constraints. Their finding that domain-specific instructions improved performance by 30% on complex cases validates the importance of abstraction-level guidance in prompts.

### Key Techniques

<details>
Expand Down Expand Up @@ -263,10 +250,14 @@ Key findings from academic research on LLM-based ArchiMate derivation:
| Finding | Source | Implication for Deriva |
|---------|--------|------------------------|
| Few-shot prompting works without fine-tuning | Chaaben 2022 | Use in-context examples, not trained models |
| Domain-specific ICL prompts can achieve 100% accuracy | Liang 2025 | Invest in tailored prompt engineering per element type |
| Guidance texts significantly improve output | Coutinho 2025 | Include domain-specific instruction documents |
| Chain-of-thought may decrease performance | Chen 2023 | Prefer direct instructions over reasoning chains |
| High precision, low recall is the norm | Chen 2023 | Expect correct but incomplete outputs |
| Code-to-ArchiMate: 68% precision, 80% recall | Castillo 2019 | Industrial benchmark baseline for extraction |
| NLP model extraction: 83-96% correctness | Arora 2016 | Achievable with explicit naming rules |
| LLMs show higher consistency than humans | Reitemeyer 2025 | Multiple runs can improve reliability |
| **Consistency ≠ accuracy (independent properties)** | Raj 2025 | Validate correctness separately from consistency |
| Human-in-the-loop is essential | All sources | Design for validation, not full automation |

### Naming Conventions
Expand Down Expand Up @@ -376,6 +367,8 @@ When deriving ArchiMate elements, use these definitions and code signals:

### Validation Strategies

> **Critical caveat:** Consistency and accuracy are independent properties (Raj 2025). High consistency does NOT guarantee correctness. A process could consistently produce incorrect results. Always validate accuracy separately through manual review or ground truth comparison.

<details>
<summary><strong>Multi-Run Aggregation</strong></summary>

Expand Down Expand Up @@ -630,6 +623,8 @@ RETURN n.id, n.name, n.pagerank, n.kcore_level

Semantic nodes extracted by LLM have no structural relationships in the code graph. When derivation uses these as sources, the LLM has less context, leading to inconsistent outputs.

This observation aligns with broader challenges in neural-symbolic integration: Cai 2025 identifies "representation gaps between neural network outputs and structured symbolic representations" as a fundamental challenge, particularly for complex relational reasoning. The graph-based filtering approach helps bridge this gap by grounding LLM interpretation in structural context.

**Recommendation:** For element types that can use either structural or semantic sources, prefer structural sources or require minimum graph connectivity.

---
Expand Down Expand Up @@ -901,10 +896,11 @@ See [Graph-Based Optimization](#graph-based-optimization) for the full methodolo
4. **Add determinism instruction** - "Output stable, deterministic results" in every LLM prompt
5. **Test one config at a time** - Use `--nocache-configs` for targeted testing
6. **Examples drive consistency** - A good example JSON is more effective than verbose rules
7. **Abstraction level is key** - Use generic category names, not domain-specific names
7. **Abstraction level is key** - Use generic category names, not domain-specific names (Liang 2025: +30% improvement)
8. **Graph-based selection over name-based** - Filter by structural properties (in_degree, pagerank)
9. **Never use repository-specific rules** - All optimizations must be generic
10. **Prefer structural sources over semantic** - TypeDefinition/Method sources are more stable than BusinessConcept
11. **Consistency ≠ accuracy** - High consistency doesn't guarantee correctness; validate both independently (Raj 2025)

---

Expand Down Expand Up @@ -1034,10 +1030,15 @@ After Phase 4 optimizations (5 runs, mistral-devstral2, flask_invoice_generator)

| Citation | Reference | Key Contribution |
|----------|-----------|------------------|
| Arora 2016 | Arora et al., "Extracting domain models from natural-language requirements" | Industrial NLP extraction: 83-96% correctness, explicit naming rules |
| Cai 2025 | Cai et al., "Practices, opportunities and challenges in the fusion of knowledge graphs and large language models" | KG-LLM integration taxonomy (KEL/LEK/LKC), neural-symbolic representation gaps |
| Castillo 2019 | Castillo et al., "ArchiRev - Reverse engineering toward ArchiMate models" | Code-to-ArchiMate benchmark: 68% precision, 80% recall |
| Chaaben 2022 | Chaaben et al., "Towards using Few-Shot Prompt Learning for Automating Model Completion" | Few-shot prompting without fine-tuning, frequency-based ranking |
| Chaaben 2024 | Chaaben et al., "On the Utility of Domain Modeling Assistance with LLMs" | 20% time reduction, 33-56% suggestion contribution rates |
| Chen 2023 | Chen et al., "Automated Domain Modeling with LLMs: A Comparative Study" | F1 scores (0.76 classes, 0.34 relationships), chain-of-thought caution |
| Coutinho 2025 | Coutinho et al., "LLM-Based Modeling Assistance for Textual Ontology-Driven Conceptual Modeling" | Guidance texts significantly improve output quality |
| Liang 2025 | Liang et al., "Integrating Large Language Models for Automated Structural Analysis" | Domain-specific ICL achieves 100% accuracy; benchmarking methodology |
| Raj 2025 | Raj et al., "Semantic Consistency for Assuring Reliability of Large Language Models" | **Critical:** Consistency and accuracy are independent properties |
| Reitemeyer 2025 | Reitemeyer & Fill, "Applying LLMs in Knowledge Graph-based Enterprise Modeling" | LLMs show higher consistency than humans, human-in-the-loop essential |
| Wang 2025 | Wang & Wang, "Assessing Consistency and Reproducibility in LLM Outputs" | 3-5 runs optimal for consistency |

Expand Down
Loading
Loading