refactor: improve assessment pipeline correctness (P0/P1/P2) by arthurpanhku · Pull Request #8 · arthurpanhku/DocSentinel

arthurpanhku · 2026-05-28T09:55:12Z

Summary

Four targeted fixes to the assessment pipeline, identified via first-principles review.

P0 — Citation traceability (`orchestrator.py`)

_reviewer_agent prompt now instructs the LLM to declare exactly which chunk_ids it used (POL-N / HIS-N) and provide verbatim quotes.
New _build_chunk_lookup + _resolve_citations_from_llm build SourceCitation objects from those declarations, replacing the prior behaviour that silently attached all passed-in KB chunks as citations regardless of whether the LLM actually consulted them.
Falls back to the legacy all-chunks path only when the reviewer returns an empty sources array.

P0 — Semantic evidence extraction (`orchestrator.py`)

_evidence_agent is now async and issues an LLM call to extract verbatim evidence lines scoped to skill.risk_focus, replacing a hardcoded keyword list that missed domain-specific terms ("data residency", "non-repudiation", "privacy by design", etc.).
Old keyword scan extracted to _evidence_agent_keyword_fallback and retained as a fallback when the LLM call fails.

P1 — Semantic KB query seed (`orchestrator.py`)

New _extract_query_seed scores document paragraphs by skill.risk_focus term frequency and greedily packs the highest-scoring ones into the KB query, replacing combined_input[:2000] which returned context from the document header regardless of relevance.

P1 — Large-document map-reduce (`orchestrator.py`)

Documents longer than 12 000 chars are split into overlapping 10 000-char sections via _split_text_with_overlap.
_draft_large_document runs _drafter_agent in parallel across sections and consolidates with a new _merge_drafts (MergeAgent) call, preventing silent truncation of appendices, exception clauses, and definitions that often appear at the end of security documents.

P2 — History storage granularity (`kb/service.py`)

add_history_response now stores each RiskItem and ComplianceGap as a separate Document in the history vector store, replacing character-chunked JSON blobs that split objects mid-structure and made similarity retrieval semantically meaningless.

Test plan

pytest tests/test_orchestrator.py tests/test_kb_history.py — 27 new/updated unit tests, all green locally (Python 3.11)
Existing test test_orchestrator_uses_skill_prompt still passes (no breaking changes to public API)
Manual: run a STRIDE assessment on a multi-page PDF and verify citations map to real KB chunks
Manual: run on a document > 12 000 chars and confirm MergeAgent fires in logs

🤖 Generated with Claude Code

P0 – Citation traceability: reviewer now declares which chunk IDs it actually used; _resolve_citations_from_llm builds SourceCitation objects from those declarations, eliminating phantom citations that referenced chunks the LLM never consulted. Falls back to the old all-chunks path only when the reviewer returns no sources. P0 – Semantic evidence extraction: _evidence_agent is now async and issues an LLM call to extract verbatim evidence lines scoped to skill.risk_focus, replacing the hardcoded keyword list that missed domain-specific terms (e.g. "data residency", "non-repudiation"). The old keyword scan is kept as a fallback for LLM failure. P1 – Semantic query seed: _extract_query_seed scores document paragraphs by skill focus term frequency and packs the highest-scoring ones into the KB query, replacing a naive head-truncation that returned the wrong context for most security documents. P1 – Large-document map-reduce: documents longer than 12 000 chars are split into 10 000-char overlapping sections; _draft_large_document runs drafter agents in parallel and consolidates with a MergeAgent, preventing silent truncation of appendices and exception clauses. P2 – History storage granularity: add_history_response now stores each RiskItem and ComplianceGap as a separate vector document instead of character-chunking the whole report JSON, making history retrieval semantically meaningful. Adds 27 unit tests covering all four improvements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

arthurpanhku merged commit e1dee20 into main May 28, 2026
0 of 2 checks passed

arthurpanhku mentioned this pull request Jun 4, 2026

fix: guard docs static mount and improve build correctness #11

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: improve assessment pipeline correctness (P0/P1/P2)#8

refactor: improve assessment pipeline correctness (P0/P1/P2)#8
arthurpanhku merged 1 commit into
mainfrom
feat/pipeline-correctness

arthurpanhku commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arthurpanhku commented May 28, 2026

Summary

P0 — Citation traceability (orchestrator.py)

P0 — Semantic evidence extraction (orchestrator.py)

P1 — Semantic KB query seed (orchestrator.py)

P1 — Large-document map-reduce (orchestrator.py)

P2 — History storage granularity (kb/service.py)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

P0 — Citation traceability (`orchestrator.py`)

P0 — Semantic evidence extraction (`orchestrator.py`)

P1 — Semantic KB query seed (`orchestrator.py`)

P1 — Large-document map-reduce (`orchestrator.py`)

P2 — History storage granularity (`kb/service.py`)