refactor: improve assessment pipeline correctness (P0/P1/P2)#8
Merged
Conversation
P0 – Citation traceability: reviewer now declares which chunk IDs it actually used; _resolve_citations_from_llm builds SourceCitation objects from those declarations, eliminating phantom citations that referenced chunks the LLM never consulted. Falls back to the old all-chunks path only when the reviewer returns no sources. P0 – Semantic evidence extraction: _evidence_agent is now async and issues an LLM call to extract verbatim evidence lines scoped to skill.risk_focus, replacing the hardcoded keyword list that missed domain-specific terms (e.g. "data residency", "non-repudiation"). The old keyword scan is kept as a fallback for LLM failure. P1 – Semantic query seed: _extract_query_seed scores document paragraphs by skill focus term frequency and packs the highest-scoring ones into the KB query, replacing a naive head-truncation that returned the wrong context for most security documents. P1 – Large-document map-reduce: documents longer than 12 000 chars are split into 10 000-char overlapping sections; _draft_large_document runs drafter agents in parallel and consolidates with a MergeAgent, preventing silent truncation of appendices and exception clauses. P2 – History storage granularity: add_history_response now stores each RiskItem and ComplianceGap as a separate vector document instead of character-chunking the whole report JSON, making history retrieval semantically meaningful. Adds 27 unit tests covering all four improvements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four targeted fixes to the assessment pipeline, identified via first-principles review.
P0 — Citation traceability (
orchestrator.py)_reviewer_agentprompt now instructs the LLM to declare exactly whichchunk_ids it used (POL-N/HIS-N) and provide verbatim quotes._build_chunk_lookup+_resolve_citations_from_llmbuildSourceCitationobjects from those declarations, replacing the prior behaviour that silently attached all passed-in KB chunks as citations regardless of whether the LLM actually consulted them.sourcesarray.P0 — Semantic evidence extraction (
orchestrator.py)_evidence_agentis nowasyncand issues an LLM call to extract verbatim evidence lines scoped toskill.risk_focus, replacing a hardcoded keyword list that missed domain-specific terms ("data residency", "non-repudiation", "privacy by design", etc.)._evidence_agent_keyword_fallbackand retained as a fallback when the LLM call fails.P1 — Semantic KB query seed (
orchestrator.py)_extract_query_seedscores document paragraphs byskill.risk_focusterm frequency and greedily packs the highest-scoring ones into the KB query, replacingcombined_input[:2000]which returned context from the document header regardless of relevance.P1 — Large-document map-reduce (
orchestrator.py)_split_text_with_overlap._draft_large_documentruns_drafter_agentin parallel across sections and consolidates with a new_merge_drafts(MergeAgent) call, preventing silent truncation of appendices, exception clauses, and definitions that often appear at the end of security documents.P2 — History storage granularity (
kb/service.py)add_history_responsenow stores eachRiskItemandComplianceGapas a separateDocumentin the history vector store, replacing character-chunked JSON blobs that split objects mid-structure and made similarity retrieval semantically meaningless.Test plan
pytest tests/test_orchestrator.py tests/test_kb_history.py— 27 new/updated unit tests, all green locally (Python 3.11)test_orchestrator_uses_skill_promptstill passes (no breaking changes to public API)🤖 Generated with Claude Code