Skip to content

refactor: improve assessment pipeline correctness (P0/P1/P2)#8

Merged
arthurpanhku merged 1 commit into
mainfrom
feat/pipeline-correctness
May 28, 2026
Merged

refactor: improve assessment pipeline correctness (P0/P1/P2)#8
arthurpanhku merged 1 commit into
mainfrom
feat/pipeline-correctness

Conversation

@arthurpanhku

Copy link
Copy Markdown
Owner

Summary

Four targeted fixes to the assessment pipeline, identified via first-principles review.

P0 — Citation traceability (orchestrator.py)

  • _reviewer_agent prompt now instructs the LLM to declare exactly which chunk_ids it used (POL-N / HIS-N) and provide verbatim quotes.
  • New _build_chunk_lookup + _resolve_citations_from_llm build SourceCitation objects from those declarations, replacing the prior behaviour that silently attached all passed-in KB chunks as citations regardless of whether the LLM actually consulted them.
  • Falls back to the legacy all-chunks path only when the reviewer returns an empty sources array.

P0 — Semantic evidence extraction (orchestrator.py)

  • _evidence_agent is now async and issues an LLM call to extract verbatim evidence lines scoped to skill.risk_focus, replacing a hardcoded keyword list that missed domain-specific terms ("data residency", "non-repudiation", "privacy by design", etc.).
  • Old keyword scan extracted to _evidence_agent_keyword_fallback and retained as a fallback when the LLM call fails.

P1 — Semantic KB query seed (orchestrator.py)

  • New _extract_query_seed scores document paragraphs by skill.risk_focus term frequency and greedily packs the highest-scoring ones into the KB query, replacing combined_input[:2000] which returned context from the document header regardless of relevance.

P1 — Large-document map-reduce (orchestrator.py)

  • Documents longer than 12 000 chars are split into overlapping 10 000-char sections via _split_text_with_overlap.
  • _draft_large_document runs _drafter_agent in parallel across sections and consolidates with a new _merge_drafts (MergeAgent) call, preventing silent truncation of appendices, exception clauses, and definitions that often appear at the end of security documents.

P2 — History storage granularity (kb/service.py)

  • add_history_response now stores each RiskItem and ComplianceGap as a separate Document in the history vector store, replacing character-chunked JSON blobs that split objects mid-structure and made similarity retrieval semantically meaningless.

Test plan

  • pytest tests/test_orchestrator.py tests/test_kb_history.py — 27 new/updated unit tests, all green locally (Python 3.11)
  • Existing test test_orchestrator_uses_skill_prompt still passes (no breaking changes to public API)
  • Manual: run a STRIDE assessment on a multi-page PDF and verify citations map to real KB chunks
  • Manual: run on a document > 12 000 chars and confirm MergeAgent fires in logs

🤖 Generated with Claude Code

P0 – Citation traceability: reviewer now declares which chunk IDs it
actually used; _resolve_citations_from_llm builds SourceCitation objects
from those declarations, eliminating phantom citations that referenced
chunks the LLM never consulted.  Falls back to the old all-chunks path
only when the reviewer returns no sources.

P0 – Semantic evidence extraction: _evidence_agent is now async and
issues an LLM call to extract verbatim evidence lines scoped to
skill.risk_focus, replacing the hardcoded keyword list that missed
domain-specific terms (e.g. "data residency", "non-repudiation").
The old keyword scan is kept as a fallback for LLM failure.

P1 – Semantic query seed: _extract_query_seed scores document paragraphs
by skill focus term frequency and packs the highest-scoring ones into
the KB query, replacing a naive head-truncation that returned the wrong
context for most security documents.

P1 – Large-document map-reduce: documents longer than 12 000 chars are
split into 10 000-char overlapping sections; _draft_large_document runs
drafter agents in parallel and consolidates with a MergeAgent, preventing
silent truncation of appendices and exception clauses.

P2 – History storage granularity: add_history_response now stores each
RiskItem and ComplianceGap as a separate vector document instead of
character-chunking the whole report JSON, making history retrieval
semantically meaningful.

Adds 27 unit tests covering all four improvements.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@arthurpanhku arthurpanhku merged commit e1dee20 into main May 28, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant