Problem
The pipeline's 12 LLM-extracted entity types (Concept, Challenge, Artifact, etc.) do not include a description property in the extraction schema (extraction/schema.py). As a result:
- No entities have descriptions — The
LLMEntityRelExtractor only extracts properties defined in the schema. Without a description property, the LLM extracts name, display_name, and type-specific fields but never a description.
- EntitySummarizer is a no-op — The summarizer (
postprocessing/entity_summarizer.py) consolidates fragmented descriptions into coherent summaries, but finds zero entities to process because the description property doesn't exist in the database.
- Community summaries are shallower —
CommunitySummarizer attempts to read n.description for richer community summaries but falls back to names and labels only, producing less informative results.
- RAG retrieval misses semantic context — Entity matching currently relies on
name only. A description like "the practice of linking requirements to downstream artifacts to ensure completeness and enable impact analysis" would give retrievers far richer semantic context.
Evidence from staging pipeline run (2026-02-24)
WARNING:neo4j.notifications: The property `description` does not exist in database `neo4j`.
2026-02-24 16:29:46 [info] No entities with fragmented descriptions found
Summarized 0 entities
The Neo4j warning confirms that no entity in the graph has a description property. The EntitySummarizer correctly returns immediately with zero work.
Verification query:
MATCH (n:__Entity__) WHERE n.description IS NOT NULL RETURN count(n)
-- Returns 0
Root Cause
In src/graphrag_kg_pipeline/extraction/schema.py, the NODE_TYPES dict defines properties for each entity type. None of the 12 types include a description property:
- Concept:
name, display_name, definition, aliases
- Challenge:
name, display_name, severity
- Artifact:
name, display_name, artifact_type
- Bestpractice:
name, display_name, rationale
- Processstage:
name, display_name, sequence
- Role:
name, display_name, responsibilities
- Standard:
name, display_name, organization, domain
- Tool:
name, display_name, vendor, tool_type
- Methodology:
name, display_name, approach
- Industry:
name, display_name, regulated
- Organization:
name, display_name, organization_type, domain
- Outcome:
name, display_name, outcome_type
The SimpleKGPipeline LLM extraction prompt only asks the LLM to extract properties listed in the schema, so descriptions are never produced.
Proposed Solution
Add a description property to all 12 entity types in NODE_TYPES:
"description": {
"type": "STRING",
"required": False,
"description": "One-sentence description of this entity in the context of requirements management",
},
Downstream effects (already implemented, will activate automatically)
- EntitySummarizer — Will find entities with multi-fragment descriptions (>200 chars from multi-chunk extraction) and consolidate them via LLM into clean 1-3 sentence summaries. Currently implemented and tested but has zero work to do.
- CommunitySummarizer — Already reads
n.description in its community member query. Will produce richer community summaries without code changes.
- API repo retrieval —
text2cypher.py and entity search will benefit from richer entity context. No changes needed in the API repo.
Cost and runtime impact
- Extraction: ~10-20% more tokens per article (LLM must extract an additional property per entity). Estimated additional cost: ~$1-2 on a full pipeline run.
- EntitySummarizer: Will now make LLM calls for entities with fragmented descriptions. Estimated: ~$0.50-1.00 (gpt-4o, ~100 entities with fragments).
- Total additional cost per full run: ~$1.50-3.00 (on top of existing ~$9-17).
- No additional runtime for community embeddings or vector indexes — descriptions don't affect those.
Implementation steps
- Add
description property to all 12 entity types in extraction/schema.py
- Verify
EntitySummarizer activates on a staging run (should find entities with >200 char descriptions)
- Compare community summary quality with/without entity descriptions
- Update tests if schema property counts change in assertions
- Run full staging pipeline to validate end-to-end
What NOT to change
- EntitySummarizer code — Already correctly implemented, just needs data to work with
- CommunitySummarizer code — Already reads descriptions with graceful fallback
- Extraction prompts —
SimpleKGPipeline auto-generates prompts from the schema; adding the property is sufficient
- Validation queries — No description-related checks currently exist
Context
- Concept has a
definition property (specific to Concept), but description is a general-purpose field for all entity types
- The
definition property on Concept serves a different purpose — it captures formal definitions from the glossary, not contextual descriptions from extraction
- This was identified during the first staging pipeline run against
graphrag-api-db-stage (local Neo4j Desktop instance)
Labels
Enhancement, Pipeline
Problem
The pipeline's 12 LLM-extracted entity types (Concept, Challenge, Artifact, etc.) do not include a
descriptionproperty in the extraction schema (extraction/schema.py). As a result:LLMEntityRelExtractoronly extracts properties defined in the schema. Without adescriptionproperty, the LLM extractsname,display_name, and type-specific fields but never a description.postprocessing/entity_summarizer.py) consolidates fragmented descriptions into coherent summaries, but finds zero entities to process because thedescriptionproperty doesn't exist in the database.CommunitySummarizerattempts to readn.descriptionfor richer community summaries but falls back to names and labels only, producing less informative results.nameonly. A description like "the practice of linking requirements to downstream artifacts to ensure completeness and enable impact analysis" would give retrievers far richer semantic context.Evidence from staging pipeline run (2026-02-24)
The Neo4j warning confirms that no entity in the graph has a
descriptionproperty. The EntitySummarizer correctly returns immediately with zero work.Verification query:
Root Cause
In
src/graphrag_kg_pipeline/extraction/schema.py, theNODE_TYPESdict defines properties for each entity type. None of the 12 types include adescriptionproperty:name,display_name,definition,aliasesname,display_name,severityname,display_name,artifact_typename,display_name,rationalename,display_name,sequencename,display_name,responsibilitiesname,display_name,organization,domainname,display_name,vendor,tool_typename,display_name,approachname,display_name,regulatedname,display_name,organization_type,domainname,display_name,outcome_typeThe
SimpleKGPipelineLLM extraction prompt only asks the LLM to extract properties listed in the schema, so descriptions are never produced.Proposed Solution
Add a
descriptionproperty to all 12 entity types inNODE_TYPES:Downstream effects (already implemented, will activate automatically)
n.descriptionin its community member query. Will produce richer community summaries without code changes.text2cypher.pyand entity search will benefit from richer entity context. No changes needed in the API repo.Cost and runtime impact
Implementation steps
descriptionproperty to all 12 entity types inextraction/schema.pyEntitySummarizeractivates on a staging run (should find entities with >200 char descriptions)What NOT to change
SimpleKGPipelineauto-generates prompts from the schema; adding the property is sufficientContext
definitionproperty (specific to Concept), butdescriptionis a general-purpose field for all entity typesdefinitionproperty on Concept serves a different purpose — it captures formal definitions from the glossary, not contextual descriptions from extractiongraphrag-api-db-stage(local Neo4j Desktop instance)Labels
Enhancement, Pipeline