From 3834d346812ac0430e597a0cad3eb8f0a7285fc1 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 25 May 2026 13:33:06 +0000 Subject: [PATCH 1/2] docs: add Neo4j SE interview prep guide Stage-by-stage talking points, Neo4j product positioning, and a Q&A cheat sheet grounded in the pipeline source. https://claude.ai/code/session_01BNW3MtX6cCufJN2esFDoFi --- NEO4J_INTERVIEW_PREP.md | 173 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 173 insertions(+) create mode 100644 NEO4J_INTERVIEW_PREP.md diff --git a/NEO4J_INTERVIEW_PREP.md b/NEO4J_INTERVIEW_PREP.md new file mode 100644 index 0000000..5876987 --- /dev/null +++ b/NEO4J_INTERVIEW_PREP.md @@ -0,0 +1,173 @@ +# Neo4j Sales Engineer — Technical Evaluation Prep + +**Project:** `graphrag-api-db` — an end-to-end GraphRAG knowledge-graph pipeline built on Neo4j and `neo4j_graphrag`. + +**How to use this doc:** The first section is your 5–6 bullet **elevator overview** — memorize the one-liners. Everything after is the **drill-down** for each point, following the pipeline stage by stage, with talking points, the Neo4j-product positioning, the deep technical detail you can pull from if pressed, and the likely interviewer questions. Lean sales-positioning, but every claim is anchored in real code (file:line references included so you can re-read the source). + +> **Framing note:** This repo is the *ingestion / graph-construction* side. It builds the Neo4j graph and the vector/community indexes that **power** GraphRAG retrieval. A separate repo holds the Retrieval & Chat UI. When the interviewer asks "so how do you query it?", that's your cue to bridge to the retrieval repo — talking points for that bridge are in the final section. + +--- + +## 1. Elevator Overview (the 5–6 bullets) + +**1. End-to-end GraphRAG pipeline on Neo4j, built on `neo4j_graphrag`.** +I built a complete ETL-to-graph pipeline that ingests an unstructured technical guide (~101 articles + glossary) and turns it into a queryable Neo4j knowledge graph using Neo4j's first-party `neo4j_graphrag` library and `SimpleKGPipeline`. It demonstrates the full GraphRAG value proposition — vector search *plus* graph structure — on Neo4j-native tooling rather than a bolted-on stack. + +**2. Schema-constrained LLM entity extraction (12 node types, 14 relationship types).** +Rather than letting the LLM emit an unconstrained graph, I defined a domain schema with ~50 validated (source, relationship, target) patterns. This prevents schema drift, keeps the graph queryable, and is exactly the governed-extraction story Neo4j positions against naive RAG. + +**3. Hybrid retrieval foundation: vector embeddings + graph traversal + community summaries.** +The graph supports semantic retrieval via Neo4j native vector indexes (Voyage AI voyage-4, 1024-dim, OpenAI fallback) layered on explicit relationships and document/chunk structure. I added Leiden community detection with LLM-generated, separately-embedded community summaries — the Microsoft-GraphRAG global-vs-local retrieval pattern, running natively against Neo4j. + +**4. Production-grade data quality and entity resolution.** +An 11-step, 3-phase post-processing layer handles normalization, same-name and cross-label deduplication, an industry taxonomy collapsing 100+ variants into ~18 canonical nodes, and a validation framework with dry-run repair. Reflects the real-world truth that GraphRAG quality lives or dies on entity resolution, not extraction. + +**5. Deep Neo4j platform fluency (constraints, APOC, vector indexes, graph analytics).** +The project navigates non-obvious Neo4j behaviors — why entity-type uniqueness constraints cause silent batch rollbacks, the `__Entity__`/`__KGBuilder__` labeling entity resolution depends on, APOC-driven label operations, and preflight checks for connectivity/APOC/vector-index dimensions. Hands-on platform knowledge a customer will probe. + +**6. Engineered like a product, with a safe staging/production workflow.** +Clean architecture (Protocol-based fetcher abstraction, dependency injection), strict tooling (Ruff, type-checking, pytest, CI/Codecov), cost-estimation dry-runs, and an environment-driven staging-vs-production switch with confirmation prompts before destructive operations against production Aura. Shows I can operationalize GraphRAG, not just prototype it. + +--- + +## 2. Drill-down — Stage 1 & 2: Ingestion → Neo4j + +**Arc to land:** *"I take unstructured web content and end up with a Neo4j graph queryable by both vectors and relationships — using Neo4j's own GraphRAG library."* + +### Stage 1 — Scrape +**One-liner:** "I scrape a full technical guide — ~101 articles plus a glossary — dynamically discovering structure from the site's TOC, and normalize everything to Markdown before it touches Neo4j." + +Talking points: +- **Dynamic discovery, not hard-coded** — chapters/articles come from the live `#chapter-menu` TOC in a single request, so the pipeline survives content changes (robust ingestion, not demo-ware). +- **Protocol-based fetcher abstraction** (httpx default; Playwright for JS-rendered content) via dependency injection — your "I architect for testability and extensibility" proof point. (`fetcher.py`, PEP 544 structural subtyping.) +- **Respectful scraping** — async with a concurrency semaphore (default 3), exponential-backoff retry, custom User-Agent (Cloudflare blocks the default httpx UA). + +### Stage 2 — Extract & embed (the GraphRAG core) +**One-liner:** "I use `neo4j_graphrag`'s `SimpleKGPipeline` to do schema-constrained LLM entity extraction and vector embedding in one pass, writing both a *lexical graph* (documents → chunks) and a *domain graph* (entities + relationships) into Neo4j." + +Neo4j product-positioning hooks: +- **First-party library.** `SimpleKGPipeline`, `LexicalGraphConfig`, `OpenAILLM`, and the `Embedder` interface are all `neo4j_graphrag` (`extraction/pipeline.py:204-284`). Message: *"Neo4j isn't just the database — its GraphRAG package gives you the whole construction pipeline."* +- **Dual-graph model = the differentiator.** `LexicalGraphConfig` (`pipeline.py:264-269`) wires `Article -[FROM_ARTICLE]- Chunk` and entities `-[MENTIONED_IN]-> Chunk`. Every extracted entity is traceable to its source chunk and article — *provenance/grounding* that vector-only RAG can't provide. Your strongest "why graph beats a plain vector DB" line. +- **Vectors live *in* Neo4j**, on chunk nodes, served by Neo4j's native vector index. *"One database does semantic search AND graph traversal — no separate vector store to keep in sync."* +- **Quality levers:** schema-constrained extraction (12 node / 14 rel / ~50 patterns) prevents drift; **gleaning** runs 2 LLM passes catching 20–30% more entities (`pipeline.py:347-365`); `perform_entity_resolution=True` uses the built-in resolver; `temperature=0` for determinism; `on_error="IGNORE"` so one bad chunk can't kill a ~1.5-hour run. +- **Chunking:** two-stage hierarchical splitter — LangChain `HTMLHeaderTextSplitter` preserves document structure, then `RecursiveCharacterTextSplitter` (or optional Chonkie semantic chunker) keeps chunks within embedding/context limits (`chunking/hierarchical_chunker.py`). + +Likely questions: +- *"Why a graph instead of a vector DB?"* → provenance via the lexical graph + multi-hop traversal + community summaries. Vector similarity alone can't answer "what challenges does traceability address across industries." +- *"Why Neo4j specifically?"* → vectors + graph + graph analytics in one engine; first-party GraphRAG library; APOC. +- *"Hardest part of KG construction?"* → not extraction — **entity resolution and schema governance.** (Segue to Stage 3.) + +--- + +## 3. Drill-down — Stage 3: Normalization, Entity Resolution & Community Detection + +This is where the deepest Neo4j-product story lives. The pipeline is deliberately ordered into **3 phases** so all entity-creating steps run before any cleanup, and cleanup runs before graph analytics. + +**Phase A — Entity creation:** +1. `MentionedInBackfiller.backfill()` — creates `MENTIONED_IN` + `APPLIES_TO` relationships. +2. `LangExtractAugmenter.augment()` — post-extraction augmentation with **source grounding** (text-span provenance). + +**Phase B — Entity cleanup (runs after all entities exist):** +3. `EntityNormalizer.normalize_all_entities()` — lowercase + trim. +4. `deduplicate_by_name()` — merge same-name duplicates. +5. `deduplicate_cross_label()` — merge same-name entities with *different* type labels. +6. `EntityCleanupNormalizer.run_cleanup()` — drop generics, merge plural→singular. +7. `IndustryNormalizer.consolidate_industries()` — collapse 100+ variants → ~18 canonical industries (`postprocessing/industry_taxonomy.py`). +8. `EntitySummarizer.summarize()` — LLM-generated entity descriptions. + +**Phase C — Graph analysis (on clean entities):** +9. `CommunityDetector.detect_communities()` — **Leiden** clustering. +10. `CommunitySummarizer.summarize_communities()` — LLM summaries → `Community` nodes via `IN_COMMUNITY`. +11. `CommunityEmbedder.embed_community_summaries()` — vector embeddings of summaries. + +### Talking points +- **Entity resolution is the real work.** Normalization + same-name dedup + **cross-label dedup** (e.g., "FDA" extracted once as `Organization`, once as `Industry` → merged) is what makes the graph trustworthy. Lead with: *"Anyone can get an LLM to emit triples; the differentiator is resolving them into clean, canonical entities."* +- **Domain taxonomy.** `INDUSTRY_TAXONOMY` maps 100+ surface forms to ~18 canonical industries; `CANONICAL_INDUSTRIES` is derived (`industry_taxonomy.py:241`). An `ORGANIZATIONS_NOT_INDUSTRIES` set (`:190`) relabels NASA/FDA/IEEE from Industry → Organization. Shows you encode domain knowledge, not just generic NLP. +- **Phase ordering is intentional.** New entities (Phase A) must exist *before* cleanup (Phase B) so they get normalized too; analytics (Phase C) must run on *clean* entities or communities are noisy. This sequencing is a genuine engineering insight worth calling out. + +### The Microsoft-GraphRAG pattern, on Neo4j (your headline for this stage) +- **Leiden community detection** via `leidenalg` + `igraph` on **semantic edges only** (structural edges like `FROM_ARTICLE` excluded), reproducible with a fixed seed and `gamma` resolution parameter (`graph/community_detection.py:46-70`). Exports the entity graph, runs Leiden locally, writes community IDs back to Neo4j. +- **Community summaries** — each cluster gets an LLM-generated summary stored as a `Community` node, linked `(:Entity)-[:IN_COMMUNITY]->(:Community)`. +- **Community embeddings** — summaries are embedded (voyage-4, 1024d) into a dedicated `community_summary_embeddings` vector index, cosine similarity (`graph/constraints.py:328-350`). +- **Why this matters for retrieval:** this is the *global* (thematic, "what are the big themes") vs *local* (entity-specific) retrieval split that Microsoft's GraphRAG popularized — and you implemented it on Neo4j primitives. Positioning: *"Neo4j gives you the storage, the vector index, AND the graph the community algorithm runs on — three roles, one engine."* +- **GDS angle:** I ran Leiden client-side via `leidenalg` for reference-implementation fidelity and reproducibility; the natural enterprise path is Neo4j **Graph Data Science**, which ships Leiden/Louvain as a native, scalable procedure. Be ready to say *"I'd move this into GDS for production scale"* — it shows you know the product roadmap. + +Likely questions: +- *"What's Leiden and why not Louvain?"* → Leiden fixes Louvain's badly-connected-community defect, guarantees well-connected communities, converges better. Both are in Neo4j GDS. +- *"How do you handle duplicate entities?"* → built-in entity resolution at ingest + three explicit dedup passes (name, cross-label, plural) in post-processing. +- *"Resolution / number of communities?"* → `gamma` parameter; higher = more, smaller communities. + +--- + +## 4. Drill-down — Stage 4: Supplementary Graph Structure + +**One-liner:** "On top of the extracted knowledge graph, I add a navigational structure layer — Chapter nodes, Resource nodes (Image/Video/Webinar), and a glossary linked to the concepts it defines." + +Talking points: +- `SupplementaryGraphBuilder` (`graph/supplementary.py`) adds `Chapter` nodes with article relationships, `Resource` nodes, and glossary structure. +- **Glossary-to-concept linking** uses fuzzy matching (`rapidfuzz`) so a defined term connects to the extracted `Concept` it describes — enriching retrieval with authoritative definitions. +- Positioning: *"This is where graph shines over vectors — I can navigate from a chapter to its articles to the entities they mention to the community those entities belong to, all as first-class relationships."* It's the multi-hop story made concrete. + +--- + +## 5. Drill-down — Stage 5: Validation & Data Quality + +**One-liner:** "I built a validation framework with Cypher-based quality checks and safe, dry-run-previewable repair operations — because a knowledge graph you can't trust is worse than no graph." + +Talking points: +- `ValidationQueries` (`validation/queries.py`) runs checks: orphan chunks (chunks with no `FROM_ARTICLE`), duplicate entities, missing required properties, invalid relationship patterns. +- `ValidationFixer` applies repairs with a **dry-run preview** mode; fix ordering is deliberate (delete degenerate → re-index → chunk_ids → webinar titles → relabel → backfill `MENTIONED_IN` → definitions → generics → plurals). +- **Pass/fail gates:** orphan_chunks, duplicates, chunk_ids, chunk_index, plural_duplicates (industry count is advisory). +- Reports auto-archive prior versions with ISO-8601 timestamps before writing a new one. +- Positioning for an SE: *"This is the operational maturity question every customer eventually asks — how do you know the graph is correct, and how do you fix it safely in production?"* + +--- + +## 6. Drill-down — Neo4j Platform Fluency (the "shows you actually know the product" section) + +These are the non-obvious lessons that prove hands-on depth. Any one of them can win the room. + +- **Entity-type uniqueness constraints are a trap.** `neo4j_graphrag` 1.13+ creates entities with `CREATE` + `apoc.create.addLabels()`, not `MERGE`. Per-type uniqueness constraints (e.g., `Concept.name`) cause `IndexEntryConflictException` when the same name recurs across extraction batches — **silently rolling back entire batch transactions.** So I deliberately keep uniqueness constraints **only** on structural nodes — Article, Chunk, Chapter, Image, Video, Webinar, Definition (`graph/constraints.py:21-39`) — and let `neo4j_graphrag`'s entity resolution handle dedup. *This is the single best "I learned this the hard way in production" anecdote.* +- **The `__Entity__` / `__KGBuilder__` labels matter.** Gleaning and the LangExtract augmenter `MERGE` entities with `:__Entity__:__KGBuilder__` labels (`extraction/gleaning.py:271-272`). Without `__Entity__`, gleaned/augmented nodes are **invisible** to entity resolution and cross-label dedup. Demonstrates you understand `neo4j_graphrag`'s internal labeling contract. +- **APOC dependency** — label operations rely on `apoc.create.addLabels()`; the pipeline preflight verifies APOC is installed. +- **Native vector indexes** — chunk embeddings and a separate `community_summary_embeddings` index (1024d, cosine) are created and dimension-checked (`constraints.py:294-350`). Talk to the dimension-match requirement (index dim must equal embedder dim). +- **Preflight validation** — checks Neo4j connectivity, APOC availability, and vector-index dimensions before a run (`preflight.py`). Shows you fail fast instead of 90 minutes into a load. +- **Direct OpenAI calls (gleaning) need `response_format={"type": "json_object"}`** (`gleaning.py:183`) for reliable JSON — a concrete LLM-integration gotcha. +- **Embedding provider abstraction** — `VoyageAIEmbeddings` implements `neo4j_graphrag.embeddings.base.Embedder`, with **asymmetric input types** ("document" at index time, "query" at search time) and OpenAI fallback. Shows you understand asymmetric embeddings and clean interface design. + +--- + +## 7. Drill-down — Engineering & Ops Maturity + +**One-liner:** "I built this like a product I'd hand to a customer, not a notebook." + +- **Architecture patterns:** Protocol/structural subtyping (PEP 544) for the fetcher, Strategy pattern for swappable fetch backends, dependency injection, lazy initialization (Playwright only starts on first request). +- **Tooling:** Python 3.13, Ruff (broad rule set incl. security/bandit), `ty` type-checking, pytest, pre-commit, CI with Codecov patch-coverage gate. +- **Cost control:** `--dry-run` estimates LLM/embedding cost before committing to a ~1.5-hour run. +- **Staging vs production safety:** environment-driven targeting (shell-exported vars beat `.env`), a staging Neo4j Desktop DBMS, green "(staging)" / yellow "(production)" CLI labels, and **interactive confirmation prompts** before `--full` or `--fix` against production Aura. Promotion path dumps staging and uploads to Aura. +- Positioning: *"For a customer, the risk isn't 'can GraphRAG work' — it's 'can I run it safely against prod and know what it costs.' I built for both."* + +--- + +## 8. Bridge to Retrieval & the Chat UI (the payoff) + +When the interviewer asks "how do you actually query this?" — bridge to the separate Retrieval & Chat UI repo. Talking points you can give *from the graph you built here*, even before we review that repo together: + +- **Three retrieval modes the graph supports:** + 1. **Local / vector** — embed the user query (voyage-4, `input_type="query"`), hit the Neo4j chunk vector index, return top-k chunks *with* their `Article`/`Chapter` provenance. + 2. **Graph-augmented** — from the retrieved entities, traverse relationships (`ADDRESSES`, `REQUIRES`, `APPLIES_TO`, etc.) to pull in connected context a pure vector search would miss (multi-hop). + 3. **Global / community** — for thematic questions, vector-search the `community_summary_embeddings` index and answer from `Community` summaries (the Microsoft-GraphRAG "global search"). +- **Why this is the closing argument:** the construction repo isn't the point — it's that I built a graph deliberately shaped to make all three retrieval modes possible in one Neo4j instance. *"Vector-only RAG gives you one of these. The graph gives you all three, plus provenance."* + +> **Next session action:** review the Retrieval & Chat UI repo directly so these bridge points can be made specific (actual retriever classes, Cypher, prompt assembly). + +--- + +## 9. Rapid-fire Q&A cheat sheet + +- **"What is GraphRAG?"** — RAG where retrieval is over a knowledge graph (entities + relationships + community summaries), not just a flat vector store. Adds provenance, multi-hop reasoning, and global/thematic retrieval. +- **"Why Neo4j over a vector DB?"** — One engine for vectors + graph traversal + graph analytics; native vector indexes; first-party `neo4j_graphrag` + GDS; provenance via the lexical graph. +- **"Vector index details?"** — Cosine, 1024-dim (voyage-4); separate indexes for chunks and community summaries; index dimension must equal embedder dimension. +- **"How big / how long?"** — ~101 articles + glossary; full re-ingestion ~1.5 hours (~1 article/min), dominated by LLM extraction + gleaning. +- **"Biggest lesson?"** — Entity-type uniqueness constraints silently roll back batches under `neo4j_graphrag`'s CREATE+addLabels pattern; keep uniqueness on structural nodes only and let entity resolution dedup the rest. +- **"What would you do for enterprise scale?"** — Move Leiden into Neo4j GDS, batch/parallelize extraction, consider structured-output extraction when `SimpleKGPipeline` supports it. From 91196897124952ee62d448eb797c1cdae9b4a4b8 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 25 May 2026 13:41:42 +0000 Subject: [PATCH 2/2] docs: add handoff brief for Retrieval & Chat UI prep session Carries goal, working style, verified graph contract, and a paste-ready kickoff prompt so the retrieval-repo session continues with full context. https://claude.ai/code/session_01BNW3MtX6cCufJN2esFDoFi --- RETRIEVAL_SESSION_HANDOFF.md | 81 ++++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 RETRIEVAL_SESSION_HANDOFF.md diff --git a/RETRIEVAL_SESSION_HANDOFF.md b/RETRIEVAL_SESSION_HANDOFF.md new file mode 100644 index 0000000..313cfc7 --- /dev/null +++ b/RETRIEVAL_SESSION_HANDOFF.md @@ -0,0 +1,81 @@ +# Handoff Brief → Retrieval & Chat UI Session + +**Purpose:** Carry the goal, framing, and graph-contract knowledge from the `graphrag-api-db` (construction) prep session into a new Claude Code session rooted in the **Retrieval & Chat UI** repo. Paste the "Kickoff Prompt" (bottom of this file) into the new session to start with full context. + +--- + +## 1. The goal (unchanged across sessions) + +Prepare to articulate this GraphRAG work during a **Neo4j Sales Engineer technical evaluation**. Audience leans **sales-positioning grounded in Neo4j product detail** — i.e., always tie technical choices back to *why Neo4j / why GraphRAG beats naive vector RAG*. Output is interview talking points the candidate can expand on demand, not shipped features. + +In the construction repo we produced `NEO4J_INTERVIEW_PREP.md` (PR #82): a 5–6 bullet overview + stage-by-stage drill-downs + a Neo4j platform-fluency section + a Q&A cheat sheet. **Section 8 of that doc is the retrieval bridge** — the new session's job is to make those bridge points concrete with real retriever code, Cypher, prompt assembly, and similarity scoring. + +## 2. Working style that worked last session + +- Ground every talking point in **actual source (file:line)**, not the README — the interviewer may drill. +- Lead with the one-liner the candidate says out loud, then layer the deeper detail "if pushed." +- For each topic, anticipate the **likely interviewer question** and the angle to answer it. +- Keep Neo4j-product positioning explicit: vectors + graph + analytics in one engine, provenance, multi-hop, global-vs-local retrieval. + +## 3. The graph contract the retrieval repo queries against (verified facts) + +The retrieval app reads a Neo4j graph built by `graphrag-api-db`. Key shape: + +**Lexical graph (provenance backbone):** +- `(c:Chunk)-[:FROM_ARTICLE]->(a:Article)` — every chunk traces to its source article. +- `(e:)-[:MENTIONED_IN]->(c:Chunk)` — entities trace to the chunks that mention them. +- `Chunk` has `.text`, `.embedding`, `.chunk_id`, `.source_article_id`. `Article` has `.title`, `.url`, `.chapter_number`. + +**Domain graph (entities + relationships):** +- 12 entity node types (Concept, Challenge, Artifact, Bestpractice, Processstage, Role, Standard, Tool, Methodology, Industry, Organization, Outcome), all also labeled `:__Entity__:__KGBuilder__`. +- Entity `.name` is lowercased/normalized; `.display_name` keeps original casing; many have LLM-generated `.summary`. +- 14 semantic relationship types for traversal: ADDRESSES, REQUIRES, COMPONENT_OF, APPLIES_TO, PUBLISHES, REGULATES, DEVELOPS, ACHIEVES, etc. + +**Community layer (global / thematic retrieval):** +- `(e)-[:IN_COMMUNITY]->(:Community)`; `Community` nodes carry `.summary` and `.summary_embedding` (Leiden clustering + LLM summaries). + +**Indexes the retrievers rely on:** +- `chunk_embeddings` — VECTOR index on `(c:Chunk).embedding`, cosine. ⚠️ **Verify actual dimension in the live graph**: the construction config uses Voyage `voyage-4` = **1024d**, but the index-creation helper defaults to 1536 (OpenAI). The retrieval query embedding MUST match the index dimension and the embedding model used at ingest. +- `chunk_text_fulltext` — FULL-TEXT index on `(c:Chunk).text` → enables **hybrid (vector + BM25) retrieval**. +- `community_summary_embeddings` — VECTOR index on `(c:Community).summary_embedding`, **1024d, cosine**. + +**Embedding gotcha (asymmetric):** ingest embeds with Voyage `input_type="document"`; **retrieval must embed the query with `input_type="query"`** or relevance degrades. This is a strong "I understand asymmetric embeddings" talking point — check the retrieval repo actually does this. + +## 4. Three retrieval modes to expect / verify in the repo + +1. **Local / vector** — embed query → `chunk_embeddings` top-k → return chunks *with* Article/Chapter provenance. +2. **Graph-augmented** — from retrieved chunks/entities, traverse semantic relationships (ADDRESSES, REQUIRES, APPLIES_TO…) to pull connected context a flat vector search misses (multi-hop). In `neo4j_graphrag` this is typically a `VectorCypherRetriever` with a retrieval-query Cypher. +3. **Global / community** — vector-search `community_summary_embeddings`, answer from `Community` summaries (Microsoft-GraphRAG "global search" for thematic questions). + +Plus possibly **hybrid** (vector + `chunk_text_fulltext` BM25) via `HybridRetriever` / `HybridCypherRetriever`, and **Text2Cypher** for structured questions. + +## 5. What to extract/produce in the new session + +Find and read (likely in the retrieval repo): retriever classes, the retrieval-query Cypher, the query-embedding call, the prompt-assembly/context-formatting code, similarity-score handling, and the chat/UI orchestration. Then produce concrete talking points covering: +- **Which `neo4j_graphrag` retriever(s)** are used (`VectorRetriever`, `VectorCypherRetriever`, `HybridRetriever`, `HybridCypherRetriever`, `Text2CypherRetriever`, `GraphRAG`) and why. +- **The actual Cypher** behind graph-augmented retrieval (the multi-hop expansion) — this is the money shot for "why graph." +- **Similarity scoring**: cosine, how top-k / thresholds are chosen, whether scores are surfaced/reranked, hybrid score fusion if present. +- **Prompt assembly**: how retrieved chunks + graph context + community summaries are formatted into the LLM prompt, and how provenance/citations are returned. +- **The closing argument**: vector-only RAG gives one retrieval mode; this graph gives local + graph-augmented + global, all in one Neo4j instance, with provenance. + +Deliver as an addendum that mirrors the structure of `NEO4J_INTERVIEW_PREP.md` Section 8, so the two docs read as one guide. + +--- + +## 6. Kickoff Prompt (paste this into the new session) + +> I'm interviewing for a **Neo4j Sales Engineer** role and have a technical evaluation. I built a two-repo GraphRAG project: a construction/ingestion repo (`graphrag-api-db`) and **this repo**, the Retrieval & Chat UI. In a prior session we produced an interview prep guide for the construction side (committed as `NEO4J_INTERVIEW_PREP.md` on PR #82 of `graphrag-api-db`); its Section 8 sketches the retrieval story but without real code. +> +> Your job this session: review **this** Retrieval & Chat UI repo and produce concrete, source-grounded interview talking points (file:line references) for the retrieval/query layer, written to mirror that prep guide so the two read as one. Audience leans **sales-positioning grounded in Neo4j product detail** — always connect choices back to why Neo4j / why GraphRAG beats naive vector RAG. For each point: a spoken one-liner, deeper "if pushed" detail, and the likely interviewer question. +> +> Specifically dig into and explain: (1) which `neo4j_graphrag` retriever(s) are used and why; (2) the actual retrieval Cypher behind graph-augmented/multi-hop retrieval; (3) similarity scoring — cosine, top-k/threshold selection, score surfacing/reranking, hybrid fusion; (4) prompt assembly — how chunks + graph context + community summaries become the LLM prompt, and how provenance/citations are returned; (5) the three retrieval modes (local/vector, graph-augmented, global/community) and any hybrid (vector + BM25) or Text2Cypher paths. +> +> **The graph this app queries (built by the other repo), so you can connect retrieval to construction:** +> - Lexical: `(Chunk)-[:FROM_ARTICLE]->(Article)`, `(Entity)-[:MENTIONED_IN]->(Chunk)`. Chunk has `.text`, `.embedding`. +> - Domain: 12 entity types labeled `:__Entity__:__KGBuilder__`; `.name` lowercased, `.display_name` original; 14 semantic rel types (ADDRESSES, REQUIRES, COMPONENT_OF, APPLIES_TO, PUBLISHES, REGULATES, DEVELOPS, ACHIEVES…). +> - Community: `(Entity)-[:IN_COMMUNITY]->(Community)`; Community has `.summary`, `.summary_embedding`. +> - Indexes: `chunk_embeddings` (vector, cosine, on `Chunk.embedding`), `chunk_text_fulltext` (BM25 full-text on `Chunk.text`), `community_summary_embeddings` (vector, 1024d, cosine, on `Community.summary_embedding`). +> - Embeddings: Voyage `voyage-4`, **asymmetric** — ingest uses `input_type="document"`, so retrieval must embed the query with `input_type="query"`. Verify the index dimension matches the query-embedding dimension (ingest uses 1024d; confirm the live index isn't the 1536 default). +> +> Start by mapping the repo and locating the retriever + Cypher + prompt-assembly code, then write the addendum. +