Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
173 changes: 173 additions & 0 deletions NEO4J_INTERVIEW_PREP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# Neo4j Sales Engineer — Technical Evaluation Prep

**Project:** `graphrag-api-db` — an end-to-end GraphRAG knowledge-graph pipeline built on Neo4j and `neo4j_graphrag`.

**How to use this doc:** The first section is your 5–6 bullet **elevator overview** — memorize the one-liners. Everything after is the **drill-down** for each point, following the pipeline stage by stage, with talking points, the Neo4j-product positioning, the deep technical detail you can pull from if pressed, and the likely interviewer questions. Lean sales-positioning, but every claim is anchored in real code (file:line references included so you can re-read the source).

> **Framing note:** This repo is the *ingestion / graph-construction* side. It builds the Neo4j graph and the vector/community indexes that **power** GraphRAG retrieval. A separate repo holds the Retrieval & Chat UI. When the interviewer asks "so how do you query it?", that's your cue to bridge to the retrieval repo — talking points for that bridge are in the final section.

---

## 1. Elevator Overview (the 5–6 bullets)

**1. End-to-end GraphRAG pipeline on Neo4j, built on `neo4j_graphrag`.**
I built a complete ETL-to-graph pipeline that ingests an unstructured technical guide (~101 articles + glossary) and turns it into a queryable Neo4j knowledge graph using Neo4j's first-party `neo4j_graphrag` library and `SimpleKGPipeline`. It demonstrates the full GraphRAG value proposition — vector search *plus* graph structure — on Neo4j-native tooling rather than a bolted-on stack.

**2. Schema-constrained LLM entity extraction (12 node types, 14 relationship types).**
Rather than letting the LLM emit an unconstrained graph, I defined a domain schema with ~50 validated (source, relationship, target) patterns. This prevents schema drift, keeps the graph queryable, and is exactly the governed-extraction story Neo4j positions against naive RAG.

**3. Hybrid retrieval foundation: vector embeddings + graph traversal + community summaries.**
The graph supports semantic retrieval via Neo4j native vector indexes (Voyage AI voyage-4, 1024-dim, OpenAI fallback) layered on explicit relationships and document/chunk structure. I added Leiden community detection with LLM-generated, separately-embedded community summaries — the Microsoft-GraphRAG global-vs-local retrieval pattern, running natively against Neo4j.

**4. Production-grade data quality and entity resolution.**
An 11-step, 3-phase post-processing layer handles normalization, same-name and cross-label deduplication, an industry taxonomy collapsing 100+ variants into ~18 canonical nodes, and a validation framework with dry-run repair. Reflects the real-world truth that GraphRAG quality lives or dies on entity resolution, not extraction.

**5. Deep Neo4j platform fluency (constraints, APOC, vector indexes, graph analytics).**
The project navigates non-obvious Neo4j behaviors — why entity-type uniqueness constraints cause silent batch rollbacks, the `__Entity__`/`__KGBuilder__` labeling entity resolution depends on, APOC-driven label operations, and preflight checks for connectivity/APOC/vector-index dimensions. Hands-on platform knowledge a customer will probe.

**6. Engineered like a product, with a safe staging/production workflow.**
Clean architecture (Protocol-based fetcher abstraction, dependency injection), strict tooling (Ruff, type-checking, pytest, CI/Codecov), cost-estimation dry-runs, and an environment-driven staging-vs-production switch with confirmation prompts before destructive operations against production Aura. Shows I can operationalize GraphRAG, not just prototype it.

---

## 2. Drill-down — Stage 1 & 2: Ingestion → Neo4j

**Arc to land:** *"I take unstructured web content and end up with a Neo4j graph queryable by both vectors and relationships — using Neo4j's own GraphRAG library."*

### Stage 1 — Scrape
**One-liner:** "I scrape a full technical guide — ~101 articles plus a glossary — dynamically discovering structure from the site's TOC, and normalize everything to Markdown before it touches Neo4j."

Talking points:
- **Dynamic discovery, not hard-coded** — chapters/articles come from the live `#chapter-menu` TOC in a single request, so the pipeline survives content changes (robust ingestion, not demo-ware).
- **Protocol-based fetcher abstraction** (httpx default; Playwright for JS-rendered content) via dependency injection — your "I architect for testability and extensibility" proof point. (`fetcher.py`, PEP 544 structural subtyping.)
- **Respectful scraping** — async with a concurrency semaphore (default 3), exponential-backoff retry, custom User-Agent (Cloudflare blocks the default httpx UA).

### Stage 2 — Extract & embed (the GraphRAG core)
**One-liner:** "I use `neo4j_graphrag`'s `SimpleKGPipeline` to do schema-constrained LLM entity extraction and vector embedding in one pass, writing both a *lexical graph* (documents → chunks) and a *domain graph* (entities + relationships) into Neo4j."

Neo4j product-positioning hooks:
- **First-party library.** `SimpleKGPipeline`, `LexicalGraphConfig`, `OpenAILLM`, and the `Embedder` interface are all `neo4j_graphrag` (`extraction/pipeline.py:204-284`). Message: *"Neo4j isn't just the database — its GraphRAG package gives you the whole construction pipeline."*
- **Dual-graph model = the differentiator.** `LexicalGraphConfig` (`pipeline.py:264-269`) wires `Article -[FROM_ARTICLE]- Chunk` and entities `-[MENTIONED_IN]-> Chunk`. Every extracted entity is traceable to its source chunk and article — *provenance/grounding* that vector-only RAG can't provide. Your strongest "why graph beats a plain vector DB" line.
- **Vectors live *in* Neo4j**, on chunk nodes, served by Neo4j's native vector index. *"One database does semantic search AND graph traversal — no separate vector store to keep in sync."*
- **Quality levers:** schema-constrained extraction (12 node / 14 rel / ~50 patterns) prevents drift; **gleaning** runs 2 LLM passes catching 20–30% more entities (`pipeline.py:347-365`); `perform_entity_resolution=True` uses the built-in resolver; `temperature=0` for determinism; `on_error="IGNORE"` so one bad chunk can't kill a ~1.5-hour run.
- **Chunking:** two-stage hierarchical splitter — LangChain `HTMLHeaderTextSplitter` preserves document structure, then `RecursiveCharacterTextSplitter` (or optional Chonkie semantic chunker) keeps chunks within embedding/context limits (`chunking/hierarchical_chunker.py`).

Likely questions:
- *"Why a graph instead of a vector DB?"* → provenance via the lexical graph + multi-hop traversal + community summaries. Vector similarity alone can't answer "what challenges does traceability address across industries."
- *"Why Neo4j specifically?"* → vectors + graph + graph analytics in one engine; first-party GraphRAG library; APOC.
- *"Hardest part of KG construction?"* → not extraction — **entity resolution and schema governance.** (Segue to Stage 3.)

---

## 3. Drill-down — Stage 3: Normalization, Entity Resolution & Community Detection

This is where the deepest Neo4j-product story lives. The pipeline is deliberately ordered into **3 phases** so all entity-creating steps run before any cleanup, and cleanup runs before graph analytics.

**Phase A — Entity creation:**
1. `MentionedInBackfiller.backfill()` — creates `MENTIONED_IN` + `APPLIES_TO` relationships.
2. `LangExtractAugmenter.augment()` — post-extraction augmentation with **source grounding** (text-span provenance).

**Phase B — Entity cleanup (runs after all entities exist):**
3. `EntityNormalizer.normalize_all_entities()` — lowercase + trim.
4. `deduplicate_by_name()` — merge same-name duplicates.
5. `deduplicate_cross_label()` — merge same-name entities with *different* type labels.
6. `EntityCleanupNormalizer.run_cleanup()` — drop generics, merge plural→singular.
7. `IndustryNormalizer.consolidate_industries()` — collapse 100+ variants → ~18 canonical industries (`postprocessing/industry_taxonomy.py`).
8. `EntitySummarizer.summarize()` — LLM-generated entity descriptions.

**Phase C — Graph analysis (on clean entities):**
9. `CommunityDetector.detect_communities()` — **Leiden** clustering.
10. `CommunitySummarizer.summarize_communities()` — LLM summaries → `Community` nodes via `IN_COMMUNITY`.
11. `CommunityEmbedder.embed_community_summaries()` — vector embeddings of summaries.

### Talking points
- **Entity resolution is the real work.** Normalization + same-name dedup + **cross-label dedup** (e.g., "FDA" extracted once as `Organization`, once as `Industry` → merged) is what makes the graph trustworthy. Lead with: *"Anyone can get an LLM to emit triples; the differentiator is resolving them into clean, canonical entities."*
- **Domain taxonomy.** `INDUSTRY_TAXONOMY` maps 100+ surface forms to ~18 canonical industries; `CANONICAL_INDUSTRIES` is derived (`industry_taxonomy.py:241`). An `ORGANIZATIONS_NOT_INDUSTRIES` set (`:190`) relabels NASA/FDA/IEEE from Industry → Organization. Shows you encode domain knowledge, not just generic NLP.
- **Phase ordering is intentional.** New entities (Phase A) must exist *before* cleanup (Phase B) so they get normalized too; analytics (Phase C) must run on *clean* entities or communities are noisy. This sequencing is a genuine engineering insight worth calling out.

### The Microsoft-GraphRAG pattern, on Neo4j (your headline for this stage)
- **Leiden community detection** via `leidenalg` + `igraph` on **semantic edges only** (structural edges like `FROM_ARTICLE` excluded), reproducible with a fixed seed and `gamma` resolution parameter (`graph/community_detection.py:46-70`). Exports the entity graph, runs Leiden locally, writes community IDs back to Neo4j.
- **Community summaries** — each cluster gets an LLM-generated summary stored as a `Community` node, linked `(:Entity)-[:IN_COMMUNITY]->(:Community)`.
- **Community embeddings** — summaries are embedded (voyage-4, 1024d) into a dedicated `community_summary_embeddings` vector index, cosine similarity (`graph/constraints.py:328-350`).
- **Why this matters for retrieval:** this is the *global* (thematic, "what are the big themes") vs *local* (entity-specific) retrieval split that Microsoft's GraphRAG popularized — and you implemented it on Neo4j primitives. Positioning: *"Neo4j gives you the storage, the vector index, AND the graph the community algorithm runs on — three roles, one engine."*
- **GDS angle:** I ran Leiden client-side via `leidenalg` for reference-implementation fidelity and reproducibility; the natural enterprise path is Neo4j **Graph Data Science**, which ships Leiden/Louvain as a native, scalable procedure. Be ready to say *"I'd move this into GDS for production scale"* — it shows you know the product roadmap.

Likely questions:
- *"What's Leiden and why not Louvain?"* → Leiden fixes Louvain's badly-connected-community defect, guarantees well-connected communities, converges better. Both are in Neo4j GDS.
- *"How do you handle duplicate entities?"* → built-in entity resolution at ingest + three explicit dedup passes (name, cross-label, plural) in post-processing.
- *"Resolution / number of communities?"* → `gamma` parameter; higher = more, smaller communities.

---

## 4. Drill-down — Stage 4: Supplementary Graph Structure

**One-liner:** "On top of the extracted knowledge graph, I add a navigational structure layer — Chapter nodes, Resource nodes (Image/Video/Webinar), and a glossary linked to the concepts it defines."

Talking points:
- `SupplementaryGraphBuilder` (`graph/supplementary.py`) adds `Chapter` nodes with article relationships, `Resource` nodes, and glossary structure.
- **Glossary-to-concept linking** uses fuzzy matching (`rapidfuzz`) so a defined term connects to the extracted `Concept` it describes — enriching retrieval with authoritative definitions.
- Positioning: *"This is where graph shines over vectors — I can navigate from a chapter to its articles to the entities they mention to the community those entities belong to, all as first-class relationships."* It's the multi-hop story made concrete.

---

## 5. Drill-down — Stage 5: Validation & Data Quality

**One-liner:** "I built a validation framework with Cypher-based quality checks and safe, dry-run-previewable repair operations — because a knowledge graph you can't trust is worse than no graph."

Talking points:
- `ValidationQueries` (`validation/queries.py`) runs checks: orphan chunks (chunks with no `FROM_ARTICLE`), duplicate entities, missing required properties, invalid relationship patterns.
- `ValidationFixer` applies repairs with a **dry-run preview** mode; fix ordering is deliberate (delete degenerate → re-index → chunk_ids → webinar titles → relabel → backfill `MENTIONED_IN` → definitions → generics → plurals).
- **Pass/fail gates:** orphan_chunks, duplicates, chunk_ids, chunk_index, plural_duplicates (industry count is advisory).
- Reports auto-archive prior versions with ISO-8601 timestamps before writing a new one.
- Positioning for an SE: *"This is the operational maturity question every customer eventually asks — how do you know the graph is correct, and how do you fix it safely in production?"*

---

## 6. Drill-down — Neo4j Platform Fluency (the "shows you actually know the product" section)

These are the non-obvious lessons that prove hands-on depth. Any one of them can win the room.

- **Entity-type uniqueness constraints are a trap.** `neo4j_graphrag` 1.13+ creates entities with `CREATE` + `apoc.create.addLabels()`, not `MERGE`. Per-type uniqueness constraints (e.g., `Concept.name`) cause `IndexEntryConflictException` when the same name recurs across extraction batches — **silently rolling back entire batch transactions.** So I deliberately keep uniqueness constraints **only** on structural nodes — Article, Chunk, Chapter, Image, Video, Webinar, Definition (`graph/constraints.py:21-39`) — and let `neo4j_graphrag`'s entity resolution handle dedup. *This is the single best "I learned this the hard way in production" anecdote.*
- **The `__Entity__` / `__KGBuilder__` labels matter.** Gleaning and the LangExtract augmenter `MERGE` entities with `:__Entity__:__KGBuilder__` labels (`extraction/gleaning.py:271-272`). Without `__Entity__`, gleaned/augmented nodes are **invisible** to entity resolution and cross-label dedup. Demonstrates you understand `neo4j_graphrag`'s internal labeling contract.
- **APOC dependency** — label operations rely on `apoc.create.addLabels()`; the pipeline preflight verifies APOC is installed.
- **Native vector indexes** — chunk embeddings and a separate `community_summary_embeddings` index (1024d, cosine) are created and dimension-checked (`constraints.py:294-350`). Talk to the dimension-match requirement (index dim must equal embedder dim).
- **Preflight validation** — checks Neo4j connectivity, APOC availability, and vector-index dimensions before a run (`preflight.py`). Shows you fail fast instead of 90 minutes into a load.
- **Direct OpenAI calls (gleaning) need `response_format={"type": "json_object"}`** (`gleaning.py:183`) for reliable JSON — a concrete LLM-integration gotcha.
- **Embedding provider abstraction** — `VoyageAIEmbeddings` implements `neo4j_graphrag.embeddings.base.Embedder`, with **asymmetric input types** ("document" at index time, "query" at search time) and OpenAI fallback. Shows you understand asymmetric embeddings and clean interface design.

---

## 7. Drill-down — Engineering & Ops Maturity

**One-liner:** "I built this like a product I'd hand to a customer, not a notebook."

- **Architecture patterns:** Protocol/structural subtyping (PEP 544) for the fetcher, Strategy pattern for swappable fetch backends, dependency injection, lazy initialization (Playwright only starts on first request).
- **Tooling:** Python 3.13, Ruff (broad rule set incl. security/bandit), `ty` type-checking, pytest, pre-commit, CI with Codecov patch-coverage gate.
- **Cost control:** `--dry-run` estimates LLM/embedding cost before committing to a ~1.5-hour run.
- **Staging vs production safety:** environment-driven targeting (shell-exported vars beat `.env`), a staging Neo4j Desktop DBMS, green "(staging)" / yellow "(production)" CLI labels, and **interactive confirmation prompts** before `--full` or `--fix` against production Aura. Promotion path dumps staging and uploads to Aura.
- Positioning: *"For a customer, the risk isn't 'can GraphRAG work' — it's 'can I run it safely against prod and know what it costs.' I built for both."*

---

## 8. Bridge to Retrieval & the Chat UI (the payoff)

When the interviewer asks "how do you actually query this?" — bridge to the separate Retrieval & Chat UI repo. Talking points you can give *from the graph you built here*, even before we review that repo together:

- **Three retrieval modes the graph supports:**
1. **Local / vector** — embed the user query (voyage-4, `input_type="query"`), hit the Neo4j chunk vector index, return top-k chunks *with* their `Article`/`Chapter` provenance.
2. **Graph-augmented** — from the retrieved entities, traverse relationships (`ADDRESSES`, `REQUIRES`, `APPLIES_TO`, etc.) to pull in connected context a pure vector search would miss (multi-hop).
3. **Global / community** — for thematic questions, vector-search the `community_summary_embeddings` index and answer from `Community` summaries (the Microsoft-GraphRAG "global search").
- **Why this is the closing argument:** the construction repo isn't the point — it's that I built a graph deliberately shaped to make all three retrieval modes possible in one Neo4j instance. *"Vector-only RAG gives you one of these. The graph gives you all three, plus provenance."*

> **Next session action:** review the Retrieval & Chat UI repo directly so these bridge points can be made specific (actual retriever classes, Cypher, prompt assembly).

---

## 9. Rapid-fire Q&A cheat sheet

- **"What is GraphRAG?"** — RAG where retrieval is over a knowledge graph (entities + relationships + community summaries), not just a flat vector store. Adds provenance, multi-hop reasoning, and global/thematic retrieval.
- **"Why Neo4j over a vector DB?"** — One engine for vectors + graph traversal + graph analytics; native vector indexes; first-party `neo4j_graphrag` + GDS; provenance via the lexical graph.
- **"Vector index details?"** — Cosine, 1024-dim (voyage-4); separate indexes for chunks and community summaries; index dimension must equal embedder dimension.
- **"How big / how long?"** — ~101 articles + glossary; full re-ingestion ~1.5 hours (~1 article/min), dominated by LLM extraction + gleaning.
- **"Biggest lesson?"** — Entity-type uniqueness constraints silently roll back batches under `neo4j_graphrag`'s CREATE+addLabels pattern; keep uniqueness on structural nodes only and let entity resolution dedup the rest.
- **"What would you do for enterprise scale?"** — Move Leiden into Neo4j GDS, batch/parallelize extraction, consider structured-output extraction when `SimpleKGPipeline` supports it.
Loading
Loading