Context
We have successfully implemented C++ code indexing via scip-clang (Sourcegraph). However, the current graph lacks awareness of the build system (CMake targets) and MLIR dialect definitions (TableGen). To enable "GraphRAG," we must bridge the gap between semantic vectors (LanceDB) and structural relationships (Neo4j). We need a "Compiler Introspection" approach—querying CMake and TableGen directly—rather than relying on fragile regex parsing.
Objective
Implement the ingest_build.py and ingest_dialects.py modules to query the CMake File API and iree-tblgen, respectively, and merge this data into the existing Neo4j graph to create a unified lineage: Build Target ↔ Source File ↔ C++ Symbol ↔ MLIR Op.
Scope of Work
- Build System Ingestion (
scripts/ingest_build.py):
- Mechanism: Inject a query object into
.cmake/api/v1/query, trigger a CMake configure, and parse the resulting JSON reply.
- Graph Schema:
- Nodes:
(:BuildTarget {name, type}), (:File {path}).
- Edges:
(:BuildTarget)-[:COMPILES]->(:File), (:BuildTarget)-[:LINKS_TO]->(:BuildTarget).
- Dialect Ingestion (
scripts/ingest_dialects.py):
- Mechanism: Invoke
iree-tblgen --gen-dialect-doc (or -gen-json) on .td files to extract Op definitions, attributes, and types.
- Graph Schema:
- Nodes:
(:Dialect {namespace}), (:Op {mnemonic}), (:TableGenFile {path}).
- Edges:
(:Op)-[:DEFINED_IN]->(:TableGenFile).
- Graph Unification Strategy:
- Update
src/ingest_pipeline.py to reconcile file paths between the SCIP index (C++) and the CMake API (Build).
- Establish the
(:Op)-[:IMPLEMENTED_BY]->(:Class) relationship by correlating TableGen mnemonics with C++ class names (e.g., linalg.matmul -> LinalgMatmulOp).
Acceptance Criteria (Definition of Done)
We define success by the ability to traverse the graph across domains:
Test 1: CMake Target Resolution
- Input: The IREE build directory configured with File API.
- Condition: Run
ingest_build.py → Query Neo4j for the target iree-compile (or a known library like IREEHal).
- Success: The query returns a
(:BuildTarget) node that connects via [:COMPILES] to the specific list of .cpp source files defined in the CMakeLists.txt.
Test 2: TableGen to C++ Linkage
- Input:
LinalgOps.td (upstream MLIR definition).
- Condition: Run
ingest_dialects.py → Query Neo4j for the node (:Op {name: 'linalg.matmul'}).
- Success: The node exists and has a path to its definition file
(:TableGenFile {path: '.../LinalgOps.td'}).
Test 3: Full Dependency Traversal
- Input: A specific source file
Tools/iree-compile/main.cpp.
- Condition: execute Cypher query:
MATCH (f:File {path: '...'})<-[:COMPILES]-(t:BuildTarget)-[:LINKS_TO]->(lib) RETURN lib.name.
- Success: Returns a list of static libraries (e.g.,
IREECompiler, LLVMSupport) that matches the actual CMake linker arguments.
Test 4: Hybrid Retrieval Check
- Input: Natural language query "Find the build target for the linalg dialect".
- Condition: Run
src/mlirAgent/tools/retriever.py.
- Success: The tool uses vector search to find the "Linalg" concept, then follows the Graph edges to return the specific
(:BuildTarget) name (e.g., MLIRLinalg).
Context
We have successfully implemented C++ code indexing via
scip-clang(Sourcegraph). However, the current graph lacks awareness of the build system (CMake targets) and MLIR dialect definitions (TableGen). To enable "GraphRAG," we must bridge the gap between semantic vectors (LanceDB) and structural relationships (Neo4j). We need a "Compiler Introspection" approach—querying CMake and TableGen directly—rather than relying on fragile regex parsing.Objective
Implement the
ingest_build.pyandingest_dialects.pymodules to query the CMake File API andiree-tblgen, respectively, and merge this data into the existing Neo4j graph to create a unified lineage:Build Target↔Source File↔C++ Symbol↔MLIR Op.Scope of Work
scripts/ingest_build.py):.cmake/api/v1/query, trigger a CMake configure, and parse the resulting JSONreply.(:BuildTarget {name, type}),(:File {path}).(:BuildTarget)-[:COMPILES]->(:File),(:BuildTarget)-[:LINKS_TO]->(:BuildTarget).scripts/ingest_dialects.py):iree-tblgen --gen-dialect-doc(or-gen-json) on.tdfiles to extract Op definitions, attributes, and types.(:Dialect {namespace}),(:Op {mnemonic}),(:TableGenFile {path}).(:Op)-[:DEFINED_IN]->(:TableGenFile).src/ingest_pipeline.pyto reconcile file paths between the SCIP index (C++) and the CMake API (Build).(:Op)-[:IMPLEMENTED_BY]->(:Class)relationship by correlating TableGen mnemonics with C++ class names (e.g.,linalg.matmul->LinalgMatmulOp).Acceptance Criteria (Definition of Done)
We define success by the ability to traverse the graph across domains:
Test 1: CMake Target Resolution
ingest_build.py→ Query Neo4j for the targetiree-compile(or a known library likeIREEHal).(:BuildTarget)node that connects via[:COMPILES]to the specific list of.cppsource files defined in theCMakeLists.txt.Test 2: TableGen to C++ Linkage
LinalgOps.td(upstream MLIR definition).ingest_dialects.py→ Query Neo4j for the node(:Op {name: 'linalg.matmul'}).(:TableGenFile {path: '.../LinalgOps.td'}).Test 3: Full Dependency Traversal
Tools/iree-compile/main.cpp.MATCH (f:File {path: '...'})<-[:COMPILES]-(t:BuildTarget)-[:LINKS_TO]->(lib) RETURN lib.name.IREECompiler,LLVMSupport) that matches the actual CMake linker arguments.Test 4: Hybrid Retrieval Check
src/mlirAgent/tools/retriever.py.(:BuildTarget)name (e.g.,MLIRLinalg).