Skip to content

[Infra][KG] Implement Unified Build & Dialect Graph Ingestion #4

@copparihollmann

Description

@copparihollmann

Context

We have successfully implemented C++ code indexing via scip-clang (Sourcegraph). However, the current graph lacks awareness of the build system (CMake targets) and MLIR dialect definitions (TableGen). To enable "GraphRAG," we must bridge the gap between semantic vectors (LanceDB) and structural relationships (Neo4j). We need a "Compiler Introspection" approach—querying CMake and TableGen directly—rather than relying on fragile regex parsing.

Objective

Implement the ingest_build.py and ingest_dialects.py modules to query the CMake File API and iree-tblgen, respectively, and merge this data into the existing Neo4j graph to create a unified lineage: Build TargetSource FileC++ SymbolMLIR Op.

Scope of Work

  1. Build System Ingestion (scripts/ingest_build.py):
    • Mechanism: Inject a query object into .cmake/api/v1/query, trigger a CMake configure, and parse the resulting JSON reply.
    • Graph Schema:
      • Nodes: (:BuildTarget {name, type}), (:File {path}).
      • Edges: (:BuildTarget)-[:COMPILES]->(:File), (:BuildTarget)-[:LINKS_TO]->(:BuildTarget).
  2. Dialect Ingestion (scripts/ingest_dialects.py):
    • Mechanism: Invoke iree-tblgen --gen-dialect-doc (or -gen-json) on .td files to extract Op definitions, attributes, and types.
    • Graph Schema:
      • Nodes: (:Dialect {namespace}), (:Op {mnemonic}), (:TableGenFile {path}).
      • Edges: (:Op)-[:DEFINED_IN]->(:TableGenFile).
  3. Graph Unification Strategy:
    • Update src/ingest_pipeline.py to reconcile file paths between the SCIP index (C++) and the CMake API (Build).
    • Establish the (:Op)-[:IMPLEMENTED_BY]->(:Class) relationship by correlating TableGen mnemonics with C++ class names (e.g., linalg.matmul -> LinalgMatmulOp).

Acceptance Criteria (Definition of Done)

We define success by the ability to traverse the graph across domains:

Test 1: CMake Target Resolution

  • Input: The IREE build directory configured with File API.
  • Condition: Run ingest_build.py → Query Neo4j for the target iree-compile (or a known library like IREEHal).
  • Success: The query returns a (:BuildTarget) node that connects via [:COMPILES] to the specific list of .cpp source files defined in the CMakeLists.txt.

Test 2: TableGen to C++ Linkage

  • Input: LinalgOps.td (upstream MLIR definition).
  • Condition: Run ingest_dialects.py → Query Neo4j for the node (:Op {name: 'linalg.matmul'}).
  • Success: The node exists and has a path to its definition file (:TableGenFile {path: '.../LinalgOps.td'}).

Test 3: Full Dependency Traversal

  • Input: A specific source file Tools/iree-compile/main.cpp.
  • Condition: execute Cypher query: MATCH (f:File {path: '...'})<-[:COMPILES]-(t:BuildTarget)-[:LINKS_TO]->(lib) RETURN lib.name.
  • Success: Returns a list of static libraries (e.g., IREECompiler, LLVMSupport) that matches the actual CMake linker arguments.

Test 4: Hybrid Retrieval Check

  • Input: Natural language query "Find the build target for the linalg dialect".
  • Condition: Run src/mlirAgent/tools/retriever.py.
  • Success: The tool uses vector search to find the "Linalg" concept, then follows the Graph edges to return the specific (:BuildTarget) name (e.g., MLIRLinalg).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions