Skip to content

Technical Proposal: Lance-Native Multimodal Knowledge Graph (MMKG) #91

@beinan

Description

@beinan

Version: 1.0

Date: January 18, 2026

Target System: lance-context + lance-graph


1. Executive Summary

This proposal outlines the architecture for a Multimodal Knowledge Graph (MMKG) built natively on the Lance data format. By leveraging lance-graph (for Cypher-based graph traversal) and lance-context (for multimodal storage), we aim to create a system where nodes can be rich media objects (images, video clips, audio segments) linked by semantic and temporal relationships.

Unlike traditional MMKG approaches that store media in blob storage (S3) and graph topology in a separate graph DB (Neo4j), this architecture unifies both in a single, high-performance columnar format. This enables zero-copy retrieval of heavy media payloads during graph traversals, unlocking new capabilities for AI agents requiring deep multimodal reasoning.


2. Problem Statement: The "Split-Brain" of Multimodal AI

Current architectures for Multimodal RAG (Retrieval-Augmented Generation) typically fragment data across three distinct systems to handle the complexity of rich media:

  1. Vector Database (e.g., Milvus/Pinecone): Stores embeddings for similarity search (e.g., "Find images looking like this").
  2. Graph Database (e.g., Neo4j): Stores semantic relationships (e.g., "Person A appears in Video B").
  3. Blob Storage (e.g., S3/GCS): Stores the actual heavy media files (images, video clips, audio).

The Latency Tax: An agent reasoning about a video must perform a "Three-Hop Dance":

  1. Query Vector DB to find relevant timestamps.
  2. Query Graph DB to understand who is in the scene.
  3. Network request to S3 to fetch the actual frame to feed into a VLM (Vision-Language Model) like GPT-4o.

This introduces significant I/O latency, making real-time "watching and reasoning" agents sluggish.


3. The Solution: Unified Columnar MMKG

We propose a Lance-Native MMKG where the graph topology, vector indices, and binary media payloads coexist in a single, local-first columnar format.

By utilizing Lance's LargeBinary column type and efficient random access decoders, we can store video frames and audio clips directly inside the graph nodes.

Key Advantage: Zero-Copy Retrieval. A Cypher query doesn't just return a file path; it returns the actual image bytes in memory, ready for the VLM, with zero network overhead.


4. Schema Design

We define a schema optimized for Temporal and Semantic reasoning over media.

4.1 MediaNodes Table (The "Heavy" Nodes)

Unlike standard graphs where nodes are lightweight JSON objects, MediaNodes carry payloads.

Column Arrow Type Description
id Utf8 UUID
type Dictionary<Int8, Utf8> IMAGE, VIDEO_CLIP, AUDIO_SEGMENT
blob_data LargeBinary The raw bytes (JPEG, MP4 chunk, WAV).
embedding FixedSizeList<Float32> Multimodal embedding (e.g., CLIP-ViT, ImageBind).
start_time Timestamp Start time in source media timeline.
end_time Timestamp End time in source media timeline.
metadata Struct Tech specs (resolution, codec, source_url).

4.2 TemporalEdges Table (The Time Layer)

Explicit edges for temporal ordering allow the graph engine to "playback" a sequence of nodes.

Column Arrow Type Description
source_id Utf8 Previous Clip ID
target_id Utf8 Next Clip ID
type Dictionary NEXT_SCENE, NEXT_FRAME

4.3 SemanticEdges Table (The Knowledge Layer)

Standard edges linking media to concepts.

Column Arrow Type Description
source_id Utf8 MediaNode ID
target_id Utf8 Entity ID (Person, Object, Location)
type Dictionary CONTAINS, APPEARS_IN, MENTIONED

5. Enabled Capabilities

5.1 "Flashback" Queries (Temporal Traversal)

An agent can reason backwards in time without re-ingesting the video file.

  • User: "Why did the car crash?"
  • Agent Logic:
    • Find the "Crash" node via Vector Search.
    • Traverse (:Clip)-->(:Clip) to retrieve the 5 seconds before the crash.
    • Feed the blob_data of these 5 frames into GPT-4o.
    • Result: "The driver was texting 3 seconds prior."

5.2 Cross-Modal Hybrid Search

Combine text semantics with visual similarity in a single query.

Option 1: Pure Cypher Query

  • User: "Find clips of the CEO smiling."
  • Query:
  • Cypher

MATCH (p:Person {name: "CEO"})<--(c:VideoClip)
WHERE vector.similarity(c.embedding, $smile_vector) > 0.8
RETURN c.blob_data

Option 2: Hybrid query

#83

The execute_with_vector_rerank API allows users to query a dataset with a Cypher query first, then filter/rerank the returned dataset (but only with the in-memory vector search to be compatible with the Python execute API)

let query = CypherQuery::new(
        "MATCH (d:Document) \
         RETURN d.id, d.name, d.embedding",
    )?
    .with_config(config);

let results = query
    .execute_with_vector_rerank(
        datasets,
        VectorSearch::new("d.embedding")
            .query_vector(vec![1.0, 0.0, 0.0])
            .metric(DistanceMetric::L2)
            .top_k(3),
    )
    .await?;

Users currently can only run the Lance optimized vector search only from the Rust side, e.g.

let ann_results = VectorSearch::new("embedding")
    .query_vector(vec![1.0, 0.0, 0.0])
    .metric(DistanceMetric::L2)
    .top_k(5)
    .include_distance(true)
    .search_lance(&lance_dataset)
    .await?;

6. Implementation Roadmap

Phase 1: The MediaIngestor (Python)

Build a utility to slice and embed media into Lance.

  • Input: Video file (e.g., meeting.mp4).
  • Process:
    1. Detect scenes (PySceneDetect).
    2. Extract keyframes.
    3. Embed keyframes (CLIP/SigLIP).
    4. Write to MediaNodes table.
    5. Automatically generate NEXT edges in TemporalEdges.

Phase 2: Schema Extensions (Rust)

  • Modify lance-context to support lazy-loading of LargeBinary columns. We want to ensure we don't load the image bytes unless the query explicitly asks for RETURN n.blob_data.

Phase 3: Graph Query Extensions

  • Add custom Cypher functions for multimodal operations:
    • visual_similarity(n, $image_bytes): Computes distance on the fly.
    • temporal_window(n, seconds=5): Returns the neighborhood of nodes within a time window.

7. Example Workflow

Scenario: An "AI Video Editor" Agent.

  1. Ingest: User drops raw footage into the agent's folder.
  2. Process: Agent runs MediaIngestor, populating the lance-graph.
  3. Prompt: "Make a supercut of all the funny moments."
  4. Execution:
    • Agent searches MediaNodes for "laughter", "smiling", "joke".
    • Agent retrieves blob_data for matching clips.
    • Agent concatenates bytes using ffmpeg.
  5. Output: funny_supercut.mp4 generated in seconds, purely from the graph data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions