-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Version: 1.0
Date: January 18, 2026
Target System: lance-context + lance-graph
1. Executive Summary
This proposal outlines the architecture for a Multimodal Knowledge Graph (MMKG) built natively on the Lance data format. By leveraging lance-graph (for Cypher-based graph traversal) and lance-context (for multimodal storage), we aim to create a system where nodes can be rich media objects (images, video clips, audio segments) linked by semantic and temporal relationships.
Unlike traditional MMKG approaches that store media in blob storage (S3) and graph topology in a separate graph DB (Neo4j), this architecture unifies both in a single, high-performance columnar format. This enables zero-copy retrieval of heavy media payloads during graph traversals, unlocking new capabilities for AI agents requiring deep multimodal reasoning.
2. Problem Statement: The "Split-Brain" of Multimodal AI
Current architectures for Multimodal RAG (Retrieval-Augmented Generation) typically fragment data across three distinct systems to handle the complexity of rich media:
- Vector Database (e.g., Milvus/Pinecone): Stores embeddings for similarity search (e.g., "Find images looking like this").
- Graph Database (e.g., Neo4j): Stores semantic relationships (e.g., "Person A appears in Video B").
- Blob Storage (e.g., S3/GCS): Stores the actual heavy media files (images, video clips, audio).
The Latency Tax: An agent reasoning about a video must perform a "Three-Hop Dance":
- Query Vector DB to find relevant timestamps.
- Query Graph DB to understand who is in the scene.
- Network request to S3 to fetch the actual frame to feed into a VLM (Vision-Language Model) like GPT-4o.
This introduces significant I/O latency, making real-time "watching and reasoning" agents sluggish.
3. The Solution: Unified Columnar MMKG
We propose a Lance-Native MMKG where the graph topology, vector indices, and binary media payloads coexist in a single, local-first columnar format.
By utilizing Lance's LargeBinary column type and efficient random access decoders, we can store video frames and audio clips directly inside the graph nodes.
Key Advantage: Zero-Copy Retrieval. A Cypher query doesn't just return a file path; it returns the actual image bytes in memory, ready for the VLM, with zero network overhead.
4. Schema Design
We define a schema optimized for Temporal and Semantic reasoning over media.
4.1 MediaNodes Table (The "Heavy" Nodes)
Unlike standard graphs where nodes are lightweight JSON objects, MediaNodes carry payloads.
| Column | Arrow Type | Description |
|---|---|---|
| id | Utf8 | UUID |
| type | Dictionary<Int8, Utf8> | IMAGE, VIDEO_CLIP, AUDIO_SEGMENT |
| blob_data | LargeBinary | The raw bytes (JPEG, MP4 chunk, WAV). |
| embedding | FixedSizeList<Float32> | Multimodal embedding (e.g., CLIP-ViT, ImageBind). |
| start_time | Timestamp | Start time in source media timeline. |
| end_time | Timestamp | End time in source media timeline. |
| metadata | Struct | Tech specs (resolution, codec, source_url). |
4.2 TemporalEdges Table (The Time Layer)
Explicit edges for temporal ordering allow the graph engine to "playback" a sequence of nodes.
| Column | Arrow Type | Description |
|---|---|---|
| source_id | Utf8 | Previous Clip ID |
| target_id | Utf8 | Next Clip ID |
| type | Dictionary | NEXT_SCENE, NEXT_FRAME |
4.3 SemanticEdges Table (The Knowledge Layer)
Standard edges linking media to concepts.
| Column | Arrow Type | Description |
|---|---|---|
| source_id | Utf8 | MediaNode ID |
| target_id | Utf8 | Entity ID (Person, Object, Location) |
| type | Dictionary | CONTAINS, APPEARS_IN, MENTIONED |
5. Enabled Capabilities
5.1 "Flashback" Queries (Temporal Traversal)
An agent can reason backwards in time without re-ingesting the video file.
- User: "Why did the car crash?"
- Agent Logic:
- Find the "Crash" node via Vector Search.
- Traverse (:Clip)-->(:Clip) to retrieve the 5 seconds before the crash.
- Feed the blob_data of these 5 frames into GPT-4o.
- Result: "The driver was texting 3 seconds prior."
5.2 Cross-Modal Hybrid Search
Combine text semantics with visual similarity in a single query.
Option 1: Pure Cypher Query
- User: "Find clips of the CEO smiling."
- Query:
- Cypher
MATCH (p:Person {name: "CEO"})<--(c:VideoClip)
WHERE vector.similarity(c.embedding, $smile_vector) > 0.8
RETURN c.blob_data
Option 2: Hybrid query
The execute_with_vector_rerank API allows users to query a dataset with a Cypher query first, then filter/rerank the returned dataset (but only with the in-memory vector search to be compatible with the Python execute API)
let query = CypherQuery::new(
"MATCH (d:Document) \
RETURN d.id, d.name, d.embedding",
)?
.with_config(config);
let results = query
.execute_with_vector_rerank(
datasets,
VectorSearch::new("d.embedding")
.query_vector(vec![1.0, 0.0, 0.0])
.metric(DistanceMetric::L2)
.top_k(3),
)
.await?;
Users currently can only run the Lance optimized vector search only from the Rust side, e.g.
let ann_results = VectorSearch::new("embedding")
.query_vector(vec![1.0, 0.0, 0.0])
.metric(DistanceMetric::L2)
.top_k(5)
.include_distance(true)
.search_lance(&lance_dataset)
.await?;
6. Implementation Roadmap
Phase 1: The MediaIngestor (Python)
Build a utility to slice and embed media into Lance.
- Input: Video file (e.g., meeting.mp4).
- Process:
- Detect scenes (PySceneDetect).
- Extract keyframes.
- Embed keyframes (CLIP/SigLIP).
- Write to MediaNodes table.
- Automatically generate NEXT edges in TemporalEdges.
Phase 2: Schema Extensions (Rust)
- Modify lance-context to support lazy-loading of LargeBinary columns. We want to ensure we don't load the image bytes unless the query explicitly asks for RETURN n.blob_data.
Phase 3: Graph Query Extensions
- Add custom Cypher functions for multimodal operations:
- visual_similarity(n, $image_bytes): Computes distance on the fly.
- temporal_window(n, seconds=5): Returns the neighborhood of nodes within a time window.
7. Example Workflow
Scenario: An "AI Video Editor" Agent.
- Ingest: User drops raw footage into the agent's folder.
- Process: Agent runs MediaIngestor, populating the lance-graph.
- Prompt: "Make a supercut of all the funny moments."
- Execution:
- Agent searches MediaNodes for "laughter", "smiling", "joke".
- Agent retrieves blob_data for matching clips.
- Agent concatenates bytes using ffmpeg.
- Output: funny_supercut.mp4 generated in seconds, purely from the graph data.