Technical Proposal: Lance-Native Multimodal Knowledge Graph (MMKG)

Version: 1.0

Date: January 18, 2026

Target System: lance-context \+ lance-graph

---

## **1\. Executive Summary**

This proposal outlines the architecture for a **Multimodal Knowledge Graph (MMKG)** built natively on the **Lance** data format. By leveraging lance-graph (for Cypher-based graph traversal) and lance-context (for multimodal storage), we aim to create a system where nodes can be rich media objects (images, video clips, audio segments) linked by semantic and temporal relationships.

Unlike traditional MMKG approaches that store media in blob storage (S3) and graph topology in a separate graph DB (Neo4j), this architecture unifies both in a single, high-performance columnar format. This enables **zero-copy retrieval** of heavy media payloads during graph traversals, unlocking new capabilities for AI agents requiring deep multimodal reasoning.

---

## **2\. Problem Statement: The "Split-Brain" of Multimodal AI**

Current architectures for Multimodal RAG (Retrieval-Augmented Generation) typically fragment data across three distinct systems to handle the complexity of rich media:

1. **Vector Database (e.g., Milvus/Pinecone):** Stores embeddings for similarity search (e.g., "Find images looking like this").  
2. **Graph Database (e.g., Neo4j):** Stores semantic relationships (e.g., "Person A appears in Video B").  
3. **Blob Storage (e.g., S3/GCS):** Stores the actual heavy media files (images, video clips, audio).

**The Latency Tax:** An agent reasoning about a video must perform a "Three-Hop Dance":

1. Query Vector DB to find relevant timestamps.  
2. Query Graph DB to understand who is in the scene.  
3. Network request to S3 to fetch the actual frame to feed into a VLM (Vision-Language Model) like GPT-4o.

This introduces significant I/O latency, making real-time "watching and reasoning" agents sluggish.

---

## **3\. The Solution: Unified Columnar MMKG**

We propose a **Lance-Native MMKG** where the graph topology, vector indices, and **binary media payloads** coexist in a single, local-first columnar format.

By utilizing Lance's LargeBinary column type and efficient random access decoders, we can store video frames and audio clips *directly inside the graph nodes*.

**Key Advantage:** **Zero-Copy Retrieval.** A Cypher query doesn't just return a file path; it returns the *actual image bytes* in memory, ready for the VLM, with zero network overhead.

---

## **4\. Schema Design**

We define a schema optimized for **Temporal** and **Semantic** reasoning over media.

### **4.1 MediaNodes Table (The "Heavy" Nodes)**

Unlike standard graphs where nodes are lightweight JSON objects, MediaNodes carry payloads.

| Column | Arrow Type | Description |
| :---- | :---- | :---- |
| id | Utf8 | UUID |
| type | Dictionary\<Int8, Utf8\> | IMAGE, VIDEO\_CLIP, AUDIO\_SEGMENT |
| **blob\_data** | **LargeBinary** | **The raw bytes (JPEG, MP4 chunk, WAV).** |
| embedding | FixedSizeList\<Float32\> | Multimodal embedding (e.g., CLIP-ViT, ImageBind). |
| start\_time | Timestamp | Start time in source media timeline. |
| end\_time | Timestamp | End time in source media timeline. |
| metadata | Struct | Tech specs (resolution, codec, source\_url). |

### **4.2 TemporalEdges Table (The Time Layer)**

Explicit edges for temporal ordering allow the graph engine to "playback" a sequence of nodes.

| Column | Arrow Type | Description |
| :---- | :---- | :---- |
| source\_id | Utf8 | Previous Clip ID |
| target\_id | Utf8 | Next Clip ID |
| type | Dictionary | NEXT\_SCENE, NEXT\_FRAME |

### **4.3 SemanticEdges Table (The Knowledge Layer)**

Standard edges linking media to concepts.

| Column | Arrow Type | Description |
| :---- | :---- | :---- |
| source\_id | Utf8 | MediaNode ID |
| target\_id | Utf8 | Entity ID (Person, Object, Location) |
| type | Dictionary | CONTAINS, APPEARS\_IN, MENTIONED |

---

## **5\. Enabled Capabilities**

### **5.1 "Flashback" Queries (Temporal Traversal)**

An agent can reason backwards in time without re-ingesting the video file.

* **User:** "Why did the car crash?"  
* **Agent Logic:**  
  * Find the "Crash" node via Vector Search.  
  * Traverse (:Clip)--\>(:Clip) to retrieve the 5 seconds *before* the crash.  
  * Feed the blob\_data of these 5 frames into GPT-4o.  
  * *Result:* "The driver was texting 3 seconds prior."

### **5.2 Cross-Modal Hybrid Search**

Combine text semantics with visual similarity in a single query.

#### Option 1: Pure Cypher Query
* **User:** "Find clips of the CEO smiling."  
* **Query:**  
* Cypher

```

MATCH (p:Person {name: "CEO"})<--(c:VideoClip)
WHERE vector.similarity(c.embedding, $smile_vector) > 0.8
RETURN c.blob_data

```

#### Option 2: Hybrid query

https://github.com/lance-format/lance-graph/pull/83

The `execute_with_vector_rerank` API allows users to query a dataset with a Cypher query first, then filter/rerank the returned dataset (but only with the in-memory vector search to be compatible with the Python `execute` API)

```
let query = CypherQuery::new(
        "MATCH (d:Document) \
         RETURN d.id, d.name, d.embedding",
    )?
    .with_config(config);

let results = query
    .execute_with_vector_rerank(
        datasets,
        VectorSearch::new("d.embedding")
            .query_vector(vec![1.0, 0.0, 0.0])
            .metric(DistanceMetric::L2)
            .top_k(3),
    )
    .await?;
```

Users currently can only run the Lance optimized vector search only from the Rust side, e.g.
```
let ann_results = VectorSearch::new("embedding")
    .query_vector(vec![1.0, 0.0, 0.0])
    .metric(DistanceMetric::L2)
    .top_k(5)
    .include_distance(true)
    .search_lance(&lance_dataset)
    .await?;
```

* 

---

## **6\. Implementation Roadmap**

### **Phase 1: The MediaIngestor (Python)**

Build a utility to slice and embed media into Lance.

* **Input:** Video file (e.g., meeting.mp4).  
* **Process:**  
  1. Detect scenes (PySceneDetect).  
  2. Extract keyframes.  
  3. Embed keyframes (CLIP/SigLIP).  
  4. Write to MediaNodes table.  
  5. Automatically generate NEXT edges in TemporalEdges.

### **Phase 2: Schema Extensions (Rust)**

* Modify lance-context to support lazy-loading of LargeBinary columns. We want to ensure we don't load the image bytes unless the query explicitly asks for RETURN n.blob\_data.

### **Phase 3: Graph Query Extensions**

* Add custom Cypher functions for multimodal operations:  
  * visual\_similarity(n, $image\_bytes): Computes distance on the fly.  
  * temporal\_window(n, seconds=5): Returns the neighborhood of nodes within a time window.

---

## **7\. Example Workflow**

**Scenario:** An "AI Video Editor" Agent.

1. **Ingest:** User drops raw footage into the agent's folder.  
2. **Process:** Agent runs MediaIngestor, populating the lance-graph.  
3. **Prompt:** "Make a supercut of all the funny moments."  
4. **Execution:**  
   * Agent searches MediaNodes for "laughter", "smiling", "joke".  
   * Agent retrieves blob\_data for matching clips.  
   * Agent concatenates bytes using ffmpeg.  
5. **Output:** funny\_supercut.mp4 generated in seconds, purely from the graph data.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Technical Proposal: Lance-Native Multimodal Knowledge Graph (MMKG) #91

1. Executive Summary

2. Problem Statement: The "Split-Brain" of Multimodal AI

3. The Solution: Unified Columnar MMKG

4. Schema Design

4.1 MediaNodes Table (The "Heavy" Nodes)

4.2 TemporalEdges Table (The Time Layer)

4.3 SemanticEdges Table (The Knowledge Layer)

5. Enabled Capabilities

5.1 "Flashback" Queries (Temporal Traversal)

5.2 Cross-Modal Hybrid Search

Option 1: Pure Cypher Query

Option 2: Hybrid query

6. Implementation Roadmap

Phase 1: The MediaIngestor (Python)

Phase 2: Schema Extensions (Rust)

Phase 3: Graph Query Extensions

7. Example Workflow

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Column	Arrow Type	Description
id	Utf8	UUID
type	Dictionary<Int8, Utf8>	IMAGE, VIDEO_CLIP, AUDIO_SEGMENT
blob_data	LargeBinary	The raw bytes (JPEG, MP4 chunk, WAV).
embedding	FixedSizeList<Float32>	Multimodal embedding (e.g., CLIP-ViT, ImageBind).
start_time	Timestamp	Start time in source media timeline.
end_time	Timestamp	End time in source media timeline.
metadata	Struct	Tech specs (resolution, codec, source_url).

Column	Arrow Type	Description
source_id	Utf8	Previous Clip ID
target_id	Utf8	Next Clip ID
type	Dictionary	NEXT_SCENE, NEXT_FRAME

Column	Arrow Type	Description
source_id	Utf8	MediaNode ID
target_id	Utf8	Entity ID (Person, Object, Location)
type	Dictionary	CONTAINS, APPEARS_IN, MENTIONED

Technical Proposal: Lance-Native Multimodal Knowledge Graph (MMKG) #91

Description

1. Executive Summary

2. Problem Statement: The "Split-Brain" of Multimodal AI

3. The Solution: Unified Columnar MMKG

4. Schema Design

4.1 MediaNodes Table (The "Heavy" Nodes)

4.2 TemporalEdges Table (The Time Layer)

4.3 SemanticEdges Table (The Knowledge Layer)

5. Enabled Capabilities

5.1 "Flashback" Queries (Temporal Traversal)

5.2 Cross-Modal Hybrid Search

Option 1: Pure Cypher Query

Option 2: Hybrid query

6. Implementation Roadmap

Phase 1: The MediaIngestor (Python)

Phase 2: Schema Extensions (Rust)

Phase 3: Graph Query Extensions

7. Example Workflow

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions