-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Knowledge Store Plugin Architecture - Implementation Plan
Version: 4.0 (Technology-Agnostic API)
Status: Draft for Review
Prerequisite: Issue #203 (KnowledgeStore Setups) must be implemented first
Executive Summary
This proposal extends the KnowledgeStore Setup infrastructure from Issue #203 to enable technology-agnostic retrieval backends for KB-Server. By expanding organization-level setup management to support multiple plugin types, we achieve:
- Technology-agnostic retrieval - Same API for vector DBs, graph DBs, keyword search, hybrid systems
- Per-collection backend selection - Different collections can use different technologies
- Seamless backend migration - Change backends without recreating collections
- Simplified service layer - Services work with chunks + metadata only, never touching embeddings/graphs/indexes
- Full backward compatibility - ChromaDB remains default, all existing functionality preserved
Relationship to Issue #203
Issue #203 provides:
- Organization-level setup management
- Collections reference setups (not inline configs)
- API key rotation capability
- Setup reusability across collections
This proposal extends #203 by:
- Renaming
embeddings_setups→knowledge_store_setups - Adding
plugin_typefield to specify backend technology - Making
plugin_configgeneric JSON (not embedding-specific) - Implementing plugin architecture that reads from setups
- Enabling multi-backend support (vector, graph, keyword, hybrid)
Architecture Overview
Current State (After Issue #203)
Organizations
↓ 1:N
Embeddings Setups (embeddings-specific)
↓ 1:N
Collections
↓
ChromaDB (hardcoded, ONLY backend)
Problems:
- ChromaDB-only (can't support Neo4j, ElasticSearch, etc.)
embeddings_setupsname implies vector-only- Setup config is embeddings-specific
- Services directly call ChromaDB APIs
Target State (Plugin Architecture)
Organizations
↓ 1:N
KnowledgeStore Setups (renamed & expanded)
Setup 1: ChromaDB + OpenAI (plugin_type: "chromadb")
Setup 2: Neo4j Graph (plugin_type: "neo4j")
Setup 3: ElasticSearch (plugin_type: "elasticsearch")
↓ 1:N
Collections (unchanged - still reference setup_id)
↓
Service Layer (technology-agnostic: chunks + metadata only)
↓
KnowledgeStorePlugin Interface (abstract)
↓
┌──────────┬──────────┬──────────┬──────────┐
│ ChromaDB │ Qdrant │ Neo4j │ElasticSch│
│ Plugin │ Plugin │ Plugin │ Plugin │
└──────────┴──────────┴──────────┴──────────┘
Key Concept: Enhanced KnowledgeStore Setup
Before (Issue #203):
{
"name": "OpenAI Production",
"setup_key": "openai-prod",
"vendor": "openai",
"model_name": "text-embedding-3-small",
"api_key": "sk-...",
"api_endpoint": "https://api.openai.com/v1/embeddings",
"embedding_dimensions": 1536
}After (This Proposal):
{
"name": "ChromaDB with OpenAI",
"setup_key": "chromadb-openai-prod",
"plugin_type": "chromadb", // NEW: Specifies plugin
"plugin_config": { // NEW: Generic JSON config
"vendor": "openai",
"model": "text-embedding-3-small",
"api_key": "sk-...",
"api_endpoint": "https://api.openai.com/v1/embeddings",
"embedding_dimensions": 1536
}
}Other Plugin Examples:
// Neo4j (graph-based, no embeddings)
{
"plugin_type": "neo4j",
"plugin_config": {
"uri": "bolt://localhost:7687",
"user": "neo4j",
"password": "secret",
"schema": {...}
}
}
// ElasticSearch (keyword search)
{
"plugin_type": "elasticsearch",
"plugin_config": {
"hosts": ["localhost:9200"],
"api_key": "...",
"analyzer": "english"
}
}Database Schema Changes
Assumptions from Issue #203
organizationstable existsembeddings_setupstable existscollections.organization_idexistscollections.embeddings_setup_idexists
Required Migrations
-- 1. Rename table
ALTER TABLE embeddings_setups RENAME TO knowledge_store_setups;
-- 2. Add plugin_type field
ALTER TABLE knowledge_store_setups
ADD COLUMN plugin_type TEXT NOT NULL DEFAULT 'chromadb';
-- 3. Add plugin_config JSON field
ALTER TABLE knowledge_store_setups
ADD COLUMN plugin_config JSON;
-- 4. Migrate existing data into plugin_config
UPDATE knowledge_store_setups
SET plugin_config = json_object(
'vendor', vendor,
'api_endpoint', api_endpoint,
'api_key', api_key,
'model', model_name,
'embedding_dimensions', embedding_dimensions
)
WHERE plugin_type = 'chromadb';
-- 5. Add new column to collections
ALTER TABLE collections
ADD COLUMN knowledge_store_setup_id INTEGER
REFERENCES knowledge_store_setups(id);
-- 6. Copy data
UPDATE collections
SET knowledge_store_setup_id = embeddings_setup_id;
-- 7. Create indexes
CREATE INDEX idx_collections_knowledge_store_setup
ON collections(knowledge_store_setup_id);
CREATE INDEX idx_knowledge_store_setups_org_plugin
ON knowledge_store_setups(organization_id, plugin_type);API Changes
Setup Endpoints (Extended from #203)
List Setups - Now includes plugin_type:
GET /organizations/{org_id}/knowledge-store-setups
Response:
{
"setups": [
{
"id": 1,
"name": "ChromaDB with OpenAI",
"setup_key": "chromadb-openai-prod",
"plugin_type": "chromadb", // NEW
"plugin_config": {...}, // NEW
"collections_count": 42
}
]
}Create Setup - Requires plugin_type:
POST /organizations/{org_id}/knowledge-store-setups
{
"name": "Neo4j Knowledge Graph",
"setup_key": "neo4j-prod",
"plugin_type": "neo4j", // NEW: Required
"plugin_config": { // NEW: Plugin-specific
"uri": "bolt://localhost:7687",
"user": "neo4j",
"password": "secret"
}
}New Endpoints
List Available Plugin Types:
GET /knowledge-store-plugins
Response:
{
"plugins": [
{
"plugin_type": "chromadb",
"name": "ChromaDB",
"description": "Vector database with persistent storage",
"supports_embeddings": true,
"supports_metadata": true
},
{
"plugin_type": "neo4j",
"name": "Neo4j",
"description": "Graph database for knowledge graphs",
"supports_embeddings": false,
"supports_metadata": true
}
]
}Validate Plugin Config:
POST /knowledge-store-plugins/{plugin_type}/validate-config
{
"plugin_config": {
"uri": "bolt://localhost:7687",
"user": "neo4j",
"password": "test"
}
}
Response:
{
"valid": true,
"errors": [],
"warnings": []
}Collection Creation (Unchanged from #203!)
POST /collections
{
"name": "Research Papers",
"organization_external_id": "org_123",
"knowledge_store_setup_key": "neo4j-prod" // Same as #203
}Key Point: Collection API unchanged - backend selection happens via setup choice.
Plugin Architecture
KnowledgeStorePlugin Interface
class KnowledgeStorePlugin(abc.ABC):
"""Technology-agnostic interface for retrieval backends.
Plugins work with CHUNKS and METADATA only.
No embeddings, vectors, or graphs in the interface!
"""
name: str = "plugin_name"
supports_embeddings: bool = False
supports_metadata: bool = True
# Initialization
@abc.abstractmethod
def initialize(self, global_config: Dict) -> None:
"""Initialize plugin (e.g., ChromaDB storage path)."""
pass
@classmethod
@abc.abstractmethod
def validate_plugin_config(cls, plugin_config: Dict) -> Dict:
"""Validate setup's plugin_config."""
pass
# Collection Operations
@abc.abstractmethod
def create_collection(self, name: str, setup_id: int,
metadata: Dict = None) -> Any:
"""Create collection. Plugin resolves setup_id to get config."""
pass
# Chunk Operations (technology-agnostic!)
@abc.abstractmethod
def add_chunks(self, collection, chunk_ids: List[str],
chunk_texts: List[str],
chunk_metadata: List[Dict] = None) -> None:
"""Add chunks. Plugin decides: embeddings? graph? index?"""
pass
@abc.abstractmethod
def query_chunks(self, collection, query_text: str,
n_results: int = 10, filters: Dict = None) -> Dict:
"""Query chunks. Plugin decides how to process query."""
pass
# Helper to resolve setup
def _resolve_setup(self, setup_id: int) -> Dict:
"""Get plugin_config from setup (enables key rotation!)."""
setup = db.get_knowledge_store_setup(setup_id)
return setup.plugin_configChromaDB Plugin Implementation
@PluginRegistry.register
class ChromaDBPlugin(KnowledgeStorePlugin):
name = "chromadb"
supports_embeddings = True
def create_collection(self, name, setup_id, metadata):
# Resolve setup to get plugin_config
plugin_config = self._resolve_setup(setup_id)
# Store setup_id in collection metadata (for later resolution)
collection_metadata = {
"knowledge_store_setup_id": setup_id,
"hnsw:space": "cosine"
}
return self._client.create_collection(
name=name,
metadata=collection_metadata
)
def add_chunks(self, collection, chunk_ids, chunk_texts, chunk_metadata):
# Get embedding function from setup (INTERNAL to plugin!)
embedding_func = self._get_embedding_function(collection)
# Compute embeddings (service layer never touches this!)
embeddings = embedding_func(chunk_texts)
# Store chunks with embeddings
collection.add(
ids=chunk_ids,
documents=chunk_texts,
embeddings=embeddings,
metadatas=chunk_metadata
)
def _get_embedding_function(self, collection):
"""Get embedding function from setup config.
KEY CHANGE: Resolves setup each time (allows key rotation!)
"""
setup_id = collection.metadata["knowledge_store_setup_id"]
plugin_config = self._resolve_setup(setup_id)
return get_embedding_function_by_params(
vendor=plugin_config.get("vendor"),
model_name=plugin_config.get("model"),
api_key=plugin_config.get("api_key"), # Gets CURRENT key!
api_endpoint=plugin_config.get("api_endpoint")
)Neo4j Plugin (Example - No Embeddings!)
@PluginRegistry.register
class Neo4jPlugin(KnowledgeStorePlugin):
name = "neo4j"
supports_embeddings = False # Graph-based, no embeddings!
def add_chunks(self, collection, chunk_ids, chunk_texts, chunk_metadata):
# Extract entities from texts (NO EMBEDDINGS!)
for i, text in enumerate(chunk_texts):
entities = self._extract_entities(text)
# Build knowledge graph
with self._driver.session() as session:
session.run(
"CREATE (c:Chunk {id: $id, text: $text})",
id=chunk_ids[i], text=text
)
for entity in entities:
session.run(
"MERGE (e:Entity {name: $name}) "
"CREATE (c)-[:MENTIONS]->(e)",
name=entity
)
def query_chunks(self, collection, query_text, n_results, filters):
# Graph traversal (NO EMBEDDINGS!)
entities = self._extract_entities(query_text)
with self._driver.session() as session:
result = session.run(
"""
MATCH (c:Chunk)-[:MENTIONS]->(e:Entity)
WHERE e.name IN $entities
RETURN c.id, c.text, count(e) as relevance
ORDER BY relevance DESC
LIMIT $limit
""",
entities=entities, limit=n_results
)
return self._format_results(result)Service Layer Changes
Before (Services Compute Embeddings)
# services/ingestion.py (OLD)
from services.embedding import EmbeddingService
def ingest_chunks(collection_name, chunks, metadata):
client = get_chroma_client()
collection = client.get_collection(collection_name)
# Service computes embeddings (TIGHT COUPLING!)
embedding_config = get_collection_embedding_config(collection_name)
embeddings = EmbeddingService.compute_embeddings(chunks, embedding_config)
collection.add(ids, chunks, embeddings, metadata)After (Technology-Agnostic)
# services/ingestion.py (NEW)
from database.connection import get_knowledge_store
def ingest_chunks(collection_name, chunks, metadata):
# Get collection from DB to find its setup
db_collection = db.get_collection_by_name(collection_name)
setup = db.get_knowledge_store_setup(db_collection.knowledge_store_setup_id)
# Get appropriate plugin
plugin = get_knowledge_store(setup.plugin_type)
collection = plugin.get_collection(collection_name)
# Just pass chunks + metadata (technology-agnostic!)
plugin.add_chunks(collection, chunk_ids, chunks, metadata)
# Plugin handles the rest:
# - ChromaDB plugin: computes embeddings, stores vectors
# - Neo4j plugin: extracts entities, builds graph
# - ElasticSearch plugin: tokenizes, builds indexKey Changes:
- ❌ Remove
services/embedding.pyentirely - ✅ Services never compute embeddings
- ✅ Services work with chunks + metadata only
- ✅ Same code works for any plugin!
Migration Plan
Phase 1: Database Schema (Week 1)
- Rename
embeddings_setups→knowledge_store_setups - Add
plugin_typecolumn (default 'chromadb') - Add
plugin_configJSON column - Migrate existing fields into
plugin_config - Add
knowledge_store_setup_idto collections
Phase 2: Plugin Infrastructure (Week 2)
- Implement
KnowledgeStorePluginbase class - Implement
ChromaDBPlugin(reads from setups) - Implement
KnowledgeStoreService - Update
PluginRegistry
Phase 3: Service Layer Refactoring (Week 3)
- Remove embedding computation from services
- Replace
get_chroma_client()withget_knowledge_store() - Delete
services/embedding.py - Update ingestion, query, collections services
Phase 4: API & Router Updates (Week 4)
- Add
/knowledge-store-pluginsendpoints - Update router to use plugin resolution
- Configure
KnowledgeStoreServicein startup
Phase 5: LAMB Integration (Week 5)
- Update LAMB endpoints:
/embeddings-setups→/knowledge-store-setups - Update frontend labels
- Show plugin type in setup list
Phase 6: Testing (Week 6)
- Run KB-Server E2E tests (must pass 100%)
- Run Playwright tests (must pass 100%)
- Verify backward compatibility
Common Scenarios
Scenario 1: API Key Rotation (Same as #203)
Admin updates ChromaDB setup's plugin_config.api_key
→ All 42 collections using this setup immediately use new key
→ No collection-level updates needed!
Scenario 2: Adding New Backend (NEW)
Admin creates new Neo4j setup:
plugin_type: "neo4j"
plugin_config: {uri, user, password, schema}
→ Users can now choose Neo4j when creating collections
→ Existing collections unaffected
Scenario 3: Choosing Backend by Use Case (NEW)
Use Case → Best Plugin
─────────────────────────────────────────────
Document Q&A, semantic search → ChromaDB/Qdrant (vector)
Knowledge graph, relationships → Neo4j (graph)
Exact phrase matching, legal → ElasticSearch (keyword)
Hybrid: semantic + keyword → Hybrid plugin
Benefits Summary
Technical Benefits
✅ Technology-agnostic - Same API for vector, graph, keyword, hybrid
✅ Plugin autonomy - Each plugin decides how to process chunks
✅ Simpler services - No embedding knowledge, cleaner code
✅ Better testing - Mock plugins easily
✅ Future-proof - New retrieval tech added without touching services
Business Benefits
✅ Flexibility - Organizations choose best tech for their needs
✅ Risk reduction - Not locked to single database vendor
✅ Innovation - Experiment with new retrieval approaches
✅ Cost optimization - Use cost-effective backends per use case
Comparison to Current State
| Aspect | Before (Issue #203) | After (Plugin Architecture) |
|---|---|---|
| Setup naming | embeddings_setups |
knowledge_store_setups |
| Config structure | Individual fields | Generic plugin_config JSON |
| Backend support | ChromaDB only | ChromaDB, Neo4j, ElasticSearch, etc. |
| Plugin selection | N/A (implicit) | Via plugin_type field |
| Service layer | Knows embeddings | Technology-agnostic |
| Embedding computation | Service layer | Plugin-internal |
| Adding backend | Refactor 8+ files | Create plugin, register |
| API key rotation | ✅ Works | ✅ Works |
| Per-collection backend | ❌ All ChromaDB | ✅ Each chooses |
Validation & Testing
Backward Compatibility Requirements
- All existing collections continue working
- Can ingest to existing collections
- Can query existing collections
- API key rotation still works
- All E2E tests pass without modification
- All Playwright tests pass without modification
New Features to Verify
- Can list available plugin types
- Can validate plugin configs
- ChromaDB plugin works identically to pre-plugin architecture
- Organizations can create setups with different plugin types
- Collections can be created with any active setup
Open Questions
-
Plugin Config Migration: Keep old individual fields for backward compatibility or require immediate migration to
plugin_config? -
Cross-Plugin Migration: Support migrating collections between plugin types (e.g., ChromaDB → Qdrant)?
-
Plugin Versioning: Should setups include plugin version info?
-
System-Wide Setups: Should there be system setups available to all orgs?
-
Plugin Discovery: Auto-discover from plugins directory or explicit registration?
Success Metrics
After implementation:
- API key rotation takes < 1 minute
- Can add plugin types without modifying service code
- Organizations can offer multiple backend choices
- ChromaDB plugin identical to pre-plugin architecture
- All existing tests pass
- Setup configs never expose credentials
- Collection creation shows available backends
Next Steps
- Review - Get team approval on architecture
- Phase 1 - Database schema migration
- Phase 2 - Plugin infrastructure
- Phase 3 - Service refactoring
- Phase 4 - API updates
- Phase 5 - LAMB integration
- Phase 6 - Testing & validation
- Deploy - Gradual rollout with monitoring
Prerequisites: Issue #203 (KnowledgeStore Setups) must be implemented first.
Status: Draft for Review