(ChromaDB Plugin) Convert current ChromaDB Implementation to be a Plugin

# Knowledge Store Plugin Architecture - Implementation Plan

**Version:** 4.0 (Technology-Agnostic API)
**Status:** Draft for Review
**Prerequisite:** Issue #203 (KnowledgeStore Setups) must be implemented first

---

## Executive Summary

This proposal extends the **KnowledgeStore Setup** infrastructure from Issue #203 to enable **technology-agnostic retrieval backends** for KB-Server. By expanding organization-level setup management to support multiple plugin types, we achieve:

1. **Technology-agnostic retrieval** - Same API for vector DBs, graph DBs, keyword search, hybrid systems
2. **Per-collection backend selection** - Different collections can use different technologies
3. **Seamless backend migration** - Change backends without recreating collections
4. **Simplified service layer** - Services work with chunks + metadata only, never touching embeddings/graphs/indexes
5. **Full backward compatibility** - ChromaDB remains default, all existing functionality preserved

### Relationship to Issue #203

**Issue #203 provides:**
- Organization-level setup management
- Collections reference setups (not inline configs)
- API key rotation capability
- Setup reusability across collections

**This proposal extends #203 by:**
- Renaming `embeddings_setups` → `knowledge_store_setups`
- Adding `plugin_type` field to specify backend technology
- Making `plugin_config` generic JSON (not embedding-specific)
- Implementing plugin architecture that reads from setups
- Enabling multi-backend support (vector, graph, keyword, hybrid)

---

## Architecture Overview

### Current State (After Issue #203)

```
Organizations
    ↓ 1:N
Embeddings Setups (embeddings-specific)
    ↓ 1:N
Collections
    ↓
ChromaDB (hardcoded, ONLY backend)
```

**Problems:**
- ChromaDB-only (can't support Neo4j, ElasticSearch, etc.)
- `embeddings_setups` name implies vector-only
- Setup config is embeddings-specific
- Services directly call ChromaDB APIs

### Target State (Plugin Architecture)

```
Organizations
    ↓ 1:N
KnowledgeStore Setups (renamed & expanded)
  Setup 1: ChromaDB + OpenAI (plugin_type: "chromadb")
  Setup 2: Neo4j Graph (plugin_type: "neo4j")
  Setup 3: ElasticSearch (plugin_type: "elasticsearch")
    ↓ 1:N
Collections (unchanged - still reference setup_id)
    ↓
Service Layer (technology-agnostic: chunks + metadata only)
    ↓
KnowledgeStorePlugin Interface (abstract)
    ↓
┌──────────┬──────────┬──────────┬──────────┐
│ ChromaDB │  Qdrant  │  Neo4j   │ElasticSch│
│ Plugin   │  Plugin  │  Plugin  │  Plugin  │
└──────────┴──────────┴──────────┴──────────┘
```

---

## Key Concept: Enhanced KnowledgeStore Setup

**Before (Issue #203):**
```json
{
  "name": "OpenAI Production",
  "setup_key": "openai-prod",
  "vendor": "openai",
  "model_name": "text-embedding-3-small",
  "api_key": "sk-...",
  "api_endpoint": "https://api.openai.com/v1/embeddings",
  "embedding_dimensions": 1536
}
```

**After (This Proposal):**
```json
{
  "name": "ChromaDB with OpenAI",
  "setup_key": "chromadb-openai-prod",
  "plugin_type": "chromadb",  // NEW: Specifies plugin
  "plugin_config": {          // NEW: Generic JSON config
    "vendor": "openai",
    "model": "text-embedding-3-small",
    "api_key": "sk-...",
    "api_endpoint": "https://api.openai.com/v1/embeddings",
    "embedding_dimensions": 1536
  }
}
```

**Other Plugin Examples:**

```json
// Neo4j (graph-based, no embeddings)
{
  "plugin_type": "neo4j",
  "plugin_config": {
    "uri": "bolt://localhost:7687",
    "user": "neo4j",
    "password": "secret",
    "schema": {...}
  }
}

// ElasticSearch (keyword search)
{
  "plugin_type": "elasticsearch",
  "plugin_config": {
    "hosts": ["localhost:9200"],
    "api_key": "...",
    "analyzer": "english"
  }
}
```

---

## Database Schema Changes

### Assumptions from Issue #203
- `organizations` table exists
- `embeddings_setups` table exists
- `collections.organization_id` exists
- `collections.embeddings_setup_id` exists

### Required Migrations

```sql
-- 1. Rename table
ALTER TABLE embeddings_setups RENAME TO knowledge_store_setups;

-- 2. Add plugin_type field
ALTER TABLE knowledge_store_setups
ADD COLUMN plugin_type TEXT NOT NULL DEFAULT 'chromadb';

-- 3. Add plugin_config JSON field
ALTER TABLE knowledge_store_setups
ADD COLUMN plugin_config JSON;

-- 4. Migrate existing data into plugin_config
UPDATE knowledge_store_setups
SET plugin_config = json_object(
    'vendor', vendor,
    'api_endpoint', api_endpoint,
    'api_key', api_key,
    'model', model_name,
    'embedding_dimensions', embedding_dimensions
)
WHERE plugin_type = 'chromadb';

-- 5. Add new column to collections
ALTER TABLE collections
ADD COLUMN knowledge_store_setup_id INTEGER
REFERENCES knowledge_store_setups(id);

-- 6. Copy data
UPDATE collections
SET knowledge_store_setup_id = embeddings_setup_id;

-- 7. Create indexes
CREATE INDEX idx_collections_knowledge_store_setup
    ON collections(knowledge_store_setup_id);
CREATE INDEX idx_knowledge_store_setups_org_plugin
    ON knowledge_store_setups(organization_id, plugin_type);
```

---

## API Changes

### Setup Endpoints (Extended from #203)

**List Setups** - Now includes plugin_type:
```http
GET /organizations/{org_id}/knowledge-store-setups

Response:
{
  "setups": [
    {
      "id": 1,
      "name": "ChromaDB with OpenAI",
      "setup_key": "chromadb-openai-prod",
      "plugin_type": "chromadb",  // NEW
      "plugin_config": {...},      // NEW
      "collections_count": 42
    }
  ]
}
```

**Create Setup** - Requires plugin_type:
```http
POST /organizations/{org_id}/knowledge-store-setups

{
  "name": "Neo4j Knowledge Graph",
  "setup_key": "neo4j-prod",
  "plugin_type": "neo4j",    // NEW: Required
  "plugin_config": {         // NEW: Plugin-specific
    "uri": "bolt://localhost:7687",
    "user": "neo4j",
    "password": "secret"
  }
}
```

### New Endpoints

**List Available Plugin Types:**
```http
GET /knowledge-store-plugins

Response:
{
  "plugins": [
    {
      "plugin_type": "chromadb",
      "name": "ChromaDB",
      "description": "Vector database with persistent storage",
      "supports_embeddings": true,
      "supports_metadata": true
    },
    {
      "plugin_type": "neo4j",
      "name": "Neo4j",
      "description": "Graph database for knowledge graphs",
      "supports_embeddings": false,
      "supports_metadata": true
    }
  ]
}
```

**Validate Plugin Config:**
```http
POST /knowledge-store-plugins/{plugin_type}/validate-config

{
  "plugin_config": {
    "uri": "bolt://localhost:7687",
    "user": "neo4j",
    "password": "test"
  }
}

Response:
{
  "valid": true,
  "errors": [],
  "warnings": []
}
```

### Collection Creation (Unchanged from #203!)

```http
POST /collections

{
  "name": "Research Papers",
  "organization_external_id": "org_123",
  "knowledge_store_setup_key": "neo4j-prod"  // Same as #203
}
```

**Key Point:** Collection API unchanged - backend selection happens via setup choice.

---

## Plugin Architecture

### KnowledgeStorePlugin Interface

```python
class KnowledgeStorePlugin(abc.ABC):
    """Technology-agnostic interface for retrieval backends.

    Plugins work with CHUNKS and METADATA only.
    No embeddings, vectors, or graphs in the interface!
    """

    name: str = "plugin_name"
    supports_embeddings: bool = False
    supports_metadata: bool = True

    # Initialization
    @abc.abstractmethod
    def initialize(self, global_config: Dict) -> None:
        """Initialize plugin (e.g., ChromaDB storage path)."""
        pass

    @classmethod
    @abc.abstractmethod
    def validate_plugin_config(cls, plugin_config: Dict) -> Dict:
        """Validate setup's plugin_config."""
        pass

    # Collection Operations
    @abc.abstractmethod
    def create_collection(self, name: str, setup_id: int,
                         metadata: Dict = None) -> Any:
        """Create collection. Plugin resolves setup_id to get config."""
        pass

    # Chunk Operations (technology-agnostic!)
    @abc.abstractmethod
    def add_chunks(self, collection, chunk_ids: List[str],
                   chunk_texts: List[str],
                   chunk_metadata: List[Dict] = None) -> None:
        """Add chunks. Plugin decides: embeddings? graph? index?"""
        pass

    @abc.abstractmethod
    def query_chunks(self, collection, query_text: str,
                    n_results: int = 10, filters: Dict = None) -> Dict:
        """Query chunks. Plugin decides how to process query."""
        pass

    # Helper to resolve setup
    def _resolve_setup(self, setup_id: int) -> Dict:
        """Get plugin_config from setup (enables key rotation!)."""
        setup = db.get_knowledge_store_setup(setup_id)
        return setup.plugin_config
```

### ChromaDB Plugin Implementation

```python
@PluginRegistry.register
class ChromaDBPlugin(KnowledgeStorePlugin):
    name = "chromadb"
    supports_embeddings = True

    def create_collection(self, name, setup_id, metadata):
        # Resolve setup to get plugin_config
        plugin_config = self._resolve_setup(setup_id)

        # Store setup_id in collection metadata (for later resolution)
        collection_metadata = {
            "knowledge_store_setup_id": setup_id,
            "hnsw:space": "cosine"
        }

        return self._client.create_collection(
            name=name,
            metadata=collection_metadata
        )

    def add_chunks(self, collection, chunk_ids, chunk_texts, chunk_metadata):
        # Get embedding function from setup (INTERNAL to plugin!)
        embedding_func = self._get_embedding_function(collection)

        # Compute embeddings (service layer never touches this!)
        embeddings = embedding_func(chunk_texts)

        # Store chunks with embeddings
        collection.add(
            ids=chunk_ids,
            documents=chunk_texts,
            embeddings=embeddings,
            metadatas=chunk_metadata
        )

    def _get_embedding_function(self, collection):
        """Get embedding function from setup config.

        KEY CHANGE: Resolves setup each time (allows key rotation!)
        """
        setup_id = collection.metadata["knowledge_store_setup_id"]
        plugin_config = self._resolve_setup(setup_id)

        return get_embedding_function_by_params(
            vendor=plugin_config.get("vendor"),
            model_name=plugin_config.get("model"),
            api_key=plugin_config.get("api_key"),  # Gets CURRENT key!
            api_endpoint=plugin_config.get("api_endpoint")
        )
```

### Neo4j Plugin (Example - No Embeddings!)

```python
@PluginRegistry.register
class Neo4jPlugin(KnowledgeStorePlugin):
    name = "neo4j"
    supports_embeddings = False  # Graph-based, no embeddings!

    def add_chunks(self, collection, chunk_ids, chunk_texts, chunk_metadata):
        # Extract entities from texts (NO EMBEDDINGS!)
        for i, text in enumerate(chunk_texts):
            entities = self._extract_entities(text)

            # Build knowledge graph
            with self._driver.session() as session:
                session.run(
                    "CREATE (c:Chunk {id: $id, text: $text})",
                    id=chunk_ids[i], text=text
                )
                for entity in entities:
                    session.run(
                        "MERGE (e:Entity {name: $name}) "
                        "CREATE (c)-[:MENTIONS]->(e)",
                        name=entity
                    )

    def query_chunks(self, collection, query_text, n_results, filters):
        # Graph traversal (NO EMBEDDINGS!)
        entities = self._extract_entities(query_text)

        with self._driver.session() as session:
            result = session.run(
                """
                MATCH (c:Chunk)-[:MENTIONS]->(e:Entity)
                WHERE e.name IN $entities
                RETURN c.id, c.text, count(e) as relevance
                ORDER BY relevance DESC
                LIMIT $limit
                """,
                entities=entities, limit=n_results
            )
            return self._format_results(result)
```

---

## Service Layer Changes

### Before (Services Compute Embeddings)

```python
# services/ingestion.py (OLD)
from services.embedding import EmbeddingService

def ingest_chunks(collection_name, chunks, metadata):
    client = get_chroma_client()
    collection = client.get_collection(collection_name)

    # Service computes embeddings (TIGHT COUPLING!)
    embedding_config = get_collection_embedding_config(collection_name)
    embeddings = EmbeddingService.compute_embeddings(chunks, embedding_config)

    collection.add(ids, chunks, embeddings, metadata)
```

### After (Technology-Agnostic)

```python
# services/ingestion.py (NEW)
from database.connection import get_knowledge_store

def ingest_chunks(collection_name, chunks, metadata):
    # Get collection from DB to find its setup
    db_collection = db.get_collection_by_name(collection_name)
    setup = db.get_knowledge_store_setup(db_collection.knowledge_store_setup_id)

    # Get appropriate plugin
    plugin = get_knowledge_store(setup.plugin_type)
    collection = plugin.get_collection(collection_name)

    # Just pass chunks + metadata (technology-agnostic!)
    plugin.add_chunks(collection, chunk_ids, chunks, metadata)

    # Plugin handles the rest:
    # - ChromaDB plugin: computes embeddings, stores vectors
    # - Neo4j plugin: extracts entities, builds graph
    # - ElasticSearch plugin: tokenizes, builds index
```

**Key Changes:**
- ❌ Remove `services/embedding.py` entirely
- ✅ Services never compute embeddings
- ✅ Services work with chunks + metadata only
- ✅ Same code works for any plugin!

---

## Migration Plan

### Phase 1: Database Schema (Week 1)
- Rename `embeddings_setups` → `knowledge_store_setups`
- Add `plugin_type` column (default 'chromadb')
- Add `plugin_config` JSON column
- Migrate existing fields into `plugin_config`
- Add `knowledge_store_setup_id` to collections

### Phase 2: Plugin Infrastructure (Week 2)
- Implement `KnowledgeStorePlugin` base class
- Implement `ChromaDBPlugin` (reads from setups)
- Implement `KnowledgeStoreService`
- Update `PluginRegistry`

### Phase 3: Service Layer Refactoring (Week 3)
- Remove embedding computation from services
- Replace `get_chroma_client()` with `get_knowledge_store()`
- Delete `services/embedding.py`
- Update ingestion, query, collections services

### Phase 4: API & Router Updates (Week 4)
- Add `/knowledge-store-plugins` endpoints
- Update router to use plugin resolution
- Configure `KnowledgeStoreService` in startup

### Phase 5: LAMB Integration (Week 5)
- Update LAMB endpoints: `/embeddings-setups` → `/knowledge-store-setups`
- Update frontend labels
- Show plugin type in setup list

### Phase 6: Testing (Week 6)
- Run KB-Server E2E tests (must pass 100%)
- Run Playwright tests (must pass 100%)
- Verify backward compatibility

---

## Common Scenarios

### Scenario 1: API Key Rotation (Same as #203)
```
Admin updates ChromaDB setup's plugin_config.api_key
→ All 42 collections using this setup immediately use new key
→ No collection-level updates needed!
```

### Scenario 2: Adding New Backend (NEW)
```
Admin creates new Neo4j setup:
  plugin_type: "neo4j"
  plugin_config: {uri, user, password, schema}

→ Users can now choose Neo4j when creating collections
→ Existing collections unaffected
```

### Scenario 3: Choosing Backend by Use Case (NEW)
```
Use Case                      → Best Plugin
─────────────────────────────────────────────
Document Q&A, semantic search → ChromaDB/Qdrant (vector)
Knowledge graph, relationships → Neo4j (graph)
Exact phrase matching, legal  → ElasticSearch (keyword)
Hybrid: semantic + keyword    → Hybrid plugin
```

---

## Benefits Summary

### Technical Benefits
✅ **Technology-agnostic** - Same API for vector, graph, keyword, hybrid
✅ **Plugin autonomy** - Each plugin decides how to process chunks
✅ **Simpler services** - No embedding knowledge, cleaner code
✅ **Better testing** - Mock plugins easily
✅ **Future-proof** - New retrieval tech added without touching services

### Business Benefits
✅ **Flexibility** - Organizations choose best tech for their needs
✅ **Risk reduction** - Not locked to single database vendor
✅ **Innovation** - Experiment with new retrieval approaches
✅ **Cost optimization** - Use cost-effective backends per use case

### Comparison to Current State

| Aspect | Before (Issue #203) | After (Plugin Architecture) |
|--------|---------------------|------------------------------|
| Setup naming | `embeddings_setups` | `knowledge_store_setups` |
| Config structure | Individual fields | Generic `plugin_config` JSON |
| Backend support | ChromaDB only | ChromaDB, Neo4j, ElasticSearch, etc. |
| Plugin selection | N/A (implicit) | Via `plugin_type` field |
| Service layer | Knows embeddings | Technology-agnostic |
| Embedding computation | Service layer | Plugin-internal |
| Adding backend | Refactor 8+ files | Create plugin, register |
| API key rotation | ✅ Works | ✅ Works |
| Per-collection backend | ❌ All ChromaDB | ✅ Each chooses |

---

## Validation & Testing

### Backward Compatibility Requirements
- [ ] All existing collections continue working
- [ ] Can ingest to existing collections
- [ ] Can query existing collections
- [ ] API key rotation still works
- [ ] All E2E tests pass without modification
- [ ] All Playwright tests pass without modification

### New Features to Verify
- [ ] Can list available plugin types
- [ ] Can validate plugin configs
- [ ] ChromaDB plugin works identically to pre-plugin architecture
- [ ] Organizations can create setups with different plugin types
- [ ] Collections can be created with any active setup

---

## Open Questions

1. **Plugin Config Migration:** Keep old individual fields for backward compatibility or require immediate migration to `plugin_config`?

2. **Cross-Plugin Migration:** Support migrating collections between plugin types (e.g., ChromaDB → Qdrant)?

3. **Plugin Versioning:** Should setups include plugin version info?

4. **System-Wide Setups:** Should there be system setups available to all orgs?

5. **Plugin Discovery:** Auto-discover from plugins directory or explicit registration?

---

## Success Metrics

After implementation:
- [ ] API key rotation takes < 1 minute
- [ ] Can add plugin types without modifying service code
- [ ] Organizations can offer multiple backend choices
- [ ] ChromaDB plugin identical to pre-plugin architecture
- [ ] All existing tests pass
- [ ] Setup configs never expose credentials
- [ ] Collection creation shows available backends

---

## Next Steps

1. **Review** - Get team approval on architecture
2. **Phase 1** - Database schema migration
3. **Phase 2** - Plugin infrastructure
4. **Phase 3** - Service refactoring
5. **Phase 4** - API updates
6. **Phase 5** - LAMB integration
7. **Phase 6** - Testing & validation
8. **Deploy** - Gradual rollout with monitoring

---

**Prerequisites:** Issue #203 (KnowledgeStore Setups) must be implemented first.

**Status:** Draft for Review


Aspect	Before (Issue #203)	After (Plugin Architecture)
Setup naming	`embeddings_setups`	`knowledge_store_setups`
Config structure	Individual fields	Generic `plugin_config` JSON
Backend support	ChromaDB only	ChromaDB, Neo4j, ElasticSearch, etc.
Plugin selection	N/A (implicit)	Via `plugin_type` field
Service layer	Knows embeddings	Technology-agnostic
Embedding computation	Service layer	Plugin-internal
Adding backend	Refactor 8+ files	Create plugin, register
API key rotation	✅ Works	✅ Works
Per-collection backend	❌ All ChromaDB	✅ Each chooses

(ChromaDB Plugin) Convert current ChromaDB Implementation to be a Plugin #229

Description

Knowledge Store Plugin Architecture - Implementation Plan

Executive Summary

Relationship to Issue #203

Architecture Overview

Current State (After Issue #203)

Target State (Plugin Architecture)

Key Concept: Enhanced KnowledgeStore Setup

Database Schema Changes

Assumptions from Issue #203

Required Migrations

API Changes

Setup Endpoints (Extended from #203)

New Endpoints

Collection Creation (Unchanged from #203!)

Plugin Architecture

KnowledgeStorePlugin Interface

ChromaDB Plugin Implementation

Neo4j Plugin (Example - No Embeddings!)

Service Layer Changes

Before (Services Compute Embeddings)

After (Technology-Agnostic)

Migration Plan

Phase 1: Database Schema (Week 1)

Phase 2: Plugin Infrastructure (Week 2)

Phase 3: Service Layer Refactoring (Week 3)

Phase 4: API & Router Updates (Week 4)

Phase 5: LAMB Integration (Week 5)

Phase 6: Testing (Week 6)

Common Scenarios

Scenario 1: API Key Rotation (Same as #203)

Scenario 2: Adding New Backend (NEW)

Scenario 3: Choosing Backend by Use Case (NEW)

Benefits Summary

Technical Benefits

Business Benefits

Comparison to Current State

Validation & Testing

Backward Compatibility Requirements

New Features to Verify

Open Questions

Success Metrics

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions