Problem
After #459 (Entity Registry), entities are created by the curator during curation runs. Duplicate entities will inevitably appear:
- Same person referenced as "Seylan Cinar" in one session and "Seylan" in another → two entities
- A tool referenced as "GitHub Actions" and "GHA" → two entities with no alias overlap
- A service called "the cache" in one context and "Redis" in another
The current findDuplicateCandidates() uses alias overlap and Jaccard word-similarity, which catches exact alias collisions and word-level name overlap but misses semantic duplicates where the names are spelled differently but mean the same thing.
Proposal: Embedding-Based Alias Clustering
Leverage Lore's existing embedding infrastructure (Nomic v1.5 local, with Voyage/OpenAI fallbacks) to compute vector similarity between entity names and aliases, then cluster and suggest merges.
How It Works
-
Embed entity names + aliases — When an entity is created or updated, embed its canonical name and all alias values into a single composite vector (average of all alias embeddings). Store in an embedding BLOB column on the entities table (same pattern as knowledge.embedding).
-
Pairwise similarity scan — After curator creates entities, compute cosine similarity between the new entity's embedding and all existing entities. Flag pairs above a threshold (e.g., 0.85 for Nomic v1.5 — lower than the 0.935 knowledge dedup threshold since entity names are shorter and noisier).
-
Cluster formation — Use star clustering (same algorithm as ltm.deduplicate()): pick highest-similarity pair, merge, repeat. No transitive chains — if AB and BC but not A~C, only merge the highest pair.
-
Auto-merge vs. suggest — Two modes:
- Auto-merge (high confidence, similarity ≥ 0.92): merge silently, log the action
- Suggest (moderate confidence, 0.85 ≤ similarity < 0.92): surface in CLI
lore entity dedup and web dashboard /ui/entities as merge suggestions
Dedup Signals (Beyond Embedding Similarity)
Combine multiple signals for higher precision:
| Signal |
Weight |
Example |
| Embedding cosine similarity |
Primary |
"GitHub Actions" ↔ "GHA" |
| Alias value overlap |
Strong (auto-merge) |
Both have alias email:bob@corp.com |
| Same entity_type |
Required |
Don't merge a person with a service |
| Canonical name Jaccard |
Moderate |
"Redis Labs" ↔ "Redis" |
| Linked knowledge overlap |
Boost |
Both referenced by same knowledge entry |
Adaptive Threshold
Mirror the knowledge dedup adaptive threshold system (dedup_feedback table):
- Record auto-merge signals (merged pair → accept, high-sim non-merged → reject)
- Calibrate the entity dedup threshold with enough samples (MIN_CALIBRATION_SAMPLES=20)
- Store per-project calibrated threshold in
kv_meta
Implementation Plan
Step 1: Schema — Add embedding column to entities (~30min)
- Migration v28+:
ALTER TABLE entities ADD COLUMN embedding BLOB
- Same pattern as
knowledge.embedding (Float32Array stored as BLOB)
Step 2: Entity embedding pipeline (~1-2h)
embedEntity(id) — embed "${canonical_name} ${aliases.join(" ")}" as a single vector
- Call on create/update (fire-and-forget, same as
embedKnowledgeEntry)
vectorSearchEntities(queryVec, limit) — search entities by embedding similarity
- Backfill existing entities on first run
Step 3: Pairwise dedup scan (~2h)
deduplicateEntities(projectPath, opts?) — same structure as ltm.deduplicate()
- Multi-signal scoring: embedding similarity × type-match × alias-overlap boost
- Star clustering with configurable threshold
- Return
{ merged, suggested, pairSimilarities }
- Wire into post-curation flow (after entity creation in
applyOps)
Step 4: CLI lore entity dedup (~1h)
lore entity dedup — show merge suggestions with similarity scores
lore entity dedup --auto — auto-merge above threshold
lore entity dedup --dry-run — show what would be merged
- Accept/reject feedback stored in
dedup_feedback (entity variant)
Step 5: Web dashboard merge suggestions (~1h)
/ui/entities — show a banner when dedup candidates exist
- "Suggested merges" section with similarity scores and one-click merge buttons
POST /ui/api/merge/entity/:targetId/:sourceId
Step 6: Adaptive threshold calibration (~1h)
- Reuse
dedup_feedback table with entity_type discriminator
calibrateEntityDedupThreshold(projectId) — logistic regression on feedback
- Auto-recalibrate after each dedup run
Time Estimate
~6-8h of AI-assisted development. Can be done in 1-2 sessions.
Fits With
Files to Modify
packages/core/src/db.ts — migration: embedding column on entities
packages/core/src/entities.ts — deduplicateEntities(), embedEntity(), vectorSearchEntities()
packages/core/src/embedding.ts — embedEntityEntry(), vectorSearchEntities() (vector storage/search)
packages/core/src/curator.ts — wire entity dedup after creation (same as knowledge post-curation dedup)
packages/gateway/src/cli/entity.ts — dedup subcommand
packages/gateway/src/ui.ts — merge suggestions in entity list page
Problem
After #459 (Entity Registry), entities are created by the curator during curation runs. Duplicate entities will inevitably appear:
The current
findDuplicateCandidates()uses alias overlap and Jaccard word-similarity, which catches exact alias collisions and word-level name overlap but misses semantic duplicates where the names are spelled differently but mean the same thing.Proposal: Embedding-Based Alias Clustering
Leverage Lore's existing embedding infrastructure (Nomic v1.5 local, with Voyage/OpenAI fallbacks) to compute vector similarity between entity names and aliases, then cluster and suggest merges.
How It Works
Embed entity names + aliases — When an entity is created or updated, embed its canonical name and all alias values into a single composite vector (average of all alias embeddings). Store in an
embeddingBLOB column on theentitiestable (same pattern asknowledge.embedding).Pairwise similarity scan — After curator creates entities, compute cosine similarity between the new entity's embedding and all existing entities. Flag pairs above a threshold (e.g., 0.85 for Nomic v1.5 — lower than the 0.935 knowledge dedup threshold since entity names are shorter and noisier).
Cluster formation — Use star clustering (same algorithm as
ltm.deduplicate()): pick highest-similarity pair, merge, repeat. No transitive chains — if AB and BC but not A~C, only merge the highest pair.Auto-merge vs. suggest — Two modes:
lore entity dedupand web dashboard/ui/entitiesas merge suggestionsDedup Signals (Beyond Embedding Similarity)
Combine multiple signals for higher precision:
email:bob@corp.comAdaptive Threshold
Mirror the knowledge dedup adaptive threshold system (
dedup_feedbacktable):kv_metaImplementation Plan
Step 1: Schema — Add embedding column to entities (~30min)
ALTER TABLE entities ADD COLUMN embedding BLOBknowledge.embedding(Float32Array stored as BLOB)Step 2: Entity embedding pipeline (~1-2h)
embedEntity(id)— embed"${canonical_name} ${aliases.join(" ")}"as a single vectorembedKnowledgeEntry)vectorSearchEntities(queryVec, limit)— search entities by embedding similarityStep 3: Pairwise dedup scan (~2h)
deduplicateEntities(projectPath, opts?)— same structure asltm.deduplicate(){ merged, suggested, pairSimilarities }applyOps)Step 4: CLI
lore entity dedup(~1h)lore entity dedup— show merge suggestions with similarity scoreslore entity dedup --auto— auto-merge above thresholdlore entity dedup --dry-run— show what would be mergeddedup_feedback(entity variant)Step 5: Web dashboard merge suggestions (~1h)
/ui/entities— show a banner when dedup candidates existPOST /ui/api/merge/entity/:targetId/:sourceIdStep 6: Adaptive threshold calibration (~1h)
dedup_feedbacktable withentity_typediscriminatorcalibrateEntityDedupThreshold(projectId)— logistic regression on feedbackTime Estimate
~6-8h of AI-assisted development. Can be done in 1-2 sessions.
Fits With
findDuplicateCandidates()already exists as a stubembed(),cosineSimilarity(), BLOB storage patternFiles to Modify
packages/core/src/db.ts— migration: embedding column on entitiespackages/core/src/entities.ts—deduplicateEntities(),embedEntity(),vectorSearchEntities()packages/core/src/embedding.ts—embedEntityEntry(),vectorSearchEntities()(vector storage/search)packages/core/src/curator.ts— wire entity dedup after creation (same as knowledge post-curation dedup)packages/gateway/src/cli/entity.ts—dedupsubcommandpackages/gateway/src/ui.ts— merge suggestions in entity list page