Skip to content

Entity Auto-Dedup: embedding-based alias clustering and merge suggestions #462

@BYK

Description

@BYK

Problem

After #459 (Entity Registry), entities are created by the curator during curation runs. Duplicate entities will inevitably appear:

  • Same person referenced as "Seylan Cinar" in one session and "Seylan" in another → two entities
  • A tool referenced as "GitHub Actions" and "GHA" → two entities with no alias overlap
  • A service called "the cache" in one context and "Redis" in another

The current findDuplicateCandidates() uses alias overlap and Jaccard word-similarity, which catches exact alias collisions and word-level name overlap but misses semantic duplicates where the names are spelled differently but mean the same thing.

Proposal: Embedding-Based Alias Clustering

Leverage Lore's existing embedding infrastructure (Nomic v1.5 local, with Voyage/OpenAI fallbacks) to compute vector similarity between entity names and aliases, then cluster and suggest merges.

How It Works

  1. Embed entity names + aliases — When an entity is created or updated, embed its canonical name and all alias values into a single composite vector (average of all alias embeddings). Store in an embedding BLOB column on the entities table (same pattern as knowledge.embedding).

  2. Pairwise similarity scan — After curator creates entities, compute cosine similarity between the new entity's embedding and all existing entities. Flag pairs above a threshold (e.g., 0.85 for Nomic v1.5 — lower than the 0.935 knowledge dedup threshold since entity names are shorter and noisier).

  3. Cluster formation — Use star clustering (same algorithm as ltm.deduplicate()): pick highest-similarity pair, merge, repeat. No transitive chains — if AB and BC but not A~C, only merge the highest pair.

  4. Auto-merge vs. suggest — Two modes:

    • Auto-merge (high confidence, similarity ≥ 0.92): merge silently, log the action
    • Suggest (moderate confidence, 0.85 ≤ similarity < 0.92): surface in CLI lore entity dedup and web dashboard /ui/entities as merge suggestions

Dedup Signals (Beyond Embedding Similarity)

Combine multiple signals for higher precision:

Signal Weight Example
Embedding cosine similarity Primary "GitHub Actions" ↔ "GHA"
Alias value overlap Strong (auto-merge) Both have alias email:bob@corp.com
Same entity_type Required Don't merge a person with a service
Canonical name Jaccard Moderate "Redis Labs" ↔ "Redis"
Linked knowledge overlap Boost Both referenced by same knowledge entry

Adaptive Threshold

Mirror the knowledge dedup adaptive threshold system (dedup_feedback table):

  • Record auto-merge signals (merged pair → accept, high-sim non-merged → reject)
  • Calibrate the entity dedup threshold with enough samples (MIN_CALIBRATION_SAMPLES=20)
  • Store per-project calibrated threshold in kv_meta

Implementation Plan

Step 1: Schema — Add embedding column to entities (~30min)

  • Migration v28+: ALTER TABLE entities ADD COLUMN embedding BLOB
  • Same pattern as knowledge.embedding (Float32Array stored as BLOB)

Step 2: Entity embedding pipeline (~1-2h)

  • embedEntity(id) — embed "${canonical_name} ${aliases.join(" ")}" as a single vector
  • Call on create/update (fire-and-forget, same as embedKnowledgeEntry)
  • vectorSearchEntities(queryVec, limit) — search entities by embedding similarity
  • Backfill existing entities on first run

Step 3: Pairwise dedup scan (~2h)

  • deduplicateEntities(projectPath, opts?) — same structure as ltm.deduplicate()
  • Multi-signal scoring: embedding similarity × type-match × alias-overlap boost
  • Star clustering with configurable threshold
  • Return { merged, suggested, pairSimilarities }
  • Wire into post-curation flow (after entity creation in applyOps)

Step 4: CLI lore entity dedup (~1h)

  • lore entity dedup — show merge suggestions with similarity scores
  • lore entity dedup --auto — auto-merge above threshold
  • lore entity dedup --dry-run — show what would be merged
  • Accept/reject feedback stored in dedup_feedback (entity variant)

Step 5: Web dashboard merge suggestions (~1h)

  • /ui/entities — show a banner when dedup candidates exist
  • "Suggested merges" section with similarity scores and one-click merge buttons
  • POST /ui/api/merge/entity/:targetId/:sourceId

Step 6: Adaptive threshold calibration (~1h)

  • Reuse dedup_feedback table with entity_type discriminator
  • calibrateEntityDedupThreshold(projectId) — logistic regression on feedback
  • Auto-recalibrate after each dedup run

Time Estimate

~6-8h of AI-assisted development. Can be done in 1-2 sessions.

Fits With

Files to Modify

  • packages/core/src/db.ts — migration: embedding column on entities
  • packages/core/src/entities.tsdeduplicateEntities(), embedEntity(), vectorSearchEntities()
  • packages/core/src/embedding.tsembedEntityEntry(), vectorSearchEntities() (vector storage/search)
  • packages/core/src/curator.ts — wire entity dedup after creation (same as knowledge post-curation dedup)
  • packages/gateway/src/cli/entity.tsdedup subcommand
  • packages/gateway/src/ui.ts — merge suggestions in entity list page

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions