Entity Auto-Dedup: embedding-based alias clustering and merge suggestions

## Problem

After #459 (Entity Registry), entities are created by the curator during curation runs. Duplicate entities will inevitably appear:
- Same person referenced as "Seylan Cinar" in one session and "Seylan" in another → two entities
- A tool referenced as "GitHub Actions" and "GHA" → two entities with no alias overlap
- A service called "the cache" in one context and "Redis" in another

The current `findDuplicateCandidates()` uses alias overlap and Jaccard word-similarity, which catches exact alias collisions and word-level name overlap but misses **semantic** duplicates where the names are spelled differently but mean the same thing.

## Proposal: Embedding-Based Alias Clustering

Leverage Lore's existing embedding infrastructure (Nomic v1.5 local, with Voyage/OpenAI fallbacks) to compute vector similarity between entity names and aliases, then cluster and suggest merges.

### How It Works

1. **Embed entity names + aliases** — When an entity is created or updated, embed its canonical name and all alias values into a single composite vector (average of all alias embeddings). Store in an `embedding` BLOB column on the `entities` table (same pattern as `knowledge.embedding`).

2. **Pairwise similarity scan** — After curator creates entities, compute cosine similarity between the new entity's embedding and all existing entities. Flag pairs above a threshold (e.g., 0.85 for Nomic v1.5 — lower than the 0.935 knowledge dedup threshold since entity names are shorter and noisier).

3. **Cluster formation** — Use star clustering (same algorithm as `ltm.deduplicate()`): pick highest-similarity pair, merge, repeat. No transitive chains — if A~B and B~C but not A~C, only merge the highest pair.

4. **Auto-merge vs. suggest** — Two modes:
   - **Auto-merge** (high confidence, similarity ≥ 0.92): merge silently, log the action
   - **Suggest** (moderate confidence, 0.85 ≤ similarity < 0.92): surface in CLI `lore entity dedup` and web dashboard `/ui/entities` as merge suggestions

### Dedup Signals (Beyond Embedding Similarity)

Combine multiple signals for higher precision:

| Signal | Weight | Example |
|--------|--------|---------|
| Embedding cosine similarity | Primary | "GitHub Actions" ↔ "GHA" |
| Alias value overlap | Strong (auto-merge) | Both have alias `email:bob@corp.com` |
| Same entity_type | Required | Don't merge a person with a service |
| Canonical name Jaccard | Moderate | "Redis Labs" ↔ "Redis" |
| Linked knowledge overlap | Boost | Both referenced by same knowledge entry |

### Adaptive Threshold

Mirror the knowledge dedup adaptive threshold system (`dedup_feedback` table):
- Record auto-merge signals (merged pair → accept, high-sim non-merged → reject)
- Calibrate the entity dedup threshold with enough samples (MIN_CALIBRATION_SAMPLES=20)
- Store per-project calibrated threshold in `kv_meta`

## Implementation Plan

### Step 1: Schema — Add embedding column to entities (~30min)
- Migration v28+: `ALTER TABLE entities ADD COLUMN embedding BLOB`
- Same pattern as `knowledge.embedding` (Float32Array stored as BLOB)

### Step 2: Entity embedding pipeline (~1-2h)
- `embedEntity(id)` — embed `"${canonical_name} ${aliases.join(" ")}"` as a single vector
- Call on create/update (fire-and-forget, same as `embedKnowledgeEntry`)
- `vectorSearchEntities(queryVec, limit)` — search entities by embedding similarity
- Backfill existing entities on first run

### Step 3: Pairwise dedup scan (~2h)
- `deduplicateEntities(projectPath, opts?)` — same structure as `ltm.deduplicate()`
- Multi-signal scoring: embedding similarity × type-match × alias-overlap boost
- Star clustering with configurable threshold
- Return `{ merged, suggested, pairSimilarities }`
- Wire into post-curation flow (after entity creation in `applyOps`)

### Step 4: CLI `lore entity dedup` (~1h)
- `lore entity dedup` — show merge suggestions with similarity scores
- `lore entity dedup --auto` — auto-merge above threshold
- `lore entity dedup --dry-run` — show what would be merged
- Accept/reject feedback stored in `dedup_feedback` (entity variant)

### Step 5: Web dashboard merge suggestions (~1h)
- `/ui/entities` — show a banner when dedup candidates exist
- "Suggested merges" section with similarity scores and one-click merge buttons
- `POST /ui/api/merge/entity/:targetId/:sourceId`

### Step 6: Adaptive threshold calibration (~1h)
- Reuse `dedup_feedback` table with `entity_type` discriminator
- `calibrateEntityDedupThreshold(projectId)` — logistic regression on feedback
- Auto-recalibrate after each dedup run

## Time Estimate

~6-8h of AI-assisted development. Can be done in 1-2 sessions.

## Fits With

- **Entity Registry (#459)**: Direct enhancement — `findDuplicateCandidates()` already exists as a stub
- **Existing embedding infra**: Reuses `embed()`, `cosineSimilarity()`, BLOB storage pattern
- **Knowledge dedup system**: Same adaptive threshold, star clustering, and feedback patterns
- **Curator pipeline**: Natural extension — dedup runs after entity creation, same as knowledge dedup

## Files to Modify

- `packages/core/src/db.ts` — migration: embedding column on entities
- `packages/core/src/entities.ts` — `deduplicateEntities()`, `embedEntity()`, `vectorSearchEntities()`
- `packages/core/src/embedding.ts` — `embedEntityEntry()`, `vectorSearchEntities()` (vector storage/search)
- `packages/core/src/curator.ts` — wire entity dedup after creation (same as knowledge post-curation dedup)
- `packages/gateway/src/cli/entity.ts` — `dedup` subcommand
- `packages/gateway/src/ui.ts` — merge suggestions in entity list page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity Auto-Dedup: embedding-based alias clustering and merge suggestions #462

Problem

Proposal: Embedding-Based Alias Clustering

How It Works

Dedup Signals (Beyond Embedding Similarity)

Adaptive Threshold

Implementation Plan

Step 1: Schema — Add embedding column to entities (~30min)

Step 2: Entity embedding pipeline (~1-2h)

Step 3: Pairwise dedup scan (~2h)

Step 4: CLI `lore entity dedup` (~1h)

Step 5: Web dashboard merge suggestions (~1h)

Step 6: Adaptive threshold calibration (~1h)

Time Estimate

Fits With

Files to Modify

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Signal	Weight	Example
Embedding cosine similarity	Primary	"GitHub Actions" ↔ "GHA"
Alias value overlap	Strong (auto-merge)	Both have alias `email:bob@corp.com`
Same entity_type	Required	Don't merge a person with a service
Canonical name Jaccard	Moderate	"Redis Labs" ↔ "Redis"
Linked knowledge overlap	Boost	Both referenced by same knowledge entry

Entity Auto-Dedup: embedding-based alias clustering and merge suggestions #462

Description

Problem

Proposal: Embedding-Based Alias Clustering

How It Works

Dedup Signals (Beyond Embedding Similarity)

Adaptive Threshold

Implementation Plan

Step 1: Schema — Add embedding column to entities (~30min)

Step 2: Entity embedding pipeline (~1-2h)

Step 3: Pairwise dedup scan (~2h)

Step 4: CLI lore entity dedup (~1h)

Step 5: Web dashboard merge suggestions (~1h)

Step 6: Adaptive threshold calibration (~1h)

Time Estimate

Fits With

Files to Modify

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Step 4: CLI `lore entity dedup` (~1h)