Auto-memory dedup is exact-string only — semantic duplicates and cross-type dupes accumulate unchecked

# Auto-memory dedup is exact-string only — semantic duplicates and cross-type dupes accumulate unchecked

## Problem

After ~7 days of normal use, my brain had 208 entries — 131 over the prompt budget warning threshold. Running `memory-consolidate` revealed systemic patterns in how duplicates accumulate.

### 1. Exact-string dedup misses semantic duplicates

`appendBrainEntryWithDedup` compares via exact case-insensitive string match:

```typescript
// extensions/rho/index.ts, storeLearningEntry
existing.some(e =>
  e.type === "learning" &&
  normalizeMemoryText((e as any).text || "").toLowerCase() === normalized.toLowerCase()
)
```

Any variation in wording bypasses this entirely. Real examples from my brain:

**Varying numbers (8 duplicates):**
```
"The user can run /rewind to undo all file changes... There are 37 snapshot(s) available."
"The user can run /rewind to undo all file changes... There are 15 snapshot(s) available."
"The user can run /rewind to undo all file changes... There are 11 snapshot(s) available."
... (5 more with different counts)
```

**Slight rewording (4 duplicates):**
```
"Always use the Grep tool instead of bash grep/rg/find commands..."
"Use the Grep tool for all file content searches instead of running bash grep/rg/find commands..."
"Use the Grep tool instead of bash grep/rg/find commands for file content searches..."
"Always use the Grep tool for all file content searches instead of running bash grep/rg/find commands..."
```

### 2. No cross-type dedup

`storeLearningEntry` only checks against existing learnings (`e.type === "learning"`), and `storePreferenceEntry` only checks against existing preferences (`e.type === "preference"`). The same fact gets stored twice:

```
[learning]    "Always check for and remove any stale .git/index.lock files..."
[preference]  "Always check for and remove any stale .git/index.lock files..."

[learning]    "Use the Whisper Large model in Handy for multilingual transcription..."
[preference]  "Use the Whisper Large model in Handy for multilingual transcription..."
```

I found **17 learning↔preference duplicate pairs** in a single consolidation pass.

### 3. Transient information saved as durable learnings

The auto-memory extraction skill instructs the model to skip transient info, but the cheap model frequently ignores this. Examples that were saved:

- `"Pushed. Repo is clean, remote is up to date."` — session state
- `"The CachyOS mirror is currently serving corrupted/incomplete package downloads"` — temporary outage
- `"The review session shows as cancelled with 0 comments"` — one-off event
- `"Errors were found in the pi-extensions workspace"` — transient status
- `"The user ran ls in the current working directory"` — meaningless

### 4. Existing memories not effectively preventing re-extraction

The SKILL.md has a Step 2 ("Check Against Existing Memories") and the `existing_memories` parameter is marked optional. Even when passed, the cheap model doesn't reliably cross-reference — the same Grep tool instruction (already present as a behavior entry) got re-extracted as a learning 4 separate times across sessions.

## Impact

- Brain grows ~10-20 entries/day with duplicates and noise
- Hits the prompt budget warning within a week of normal use
- `memory-consolidate` is the only cleanup path, but it's manual and expensive
- Duplicate entries waste prompt tokens on every session load

## Consolidation stats (manual cleanup)

| Type | Before | After | Removed |
|---|---|---|---|
| Learnings | 136 | 98 | 38 |
| Preferences | 48 | 11 | 37 |
| **Total** | **208** | **132** | **76** |

## Suggestions

**Quick wins:**
- Add cross-type checking in dedup (check all types, not just same type)
- Normalize numbers/counts before comparison (e.g., strip digits or replace with placeholder)
- Substring/containment check: if new text is a subset of existing entry (or vice versa), skip it

**Longer term:**
- Token-overlap / Jaccard similarity threshold (e.g., >0.7 = duplicate) — no embeddings needed
- Pass existing brain entries to the auto-memory model call so it can actually do Step 2 of the skill
- Add a blocklist of patterns that should never be extracted (snapshot counts, "repo is clean", etc.)

## Environment

- rho v0.1.8
- pi v0.55.3
- ~7 days of daily use across multiple pi sessions


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto-memory dedup is exact-string only — semantic duplicates and cross-type dupes accumulate unchecked #34

Auto-memory dedup is exact-string only — semantic duplicates and cross-type dupes accumulate unchecked

Problem

1. Exact-string dedup misses semantic duplicates

2. No cross-type dedup

3. Transient information saved as durable learnings

4. Existing memories not effectively preventing re-extraction

Impact

Consolidation stats (manual cleanup)

Suggestions

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Auto-memory dedup is exact-string only — semantic duplicates and cross-type dupes accumulate unchecked #34

Description

Auto-memory dedup is exact-string only — semantic duplicates and cross-type dupes accumulate unchecked

Problem

1. Exact-string dedup misses semantic duplicates

2. No cross-type dedup

3. Transient information saved as durable learnings

4. Existing memories not effectively preventing re-extraction

Impact

Consolidation stats (manual cleanup)

Suggestions

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions