Auto-memory dedup is exact-string only — semantic duplicates and cross-type dupes accumulate unchecked
Problem
After ~7 days of normal use, my brain had 208 entries — 131 over the prompt budget warning threshold. Running memory-consolidate revealed systemic patterns in how duplicates accumulate.
1. Exact-string dedup misses semantic duplicates
appendBrainEntryWithDedup compares via exact case-insensitive string match:
// extensions/rho/index.ts, storeLearningEntry
existing.some(e =>
e.type === "learning" &&
normalizeMemoryText((e as any).text || "").toLowerCase() === normalized.toLowerCase()
)
Any variation in wording bypasses this entirely. Real examples from my brain:
Varying numbers (8 duplicates):
"The user can run /rewind to undo all file changes... There are 37 snapshot(s) available."
"The user can run /rewind to undo all file changes... There are 15 snapshot(s) available."
"The user can run /rewind to undo all file changes... There are 11 snapshot(s) available."
... (5 more with different counts)
Slight rewording (4 duplicates):
"Always use the Grep tool instead of bash grep/rg/find commands..."
"Use the Grep tool for all file content searches instead of running bash grep/rg/find commands..."
"Use the Grep tool instead of bash grep/rg/find commands for file content searches..."
"Always use the Grep tool for all file content searches instead of running bash grep/rg/find commands..."
2. No cross-type dedup
storeLearningEntry only checks against existing learnings (e.type === "learning"), and storePreferenceEntry only checks against existing preferences (e.type === "preference"). The same fact gets stored twice:
[learning] "Always check for and remove any stale .git/index.lock files..."
[preference] "Always check for and remove any stale .git/index.lock files..."
[learning] "Use the Whisper Large model in Handy for multilingual transcription..."
[preference] "Use the Whisper Large model in Handy for multilingual transcription..."
I found 17 learning↔preference duplicate pairs in a single consolidation pass.
3. Transient information saved as durable learnings
The auto-memory extraction skill instructs the model to skip transient info, but the cheap model frequently ignores this. Examples that were saved:
"Pushed. Repo is clean, remote is up to date." — session state
"The CachyOS mirror is currently serving corrupted/incomplete package downloads" — temporary outage
"The review session shows as cancelled with 0 comments" — one-off event
"Errors were found in the pi-extensions workspace" — transient status
"The user ran ls in the current working directory" — meaningless
4. Existing memories not effectively preventing re-extraction
The SKILL.md has a Step 2 ("Check Against Existing Memories") and the existing_memories parameter is marked optional. Even when passed, the cheap model doesn't reliably cross-reference — the same Grep tool instruction (already present as a behavior entry) got re-extracted as a learning 4 separate times across sessions.
Impact
- Brain grows ~10-20 entries/day with duplicates and noise
- Hits the prompt budget warning within a week of normal use
memory-consolidate is the only cleanup path, but it's manual and expensive
- Duplicate entries waste prompt tokens on every session load
Consolidation stats (manual cleanup)
| Type |
Before |
After |
Removed |
| Learnings |
136 |
98 |
38 |
| Preferences |
48 |
11 |
37 |
| Total |
208 |
132 |
76 |
Suggestions
Quick wins:
- Add cross-type checking in dedup (check all types, not just same type)
- Normalize numbers/counts before comparison (e.g., strip digits or replace with placeholder)
- Substring/containment check: if new text is a subset of existing entry (or vice versa), skip it
Longer term:
- Token-overlap / Jaccard similarity threshold (e.g., >0.7 = duplicate) — no embeddings needed
- Pass existing brain entries to the auto-memory model call so it can actually do Step 2 of the skill
- Add a blocklist of patterns that should never be extracted (snapshot counts, "repo is clean", etc.)
Environment
- rho v0.1.8
- pi v0.55.3
- ~7 days of daily use across multiple pi sessions
Auto-memory dedup is exact-string only — semantic duplicates and cross-type dupes accumulate unchecked
Problem
After ~7 days of normal use, my brain had 208 entries — 131 over the prompt budget warning threshold. Running
memory-consolidaterevealed systemic patterns in how duplicates accumulate.1. Exact-string dedup misses semantic duplicates
appendBrainEntryWithDedupcompares via exact case-insensitive string match:Any variation in wording bypasses this entirely. Real examples from my brain:
Varying numbers (8 duplicates):
Slight rewording (4 duplicates):
2. No cross-type dedup
storeLearningEntryonly checks against existing learnings (e.type === "learning"), andstorePreferenceEntryonly checks against existing preferences (e.type === "preference"). The same fact gets stored twice:I found 17 learning↔preference duplicate pairs in a single consolidation pass.
3. Transient information saved as durable learnings
The auto-memory extraction skill instructs the model to skip transient info, but the cheap model frequently ignores this. Examples that were saved:
"Pushed. Repo is clean, remote is up to date."— session state"The CachyOS mirror is currently serving corrupted/incomplete package downloads"— temporary outage"The review session shows as cancelled with 0 comments"— one-off event"Errors were found in the pi-extensions workspace"— transient status"The user ran ls in the current working directory"— meaningless4. Existing memories not effectively preventing re-extraction
The SKILL.md has a Step 2 ("Check Against Existing Memories") and the
existing_memoriesparameter is marked optional. Even when passed, the cheap model doesn't reliably cross-reference — the same Grep tool instruction (already present as a behavior entry) got re-extracted as a learning 4 separate times across sessions.Impact
memory-consolidateis the only cleanup path, but it's manual and expensiveConsolidation stats (manual cleanup)
Suggestions
Quick wins:
Longer term:
Environment