Skip to content

fix(memory): consolidation & recall correctness — live crisis graph, score clamp, reflection embedding, crash recovery#22

Merged
CodeAbra merged 21 commits into
CodeAbra:mainfrom
Marsu6996:fix/consolidation-recall-correctness
Jun 22, 2026
Merged

fix(memory): consolidation & recall correctness — live crisis graph, score clamp, reflection embedding, crash recovery#22
CodeAbra merged 21 commits into
CodeAbra:mainfrom
Marsu6996:fix/consolidation-recall-correctness

Conversation

@Marsu6996

Copy link
Copy Markdown
Contributor

Summary

This is the consolidation- and recall-correctness half of #17 (the daemon
WAKE/idle CPU-storm half is the companion PR). After a healthy store accumulates
soft-deleted / deduped records, the sleep pipeline rebuilds its topology over the
tombstoned nodes too, so rich_club sits just under its crisis floor and
re-arms crisis_mode every cycle; meanwhile a degraded daemon serves recall as
flat 1.000 scores. Each commit fixes one root cause behind those symptoms, on
top of v1.1.4.

What changed (one logical change per commit)

  • fix(capture) — stop re-embedding seen turns, close the dedup race. Active
    sessions re-drain the whole transcript every turn, so the eager embed at the top
    of capture_turn re-ran the GIL-bound Rust matmul for every already-stored turn
    just to discard it on the idem-tag check. Embedding is now deferred behind a
    memoized _compute_embedding() that runs at most once and only for genuinely new
    turns; flush_record_buffer before the idem lookup under _CAPTURE_DEDUP_LOCK
    closes a check-then-insert race that produced live duplicate records.
  • fix(recall) — clamp the displayed score to [0,1], rank on unclamped
    sort_score.
    Multiplicative boosts (trigram×2, FTS×3, valence) push the
    internal score past 1.0, so a served hit could surface a "confidence" > 1 and a
    degraded state surfaced flat 1.000. The displayed score is clamped at
    serialization while ranking uses a new, unclamped MemoryHit.sort_score; the
    stale-downweight keeps the two in lock-step. Ordering is provably unchanged
    (regression test included).
  • fix(consolidation) — build crisis topology on the live graph; demote
    rich_club.
    The essential-variable tracker and crisis recluster rebuilt over
    ALL records including tombstoned ones, so communities / centrality / rich_club
    were computed on thousands of dead nodes (rich_club ~0.019, just under the 0.02
    floor → re-arms crisis on a healthy store). A shared build_live_graph()
    (tombstone-filtered nodes, live-only edges) feeds both paths, and
    rich_club_coefficient is demoted from a crisis trigger to a diagnostic-only
    signal (edge_density and community_count remain triggers). The runtime-graph
    tombstone guard is hardened to pd.isna() so a NaT/NA tombstoned_at on a
    reembedded datetime64 column reads as LIVE instead of collapsing the graph.
  • fix(dmn) — embed the reflection surface text instead of a zero placeholder.
    Daily reflections wrote an all-zero embedding that was never re-embedded, leaving
    them permanently unretrievable and feeding zero vectors into the scoring matmul.
    Now embeds literal_surface; on native failure falls back to a zero vector
    flagged embedding_pending=1 for deferred reembed.
  • fix(sigma) — bound small-worldness with a node-count ceiling. Adds
    SIGMA_N_CEIL (default 20000, env IAI_MCP_SIGMA_N_FLOOR / ..._CEIL) returning
    None above it so the unbounded random-reference compute can't spin a core.
  • fix(watchdog) — monitor a retried-but-wedged sleep cycle for staleness. The
    staleness check only monitored attempt == 1, so a retried-and-still-wedged
    cycle (attempt >= 2) was short-circuited and ignored. Gates on attempt < 1
    instead (excluding bool), so every genuine running attempt is monitored.
  • feat(daemon) — surface a repeated store-empty count failure as telemetry.
    The companion PR stops a transient count failure from parking the tick but left
    the condition invisible; this emits a best-effort store_empty_check_failed
    warning event (buffered, never raises).
  • fix(hippo) — decrypt literal_surface before reembedding pending rows. On
    an encrypted store, reembed_pending_rows embedded the raw iai:enc:v1:
    ciphertext, producing a garbage vector for every embedding_pending=1 row. Now
    decrypts via the existing _decrypt_record_field (a no-op on a plaintext store);
    a decrypt failure leaves the row pending for retry rather than poisoning it.
    Distinct from the v1.1.4 reembed fix (04e62e2): that repaired the migration
    path (migrate/_reembed_from_text.py::migrate_reembed_from_text); this fixes the
    runtime/daemon path (hippo/_db.py::reembed_pending_rows), which v1.1.4 does
    not touch — so the two are complementary, not redundant.
  • fix(daemon) — recover cleanly from a crash mid-SLEEP at boot. A daemon
    killed mid-SLEEP leaves lifecycle_state.json at SLEEP with
    sleep_cycle_progress=None (incoherent); resuming wedges the daemon so it never
    reaches the recluster that clears crisis. A pure _normalize_boot_lifecycle_state
    resets exactly that case to a clean WAKE and clears the stale crisis flags at boot.

Type of change

  • Bug fix
  • New feature
  • Refactor (no behaviour change)
  • Documentation
  • Build / tooling

Affected areas

  • Capture path
  • Recall / retrieval
  • Consolidation / sleep cycles
  • Daemon lifecycle / FSM
  • Storage / encryption at rest
  • MCP wrapper (TypeScript)
  • Bench harness
  • CLI / doctor
  • Other: ___

Testing

  • pytest passes locally — full default gate
    (pytest -m "not perf and not slow and not live", 3538 tests), rebased on
    v1.1.4: 3514 passed, 33 skipped, 1 xfailed, 1 failed. The single failure
    (test_rendered_plist_contains_fd_floor) is pre-existing — it fails
    identically on a clean v1.1.4 checkout
    and is unrelated to this branch.
    (Some test_doctor_* rows are environment-flaky on the dev machine — a large
    system subprocess output decoded as strict UTF-8 — and pass on a clean run.)
  • ruff check src/ tests/ — no new findings vs the v1.1.3 baseline on the
    touched files (the repo ships no ruff config).
  • New tests added for changed behaviour — score-clamp order preservation,
    tombstone/live-graph filtering on NaT/NA columns, dmn embedding norm,
    sigma ceiling + env override, watchdog stale gate, encrypted-store reembed
    decrypt, and crash-mid-SLEEP boot normalization. (The capture drain/dedup
    commit currently relies on the existing capture tests + a live-store check — a
    dedicated unit test would strengthen it; happy to add if you'd like one here.)

Benchmarks

The recall change is display-only and order-preserving: ranking moves to an
unclamped sort_score, the clamp touches only the serialized number, and the
regression test asserts identical ordering — so LongMemEval-S is unaffected by
construction. The dmn fix makes daily reflections retrievable again (previously
zero-vector), which can only help recall on reflection cues. A blind LongMemEval-S
run (python -m bench.longmemeval_blind --split S) can be attached on request; it
is not run by default to keep the blind split blind.

  • Bench command run: pending — blind LongMemEval-S queued before merge if desired
  • Before:
  • After:

Notes for reviewers

Marsu6996 and others added 20 commits June 21, 2026 17:02
…og status probes

The sleep/consolidation pipeline defers whenever _interrupt_check reports recent
activity. Two independent signals wrongly marked the daemon "active" on nearly
every tick, so it never completed a cycle, never hibernated, and the wake-hook
re-ran every 30s — a sustained ~200% CPU churn on any long-lived deployment:

1. _interrupt_check returned True whenever mcp_socket.active_connections > 0.
   Long-lived MCP clients hold their socket open permanently -> always True.
2. Even after removing (1), last_activity_ts was refreshed for EVERY inbound
   socket line — including the watchdog's own {"type": "status"} liveness probe
   sent every 7-30s (daemon/_watchdog.py::_probe_status_roundtrip). So the
   30s-activity window never elapsed.

Fix: _interrupt_check keys off last_activity_ts recency only, and SocketServer
refreshes last_activity_ts only for dispatched JSON-RPC method calls (real
recall/capture traffic), never for control-plane messages. A busy burst still
defers consolidation; a 30s lull now lets the cycle finish and the daemon
hibernate.

Adds tests/test_socket_activity_tracking.py locking in that a status probe does
not count as activity while a real method call does.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ute storm

At WAKE several background subsystems (boot preload, sigma identity audit,
foraging weak-bridge detection, hippea cascade) each call build_runtime_graph
concurrently in their own asyncio.to_thread workers. On a cache miss each one
independently ran the full, GIL-bound community detection (mosaic). Three+ at
once contended for the GIL, starved the asyncio event loop, and the liveness
watchdog's socket probe timed out -> SIGKILL -> launchd relaunch -> loop.

Wrap build_runtime_graph in a single-flight gate keyed on the cache key: the
first caller (leader) computes and saves the on-disk cache; concurrent callers
(followers) wait on an Event and then reload the freshly-saved cache via the
existing cheap path. No mutable MemoryGraph is shared between callers (each
rebuilds its own shell + single-slot sync hook), and recall is independent of
the community assignment, so a slightly-stale shared result is harmless.

Followers re-contend in a bounded loop rather than recomputing unconditionally:
if the leader fails before saving, the cache key shifts mid-burst, or the wait
times out, the woken followers loop back and exactly one becomes the next
leader while the rest wait again — degrading those edge cases to sequential
single-flight (one compute at a time) instead of an N-way concurrent re-storm.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The cache key buckets on records//WINDOW and edges//WINDOW and try_load
requires an exact match. With WINDOW=10 a normal day of capture (+~150
records, +~1300 edges) crossed ~130 buckets, so the on-disk graph cache MISSED
on essentially every WAKE and the full community detection was recomputed each
time. Edges churn fastest, so they are the binding term.

WINDOW=250 keeps the cache valid across a normal day, so the common WAKE is now
a cheap cache HIT. The independent age/dirty fuse in consult_overlay (25h /
dirty>50) remains the real freshness backstop, and the single-flight gate makes
the rare genuine miss harmless.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_boot_preload called build_runtime_graph (which already persists the cache,
with the full node_payload, on a miss) and then called save(..., node_payload=
None, ...) again, overwriting the good cache with a payload-less one. That
forced a pandas re-read of every record on the next cache hit. Just warm the
cache via build_runtime_graph and drop the redundant second save.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The daemon only enters WAKE at boot if ~/.iai-mcp/wake.signal exists, but
nothing ever created it — WakeHandler only consumed it. The CLI start/install
path (and the operator's capture hook) brought the daemon up with a plain
launchctl kickstart, so it re-read its persisted HIBERNATION state and
hibernate-exited within a tick, closing the socket before it ever served recall.

Add WakeHandler.signal_wake() (symmetric to consume_wake_signal) and create the
signal before the kickstart in daemon install/start, so the booting daemon
transitions HIBERNATION -> WAKE and serves its socket.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build_runtime_graph added every record (and every edge) to the graph,
including soft-deleted / deduped / erased records (tombstoned_at IS NOT NULL).
That polluted communities, centrality, rich_club and the sigma topology audit
with dead nodes, and -- worse -- desynced the node count from
store.active_records_count() (the payload-cache validity anchor), so after any
tombstoning (e.g. migrate --dedupe-episodic) the cache was permanently invalid
and every WAKE did a full rebuild on an over-large graph.

Skip tombstoned rows in the node loop (matching active_records_count:
tombstoned_at IS NULL), skip edges whose endpoints are not live nodes (add_edge
does setdefault on both endpoints, so it would re-create dead nodes), and drop
the cached assignment/rich_club when the live node set changed so they are
recomputed on the fresh graph instead of referencing dead nodes.

On a real store this took the graph from 9733 nodes to 3612, rich_club from 974
to 362, and restored payload-cache hits across builds.
_store_is_empty() caught (OSError, ValueError, KeyError, RuntimeError) and
returned True. All Hippo store errors (HippoIntegrityError, HippoLockHeldError,
ConsolidationPendingError, HippoDecryptError) subclass RuntimeError, and
count_rows() raises HippoIntegrityError when the shared sqlite connection is
left in an error state by a concurrent heavy reader. Returning True there parks
the whole lifecycle tick (no idle-check, no drain) on a store that actually has
records. Treat the unknown case as NOT empty so the tick proceeds; a truly empty
store just does a little harmless no-op work.
The field was only ever set (on the empty_store/paused skip paths), never
cleared, so a single early skip (e.g. a first-tick count race at boot) left a
healthy, ticking, draining daemon permanently reporting skip=empty_store in
.daemon-state.json — misleading observability that reads as a parked lifecycle.
The lifecycle idle countdown only refreshed `_last_active_monotonic` when
the Node wrapper heartbeat file was fresh (`HeartbeatScanner.is_active`).
The wrappers dir can be empty (heartbeat stale) while the daemon is still
draining a continuously-fed deferred-capture backlog. In that state the
idle timer grew unconditionally and the FSM forced itself to SLEEP after
30 min even though drain threads were still writing to the store. Entering
the SLEEP pipeline escalates to an EXCLUSIVE store lock, so this contended
with the in-flight drain; and because crisis re-arming only runs in SLEEP,
an oscillating/never-settling daemon could silently stop re-arming crisis
detection.

Fold two more activity signals into the idle countdown, alongside the
wrapper heartbeat:

- in-flight drain state: `capture.is_drain_in_progress()`, a thread-safe
  depth counter set by `drain_deferred_captures` / `drain_active_live_captures`
  for their whole duration;
- recent real RPC traffic: `mcp_socket.last_activity_ts` (already used by
  the sleep-pipeline interrupt check, now also by the countdown).

The decision is centralized in a pure, unit-testable helper
`_idle_countdown_decision`. A genuinely idle daemon still advances to
DROWSY/SLEEP exactly as before, so crisis re-arming keeps running; only an
actively-working daemon is held awake. Explicit FORCE_SLEEP/user-sleep
requests are unaffected.

Add tests asserting the daemon does NOT advance toward SLEEP while a drain
is in progress (or RPC is recent), that a truly idle daemon still sleeps,
and that the in-progress flag is set across the production drain wrappers
and released on exception.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Covers 2cffb35: a successful tick clears a stale last_tick_skipped_reason,
plus the paused-skip event/persistence and the no-run_rem_cycle routing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Active sessions re-drain the entire transcript every turn, so the eager
embed at the top of capture_turn re-ran the expensive GIL-bound Rust matmul
for every already-stored turn just to discard it on the idem-tag dedup check
-- a steady CPU drain proportional to transcript length.

Defer embedding behind a memoized _compute_embedding() closure that runs at
most once and only when a turn is actually new, and flush the record buffer
before the idem lookup under _CAPTURE_DEDUP_LOCK so a just-inserted but
unflushed turn is visible to the SQLite-backed dedup -- closing a
check-then-insert race that produced live duplicate records.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…core

Multiplicative boosts (trigram*2, FTS*3, valence) can push the internal
score past 1.0, so a served recall hit could surface a "confidence" > 1, and
a degraded daemon state surfaced flat 1.000 scores. Clamp the *displayed*
score to [0,1] at serialization while ranking on a new, unclamped
MemoryHit.sort_score so ordering is provably unchanged; the stale-downweight
keeps sort_score in lock-step with score.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ich_club

The essential-variable tracker and crisis recluster rebuilt their graph over
ALL records -- including tombstoned (soft-deleted / deduped) ones -- so
communities, centrality and rich_club were computed on thousands of dead
nodes. On a real store rich_club sat at ~0.019, just under the 0.02 floor,
re-arming crisis_mode every sleep cycle on a healthy store.

Add a shared build_live_graph() helper (tombstone-filtered nodes, live-only
edges) used by both paths, and demote rich_club_coefficient from a crisis
*trigger* to a diagnostic-only signal (edge_density and community_count
remain the triggers). Harden the runtime-graph tombstone guard to pd.isna()
so a NaT/NA tombstoned_at on a reembedded datetime64 column is read as LIVE
instead of collapsing the graph to empty.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lder

The daily-reflection step wrote an all-zero embedding placeholder that was
never re-embedded, leaving reflections permanently unretrievable and feeding
zero vectors into the scoring matmul. Embed literal_surface directly; on a
native embedder failure fall back to a zero vector flagged embedding_pending=1
for deferred reembed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
compute_sigma runs an unbounded random-graph reference that can spin a core
on a large graph. Add SIGMA_N_CEIL (default 20000, env IAI_MCP_SIGMA_N_FLOOR /
IAI_MCP_SIGMA_N_CEIL) and return None above it instead of computing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The sleep-cycle staleness check only monitored attempt == 1, so a cycle that
had already been retried (attempt >= 2) and was still wedged -- exactly the
case the watchdog must catch -- was short-circuited and ignored. Gate on
attempt < 1 instead so every genuine running attempt is monitored, excluding
bool (isinstance(True, int) is True).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
e8f3deb stopped a transient count failure from parking the lifecycle tick but
left the condition invisible (log.debug only). Emit a best-effort
store_empty_check_failed warning event -- buffered and never raising, so it is
safe even when the store connection is the thing failing -- so a sqlite
left-in-error-state failure surfaces to the operator.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
reembed_pending_rows fed the raw stored literal_surface to embedder.embed(); on
an encrypted store that is iai:enc:v1: ciphertext, so every embedding_pending=1
row re-embedded by this path got an embedding of the ciphertext (garbage). Decrypt
via _decrypt_record_field first (no-op on a plaintext store); a decrypt failure
leaves the row pending for retry rather than poisoning it with a garbage vector.
A daemon killed mid-SLEEP leaves lifecycle_state.json at current_state=SLEEP with
sleep_cycle_progress=None (incoherent: a real in-flight cycle carries a progress
dict). Resuming it wedged the daemon -- it never advanced the sleep pipeline,
never reached the recluster that clears crisis, and recall stayed degraded
(SLEEP + crisis both degrade recall). Normalize that one case to a clean WAKE at
boot (dropping the stale crisis flag) via _normalize_boot_lifecycle_state; a real
degeneration re-arms crisis on the next complete sleep cycle.
@Marsu6996 Marsu6996 force-pushed the fix/consolidation-recall-correctness branch from 2631e96 to 642a8e8 Compare June 21, 2026 18:23
The final recall ranking sorted hits by score alone with a stable sort,
so equal-scoring hits kept their arrival order. Two code paths that
compute the same logical score via different float summation orders
(notably an empty profile_state falling back to the medium scale) could
therefore emit byte-different orderings, flaking
test_empty_profile_state_falls_back_to_medium_scale on CI.

Tie-break on str(record_id) — the same idiom already used elsewhere in
this module — so equal-scoring hits resolve deterministically. Behaviour
for distinctly-scored hits is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@CodeAbra CodeAbra left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — this carries the consolidation/recall-correctness root causes for the crisis loop: tombstoned-record exclusion from the runtime graph, crisis topology on the live graph, score clamp, reflection-embedding + crash recovery. Reviewed the diff, security-clean, CI green. Merging with credit to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants