fix(memory): consolidation & recall correctness — live crisis graph, score clamp, reflection embedding, crash recovery#22
Merged
CodeAbra merged 21 commits intoJun 22, 2026
Conversation
…og status probes
The sleep/consolidation pipeline defers whenever _interrupt_check reports recent
activity. Two independent signals wrongly marked the daemon "active" on nearly
every tick, so it never completed a cycle, never hibernated, and the wake-hook
re-ran every 30s — a sustained ~200% CPU churn on any long-lived deployment:
1. _interrupt_check returned True whenever mcp_socket.active_connections > 0.
Long-lived MCP clients hold their socket open permanently -> always True.
2. Even after removing (1), last_activity_ts was refreshed for EVERY inbound
socket line — including the watchdog's own {"type": "status"} liveness probe
sent every 7-30s (daemon/_watchdog.py::_probe_status_roundtrip). So the
30s-activity window never elapsed.
Fix: _interrupt_check keys off last_activity_ts recency only, and SocketServer
refreshes last_activity_ts only for dispatched JSON-RPC method calls (real
recall/capture traffic), never for control-plane messages. A busy burst still
defers consolidation; a 30s lull now lets the cycle finish and the daemon
hibernate.
Adds tests/test_socket_activity_tracking.py locking in that a status probe does
not count as activity while a real method call does.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ute storm At WAKE several background subsystems (boot preload, sigma identity audit, foraging weak-bridge detection, hippea cascade) each call build_runtime_graph concurrently in their own asyncio.to_thread workers. On a cache miss each one independently ran the full, GIL-bound community detection (mosaic). Three+ at once contended for the GIL, starved the asyncio event loop, and the liveness watchdog's socket probe timed out -> SIGKILL -> launchd relaunch -> loop. Wrap build_runtime_graph in a single-flight gate keyed on the cache key: the first caller (leader) computes and saves the on-disk cache; concurrent callers (followers) wait on an Event and then reload the freshly-saved cache via the existing cheap path. No mutable MemoryGraph is shared between callers (each rebuilds its own shell + single-slot sync hook), and recall is independent of the community assignment, so a slightly-stale shared result is harmless. Followers re-contend in a bounded loop rather than recomputing unconditionally: if the leader fails before saving, the cache key shifts mid-burst, or the wait times out, the woken followers loop back and exactly one becomes the next leader while the rest wait again — degrading those edge cases to sequential single-flight (one compute at a time) instead of an N-way concurrent re-storm. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The cache key buckets on records//WINDOW and edges//WINDOW and try_load requires an exact match. With WINDOW=10 a normal day of capture (+~150 records, +~1300 edges) crossed ~130 buckets, so the on-disk graph cache MISSED on essentially every WAKE and the full community detection was recomputed each time. Edges churn fastest, so they are the binding term. WINDOW=250 keeps the cache valid across a normal day, so the common WAKE is now a cheap cache HIT. The independent age/dirty fuse in consult_overlay (25h / dirty>50) remains the real freshness backstop, and the single-flight gate makes the rare genuine miss harmless. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_boot_preload called build_runtime_graph (which already persists the cache, with the full node_payload, on a miss) and then called save(..., node_payload= None, ...) again, overwriting the good cache with a payload-less one. That forced a pandas re-read of every record on the next cache hit. Just warm the cache via build_runtime_graph and drop the redundant second save. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The daemon only enters WAKE at boot if ~/.iai-mcp/wake.signal exists, but nothing ever created it — WakeHandler only consumed it. The CLI start/install path (and the operator's capture hook) brought the daemon up with a plain launchctl kickstart, so it re-read its persisted HIBERNATION state and hibernate-exited within a tick, closing the socket before it ever served recall. Add WakeHandler.signal_wake() (symmetric to consume_wake_signal) and create the signal before the kickstart in daemon install/start, so the booting daemon transitions HIBERNATION -> WAKE and serves its socket. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build_runtime_graph added every record (and every edge) to the graph, including soft-deleted / deduped / erased records (tombstoned_at IS NOT NULL). That polluted communities, centrality, rich_club and the sigma topology audit with dead nodes, and -- worse -- desynced the node count from store.active_records_count() (the payload-cache validity anchor), so after any tombstoning (e.g. migrate --dedupe-episodic) the cache was permanently invalid and every WAKE did a full rebuild on an over-large graph. Skip tombstoned rows in the node loop (matching active_records_count: tombstoned_at IS NULL), skip edges whose endpoints are not live nodes (add_edge does setdefault on both endpoints, so it would re-create dead nodes), and drop the cached assignment/rich_club when the live node set changed so they are recomputed on the fresh graph instead of referencing dead nodes. On a real store this took the graph from 9733 nodes to 3612, rich_club from 974 to 362, and restored payload-cache hits across builds.
_store_is_empty() caught (OSError, ValueError, KeyError, RuntimeError) and returned True. All Hippo store errors (HippoIntegrityError, HippoLockHeldError, ConsolidationPendingError, HippoDecryptError) subclass RuntimeError, and count_rows() raises HippoIntegrityError when the shared sqlite connection is left in an error state by a concurrent heavy reader. Returning True there parks the whole lifecycle tick (no idle-check, no drain) on a store that actually has records. Treat the unknown case as NOT empty so the tick proceeds; a truly empty store just does a little harmless no-op work.
The field was only ever set (on the empty_store/paused skip paths), never cleared, so a single early skip (e.g. a first-tick count race at boot) left a healthy, ticking, draining daemon permanently reporting skip=empty_store in .daemon-state.json — misleading observability that reads as a parked lifecycle.
The lifecycle idle countdown only refreshed `_last_active_monotonic` when the Node wrapper heartbeat file was fresh (`HeartbeatScanner.is_active`). The wrappers dir can be empty (heartbeat stale) while the daemon is still draining a continuously-fed deferred-capture backlog. In that state the idle timer grew unconditionally and the FSM forced itself to SLEEP after 30 min even though drain threads were still writing to the store. Entering the SLEEP pipeline escalates to an EXCLUSIVE store lock, so this contended with the in-flight drain; and because crisis re-arming only runs in SLEEP, an oscillating/never-settling daemon could silently stop re-arming crisis detection. Fold two more activity signals into the idle countdown, alongside the wrapper heartbeat: - in-flight drain state: `capture.is_drain_in_progress()`, a thread-safe depth counter set by `drain_deferred_captures` / `drain_active_live_captures` for their whole duration; - recent real RPC traffic: `mcp_socket.last_activity_ts` (already used by the sleep-pipeline interrupt check, now also by the countdown). The decision is centralized in a pure, unit-testable helper `_idle_countdown_decision`. A genuinely idle daemon still advances to DROWSY/SLEEP exactly as before, so crisis re-arming keeps running; only an actively-working daemon is held awake. Explicit FORCE_SLEEP/user-sleep requests are unaffected. Add tests asserting the daemon does NOT advance toward SLEEP while a drain is in progress (or RPC is recent), that a truly idle daemon still sleeps, and that the in-progress flag is set across the production drain wrappers and released on exception. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Covers 2cffb35: a successful tick clears a stale last_tick_skipped_reason, plus the paused-skip event/persistence and the no-run_rem_cycle routing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Active sessions re-drain the entire transcript every turn, so the eager embed at the top of capture_turn re-ran the expensive GIL-bound Rust matmul for every already-stored turn just to discard it on the idem-tag dedup check -- a steady CPU drain proportional to transcript length. Defer embedding behind a memoized _compute_embedding() closure that runs at most once and only when a turn is actually new, and flush the record buffer before the idem lookup under _CAPTURE_DEDUP_LOCK so a just-inserted but unflushed turn is visible to the SQLite-backed dedup -- closing a check-then-insert race that produced live duplicate records. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…core Multiplicative boosts (trigram*2, FTS*3, valence) can push the internal score past 1.0, so a served recall hit could surface a "confidence" > 1, and a degraded daemon state surfaced flat 1.000 scores. Clamp the *displayed* score to [0,1] at serialization while ranking on a new, unclamped MemoryHit.sort_score so ordering is provably unchanged; the stale-downweight keeps sort_score in lock-step with score. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 38c632a)
…ich_club The essential-variable tracker and crisis recluster rebuilt their graph over ALL records -- including tombstoned (soft-deleted / deduped) ones -- so communities, centrality and rich_club were computed on thousands of dead nodes. On a real store rich_club sat at ~0.019, just under the 0.02 floor, re-arming crisis_mode every sleep cycle on a healthy store. Add a shared build_live_graph() helper (tombstone-filtered nodes, live-only edges) used by both paths, and demote rich_club_coefficient from a crisis *trigger* to a diagnostic-only signal (edge_density and community_count remain the triggers). Harden the runtime-graph tombstone guard to pd.isna() so a NaT/NA tombstoned_at on a reembedded datetime64 column is read as LIVE instead of collapsing the graph to empty. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lder The daily-reflection step wrote an all-zero embedding placeholder that was never re-embedded, leaving reflections permanently unretrievable and feeding zero vectors into the scoring matmul. Embed literal_surface directly; on a native embedder failure fall back to a zero vector flagged embedding_pending=1 for deferred reembed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
compute_sigma runs an unbounded random-graph reference that can spin a core on a large graph. Add SIGMA_N_CEIL (default 20000, env IAI_MCP_SIGMA_N_FLOOR / IAI_MCP_SIGMA_N_CEIL) and return None above it instead of computing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The sleep-cycle staleness check only monitored attempt == 1, so a cycle that had already been retried (attempt >= 2) and was still wedged -- exactly the case the watchdog must catch -- was short-circuited and ignored. Gate on attempt < 1 instead so every genuine running attempt is monitored, excluding bool (isinstance(True, int) is True). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
e8f3deb stopped a transient count failure from parking the lifecycle tick but left the condition invisible (log.debug only). Emit a best-effort store_empty_check_failed warning event -- buffered and never raising, so it is safe even when the store connection is the thing failing -- so a sqlite left-in-error-state failure surfaces to the operator. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
reembed_pending_rows fed the raw stored literal_surface to embedder.embed(); on an encrypted store that is iai:enc:v1: ciphertext, so every embedding_pending=1 row re-embedded by this path got an embedding of the ciphertext (garbage). Decrypt via _decrypt_record_field first (no-op on a plaintext store); a decrypt failure leaves the row pending for retry rather than poisoning it with a garbage vector.
A daemon killed mid-SLEEP leaves lifecycle_state.json at current_state=SLEEP with sleep_cycle_progress=None (incoherent: a real in-flight cycle carries a progress dict). Resuming it wedged the daemon -- it never advanced the sleep pipeline, never reached the recluster that clears crisis, and recall stayed degraded (SLEEP + crisis both degrade recall). Normalize that one case to a clean WAKE at boot (dropping the stale crisis flag) via _normalize_boot_lifecycle_state; a real degeneration re-arms crisis on the next complete sleep cycle.
2631e96 to
642a8e8
Compare
The final recall ranking sorted hits by score alone with a stable sort, so equal-scoring hits kept their arrival order. Two code paths that compute the same logical score via different float summation orders (notably an empty profile_state falling back to the medium scale) could therefore emit byte-different orderings, flaking test_empty_profile_state_falls_back_to_medium_scale on CI. Tie-break on str(record_id) — the same idiom already used elsewhere in this module — so equal-scoring hits resolve deterministically. Behaviour for distinctly-scored hits is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CodeAbra
approved these changes
Jun 22, 2026
CodeAbra
left a comment
Owner
There was a problem hiding this comment.
Thanks — this carries the consolidation/recall-correctness root causes for the crisis loop: tombstoned-record exclusion from the runtime graph, crisis topology on the live graph, score clamp, reflection-embedding + crash recovery. Reviewed the diff, security-clean, CI green. Merging with credit to you.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is the consolidation- and recall-correctness half of #17 (the daemon
WAKE/idle CPU-storm half is the companion PR). After a healthy store accumulates
soft-deleted / deduped records, the sleep pipeline rebuilds its topology over the
tombstoned nodes too, so
rich_clubsits just under its crisis floor andre-arms
crisis_modeevery cycle; meanwhile a degraded daemon serves recall asflat
1.000scores. Each commit fixes one root cause behind those symptoms, ontop of
v1.1.4.What changed (one logical change per commit)
fix(capture)— stop re-embedding seen turns, close the dedup race. Activesessions re-drain the whole transcript every turn, so the eager embed at the top
of
capture_turnre-ran the GIL-bound Rust matmul for every already-stored turnjust to discard it on the idem-tag check. Embedding is now deferred behind a
memoized
_compute_embedding()that runs at most once and only for genuinely newturns;
flush_record_bufferbefore the idem lookup under_CAPTURE_DEDUP_LOCKcloses a check-then-insert race that produced live duplicate records.
fix(recall)— clamp the displayed score to [0,1], rank on unclampedsort_score. Multiplicative boosts (trigram×2, FTS×3, valence) push theinternal score past 1.0, so a served hit could surface a "confidence" > 1 and a
degraded state surfaced flat
1.000. The displayed score is clamped atserialization while ranking uses a new, unclamped
MemoryHit.sort_score; thestale-downweight keeps the two in lock-step. Ordering is provably unchanged
(regression test included).
fix(consolidation)— build crisis topology on the live graph; demoterich_club. The essential-variable tracker and crisis recluster rebuilt overALL records including tombstoned ones, so communities / centrality /
rich_clubwere computed on thousands of dead nodes (
rich_club ~0.019, just under the 0.02floor → re-arms crisis on a healthy store). A shared
build_live_graph()(tombstone-filtered nodes, live-only edges) feeds both paths, and
rich_club_coefficientis demoted from a crisis trigger to a diagnostic-onlysignal (
edge_densityandcommunity_countremain triggers). The runtime-graphtombstone guard is hardened to
pd.isna()so aNaT/NAtombstoned_aton areembedded
datetime64column reads as LIVE instead of collapsing the graph.fix(dmn)— embed the reflection surface text instead of a zero placeholder.Daily reflections wrote an all-zero embedding that was never re-embedded, leaving
them permanently unretrievable and feeding zero vectors into the scoring matmul.
Now embeds
literal_surface; on native failure falls back to a zero vectorflagged
embedding_pending=1for deferred reembed.fix(sigma)— bound small-worldness with a node-count ceiling. AddsSIGMA_N_CEIL(default 20000, envIAI_MCP_SIGMA_N_FLOOR/..._CEIL) returningNoneabove it so the unbounded random-reference compute can't spin a core.fix(watchdog)— monitor a retried-but-wedged sleep cycle for staleness. Thestaleness check only monitored
attempt == 1, so a retried-and-still-wedgedcycle (
attempt >= 2) was short-circuited and ignored. Gates onattempt < 1instead (excluding
bool), so every genuine running attempt is monitored.feat(daemon)— surface a repeated store-empty count failure as telemetry.The companion PR stops a transient count failure from parking the tick but left
the condition invisible; this emits a best-effort
store_empty_check_failedwarning event (buffered, never raises).
fix(hippo)— decryptliteral_surfacebefore reembedding pending rows. Onan encrypted store,
reembed_pending_rowsembedded the rawiai:enc:v1:ciphertext, producing a garbage vector for every
embedding_pending=1row. Nowdecrypts via the existing
_decrypt_record_field(a no-op on a plaintext store);a decrypt failure leaves the row pending for retry rather than poisoning it.
Distinct from the v1.1.4 reembed fix (
04e62e2): that repaired the migrationpath (
migrate/_reembed_from_text.py::migrate_reembed_from_text); this fixes theruntime/daemon path (
hippo/_db.py::reembed_pending_rows), which v1.1.4 doesnot touch — so the two are complementary, not redundant.
fix(daemon)— recover cleanly from a crash mid-SLEEP at boot. A daemonkilled mid-SLEEP leaves
lifecycle_state.jsonatSLEEPwithsleep_cycle_progress=None(incoherent); resuming wedges the daemon so it neverreaches the recluster that clears crisis. A pure
_normalize_boot_lifecycle_stateresets exactly that case to a clean WAKE and clears the stale crisis flags at boot.
Type of change
Affected areas
Testing
pytestpasses locally — full default gate(
pytest -m "not perf and not slow and not live", 3538 tests), rebased onv1.1.4: 3514 passed, 33 skipped, 1 xfailed, 1 failed. The single failure(
test_rendered_plist_contains_fd_floor) is pre-existing — it failsidentically on a clean
v1.1.4checkout and is unrelated to this branch.(Some
test_doctor_*rows are environment-flaky on the dev machine — a largesystem
subprocessoutput decoded as strict UTF-8 — and pass on a clean run.)ruff check src/ tests/— no new findings vs thev1.1.3baseline on thetouched files (the repo ships no ruff config).
tombstone/live-graph filtering on
NaT/NAcolumns,dmnembedding norm,sigma ceiling + env override, watchdog stale gate, encrypted-store reembed
decrypt, and crash-mid-SLEEP boot normalization. (The
capturedrain/dedupcommit currently relies on the existing capture tests + a live-store check — a
dedicated unit test would strengthen it; happy to add if you'd like one here.)
Benchmarks
The recall change is display-only and order-preserving: ranking moves to an
unclamped
sort_score, the clamp touches only the serialized number, and theregression test asserts identical ordering — so LongMemEval-S is unaffected by
construction. The
dmnfix makes daily reflections retrievable again (previouslyzero-vector), which can only help recall on reflection cues. A blind LongMemEval-S
run (
python -m bench.longmemeval_blind --split S) can be attached on request; itis not run by default to keep the blind split blind.
Notes for reviewers
sleep-cycle --forceworks; served recall degrades to flat 1.000 / schema records #17 (intentionally not "Fixes": the companion WAKE/idle PR carriesthe closing keyword, since you flagged the CPU-storm half as the gate for closing
Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while
sleep-cycle --forceworks; served recall degrades to flat 1.000 / schema records #17). This PR addresses the consolidation/recall symptoms — crisis loop from atombstoned-polluted graph, flat
1.000served recall — at their source.capture.pyusesis_drain_in_progress, the runtime-graph tombstone guard, the store-empty path),so this is best reviewed/merged after it. Rebasing onto
mainonce that lands isa clean fast-forward.
crisis_mode_since_ts, shipped v1.1.3)rather than duplicating it: the 72 h timeout clears a coherent crisis; this
removes the false trigger so crisis is not (re-)armed on a healthy store in the
first place.
v1.1.4. Our files are disjoint from v1.1.4's changes (reembedmigration + analytics), so the rebase was conflict-free; the full gate above was
run on the v1.1.4-based branch.
rich_club_coefficientis no longer a crisis trigger (kept as a diagnosticevent field
is_crisis_trigger). New env knobs:IAI_MCP_SIGMA_N_FLOOR/IAI_MCP_SIGMA_N_CEIL. NewMemoryHit.sort_scorefield defaults toNone(backward compatible — callers fall back to
score).