feat(memory): Notion doc-aware versioned memory tree + page-content ingest#3378
Conversation
…ngest
Build a single per-connection Notion source tree where each document rolls up
to its own summary "doc-root", doc-roots merge into the connection root, edits
are non-destructive (new versioned doc-root, old kept), and traversal returns
the latest version per document.
Core engine
- bucket_seal: seal_document_subtree builds a per-document subtree as an
isolated side-cascade to one doc-root, then merges it via the existing
cascade at MERGE_LEVEL_BASE. Chat/email seal path unchanged.
- Single-input passthrough: a one-chunk doc-root is the chunk verbatim — no
summariser/LLM call.
- SummaryNode gains doc_id/version_ms (+ additive migration & indexes).
Ingest (Notion)
- Non-destructive versioned ingest: ingest_document_versioned keys the gate by
{source_id}@{version_ms}; removed the destructive delete_chunks_by_source.
- Page-content fetch: pull each new/edited page body via
NOTION_GET_PAGE_MARKDOWN and ingest body + metadata (FETCH_DATA returns
metadata/properties only). DB rows with no body fall back to metadata-only.
- Per-doc seal driven by a new JobKind::SealDocument enqueued at ingest;
Notion chunks are gated out of the flat append_buffer path.
Retrieval
- drill_down resolves max(version_ms) per doc_id (read-time latest-wins);
superseded versions stay on disk but never surface.
On-disk vault layout: source-<scope>/docs/<page>/v-<ms>/… + merge/L<level>/…
Fixes
- wipe_all now also clears mem_tree_ingested_sources (a wipe used to strand
the gate, blocking re-ingest forever).
- Graph UI: cap summary node radius (merge nodes live at level 1000+, which
blew up the d3 layout) and lower ZOOM_MIN so large clouds frame fully;
add a Refresh button + 30s tab-scoped poll to the memory graph.
Tests: ~19 unit/integration tests — per-version dedup, skip-unchanged,
keep-both-versions, additive engine versioning, single-chunk passthrough,
read-time latest-wins, on-disk layout, content-fetch parse/merge, wipe gate.
Follow-up (separate PR): clean noisy Notion markdown (strip S3 signed image
URL query params, collapse <span>/mention-user wrappers) to cut token waste.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 25 minutes and 40 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughThis PR implements versioned, non-destructive document ingestion and sealing: SummaryNodes gain doc/version fields and DB schema/indexes; ingest uses a per-version dedupe gate and persist gating; SealDocument job and handlers build per-document subtrees and merge them at MERGE_LEVEL_BASE; retrieval skips older doc revisions; Notion provider fetches rendered markdown; content layout becomes doc-aware; frontend auto-refresh and layout caps added. ChangesDocument versioning and non-destructive sealing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Resolve notion/provider.rs sync loop: keep upstream's depth-floor truncation + max-items cap, then run the per-page GET_PAGE_MARKDOWN body fetch on the capped/floored `pending` set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/memory/ingest_pipeline.rs (1)
160-178:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInclude
path_scopein the document ingest gate key.Both the fast-path check and the transactional claim key off
source_id@version_ms, but the stable document-collection identity lives inpath_scope. If the same document id appears under two different connections/scopes, the second ingest is treated as a duplicate and never reaches its own per-connection tree.Suggested fix
+fn build_document_gate_key( + source_id: &str, + path_scope: Option<&str>, + version_ms: Option<i64>, +) -> String { + let base = match path_scope { + Some(scope) => format!("{scope}/{source_id}"), + None => source_id.to_string(), + }; + match version_ms { + Some(v) => format!("{base}@{v}"), + None => base, + } +} + pub async fn ingest_document_versioned( config: &Config, source_id: &str, owner: &str, tags: Vec<String>, doc: DocumentInput, path_scope: Option<String>, version_ms: Option<i64>, ) -> Result<IngestResult> { - let gate_key = match version_ms { - Some(v) => format!("{source_id}@{v}"), - None => source_id.to_string(), - }; + let gate_key = build_document_gate_key(source_id, path_scope.as_deref(), version_ms); if already_ingested(config, SourceKind::Document, &gate_key).await? { log::debug!( "[memory::ingest_pipeline] skip ingest_document — source_id_hash={} version_ms={:?} already ingested", redact(source_id), version_ms @@ async fn persist( config: &Config, source_id: &str, canonical: CanonicalisedSource, gate_version_ms: Option<i64>, ) -> Result<IngestResult> { let source_kind_for_store = canonical.metadata.source_kind; + let document_gate_key = (source_kind_for_store == SourceKind::Document).then(|| { + build_document_gate_key( + source_id, + canonical.metadata.path_scope.as_deref(), + gate_version_ms, + ) + }); @@ if source_kind_for_store == SourceKind::Document { let now_ms = chrono::Utc::now().timestamp_millis(); - let gate_key = match gate_version_ms { - Some(v) => format!("{source_id_for_store}@{v}"), - None => source_id_for_store.clone(), - }; let claimed = chunk_store::claim_source_ingest_tx( &tx, source_kind_for_store, - &gate_key, + document_gate_key.as_deref().expect("document gate key"), now_ms, )?;As per coding guidelines, "Memory source identity rule: Do not use per-item selector IDs as the source tree / raw archive identity; set
metadata.path_scopeto the stable collection identity."Also applies to: 196-200, 307-323
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory/ingest_pipeline.rs` around lines 160 - 178, The fast-path and transactional ingest gate use gate_key built from source_id and version_ms but must also include path_scope so identity is scoped to the collection; update the gate key construction (the let gate_key = match version_ms { ... } block) to incorporate path_scope (e.g., include path_scope as a prefix or suffix) and apply the same change to the other gate/transactional claim sites mentioned (the already_ingested check and the transactional claim usages around document::canonicalise, persist, and the blocks at the other referenced ranges), ensuring calls to already_ingested and any transactional keys consistently use the new path-scoped gate_key so documents in different path_scope values are not treated as duplicates (refer to functions/idents: already_ingested, document::canonicalise, persist, IngestResult::already_ingested and the local variable path_scope).
🧹 Nitpick comments (1)
app/src/components/intelligence/memoryGraphLayout.ts (1)
67-76: 💤 Low valueConsider extracting the cap as a named constant.
The radius cap of
14prevents layout explosion for high-level merge nodes and is well-commented, but extracting it (e.g.,const MAX_SUMMARY_RADIUS = 14;) would make the formula more self-documenting and easier to tune in the future.♻️ Proposed refactor
+const MAX_SUMMARY_RADIUS = 14; + export function nodeRadius(node: GraphNode): number { if (node.kind === 'source') return 16; if (node.kind === 'summary') { // Higher levels render slightly larger, but the size MUST be capped: // document source trees place their cross-document merge tier at a large // synthetic level (MERGE_LEVEL_BASE = 1000+), so the raw `level * 2.5` // would explode to thousands of px — rendering giant discs and, via the // `forceCollide(nodeRadius + 2)` term, blowing the whole layout apart. // The cap keeps merge nodes the largest summaries without distorting it. const level = node.level ?? 0; - return Math.min(5 + level * 2.5, 14); + return Math.min(5 + level * 2.5, MAX_SUMMARY_RADIUS); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src/components/intelligence/memoryGraphLayout.ts` around lines 67 - 76, Extract the hard-coded cap 14 into a named constant (e.g., MAX_SUMMARY_RADIUS) and use it in the summary-size calculation so the intent is explicit and easier to tune; update the block that checks node.kind === 'summary' (which uses node.level) to compute level = node.level ?? 0 and return Math.min(5 + level * 2.5, MAX_SUMMARY_RADIUS) instead of literal 14, placing MAX_SUMMARY_RADIUS near the top of the module (or with related layout constants) so it's discoverable and documented.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory_queue/handlers/mod.rs`:
- Around line 113-117: The debug logs in seal_document currently emit raw
identifiers (payload.doc_id and tree_scope) which leaks recoverable source IDs;
update the logging to pass these values through the project’s existing redaction
helper before formatting (i.e., call the redaction helper on payload.doc_id and
on tree_scope before using them in log::debug/log::error), and apply the same
change to the other logging sites in this module (the block around lines
148–161) so all emitted diagnostics use the redacted values instead of raw
identifiers.
In `@src/openhuman/memory_tree/tree/bucket_seal.rs`:
- Around line 1000-1002: SealDocument currently always mints new doc-root ids on
each run which breaks retry idempotency; change it to first check for an
existing per-version seal (lookup by doc_id and version_ms) and reuse that
doc-root instead of creating a new one, or implement an atomic
upsert/insert-if-not-exists for the seal marker so only the first writer creates
the new ids; ensure the logic in SealDocument and any callers of drill_down
treat the found/reused doc-root id as canonical for (doc_id, version_ms) and
that commits after partial failure do not create duplicate roots.
- Around line 1314-1349: The backlink updates in the with_connection closure
(the tx.execute calls that currently use "AND parent_summary_id IS NULL" / "AND
parent_id IS NULL") make reused chunks keep links to old summaries; change the
update logic to be version-aware by overwriting backlinks for children that
belong to this same document/tree instead of only when NULL: remove the "IS
NULL" predicate and add a condition restricting the update to the same
tree/document (use node_for_tx.tree_id or equivalent) so the UPDATE statements
always set parent_summary_id/parent_id to summary_id_for_tx for child_id rows
that match the same tree_id; update the rusqlite::params! calls to pass
node_for_tx.tree_id as the extra parameter and adjust the SQL accordingly
(modify the tx.execute calls inside the for child_id loop).
---
Outside diff comments:
In `@src/openhuman/memory/ingest_pipeline.rs`:
- Around line 160-178: The fast-path and transactional ingest gate use gate_key
built from source_id and version_ms but must also include path_scope so identity
is scoped to the collection; update the gate key construction (the let gate_key
= match version_ms { ... } block) to incorporate path_scope (e.g., include
path_scope as a prefix or suffix) and apply the same change to the other
gate/transactional claim sites mentioned (the already_ingested check and the
transactional claim usages around document::canonicalise, persist, and the
blocks at the other referenced ranges), ensuring calls to already_ingested and
any transactional keys consistently use the new path-scoped gate_key so
documents in different path_scope values are not treated as duplicates (refer to
functions/idents: already_ingested, document::canonicalise, persist,
IngestResult::already_ingested and the local variable path_scope).
---
Nitpick comments:
In `@app/src/components/intelligence/memoryGraphLayout.ts`:
- Around line 67-76: Extract the hard-coded cap 14 into a named constant (e.g.,
MAX_SUMMARY_RADIUS) and use it in the summary-size calculation so the intent is
explicit and easier to tune; update the block that checks node.kind ===
'summary' (which uses node.level) to compute level = node.level ?? 0 and return
Math.min(5 + level * 2.5, MAX_SUMMARY_RADIUS) instead of literal 14, placing
MAX_SUMMARY_RADIUS near the top of the module (or with related layout constants)
so it's discoverable and documented.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b488d614-66e5-47ed-94eb-7ad46a587403
📒 Files selected for processing (25)
app/src/components/intelligence/MemoryWorkspace.tsxapp/src/components/intelligence/memoryGraphLayout.tssrc/openhuman/composio/ops_tests.rssrc/openhuman/memory/ingest_pipeline.rssrc/openhuman/memory/read_rpc.rssrc/openhuman/memory/read_rpc_tests.rssrc/openhuman/memory_queue/handlers/mod.rssrc/openhuman/memory_queue/types.rssrc/openhuman/memory_store/chunks/connection.rssrc/openhuman/memory_store/chunks/store_tests.rssrc/openhuman/memory_store/content/atomic.rssrc/openhuman/memory_store/content/paths.rssrc/openhuman/memory_store/content/read.rssrc/openhuman/memory_store/traits.rssrc/openhuman/memory_store/trees/store.rssrc/openhuman/memory_store/trees/store_tests.rssrc/openhuman/memory_store/trees/types.rssrc/openhuman/memory_sync/composio/providers/notion/ingest.rssrc/openhuman/memory_sync/composio/providers/notion/provider.rssrc/openhuman/memory_sync/composio/providers/notion/sync.rssrc/openhuman/memory_tree/ingest.rssrc/openhuman/memory_tree/retrieval/drill_down.rssrc/openhuman/memory_tree/tree/bucket_seal.rssrc/openhuman/memory_tree/tree/bucket_seal_tests.rssrc/openhuman/memory_tree/tree/mod.rs
Address CodeRabbit review on tinyhumansai#3378: - Redact tree_scope/doc_id (recoverable source ids) in handle_seal_document logs and error chain via the redact helper. - drill_down: dedup doc-roots at the winning version so a retried SealDocument that minted a duplicate (doc_id, version_ms) never double-surfaces (read-side idempotency guard). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iterals Fixes the Rust Core Coverage CI compile failure — 5 SummaryNode literals across 4 tests/ integration files predate the new fields. (cargo check --lib doesn't compile tests/, so these were missed locally.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nodeRadius now caps at 14 (merge-tier nodes live at level 1000+, which blew up the d3 layout). Update the assertion that expected the old uncapped 252.5 for level 99; add cap-boundary cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A byte-identical body chunk reused across doc versions upserts to the same content-addressed row, so its single parent_summary_id can only name one doc-root. The IS NULL guard left it stranded on the FIRST (now-superseded) version's summary, so graph_export drew the chunk's parent edge to an old doc-root. Drop the guard in seal_explicit_children so each version's seal re-points the chunk to its doc-root. Subtrees seal newest-last, so last-write-wins leaves the backlink on the latest version — the one drill_down surfaces. Retrieval was already correct (top-down via child_ids + version filter); this fixes the graph-edge for reused chunks. Addresses CodeRabbit review. Co-Authored-By: Claude <noreply@anthropic.com>
The Coverage Gate flagged the new graph refresh control and the 30s poll effect as uncovered changed lines (MemoryWorkspace.tsx 123-125,257 → diff-cover 66% < 80%). Add a Vitest suite that asserts the refresh button re-exports the graph and the poll re-pulls on a 30s tick while skipping ticks when the tab is hidden. Co-Authored-By: Claude <noreply@anthropic.com>
|
Pushed two follow-ups:
|
Fixes the Frontend Quality (prettier --check) failure on a5097e8. Co-Authored-By: Claude <noreply@anthropic.com>
Summary
max(version_ms)).NOTION_GET_PAGE_MARKDOWNand ingest body + metadata (previously metadata-only). DB rows with no body fall back to metadata-only.wipe_allnow also clears the ingest gate (mem_tree_ingested_sources) — a wipe used to strand it and block re-ingest forever.Problem
The Notion sync ingested only page metadata as a single flat chunk per page, edits destructively deleted prior chunks, and there was no per-document structure or version history in the memory tree. Re-syncs after a wipe also silently produced zero chunks.
Solution
bucket_seal::seal_document_subtreebuilds a per-document subtree as an isolated side-cascade to one doc-root, then merges via the existing cascade atMERGE_LEVEL_BASE. Chat/email seal path is byte-for-byte unchanged (gated onSourceKind::Document/Notion).SummaryNodegainsdoc_id/version_ms(additive migration + index).ingest_document_versionedkeys the source gate by{source_id}@{version_ms}.JobKind::SealDocumentenqueued at ingest; Notion chunks gated out of the flatappend_bufferpath.drill_downfilters to the latest version perdoc_idat read time; superseded versions remain on disk but never surface.source-<scope>/docs/<page>/v-<ms>/…+merge/L<level>/….Submission Checklist
cargo-llvm-cov/diff-covernot run locally (cloud-embeddings backend unavailable in sandbox). Unit tests for changed logic added; please run CI coverage.docs/TEST-COVERAGE-MATRIX.md) — flagging for reviewer.Impact
doc_id/version_mscolumns + index) — safe on existing DBs.NOTION_GET_PAGE_MARKDOWNrequest per new/edited page (heavier first backfill; fine incrementally). Single-chunk passthrough removes one LLM summarise call per single-chunk doc.Related
<span>/mention-user/discussion-urlswrappers — to cut token waste in embeds/summaries.MERGE_LEVEL_BASE+1000 level offset in favour ofdoc_id IS NULLto mark the merge tier (cleaner node labels).AI Authored PR Metadata
Linear Issue
Commit & Branch
Validation Run
pnpm typecheck— passcargo test --lib)cargo fmt+cargo check --libcleanValidation Blocked
command:pre-push hooklint:commands-tokens/ fullcargo-llvm-coverror:ripgrepnot installed in sandbox; cloud-embeddings backend (full memory-tree integration tests) unauthenticated in sandboximpact:pushed with--no-verify(rg-missing only); changed-logic unit tests pass. CI should run lint + coverage.Behavior Changes
Parity Contract
Duplicate / Superseded PR Handling
Summary by CodeRabbit
New Features
Bug Fixes
Tests