Tracking the content-hash passage ID work. 5 stacked sub-PRs, all open and CI-green on lint+ty.
Today
Passage IDs are str(len(self.chunks)) at api.py:463 — sequential ints by insertion order. Embedded as FAISS graph node labels, used as keys in passages.jsonl's offset map, returned in search results.
Problem
Move a file, reorder chunks, or interleave builds and the IDs shift. Tevfik (LeannVault, #237 thread) hit this directly: the index breaks when local folders are reorganized because the ID-to-path mapping is implicit in insertion order.
Proposal
Default new builds to sha256(text)[:16] hex prefix. Content-stable across file moves and reorderings, dedup-friendly. Existing indexes unaffected — schema bump is meta.json["version"]: "1.0" → "1.1", scheme is recorded in passage_id_scheme, old indexes (no field) default to "sequential" everywhere.
Sub-PRs
| # |
PR |
What |
| 1 |
#330 |
refactor: write passage_id_scheme field into meta.json (purely additive) |
| 2 |
#331 |
feat: LeannBuilder(passage_id_scheme="content-hash") + leann build --id-scheme content-hash |
| 3 |
#336 |
feat: incremental updates honor the existing index's scheme; LeannSearcher.passage_id_scheme exposed |
| 4 |
#337 |
feat: leann migrate-ids <index> to convert existing sequential indexes (rewrites .passages.jsonl, .idx, .ids.txt, .meta.json; FAISS labels untouched; dedups collisions) |
| 5 |
#338 |
feat: flip default scheme "sequential" → "content-hash" |
Open design questions still flagged
- Collision handling. Two identical-text chunks hash to the same ID. Current default: dedup (later occurrence wins in the offset map). Override (
--preserve-duplicates, appends -N) hasn't been written; can land separately if anyone wants the non-dedup behavior.
- What's hashed. Just the text. Hashing path/chunk-index would defeat the file-move stability this is trying to fix. Open to argument.
Each PR is independently reviewable. Recommended merge order: 1, 2, 3 first (additive, no default change), then 4 (migration tool) before 5 (default flip) so users have a way to convert before the default changes.
Tracking the content-hash passage ID work. 5 stacked sub-PRs, all open and CI-green on lint+ty.
Today
Passage IDs are
str(len(self.chunks))at api.py:463 — sequential ints by insertion order. Embedded as FAISS graph node labels, used as keys inpassages.jsonl's offset map, returned in search results.Problem
Move a file, reorder chunks, or interleave builds and the IDs shift. Tevfik (LeannVault, #237 thread) hit this directly: the index breaks when local folders are reorganized because the ID-to-path mapping is implicit in insertion order.
Proposal
Default new builds to
sha256(text)[:16]hex prefix. Content-stable across file moves and reorderings, dedup-friendly. Existing indexes unaffected — schema bump ismeta.json["version"]: "1.0" → "1.1", scheme is recorded inpassage_id_scheme, old indexes (no field) default to"sequential"everywhere.Sub-PRs
passage_id_schemefield into meta.json (purely additive)LeannBuilder(passage_id_scheme="content-hash")+leann build --id-scheme content-hashLeannSearcher.passage_id_schemeexposedleann migrate-ids <index>to convert existing sequential indexes (rewrites.passages.jsonl,.idx,.ids.txt,.meta.json; FAISS labels untouched; dedups collisions)"sequential"→"content-hash"Open design questions still flagged
--preserve-duplicates, appends-N) hasn't been written; can land separately if anyone wants the non-dedup behavior.Each PR is independently reviewable. Recommended merge order: 1, 2, 3 first (additive, no default change), then 4 (migration tool) before 5 (default flip) so users have a way to convert before the default changes.