Skip to content

Design: content-hash passage IDs (file-move stability) #329

@raoabinav

Description

@raoabinav

Tracking the content-hash passage ID work. 5 stacked sub-PRs, all open and CI-green on lint+ty.

Today

Passage IDs are str(len(self.chunks)) at api.py:463 — sequential ints by insertion order. Embedded as FAISS graph node labels, used as keys in passages.jsonl's offset map, returned in search results.

Problem

Move a file, reorder chunks, or interleave builds and the IDs shift. Tevfik (LeannVault, #237 thread) hit this directly: the index breaks when local folders are reorganized because the ID-to-path mapping is implicit in insertion order.

Proposal

Default new builds to sha256(text)[:16] hex prefix. Content-stable across file moves and reorderings, dedup-friendly. Existing indexes unaffected — schema bump is meta.json["version"]: "1.0" → "1.1", scheme is recorded in passage_id_scheme, old indexes (no field) default to "sequential" everywhere.

Sub-PRs

# PR What
1 #330 refactor: write passage_id_scheme field into meta.json (purely additive)
2 #331 feat: LeannBuilder(passage_id_scheme="content-hash") + leann build --id-scheme content-hash
3 #336 feat: incremental updates honor the existing index's scheme; LeannSearcher.passage_id_scheme exposed
4 #337 feat: leann migrate-ids <index> to convert existing sequential indexes (rewrites .passages.jsonl, .idx, .ids.txt, .meta.json; FAISS labels untouched; dedups collisions)
5 #338 feat: flip default scheme "sequential""content-hash"

Open design questions still flagged

  • Collision handling. Two identical-text chunks hash to the same ID. Current default: dedup (later occurrence wins in the offset map). Override (--preserve-duplicates, appends -N) hasn't been written; can land separately if anyone wants the non-dedup behavior.
  • What's hashed. Just the text. Hashing path/chunk-index would defeat the file-move stability this is trying to fix. Open to argument.

Each PR is independently reviewable. Recommended merge order: 1, 2, 3 first (additive, no default change), then 4 (migration tool) before 5 (default flip) so users have a way to convert before the default changes.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions