Fix intra-batch duplicate-ID orphaning (langchain, haystack) by RyanCodrai · Pull Request #98 · RyanCodrai/turbovec

RyanCodrai · 2026-06-09T13:08:04Z

Addresses #90.

Problem

When a single add/write call contained the same id twice, the integration added one vector per row to the index but the id→handle dict kept only the last handle per id. The earlier vector became an orphan: live in search, mapped to the wrong document, and unreachable for delete — silent data corruption.

The duplicate check in each wrapper only looked at the existing store, never the batch-so-far, so intra-batch repeats slipped through.

Fix

Resolve duplicates against the batch-so-far too, matching each wrapper's drop-in reference store (both of which write into a plain dict as they iterate):

langchain (InMemoryVectorStore parity) — dedup the batch keeping the last occurrence per id before adding; the returned id list still mirrors the input.
haystack (InMemoryDocumentStore parity) — resolve per DuplicatePolicy: FAIL raises on the second, SKIP keeps the first, OVERWRITE keeps the last. Return count matches the reference.

The raw Rust library and the llama_index wrapper already reject intra-batch duplicates, so they were unaffected.

agno — intentionally not changed

The same fix was tried on agno and reverted: it broke a pinned test (test_search_does_not_dedupe_distinct_documents_with_identical_content). Verified against the installed real Agno LanceDb backend — it derives doc_id identically and inserts one row per document with no uniqueness constraint, so two same-content / distinct-content_id / no-id documents legitimately coexist as two search hits in production. Dedup-keep-last would diverge from that. agno's separate (smaller) wart is in delete_by_id (removes one colliding row vs LanceDb's delete-all-matching) — out of scope for #90.

Tests

langchain: test_add_texts_intra_batch_duplicate_ids_keep_last — asserts no orphan (index / id maps agree at one entry) and last-wins. 49 pass.
haystack: test_intra_batch_duplicate_{overwrite_keeps_last_no_orphan,fail_raises,skip_keeps_first}. 75 pass.

🤖 Generated with Claude Code

add_texts/add_documents added every row to the index but _str_to_u64 kept only the last handle per id, orphaning earlier vectors: live in search, mapped to the wrong document, unreachable for delete. Dedup the batch keeping the last occurrence per id before adding — matching InMemoryVectorStore, whose dict store silently overwrites on a repeated id. The returned id list still mirrors the input. Closes part of #90. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

write_documents resolved duplicates only against the existing store, not the batch-so-far. A repeated id within one call slipped past the policy check: every row got its own vector while _str_to_u64 kept only the last handle, orphaning the earlier vectors. Track ids accepted earlier in the same batch so a repeat is resolved the way InMemoryDocumentStore does (it writes into its dict as it iterates): FAIL raises on the second, SKIP keeps the first, OVERWRITE keeps the last. Return count matches the reference. Closes part of #90. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RyanCodrai and others added 2 commits June 9, 2026 13:53

RyanCodrai merged commit 38c8c22 into main Jun 9, 2026
6 checks passed

RyanCodrai deleted the fix-intrabatch-dup-ids branch June 9, 2026 13:15

RyanCodrai mentioned this pull request Jun 9, 2026

bug: intra-batch duplicate IDs create orphaned vectors (langchain, agno) #90

Closed

michaelxer mentioned this pull request Jun 9, 2026

bug: agno insert() orphans vectors on intra-batch duplicate derived doc_ids #104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix intra-batch duplicate-ID orphaning (langchain, haystack)#98

Fix intra-batch duplicate-ID orphaning (langchain, haystack)#98
RyanCodrai merged 2 commits into
mainfrom
fix-intrabatch-dup-ids

RyanCodrai commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RyanCodrai commented Jun 9, 2026

Problem

Fix

agno — intentionally not changed

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant