Skip to content

Fix intra-batch duplicate-ID orphaning (langchain, haystack)#98

Merged
RyanCodrai merged 2 commits into
mainfrom
fix-intrabatch-dup-ids
Jun 9, 2026
Merged

Fix intra-batch duplicate-ID orphaning (langchain, haystack)#98
RyanCodrai merged 2 commits into
mainfrom
fix-intrabatch-dup-ids

Conversation

@RyanCodrai

Copy link
Copy Markdown
Owner

Addresses #90.

Problem

When a single add/write call contained the same id twice, the integration added one vector per row to the index but the id→handle dict kept only the last handle per id. The earlier vector became an orphan: live in search, mapped to the wrong document, and unreachable for delete — silent data corruption.

The duplicate check in each wrapper only looked at the existing store, never the batch-so-far, so intra-batch repeats slipped through.

Fix

Resolve duplicates against the batch-so-far too, matching each wrapper's drop-in reference store (both of which write into a plain dict as they iterate):

  • langchain (InMemoryVectorStore parity) — dedup the batch keeping the last occurrence per id before adding; the returned id list still mirrors the input.
  • haystack (InMemoryDocumentStore parity) — resolve per DuplicatePolicy: FAIL raises on the second, SKIP keeps the first, OVERWRITE keeps the last. Return count matches the reference.

The raw Rust library and the llama_index wrapper already reject intra-batch duplicates, so they were unaffected.

agno — intentionally not changed

The same fix was tried on agno and reverted: it broke a pinned test (test_search_does_not_dedupe_distinct_documents_with_identical_content). Verified against the installed real Agno LanceDb backend — it derives doc_id identically and inserts one row per document with no uniqueness constraint, so two same-content / distinct-content_id / no-id documents legitimately coexist as two search hits in production. Dedup-keep-last would diverge from that. agno's separate (smaller) wart is in delete_by_id (removes one colliding row vs LanceDb's delete-all-matching) — out of scope for #90.

Tests

  • langchain: test_add_texts_intra_batch_duplicate_ids_keep_last — asserts no orphan (index / id maps agree at one entry) and last-wins. 49 pass.
  • haystack: test_intra_batch_duplicate_{overwrite_keeps_last_no_orphan,fail_raises,skip_keeps_first}. 75 pass.

🤖 Generated with Claude Code

RyanCodrai and others added 2 commits June 9, 2026 13:53
add_texts/add_documents added every row to the index but _str_to_u64
kept only the last handle per id, orphaning earlier vectors: live in
search, mapped to the wrong document, unreachable for delete.

Dedup the batch keeping the last occurrence per id before adding —
matching InMemoryVectorStore, whose dict store silently overwrites on a
repeated id. The returned id list still mirrors the input. Closes part
of #90.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
write_documents resolved duplicates only against the existing store, not
the batch-so-far. A repeated id within one call slipped past the policy
check: every row got its own vector while _str_to_u64 kept only the last
handle, orphaning the earlier vectors.

Track ids accepted earlier in the same batch so a repeat is resolved the
way InMemoryDocumentStore does (it writes into its dict as it iterates):
FAIL raises on the second, SKIP keeps the first, OVERWRITE keeps the last.
Return count matches the reference. Closes part of #90.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@RyanCodrai RyanCodrai merged commit 38c8c22 into main Jun 9, 2026
6 checks passed
@RyanCodrai RyanCodrai deleted the fix-intrabatch-dup-ids branch June 9, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant