Skip to content

Audit-driven integration + Rust core fidelity work (7 waves)#83

Merged
RyanCodrai merged 24 commits into
mainfrom
fix/haystack-blob-sparse-roundtrip
May 30, 2026
Merged

Audit-driven integration + Rust core fidelity work (7 waves)#83
RyanCodrai merged 24 commits into
mainfrom
fix/haystack-blob-sparse-roundtrip

Conversation

@RyanCodrai

@RyanCodrai RyanCodrai commented May 30, 2026

Copy link
Copy Markdown
Owner

Summary

Six waves of audit-driven work on the integrations and the Rust core. All in one PR because they share the same investigation thread.

Wave 1 — Structural fidelity bug fixes in integrations (4 commits)

Follow-up to #81. Audited each integration vs its reference store; found analogous "cherry-picked side-car schema silently drops fields" bugs in haystack, llama-index, agno. Plus the original #81 fix cherry-picked from #82.

Wave 2 — Reference-parity tests (4 commits, +30 tests)

Audited each library's test surface vs ours; pinned 30 behaviours their in-tree reference store tests but we didn't.

Wave 3 — Python integration field-completeness (4 commits, 1 fix + 17 tests)

LlamaIndex add() was silently corrupting state on intra-batch duplicate node_ids. Plus exhaustive return-field audit on all four integrations.

Wave 4 — Rust core audit (4 commits, 1 fix + 22 tests)

IdMapIndex::add_with_ids_2d was mutating ID tables before delegating to inner add; on Err returns the tables retained ghost entries pointing at slots that didn't exist. Plus TurboQuantIndex API gaps, state-sequence tests, internal module gaps.

Wave 5 — Numerical + bindings + concurrency (3 commits, 8 fixes + 29 tests)

4 silent data-corruption bugs in the encode/search pipeline:

  • NaN input → vec_scales poisoned, slot exists in len() but is silently unreachable.
  • Inf input → same NaN-poisoning path.
  • Magnitude >= 1e16 → f32 norm overflows to +Inf, slot wins top-k against every query.
  • NaN/Inf query → kernel produces NaN scores, arbitrary indices returned.

Fix: AddError::InvalidInputValue + finiteness/magnitude validation at add / add_2d / search / search_with_mask entry.

4 Python binding hygiene fixes: non-contiguous arrays → typed ValueError; wrong query dim → typed ValueError; swap_remove out-of-bounds → typed IndexError; cross-class shape consistency between TurboQuantIndex.search and IdMapIndex.search for empty queries.

Plus 3 new concurrency tests (load+search races, IdMapIndex concurrent search, prepare-with-search interleaving).

Wave 6 — TQ+ calibration + allowlist dedup + doc drift (3 commits, 3 fixes + 3 tests)

🔴 TQ+ F1 — v2-loaded index + add silently mis-encoded new vectors. Loading a v2 (pre-TQ+) file left tqplus_shift empty; the next add saw existing = None, encoded vectors with fresh-fitted calibration, but the else branch silently dropped the fitted shift/scale_tq. New vectors were encoded with calibration but searched as identity → silently corrupted scores. Fix in from_parts: populate explicit identity when loading a v2-shaped state.

🔴 TQ+ F2 — Empty first add silently froze identity calibration forever. add(&[]) returned identity from encode, the n_vectors == 0 branch wrote it to self.tqplus_shift, and every subsequent add (even 1M-vector batches) saw existing = Some(identity) and skipped fitting. Fix: add short-circuits to a true no-op when n == 0.

🔴 Wave-5 allowlist dedup bug (in my own wave-5 fix). For nq == 0, effective_k counted raw allowlist length but the kernel internally dedups via packed-bool mask. So allowlist=[1,1,1] returned shape (0, 3) for empty queries vs (N, 1) for non-empty. Fix: dedup via HashSet in the nq=0 path.

Plus doc updates for the wave-5 panic conditions (rustdoc on add / add_2d / search / search_with_mask), a deleted misleading comment in error.rs, and dropped the FAISS analogy from id_map.rs per the no-faiss-analogues memory entry.

Verified during the audits

  • agno's VectorDb ABC only requires name_exists / id_exists / content_hash_exists; an audit flag about missing content_id_exists was a false positive.

Out of scope — follow-ups worth filing

  • Full bm25_retrieval / bm25_retrieval_async for the haystack store.
  • agno Document.content_origin / size / embedding preservation (currently pinned as deliberate divergences from LanceDb).
  • Direct tests for the pack module (currently exercised transitively via kernel_correctness.rs).
  • Wider encode coverage (TQ+ identity-fallback branch on per-coord degeneracy).
  • TQPLUS_MIN_SAMPLES threshold could be made discoverable / logged on first-batch-too-small (currently silent).

Headline totals

  • 13 active bugs found and fixed across waves 1, 3, 4, 5, 6.
  • ~125 tests added across the 6 waves (regression + parity + field-completeness + state-sequence + numerical + concurrency + audit-driven coverage).

Test plan

  • Full Rust test suite locally: 135/135 pass.
  • Full Python test suite locally: 331/331 pass.
  • CI on Linux / macOS / Windows.

🤖 Generated with Claude Code

RyanCodrai and others added 3 commits May 30, 2026 11:12
The store's side-car schema cherry-picked `{id, content, meta}` on
write and reconstructed Documents with only those fields plus `score`.
`Document.blob` (binary payloads, e.g. images/audio) and
`Document.sparse_embedding` (hybrid-search sparse vectors) were
silently dropped on every write_documents call and never re-attached
on filter_documents / embedding_retrieval / storage / save_to_disk.

This is the same shape of bug as #81 — fields available at write time
were not propagated to the result-object constructor. The reference
`InMemoryDocumentStore` preserves both fields verbatim.

Fix:
- Store `blob` and `sparse_embedding` in the in-memory dict alongside
  `id` / `content` / `meta`.
- Pass them through `_reconstruct` so every retrieval path (filter,
  embedding_retrieval, storage, save/load) returns Documents with both
  fields populated when they were set on write.
- Add JSON serializers (`_serialize_doc_data` / `_deserialize_doc_data`)
  for save_to_disk / load_from_disk — `ByteStream.to_dict()` and
  `SparseEmbedding.to_dict()` give us JSON-safe forms.
- Bump docstore schema version 1 → 2; keep v1 loadable with blob /
  sparse_embedding defaulting to None so existing on-disk stores
  reload without manual migration.

Also in this PR:
- Tighten `_validate_filters` to match `InMemoryDocumentStore`
  (`document_store.py:504-509`): a bare `{"field": "x"}` with no
  `operator` or `conditions` is malformed; the reference rejects it
  and we now do too.
- Scope back the module docstring's drop-in parity claim — BM25
  retrieval (`bm25_retrieval` / `bm25_retrieval_async`) is not
  implemented, so the unqualified "matches the public surface of
  InMemoryDocumentStore" sentence was overpromising. Pipelines that
  need keyword search alongside vector search should wire in an
  explicit `InMemoryBM25Retriever` against a separate store.

Tests added (6):
- `test_blob_field_round_trips_through_filter_and_retrieval`
- `test_sparse_embedding_field_round_trips_through_filter_and_retrieval`
- `test_blob_and_sparse_embedding_survive_save_load_roundtrip`
- `test_documents_without_blob_or_sparse_embedding_round_trip_as_none`
- `test_load_accepts_v1_schema_with_no_blob_or_sparse_fields`
- `test_filter_documents_rejects_field_without_operator`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The side-car schema cherry-picked {text, metadata, ref_doc_id} on
write, so `_reconstruct_node` could only build a bare
`TextNode(id_, text, metadata)` plus the SOURCE relationship. Every
other field LlamaIndex preserves through `SimpleVectorStore` was
silently dropped on every `query` / `get_nodes` / persist+load:

- relationships other than SOURCE (PREVIOUS / NEXT / PARENT / CHILD)
  — these power hierarchical RAG retrievers like AutoMergingRetriever
  and RecursiveRetriever
- `excluded_embed_metadata_keys` / `excluded_llm_metadata_keys`
  — silently changed downstream prompt content and embedding inputs
- `text_template`, `metadata_template`, `metadata_separator`
  — change `get_content()` output for callers using non-default
  templates
- `start_char_idx` / `end_char_idx` — citations / highlighting
- `mimetype` — defaulted to text/plain on retrieval

Fix: use llama-index's own `node_to_metadata_dict` /
`metadata_dict_to_node` (the framework's canonical full-fidelity
round-trip) and store the resulting `node_dict` alongside the existing
metadata + ref_doc_id (kept at top level for fast filter / doc-id
lookup without re-parsing _node_content per hit).

`_reconstruct_node` dispatches on schema shape: v2 entries carry
`node_dict` and round-trip via `metadata_dict_to_node`; v1 entries
(persisted before this fix) keep loading with their original narrow
fidelity rather than failing. Nodes schema bumps 1 → 2.

Also in this commit (filter / query parity against
`SimpleVectorStore` and `utils.build_metadata_filter_fn`):

- `FilterCondition.NOT` is implemented (was raising
  NotImplementedError): matches when none of the inner filters match,
  per the reference.
- `FilterOperator.TEXT_MATCH` is now case-SENSITIVE; previous impl
  silently lowercased both sides (semantic divergence — our results
  disagreed with SimpleVectorStore on mixed-case keys).
- `FilterOperator.TEXT_MATCH_INSENSITIVE` is now supported as the
  opt-in case-folding form.
- `FilterOperator.ALL` and `FilterOperator.ANY` are now supported
  (list-of-tags matching) instead of raising NotImplementedError.
- Both TEXT_MATCH variants raise `TypeError` on non-string operands,
  matching the reference (`utils.py:141-144, 148-151`).
- `query.mode != VectorStoreQueryMode.DEFAULT` now raises
  NotImplementedError instead of silently treating MMR / SVM / hybrid
  as DEFAULT — those modes need full-precision vectors which the
  quantized index discards.

Tests:
- Renamed `test_query_text_match_is_case_insensitive` →
  `test_query_text_match_is_case_sensitive` (semantics changed).
- Replaced the now-stale `test_query_unsupported_filter_operator_raises`
  with `test_query_text_match_raises_type_error_on_non_string_operands`.
- Added: TEXT_MATCH_INSENSITIVE, ALL, ANY, NOT, query.mode!=DEFAULT,
  v1 backward-compat load, full-node fidelity through query +
  get_nodes + persist (asserts every relationship type, excluded
  metadata keys, char_idx, mimetype, templates).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_build_results` constructed returned Documents without `embedder=`,
so downstream code couldn't call `doc.embed()` / `doc.async_embed()`
on a retrieved hit — both methods fall back to `self.embedder`, which
was None, raising "No embedder provided" at runtime.

Matches `LanceDb._build_search_results`, which is the named drop-in
target in the module docstring and passes `embedder=self.embedder`
explicitly.

Tests:
- `test_search_results_carry_embedder` asserts the field is set on
  hits returned from `db.search`.
- `test_async_search_results_also_carry_embedder` asserts the same
  for `db.async_search`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RyanCodrai RyanCodrai changed the title Round-trip Document.blob + sparse_embedding through Haystack store Field-fidelity audit fixes across Haystack, LlamaIndex, and Agno integrations May 30, 2026
RyanCodrai and others added 5 commits May 30, 2026 11:40
`_search_vector` constructed the returned Documents without `id=sid`,
so `similarity_search`, `similarity_search_with_score`, and
`similarity_search_by_vector` all returned `Document.id = None` —
breaking downstream LangChain code (retrievers, chains) that expects
to identify search hits by id. The fix is one kwarg in two places:

- The Document used in the callable-filter predicate (so user
  filters can match on `doc.id`, matching the InMemoryVectorStore
  convention).
- The Document attached to each (Document, score) result tuple.

Both call sites already had `sid` in scope; it just wasn't being
passed to the constructor. `get_by_ids` already used the right
pattern (`Document(id=sid, page_content=..., metadata=...)`).

Adds two regression tests in test_langchain.py covering:
- Each of the three similarity_search variants returns Documents
  with `.id` populated (both explicit ids and store-generated UUIDs).
- The callable filter receives a Document with `.id` set so
  predicates can filter on it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests langchain-core's in-tree InMemoryVectorStore suite covers that we
didn't replicate, surfaced by an audit of the reference test file.
Each pins behaviour that's a drop-in regression risk:

- `test_async_methods_await_aembed_functions` — patches aembed_*
  with AsyncMock to verify our async paths actually await the async
  embedder methods, not silently fall back to sync (would block the
  event loop with a real async embedder).
- `test_add_documents_upsert_replaces_metadata` — re-adding a Document
  with the same id and new metadata must let the new metadata win.
- `test_add_documents_does_not_mutate_inputs` — caller-passed Document
  objects must not be mutated (no in-place .id assignment, metadata
  dict identity preserved).
- `test_add_documents_with_ids_is_idempotent` — re-running ingestion
  on an unchanged corpus must not accrete duplicates.
- `test_get_by_ids_empty_input_and_order_preserved` — empty input
  returns [] without error; non-empty preserves input id order so
  callers can zip with parallel arrays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers operators we implement but don't exercise — each one is its own
branch in the filter dispatch (llama_index.py:382-433), and untested
branches are easy to silently swap or regress:

- `test_query_is_empty_treats_missing_key_as_match` — IS_EMPTY's
  trickiest semantic: missing key counts as a hit.
- `test_query_contains_operator_matches_list_membership` — CONTAINS
  (scalar-in-list), distinct from ALL / ANY.
- `test_query_with_lt_filter` / `test_query_with_lte_filter` — the
  entire LT / LTE half of the numeric range was untested.
- `test_query_with_gte_filter` — boundary case (GTE includes the
  threshold; GT excludes).
- `test_query_with_nin_filter` — complements existing IN coverage.
- `test_query_contradictive_same_key_and_returns_empty` — confirms
  AND is genuinely conjunctive over same-key duplicates rather than
  last-wins / OR.
- `test_query_returns_results_sorted_by_similarity` — top-1 of a
  self-query is the matching node; similarities are monotonically
  non-increasing. Only guard against a regression that returns
  hits in insertion order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests the haystack in-tree DocumentStoreBaseTests suite covers that we
didn't replicate. Each pins behaviour that's a drop-in regression risk:

- `test_filter_documents_rejects_malformed_filter_shapes` (parametrised
  over 3 shapes) — our outer `_validate_filters` catches one shape;
  haystack's `document_matches_filter` catches the others. Both must
  fail loudly, not silently match everything.
- `test_filter_documents_with_and_or_not_operators` — compound filter
  dicts (the standard production shape) work end-to-end through both
  `filter_documents` and `embedding_retrieval` (the latter as an
  allowlist).
- `test_embedding_retrieval_rejects_empty_query_embedding` — empty
  list surfaces as a clean ValueError, not a kernel panic.
- `test_filter_documents_equality_with_missing_meta_key` — `== None`
  matches docs where the field is absent.
- `test_delete_documents_on_lazy_empty_store_is_noop` — lazy store
  with no committed dim doesn't blow up on delete-unknown-id.
- `test_get_metadata_field_min_max_handles_float_meta_prefix_and_single_value`
  — three uncovered branches: float values, "meta.x" prefix, single-
  value collection (min == max).
- `test_two_stores_have_independent_state` — refactor canary against
  accidental class-level mutable state.
- `test_async_concurrent_embedding_retrievals_are_consistent` —
  10 concurrent async retrievals all produce the same top-k as a
  single sync call, proving the to_thread wrappers under load.
- `test_return_embedding_flag_is_inert_for_turbovec` — pins the
  deliberate divergence (quantization discards full-precision
  embeddings) so a caller doesn't quietly start relying on the flag.
- `test_shutdown_closes_async_executor` — owned executor rejects new
  tasks after shutdown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests agno's LanceDb unit-test suite covers that we didn't replicate.
Each pins behaviour that's a drop-in regression risk:

- `test_update_metadata_updates_all_docs_sharing_content_id` —
  content_id is many-to-one; update must hit every doc with that cid,
  not just the first match.
- `test_update_metadata_preserves_quantized_codes` — search results
  are bit-identical before and after a metadata update (no silent
  re-embed of stored content).
- `test_insert_reembeds_document_with_empty_list_embedding` — agno's
  async pipeline surfaces failed embeds as embedding=[] (distinct
  path from None); the integration must re-embed those.
- `test_insert_empty_document_list_is_noop` — empty input from
  upstream filtering doesn't change store state or raise.
- `test_search_with_empty_query_returns_empty` — empty / None query
  short-circuits to [], matching LanceDb. Without the short-circuit
  the embedder would hash "" and return arbitrary near-matches.
  Required a 2-line addition to `search` and `async_search`.
- `test_delete_by_metadata_handles_non_string_values` — bool / float
  values in metadata equality predicates work natively.
- `test_update_metadata_with_empty_dict_is_noop` — empty update dict
  preserves existing metadata.
- `test_search_does_not_dedupe_distinct_documents_with_identical_content`
  — pins our deliberate divergence from LanceDb (which dedupes by
  content string).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RyanCodrai RyanCodrai changed the title Field-fidelity audit fixes across Haystack, LlamaIndex, and Agno integrations Audit-driven integration fidelity: bug fixes + 30 reference-parity tests May 30, 2026
RyanCodrai and others added 8 commits May 30, 2026 12:09
Follow-up round of audit surfaced four #81-family gaps where we
populate a field but no test asserts on it (silent-regression risk):

- `test_similarity_search_with_score_returns_descending_scores_and_self_match`
  — pins the actual semantics of the float in `(Document, float)`:
  self-match wins, scores are monotonically non-increasing. Existing
  tests only asserted `isinstance(score, float)`.
- `test_load_then_add_assigns_fresh_handles_without_collision` —
  `_next_u64` is persisted across dump/load; if it were dropped, new
  handles would collide with old ones and corrupt search results.
- `test_embeddings_property_returns_supplied_embedder` — pins the
  `embeddings` property override so a refactor dropping it doesn't
  silently break `similarity_search_with_relevance_scores` discovery.
- `test_aget_by_ids_preserves_order_and_returns_documents_with_id`
  — async mirror of the get_by_ids order test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Active bug fix: `add()` deduplicated against the existing store but
not against the input batch itself. If a single call contained two
nodes with the same node_id, the store ended up with N vectors in
the index but only the last node_id mapped back to one of them —
the earlier handles became orphans that `query` later resolved
through the duplicate node_id, returning the second node's payload
attached to the first node's vector.

Now raises `ValueError` on intra-batch duplicates rather than
silently corrupting state.

Plus three field-completeness tests:

- `test_add_raises_on_intra_batch_duplicate_node_id` — covers the
  fix above, and verifies the store is left in a clean state
  (no half-written index).
- `test_query_round_trips_image_node_subtype` — pins ImageNode
  fidelity (image_url, image_mimetype, etc.) through query.
  PR #83 verified full BaseNode fidelity but exclusively via
  TextNode; ImageNode / IndexNode coverage was missing.
- `test_query_round_trips_index_node_subtype` — same for IndexNode
  (used by composable indexes / routers / sub-question pipelines).
- `test_query_returned_node_always_has_none_embedding` — pins the
  contract that turbovec discards full-precision embeddings on add,
  so query / get_nodes results have `embedding=None`. Aligns with
  `get()` raising NotImplementedError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six gaps surfaced by an exhaustive return-field audit. Each pins a
value the current code returns but no existing test asserts — the
same shape as #81 but at finer granularity:

- `test_filter_documents_returns_documents_with_score_none` — score
  is only set on embedding_retrieval; filter_documents must return
  docs with `score=None`. Guards against a future cache leak that
  carries a stale score between read paths.
- `test_storage_property_documents_have_no_blob_sparse_or_score_when_unset`
  — extends the existing storage test to assert all four optional
  fields default to None when not set on write.
- `test_embedding_retrieval_preserves_content_and_meta` — no
  existing retrieval test asserts the actual `Document.content`
  string survives the round-trip; if `_reconstruct` ever dropped
  content, only the blob / sparse-embedding tests would notice
  indirectly.
- `test_to_dict_includes_all_init_params_and_type_key` — pins all
  four `init_parameters` keys + the outer `type` key (used by
  Haystack pipeline serialization to resolve the class) in a single
  test. Existing tests cover subsets but never the full shape.
- `test_save_load_preserves_similarity_function_and_return_embedding`
  — `load_from_disk` uses `.get()` with defaults for both non-bit_width
  params; if `save_to_disk` ever stopped writing them, load would
  silently fall back to defaults.
- `test_embedding_retrieval_all_results_have_finite_float_scores` —
  the existing top-k test only asserts `results[0].score is not None`;
  this extends to all results and adds `math.isfinite` so a kernel
  NaN-on-tail-hit regression doesn't slip through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five gaps from the round-2 audit. Three pin deliberate divergences
from LanceDb (the named drop-in target) so a future change to either
side becomes a deliberate decision rather than a silent surprise.
Two pin behavioural contracts that lacked assertions.

- `test_search_results_content_origin_divergence_from_lancedb` —
  LanceDb drops `Document.content_origin` on insert; turbovec mirrors
  that behaviour. Pin the divergence explicitly.
- `test_search_results_size_divergence_from_lancedb` — same pin for
  `Document.size`.
- `test_search_results_embedding_is_none_divergence_from_lancedb` —
  LanceDb sets `embedding=item["vector"]` on returned hits so callers
  can read the original vector. turbovec discards full-precision
  vectors after quantization, so we return `embedding=None`. Pin the
  divergence so callers don't get a runtime surprise.
- `test_reranker_output_documents_carry_reranking_score` — pins that
  `reranking_score` survives the post-rerank result list. Existing
  reranker tests only check ordering; this catches a refactor that
  drops fields the reranker mutated.
- `test_delete_by_metadata_returns_false_when_no_match` — pins the
  False branch of the bool return contract; other delete tests cover
  True but not False, so a regression always-returning-True wouldn't
  be detected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`add_with_ids_2d` was mutating the ID tables (`id_to_slot` /
`slot_to_id`) BEFORE delegating to `self.inner.add_2d`. If the inner
call returned `Err` — most commonly `DimMismatch` when a caller
passed the wrong dim on a committed-dim index — the ID tables
retained `n` ghost entries pointing at slots that don't exist in
the inner index. The next `search_with_allowlist` would index a
ghost id and read out-of-bounds packed-code bytes; the next
`remove` would corrupt the table further.

Fix: capture `base_slot = self.inner.len()` before the inner add,
call `self.inner.add_2d(vectors, dim)?` first, then mutate the ID
tables only on success.

`add_with_ids_2d_rolls_back_id_tables_on_inner_dim_mismatch` is the
regression test — without the fix it would either reject the second
correctly-dim'd add (ids 40/50 stuck as ghosts) or leak ghosts into
later searches.

Plus six audit-driven IdMapIndex tests:
- `add_with_ids_2d_rejects_non_multiple_buffer` — the
  `VectorBufferNotMultipleOfDim` error variant was never asserted.
- `add_with_ids_2d_rejects_zero_dim` — same variant, dim=0 sub-branch.
- `search_returns_descending_scores_aligned_with_ids` — the
  `(Vec<f32>, Vec<u64>)` parallel-array contract had no length /
  ordering / finiteness assertion at the IdMapIndex layer (same
  shape as #81: return value populated, never checked).
- `search_multi_query_results_are_row_major` — every existing test
  used a single query, leaving the documented row-major layout for
  multi-query unverified at this layer.
- `remove_keeps_swapped_id_addressable_in_both_tables` — the swap-
  and-pop consistency between `slot_to_id` and `id_to_slot` was
  only verified indirectly via self-query; a bug updating only one
  table could mask itself.
- `prepare_does_not_change_search_results` — `prepare()` had no
  observable post-condition assertion at the IdMapIndex layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three tests covering populated-but-unasserted state on TurboQuantIndex
that prior tests didn't reach:

- `search_single_query_sets_nq_to_one` — `SearchResults.nq` was only
  asserted in multi-query batches; a regression dropping nq to 0 in
  the single-query path wouldn't fail any existing test.
- `is_empty_tracks_len` — `is_empty()` was never called by any test;
  a wrong-polarity regression (e.g. `self.n_vectors > 0`) would
  compile and pass the suite.
- `add_2d_rejects_non_multiple_of_8_dim_on_lazy_index` — the
  `DimNotMultipleOf8` AddError variant is only reachable from a lazy
  index's first `add_2d` and was never asserted; a regression
  flipping the branch to Ok or DimMismatch wouldn't have failed.

Also extends `search_on_lazy_uncommitted_returns_empty` to assert
`res.nq == 0` (the lazy-empty path for nq).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Most existing tests exercise operations in isolation; bugs that live
in the transition between two operations (cache invalidation, slot
reuse, persistence capturing post-mutation state) can slip past those.
The audit that surfaced the LlamaIndex intra-batch `add()` corruption
pointed at the same risk in the Rust core.

Nine sequences pinned in a new test file:

High-priority (plausible bug surfaces):
- `second_add_after_search_lets_new_vectors_be_found` — blocked-cache
  invalidation correctness after a second add.
- `add_swap_remove_add_then_self_query_finds_all_three_phases` —
  mixed shrink+grow keeps packed_codes / n_vectors consistent.
- `swap_remove_after_load_produces_correct_search` — the
  load-then-mutate-then-search path through OnceLock state.
- `swap_remove_then_round_trip_matches_in_memory_search` —
  persistence captures the post-removal state, not stale tail bytes.
- `id_map_re_added_id_returns_new_vector_not_old` — re-add after
  remove returns the NEW vector; existing test only checked `contains`.
- `prepare_then_add_invalidates_blocked_cache` — prepare-warmed
  cache must still invalidate on subsequent add.
- `id_map_remove_last_then_add_keeps_slot_tables_consistent` —
  the no-swap branch of `remove(last)` leaves no stale tail entry
  in `slot_to_id` for the next add.

Defensive:
- `add_after_load_extends_index` — loaded index can be extended.
- `prepare_then_swap_remove_invalidates_cache` — the prepare-then-
  delete-then-search path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four tests pinning Rust internal-module branches the audit found
under-covered:

- `tv_truncated_payload_errors_cleanly` — write a v3 .tv file, then
  truncate it. Loader must surface UnexpectedEof, not panic. The
  `read_exact` error path was previously unverified.
- `tv_unsupported_version_errors_with_useful_message` — hand-construct
  a .tv with version=99. Loader must surface InvalidData with the
  "unsupported format version" message; the catch-all arm in
  `read_core_versioned` was unreached.
- `tv_v3_invalid_n_calib_errors_cleanly` — hand-construct a v3 file
  with n_calib=7 (neither 0 nor dim). Loader must reject with
  InvalidData per the io.rs contract; this branch was unreached.
- `produces_expected_shape_for_bit_width_three` — extends the
  parametrised shape test (which only covered 2 and 4) to bit_width=3,
  whose pack layout is its own branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RyanCodrai RyanCodrai changed the title Audit-driven integration fidelity: bug fixes + 30 reference-parity tests Audit-driven integration + Rust core fidelity work May 30, 2026
RyanCodrai and others added 6 commits May 30, 2026 12:59
The encode and search pipelines silently corrupted state on degenerate
inputs:

- NaN coord: `simd_norm` returns NaN, the `n_val > 1e-10` branch goes
  to `inv = 0.0`, but `0.0 * NaN = NaN` propagates through rotation
  and poisons `vec_scales[slot] = NaN`. The kernel's `s > heap_min`
  comparison is false for NaN, so the slot exists in `len()` but is
  **silently unreachable** through search.
- +/- Inf coord: same NaN-poisoning path via `1.0 / Inf = 0.0`,
  followed by `0.0 * Inf = NaN`.
- Magnitude >= ~2e19: the f32 sum-of-squares in `simd_norm` overflows
  to `+Inf`, `scale[i] = Inf` gets stored, and the slot **silently
  wins top-k against every query** because `Inf * anything = Inf`.
- NaN/Inf query value: kernel produces NaN scores, heap silently
  drops them, returned indices are arbitrary with NEG_INFINITY scores.

Fix: validate at the entry of `add`, `add_2d`, `search`, and
`search_with_mask` that every value is finite AND `|value| < 1e16`
(conservative bound that prevents f32 overflow for dims up to 2^16).
The validation cost is one pass over the input — cheaper than making
every downstream stage NaN/Inf-safe.

API surfaces:
- `TurboQuantIndex::add_2d` and `IdMapIndex::add_with_ids_2d` return
  a new `AddError::InvalidInputValue { vector_index, coord_index,
  value }` variant. Callers handling untrusted input should prefer
  these `_2d` paths.
- `TurboQuantIndex::add` and the `search` family panic with a clear
  message — consistent with their existing "caller violated
  precondition" panic style.

`AddError` drops its `Eq` derive (still `PartialEq`) because the new
variant carries an `f32`, which is not `Eq` (NaN != NaN).

15 tests across the entry points (NaN, +Inf, -Inf, huge magnitude,
boundary value just below the threshold, multi-vector batch reporting,
lazy-uncommitted skip-validation path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four binding-layer bugs surfacing as Rust panics (PanicException to
Python) instead of typed exceptions:

- Non-contiguous numpy arrays passed to `add` / `search` /
  `add_with_ids` panicked via `as_slice().expect("vectors must be
  contiguous")`. Now raise `ValueError("vectors must be C-contiguous;
  call np.ascontiguousarray(...) first")` via `ok_or_else`.
- Query with wrong `ncols` propagated to the core's
  `assert_eq!(queries.len(), nq * dim)` and surfaced as PanicException.
  The binding now validates `ncols == self.dim_opt()` before calling
  inner and raises `ValueError("query dim X does not match index
  dim Y")`.
- `swap_remove(idx)` with idx >= len panicked via core's
  `assert!(idx < n_vectors)`. Binding now raises `IndexError` with a
  range-style message.
- `IdMapIndex.search` returned shape `(0, k)` for empty queries while
  `TurboQuantIndex.search` returned `(0, min(k, n_vectors, n_allowed))`.
  Cross-class inconsistency — fixed by computing the effective_k
  identically for the nq=0 path.

Plus 11 tests covering the new behaviour and the Rust core's
`InvalidInputValue` error surfacing through the binding as
`ValueError`:

- `test_add_noncontiguous_vectors_raises_value_error`
- `test_search_query_dim_mismatch_raises_value_error`
- `test_search_noncontiguous_query_raises_value_error`
- `test_swap_remove_out_of_bounds_raises_index_error`
- `test_add_rejects_nan_with_value_error`
- `test_add_rejects_huge_magnitude_with_value_error`
- `test_search_with_nan_query_raises`
- `test_search_empty_queries_returns_consistent_shape_across_index_types`
- `test_search_query_dim_mismatch_raises_value_error` (IdMap)
- `test_add_with_ids_noncontiguous_vectors_raises_value_error`
- `test_add_with_ids_rejects_nan_with_value_error`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three test gaps surfaced by the concurrency audit. None of these
exercised actual bugs (Rust's borrow checker rules out the race
classes the audit considered), but they pin contracts the existing
`concurrent_search.rs` doesn't:

- `concurrent_search_after_load_is_safe` — a `TurboQuantIndex::load`d
  index starts with empty OnceLock caches; the race window mirrors
  the freshly-built one. The doc explicitly mentions `load` as a
  `prepare()`-skip target, but no test covered concurrent search
  immediately after `load`.
- `id_map_concurrent_search_is_deterministic_across_threads` —
  `IdMapIndex::search` had zero concurrent-search coverage.
- `concurrent_prepare_races_with_search_safely` — multiple `prepare`
  threads interleaved with multiple `search` threads on a fresh
  (unprepared) index. Pins that OnceLock's "closure runs at most
  once" guarantee holds end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two audit-found bugs in the TQ+ calibration state machine that silently
degrade index correctness without any error signal.

## Bug F1: v2-loaded index + add silently mis-encoded new vectors

A v2 (pre-TQ+) file loads with empty `tqplus_shift`. On the next
`add()`, the `existing` slot was set to `None` (the lazy-first-add
signal), `encode()` fit fresh calibration on the new batch and baked
it into the packed codes — but the `n_vectors != 0` else branch only
extended `packed_codes` / `scales`, never persisting the fitted
shift / scale_tq. The new vectors ended up encoded with calibration
that search later treated as identity, silently corrupting scores.

Fix: in `TurboQuantIndex::from_parts`, populate explicit identity TQ+
vectors when `tqplus_shift` is empty and `n_vectors > 0`. The loaded
state then matches the on-the-wire reality (those vectors were encoded
without calibration), and the next `add` sees `existing = Some(identity)`
and encodes new vectors against identity too — keeping the entire
index in a single coordinate system.

## Bug F2: Empty first add froze identity calibration forever

`add(&[])` with n=0 hit `n < TQPLUS_MIN_SAMPLES` in `encode`, returned
identity `(zeros, ones)` of length dim, and the `n_vectors == 0` branch
in `add` copied that identity into `self.tqplus_shift` /
`self.tqplus_scale`. Now `tqplus_shift.is_empty()` was false, so every
subsequent add — even a million-vector batch with rich distribution —
saw `existing = Some(identity)` and silently skipped fitting fresh
calibration. The user lost TQ+ entirely with no warning.

Fix: short-circuit `add` to a true no-op when n=0, before touching any
state. Empty input now means "no change", end-to-end.

## Regression tests

- `empty_first_add_does_not_freeze_identity_calibration` — writes the
  index after empty + 1500-vec add, parses the persisted TQ+ trailer,
  asserts at least one shift / scale value is non-trivial (pre-fix
  the trailer was exactly identity).
- `v2_loaded_index_populates_identity_calibration` — hand-constructs a
  v2 .tv file, loads it, adds 1500 vectors, writes back, asserts the
  trailer is exactly identity (pre-fix the fitted-then-discarded
  calibration would have leaked through silently).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug in the wave-5 fix for IdMap empty-query shape consistency: the
`effective_k` computation for `nq == 0` used `allow_slice.len()` (raw
length), but the kernel internally dedups the allowlist via a packed
bool mask. So `allowlist=[1, 1, 1]` would return shape `(0, 3)` for
empty queries and `(N, 1)` for non-empty queries — still divergent,
just in a different shape than before wave 5.

Fix: dedup via HashSet in the nq=0 path to match the kernel's
mask-based dedup for nq>0.

Test: `test_search_empty_queries_dedups_allowlist_for_effective_k`
pins both shapes (empty and real query) and asserts they share the
trailing `effective_k` dim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wave-5 input validation (NaN/Inf/overflow rejection) added new
panic conditions to `add` / `add_2d` / `search` / `search_with_mask`,
but the rustdoc comments still listed only the pre-wave-5 panic
triggers. Users reading the docs would expect their indexes to handle
non-finite inputs gracefully when they actually panic loudly.

Doc updates:
- `TurboQuantIndex::add` — now lists all three panic conditions
  (dim-not-set, non-multiple buffer, invalid input value) in a # Panics
  section.
- `TurboQuantIndex::add_2d` — adds `InvalidInputValue` to the typed
  return list and documents that bad buffer length still panics.
- `TurboQuantIndex::search` — adds a # Panics section covering both
  buffer-length and invalid-query-value cases.
- `TurboQuantIndex::search_with_mask` — same plus the mask-length
  panic.

Plus two small cleanups:
- Delete a misleading internal comment in `error.rs` that referenced a
  manual `PartialEq` impl which doesn't exist (the derive handles it
  fine; the only change wave 5 made was dropping `Eq`).
- Drop the `IndexIDMap2` FAISS analogy from `id_map.rs` per the
  positioning memory entry: turbovec docs should describe features
  on their own terms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RyanCodrai RyanCodrai changed the title Audit-driven integration + Rust core fidelity work Audit-driven integration + Rust core fidelity work (6 waves) May 30, 2026
RyanCodrai and others added 2 commits May 30, 2026 13:43
`search` on `x86_64` runtime-dispatches to AVX-512 BW or AVX2 inside
an `unsafe { if/else if }` block with no `else` branch. If neither
`is_x86_feature_detected!("avx512bw")` nor `is_x86_feature_detected!("avx2")`
returned true — pre-Haswell x86 (rare in 2026), or a VM / emulator
that doesn't expose AVX2 to userspace — neither kernel ran,
`heap_sizes` stayed at `[0; batch_nq]`, and `search` returned **empty
top-k results** for every query with no error signal.

The `cfg(not(any(target_arch = "aarch64", target_arch = "x86_64")))`
scalar fallback at the bottom of the function is gated on the
architecture, not the runtime feature flag, so it didn't cover this
case.

Fix: extract the per-query scalar scoring (previously inlined in the
catch-all `cfg` block) into a free helper `score_query_into_heap`,
then add an `else` branch inside the x86_64 `unsafe { }` that calls
it once per query in the current batch. The catch-all scalar block
is also refactored to call the same helper, eliminating the prior
duplication.

The workspace `.cargo/config.toml` sets `target-cpu=x86-64-v3` which
requires AVX2 at the compile level, so this affects downstream Rust
users building turbovec without that config, not turbovec's own
release wheels. But the silent-empty-results failure mode is the
worst kind, so it gets a fix even though the reproducer is rare.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`from_parts` is `pub(crate)` and the integrity gateway for both
`TurboQuantIndex::load` and `IdMapIndex::load`. The wave-6 fix added
identity-population logic that relies on length invariants holding
(`tqplus_shift.len() == tqplus_scale.len()`, packed/scales lengths
matching `n_vectors * dim * bit_width / 8` and `n_vectors`
respectively), but neither `from_parts` nor any test pinned them.

Today the read-side validation in `io::load_*` upholds these
invariants — but a future caller bypassing the IO layer (or a
refactor that drops a read-layer check) could silently construct a
malformed `TurboQuantIndex`. `search` and `swap_remove` index
`packed_codes`, `scales`, and `tqplus_*` assuming the invariants
hold; out-of-bounds reads / underflows would manifest as silent
garbage scores rather than a clear error.

Fix: assert structural invariants at `from_parts` entry, before the
identity-population logic that depends on them:

- `tqplus_shift.len() == tqplus_scale.len()` always.
- For an eager (`dim = Some(d)`) state:
  - `packed_codes.len() == n_vectors * d * bit_width / 8`
  - `scales.len() == n_vectors`
  - non-empty TQ+ must have `tqplus_shift.len() == d`
- For a lazy (`dim = None`) state: `n_vectors == 0` and every storage
  field empty.

Plus 7 unit tests in a new `from_parts_tests` module covering each
panic-on-violation case and two acceptance cases (lazy uncommitted,
eager with consistent lengths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RyanCodrai RyanCodrai changed the title Audit-driven integration + Rust core fidelity work (6 waves) Audit-driven integration + Rust core fidelity work (7 waves) May 30, 2026
@RyanCodrai RyanCodrai merged commit 73866a3 into main May 30, 2026
6 checks passed
@RyanCodrai RyanCodrai deleted the fix/haystack-blob-sparse-roundtrip branch May 30, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant