Security audit fixes: untrusted-load hardening, binding panics, integration data-integrity, x86 scalar fallback by RyanCodrai · Pull Request #108 · RyanCodrai/turbovec

RyanCodrai · 2026-06-10T16:19:32Z

Summary

Two adversarial security-audit passes over the core crate, the PyO3 binding, and all four framework integrations, fixing every confirmed finding. Each fix is its own commit (split into PRs later if you prefer) and ships with regression tests. No CI/infra changes are bundled.

Closes #104
Closes #105
Closes #106

Fixes (first pass)

	Severity	Area	What
V1+V2	Critical/High	`io.rs` load	Validate `bit_width`/`dim` and use `checked_mul` + a `take()`-capped reader before allocating. Closes the unchecked-multiply overflow, the header-driven multi-GB allocation DoS, a divide-by-zero panic, and a silent-wrong-results path (`bit_width` 5–8 passed the old length assert). Broader than #105 as filed.
V3	High	Python `search`	NaN/Inf/overflow query coords raised an uncatchable `PanicException`; now a typed `ValueError`, mirroring `add`.
V4	High	agno	Duplicate derived `doc_id` orphaned a vector (counted + searchable but undeletable; leaked on upsert). `_str_to_u64` is now one-to-many — keep-all, matching agno's upstream (LanceDb appends). NB: this is the opposite of the issue's suggested keep-last patch, which would have made agno diverge from its own reference store.
refactor	—	integrations	Extracted a shared, unit-tested `DuplicatePolicy`/`resolve_duplicates`; langchain (KEEP_LAST) and llama_index (REJECT) adopt it. Behaviour-preserving.
V5	Medium	`search.rs`	x86 scalar fallback read the perm0-interleaved code layout as if sequential → silently-wrong top-k on pre-AVX2 x86. De-interleave added. Verified end-to-end on real Sapphire Rapids hardware (forced scalar path matches the AVX-512/AVX2 kernels for 2/3/4-bit).
V6	Low	binding	The last unguarded `from_shape_vec().unwrap()` sites now map to a catchable `RuntimeError`.

Fixes (second pass)

	Severity	Area	What
V7	High	core	A 0-column array to a lazy `add()` slipped past the `dim%8` check (`0%8==0`), divided by zero (uncatchable panic), and wedged the index at `dim=0`. Now a `ValueError`; index stays uncommitted.
V8	Medium	agno	`delete_by_name`/`_content_id`/`_metadata` delegated to `delete_by_id`, which removes every handle under a shared `doc_id` — so deleting by one attribute also deleted content-identical id-twins (and left a stale name entry). Each now removes only predicate-matching handles, matching LanceDb.
V9	Medium	core/`io.rs`	`search` lazily builds a `dim×dim` f64 rotation matrix whose size isn't bounded by the file length, so a tiny crafted `.tv`/`.tvim` with a huge `dim` could OOM the process on first search. Cap `dim` at `MAX_DIM = 65536` at construct/add/load.
V10	Medium	integrations	A persisted JSON side-car out of sync with its `.tvim` index raised an opaque `KeyError` mid-query in langchain/haystack/llama_index. A shared `check_persisted_handles` (using only `len`/`contains`) now raises a clean `ValueError` at load. agno was already immune.
cleanup	—	`pack.rs`	Removed dead, untested `repack_3bit` (zero callers; 3-bit goes through `repack`).

Verification

Full Rust suite green; full Python suite green (378 tests), including the new regression tests for every finding.
V5 verified on a real x86 host (c3-standard-8, AVX-512BW), not just by cross-compile.
Also fixed a pre-existing, unrelated test failure: metadata_separator is dropped by LlamaIndex's own metadata_dict_to_node on reconstruction (verified against llama-index-core), so the assertion was testing framework behaviour, not ours.

Not in this PR

TQ+ calibration: silent identity-freeze on small first add; fit from a cumulative warm-up sample #107 (TQ+ calibration silent identity-freeze) — filed as a separate issue; the proper fix is a dual-mode warm-up redesign, out of scope here.
A few fixes have no corresponding issue (V6, V7, V8, V9, V10) — they're audit findings; happy to file issues for traceability if you want them.

The load path (io::load -> read_header_codes_scales -> from_parts) bypassed the constructor invariants, so an untrusted .tv/.tvim file could: - carry a bit_width outside 2..=4, causing a divide-by-zero panic in pack::repack (0 or >8) or silently-wrong scores (5..8 passes the length assert); - carry a dim that isn't a multiple of 8, diverging the read/validate size formulas into a panic; - overflow the unchecked (dim/8)*bit_width*n_vectors multiply (undersized alloc -> OOB on 32-bit targets); - declare billions of vectors and drive a multi-GB allocation from a tiny file (vec![0u8; n] before read). Validate bit_width and dim ranges before allocating, compute all sizes with checked_mul, and read payloads via a take()-capped reader that grows only to the bytes actually present. Add security regression tests covering each crafted-file case. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

TurboQuantIndex.search / IdMapIndex.search panicked inside the core on NaN/Inf/overflow-magnitude query coordinates, surfacing to Python as an uncatchable PanicException (a BaseException, missed by 'except Exception'). add() already maps the same condition to ValueError. Expose first_invalid_coord from the core and pre-validate queries in the binding (mirroring the existing dim-mismatch check), so search now raises ValueError consistently. Add regression tests for both index types. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

agno's reference store (LanceDb) is append-only with no unique constraint: two documents that derive the same doc_id (a repeated explicit doc.id, or identical content with no id) are BOTH stored. Our wrapper's one-to-one _str_to_u64 map could not represent that — the second write overwrote the first, orphaning the earlier vector: present in the index count and search results but unreachable by id, so undeletable, and leaked on every upsert. _deserialize re-orphaned them on load via a last-wins dict comprehension. Make _str_to_u64 a one-to-many map (doc_id -> set of handles), matching LanceDb's keep-all default and the behaviour the _remove_handle docstring already assumed. delete_by_id now removes every handle under the id. This deliberately does NOT dedup keep-last (the issue's suggested fix and langchain's behaviour): that would make agno diverge from its upstream, which keeps duplicates. The pinned identical-content/distinct-content_id test stays green. Add keep-both and persistence-roundtrip regression tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Extract the in-batch duplicate-resolution logic the wrappers each re-implemented into a pure, unit-tested _dedup.py exposing DuplicatePolicy {KEEP_LAST, KEEP_FIRST, REJECT, KEEP_ALL} and resolve_duplicates(keys, policy) -> kept indices. langchain adopts KEEP_LAST and llama_index adopts REJECT (both behaviour-preserving; their original error wording is retained). agno is KEEP_ALL (its keep-all map fix stands). Haystack keeps its own logic: its DuplicatePolicy resolution is stateful against the existing store with deferred issue-#89 removal and does not reduce to the pure in-batch function; the enum documents the mapping for reference. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

On x86_64 the blocked code buffer is perm0-interleaved hi/lo nibbles (the layout the AVX2/AVX-512 kernels consume). The scalar fallback — score_query_into_heap, taken on pre-AVX2 x86 or VMs that don't expose AVX2 at runtime — read it as the sequential layout used on other targets, decoding bytes for the wrong vectors and returning silently-wrong top-k results (indices stayed in bounds, so no crash). Add pack::deinterleave_x86_code_byte to reconstruct each vector's sequential code byte from the interleaved layout, and use it in the scalar fallback under cfg(target_arch = "x86_64"); other targets read directly as before. A round-trip unit test packs a block exactly as the x86 packer does and verifies recovery on every architecture, so the fix is verified on ARM even though the x86 search path can't run here. NOTE: the de-interleave math is verified by unit test on ARM and the x86 branch compiles (checked via --target x86_64-apple-darwin), but the full x86 scalar search path has not been executed on x86 hardware in this change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The four Array2::from_shape_vec(...).unwrap() calls that reassemble search results were the only remaining unguarded panic sites in the binding. They are provably unreachable today (the dimensions come from the core's own output), but a future change to result shaping would surface as an uncatchable PanicException. Map the ShapeError to a catchable RuntimeError via a shared shape_err helper. No behaviour change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add a cfg(test)-only FORCE_SCALAR_FALLBACK switch on the x86 dispatch and a cfg(target_arch=x86_64) test that runs search() once on the SIMD kernels and once forced onto score_query_into_heap, asserting identical top-k for 2/3/4-bit. score_query_into_heap is not compiled on aarch64, so this is the only end-to-end coverage of its scoring path + the perm0 de-interleave. Verified passing on a GCP c3-standard-8 (Sapphire Rapids, AVX-512BW): pack::tests::deinterleave_x86_recovers_sequential_code_bytes ... ok x86_scalar_fallback_tests::scalar_fallback_matches_simd_topk ... ok Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A 0-column array to TurboQuantIndex.add on a lazy index slipped past the dim%8 check (0%8==0), committed dim=0, then panicked in add() on vectors.len()/dim (divide by zero) — surfacing to Python as an uncatchable PanicException and leaving the index permanently wedged at dim=0. add_2d now rejects dim==0, mirroring IdMapIndex::add_with_ids_2d, so Python gets a ValueError and the lazy index stays uncommitted. Found by the second-pass audit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…on (V8) The derived doc_id excludes name/content_id/metadata, so distinct documents that share content collide on one doc_id. delete_by_name/_by_content_id/ _by_metadata delegated to delete_by_id, which (post-#104, correctly) removes every handle under a doc_id — so deleting by one attribute also deleted an id-twin with a different name/content_id/metadata. delete_by_name also left a stale _name_to_ids entry because _remove_handle's cleanup matched on doc_id alone. Remove the handles matching the predicate directly via _remove_handle (LanceDb deletes predicate-matching rows), and match (id, name) in the name cleanup. delete_by_id (delete by explicit id) keeps its remove-all semantics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

search/prepare lazily build a dim x dim f64 rotation matrix whose size scales with dim^2 and is NOT bounded by any loaded file's length. A crafted .tv/.tvim declaring a huge dim passes the dim%8 check, then the first search drives a multi-GB allocation from a tiny file (resource-exhaustion DoS) — the gap V2's file-length bound can't catch (the matrix isn't read from the file). Reject dim > MAX_DIM at construction, first add, and load. 65536 is far above any real embedding dim (~4096 max in common use); it bounds the catastrophic cases (e.g. dim=500000 -> ~2TB) though dim=65536 itself still permits ~34GB, so it's a sanity bound rather than a hard memory bound. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

repack_3bit had zero callers anywhere in the crate (3-bit support goes through the main repack, which handles bit_width 2/3/4 uniformly). Dead, untested code is latent-bug surface — remove it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

langchain/haystack/llama_index persist a binary .tvim index plus a JSON side-car of handle->payload maps. If the two drift out of sync (partial copy, stale backup, hand-edited/tampered side-car), an index-returned handle wouldn't resolve and surfaced as an opaque KeyError deep inside a query. agno was already immune (tolerant .get() + a dimensions check). Add a shared _persist.check_persisted_handles: using only IdMapIndex's len + contains, it verifies the side-car's handle set and the index are a bijection (equal size, every side-car handle present, no duplicates) and raises a clean ValueError at load otherwise. Wire it into all three loaders. Add a desync regression test per wrapper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

LlamaIndex's metadata_dict_to_node does not restore metadata_separator: node_to_metadata_dict serializes it (it is present in _node_content), but reconstruction drops it back to the framework default. So no store built on the framework serializer preserves it, and the assertion only passed historically because the old default equalled the test's value. The newer llama-index-core default is '\n', exposing the drop. Remove the input kwarg and the assertion (verified directly against llama-index-core); the field is outside our control. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Audited docs/ against the merged behavior changes: - api.md: document that search() raises ValueError on non-finite/oversized query coords; dim must be a positive multiple of 8 and <= 65536 (MAX_DIM); zero-width add raises; load now validates the header before allocating. - agno.md: duplicate derived doc_id is now keep-all (both kept and deletable, matching LanceDb) — previous text implied last-write-wins; clarify delete_by_name/_content_id/_metadata target only matching docs. - agno/langchain/haystack/llama_index: loading a side-car out of sync with its .tvim now raises ValueError at load instead of a later KeyError. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Release: turbovec 0.8.0 (Python) + 0.9.0 (Rust crate) Security-audit release (#108): untrusted-load hardening, binding panic fixes, integration data-integrity fixes, and the x86 scalar-fallback correctness fix. Resolves #104, #105, #106. Minor bump on both surfaces because a few inputs that previously panicked or were silently accepted now return typed errors. See CHANGELOG.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: correct stale benchmark figures in README Fact-checked the README prose against benchmarks/results/. Several figures had drifted (mostly from before TQ+ landed): - ARM speed 12-20% -> 10-19% (actual range 10.3-19.4%) - OpenAI R@1 +0.4-3.4 pts -> +0.2-1.9 pts (no config reaches 3.4) - GloVe R@1 +0.3/-1.2 -> +0.9/tied (TQ+ closed the 2-bit gap) - x86 2-bit 'within ~1% ST, 2-4% MT' -> trails 3-8% on both ST and MT - softened 'matches the Shannon lower bound' -> 'near-optimal' (the how-it-works section already states within 2.7x of the bound) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: fix benchmark + baseline-note drift in README - intro: 'matches the Shannon lower bound' -> 'near-optimal distortion'; trim to 'no separate training phase' - speed bullet/x86 prose: characterize x86 2-bit honestly (behind, most visibly d=1536 ST ~8%) instead of 'match-or-beat'/'3-8%' - recall: OpenAI converge to 1.0 by k=8 (>=0.997 at k=4) - baselines note: drop stale 'visible gap on GloVe' (TQ+ closed it; GloVe is now level at 2-bit, ahead at 4-bit) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: fix drift vs the security release (#108) Audited docs/ against the merged behavior changes: - api.md: document that search() raises ValueError on non-finite/oversized query coords; dim must be a positive multiple of 8 and <= 65536 (MAX_DIM); zero-width add raises; load now validates the header before allocating. - agno.md: duplicate derived doc_id is now keep-all (both kept and deletable, matching LanceDb) — previous text implied last-write-wins; clarify delete_by_name/_content_id/_metadata target only matching docs. - agno/langchain/haystack/llama_index: loading a side-car out of sync with its .tvim now raises ValueError at load instead of a later KeyError. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Mark AddError/ConstructError #[non_exhaustive] Adding the DimTooLarge variant this release is already breaking for downstream exhaustive matches. Mark both public error enums #[non_exhaustive] now so future variant additions stop being breaking changes — this release is the one-time free moment to do it. The Python binding only uses Display (e.to_string()), so it is unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Require a 64-bit target via compile_error guard turbovec's usize size/offset arithmetic in encode/pack/search assumes a 64-bit pointer width; on 32-bit/wasm those products can overflow and index out of bounds. The untrusted-load path is already gated by checked_mul in io.rs, so this is not a vulnerability via file input — but a developer adding very large data on a 32-bit target could still overflow. Refuse to compile on non-64-bit targets rather than ship a silently-unsafe build. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Add SECURITY.md (private vulnerability disclosure policy) Route security reports through GitHub private vulnerability reporting / Security Advisories instead of public issues, with reporting steps, what to include, expectations, supported-version policy, and scope. #105 came in as a public issue; this gives finders a private channel. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * SECURITY.md: drop the direct-contact fallback Private vulnerability reporting is the single channel; no alternate contact. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…ure/ts Sync the Node-bindings branch onto the latest upstream before proposing it upstream. Upstream's RyanCodrai#108 security audit independently hardened the same core files this branch had touched, and RyanCodrai#109 cut crate 0.9.0 / Python 0.8.0. Reconciliation: - turbovec/src/io.rs, lib.rs: take upstream's hardening wholesale; drop our now -redundant copies. Removed our duplicate `pub use io::MAX_DIM` (upstream defines `pub const MAX_DIM` at the crate root); upstream already exposes `MAX_INPUT_MAGNITUDE` as `pub` for the bindings. Net core delta from this branch is now ~zero — the package is essentially additive. - turbovec-node/src/error.rs + ts/errors.ts + docs/api.md: handle upstream's new `DimTooLarge` variants (both enums are now `#[non_exhaustive]`) via a new `DIM_TOO_LARGE` JS error code, plus a wildcard arm so future core variants don't break the binding build. - turbovec/tests/io_versioning.rs: the lazy-header regression test now anchors on upstream's wording (clean InvalidData on a dim=0/n>0 header is unchanged). - CHANGELOG/docs/api.md: keep our Node section under Unreleased alongside the 0.9.0/0.8.0 release entry and upstream's richer Python tables. Verified against the synced core: turbovec-node clippy clean; 193 vitest pass; cargo test -p turbovec green (incl. io_versioning).

RyanCodrai and others added 13 commits June 10, 2026 14:34

chore: remove dead repack_3bit

073dffe

repack_3bit had zero callers anywhere in the crate (3-bit support goes through the main repack, which handles bit_width 2/3/4 uniformly). Dead, untested code is latent-bug surface — remove it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

RyanCodrai merged commit c9a305b into main Jun 10, 2026
6 checks passed

RyanCodrai deleted the security/audit-fixes branch June 10, 2026 16:26

RyanCodrai mentioned this pull request Jun 10, 2026

Release: turbovec 0.9.0 (Rust crate) + 0.8.0 (Python package) #109

Merged

This was referenced Jun 12, 2026

Add public Read/Write I/O API + pub TurboQuantIndex::from_parts (issue #70) ZeyadMohamad/turbovec#1

Open

Public Read/Write I/O API + pub TurboQuantIndex::from_parts #70

Open

a-tokyo mentioned this pull request Jun 13, 2026

Add Node.js/TypeScript bindings (turbovec-node) a-tokyo/turbovec#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security audit fixes: untrusted-load hardening, binding panics, integration data-integrity, x86 scalar fallback#108

Security audit fixes: untrusted-load hardening, binding panics, integration data-integrity, x86 scalar fallback#108
RyanCodrai merged 13 commits into
mainfrom
security/audit-fixes

RyanCodrai commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RyanCodrai commented Jun 10, 2026

Summary

Fixes (first pass)

Fixes (second pass)

Verification

Not in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant