Security audit fixes: untrusted-load hardening, binding panics, integration data-integrity, x86 scalar fallback#108
Merged
Merged
Conversation
The load path (io::load -> read_header_codes_scales -> from_parts) bypassed
the constructor invariants, so an untrusted .tv/.tvim file could:
- carry a bit_width outside 2..=4, causing a divide-by-zero panic in
pack::repack (0 or >8) or silently-wrong scores (5..8 passes the
length assert);
- carry a dim that isn't a multiple of 8, diverging the read/validate
size formulas into a panic;
- overflow the unchecked (dim/8)*bit_width*n_vectors multiply (undersized
alloc -> OOB on 32-bit targets);
- declare billions of vectors and drive a multi-GB allocation from a
tiny file (vec![0u8; n] before read).
Validate bit_width and dim ranges before allocating, compute all sizes
with checked_mul, and read payloads via a take()-capped reader that grows
only to the bytes actually present. Add security regression tests covering
each crafted-file case.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
TurboQuantIndex.search / IdMapIndex.search panicked inside the core on NaN/Inf/overflow-magnitude query coordinates, surfacing to Python as an uncatchable PanicException (a BaseException, missed by 'except Exception'). add() already maps the same condition to ValueError. Expose first_invalid_coord from the core and pre-validate queries in the binding (mirroring the existing dim-mismatch check), so search now raises ValueError consistently. Add regression tests for both index types. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
agno's reference store (LanceDb) is append-only with no unique constraint: two documents that derive the same doc_id (a repeated explicit doc.id, or identical content with no id) are BOTH stored. Our wrapper's one-to-one _str_to_u64 map could not represent that — the second write overwrote the first, orphaning the earlier vector: present in the index count and search results but unreachable by id, so undeletable, and leaked on every upsert. _deserialize re-orphaned them on load via a last-wins dict comprehension. Make _str_to_u64 a one-to-many map (doc_id -> set of handles), matching LanceDb's keep-all default and the behaviour the _remove_handle docstring already assumed. delete_by_id now removes every handle under the id. This deliberately does NOT dedup keep-last (the issue's suggested fix and langchain's behaviour): that would make agno diverge from its upstream, which keeps duplicates. The pinned identical-content/distinct-content_id test stays green. Add keep-both and persistence-roundtrip regression tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Extract the in-batch duplicate-resolution logic the wrappers each
re-implemented into a pure, unit-tested _dedup.py exposing DuplicatePolicy
{KEEP_LAST, KEEP_FIRST, REJECT, KEEP_ALL} and resolve_duplicates(keys,
policy) -> kept indices.
langchain adopts KEEP_LAST and llama_index adopts REJECT (both
behaviour-preserving; their original error wording is retained). agno is
KEEP_ALL (its keep-all map fix stands). Haystack keeps its own logic: its
DuplicatePolicy resolution is stateful against the existing store with
deferred issue-#89 removal and does not reduce to the pure in-batch
function; the enum documents the mapping for reference.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
On x86_64 the blocked code buffer is perm0-interleaved hi/lo nibbles (the layout the AVX2/AVX-512 kernels consume). The scalar fallback — score_query_into_heap, taken on pre-AVX2 x86 or VMs that don't expose AVX2 at runtime — read it as the sequential layout used on other targets, decoding bytes for the wrong vectors and returning silently-wrong top-k results (indices stayed in bounds, so no crash). Add pack::deinterleave_x86_code_byte to reconstruct each vector's sequential code byte from the interleaved layout, and use it in the scalar fallback under cfg(target_arch = "x86_64"); other targets read directly as before. A round-trip unit test packs a block exactly as the x86 packer does and verifies recovery on every architecture, so the fix is verified on ARM even though the x86 search path can't run here. NOTE: the de-interleave math is verified by unit test on ARM and the x86 branch compiles (checked via --target x86_64-apple-darwin), but the full x86 scalar search path has not been executed on x86 hardware in this change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The four Array2::from_shape_vec(...).unwrap() calls that reassemble search results were the only remaining unguarded panic sites in the binding. They are provably unreachable today (the dimensions come from the core's own output), but a future change to result shaping would surface as an uncatchable PanicException. Map the ShapeError to a catchable RuntimeError via a shared shape_err helper. No behaviour change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add a cfg(test)-only FORCE_SCALAR_FALLBACK switch on the x86 dispatch and a cfg(target_arch=x86_64) test that runs search() once on the SIMD kernels and once forced onto score_query_into_heap, asserting identical top-k for 2/3/4-bit. score_query_into_heap is not compiled on aarch64, so this is the only end-to-end coverage of its scoring path + the perm0 de-interleave. Verified passing on a GCP c3-standard-8 (Sapphire Rapids, AVX-512BW): pack::tests::deinterleave_x86_recovers_sequential_code_bytes ... ok x86_scalar_fallback_tests::scalar_fallback_matches_simd_topk ... ok Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A 0-column array to TurboQuantIndex.add on a lazy index slipped past the dim%8 check (0%8==0), committed dim=0, then panicked in add() on vectors.len()/dim (divide by zero) — surfacing to Python as an uncatchable PanicException and leaving the index permanently wedged at dim=0. add_2d now rejects dim==0, mirroring IdMapIndex::add_with_ids_2d, so Python gets a ValueError and the lazy index stays uncommitted. Found by the second-pass audit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…on (V8) The derived doc_id excludes name/content_id/metadata, so distinct documents that share content collide on one doc_id. delete_by_name/_by_content_id/ _by_metadata delegated to delete_by_id, which (post-#104, correctly) removes every handle under a doc_id — so deleting by one attribute also deleted an id-twin with a different name/content_id/metadata. delete_by_name also left a stale _name_to_ids entry because _remove_handle's cleanup matched on doc_id alone. Remove the handles matching the predicate directly via _remove_handle (LanceDb deletes predicate-matching rows), and match (id, name) in the name cleanup. delete_by_id (delete by explicit id) keeps its remove-all semantics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
search/prepare lazily build a dim x dim f64 rotation matrix whose size scales with dim^2 and is NOT bounded by any loaded file's length. A crafted .tv/.tvim declaring a huge dim passes the dim%8 check, then the first search drives a multi-GB allocation from a tiny file (resource-exhaustion DoS) — the gap V2's file-length bound can't catch (the matrix isn't read from the file). Reject dim > MAX_DIM at construction, first add, and load. 65536 is far above any real embedding dim (~4096 max in common use); it bounds the catastrophic cases (e.g. dim=500000 -> ~2TB) though dim=65536 itself still permits ~34GB, so it's a sanity bound rather than a hard memory bound. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
repack_3bit had zero callers anywhere in the crate (3-bit support goes through the main repack, which handles bit_width 2/3/4 uniformly). Dead, untested code is latent-bug surface — remove it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
langchain/haystack/llama_index persist a binary .tvim index plus a JSON side-car of handle->payload maps. If the two drift out of sync (partial copy, stale backup, hand-edited/tampered side-car), an index-returned handle wouldn't resolve and surfaced as an opaque KeyError deep inside a query. agno was already immune (tolerant .get() + a dimensions check). Add a shared _persist.check_persisted_handles: using only IdMapIndex's len + contains, it verifies the side-car's handle set and the index are a bijection (equal size, every side-car handle present, no duplicates) and raises a clean ValueError at load otherwise. Wire it into all three loaders. Add a desync regression test per wrapper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
LlamaIndex's metadata_dict_to_node does not restore metadata_separator: node_to_metadata_dict serializes it (it is present in _node_content), but reconstruction drops it back to the framework default. So no store built on the framework serializer preserves it, and the assertion only passed historically because the old default equalled the test's value. The newer llama-index-core default is '\n', exposing the drop. Remove the input kwarg and the assertion (verified directly against llama-index-core); the field is outside our control. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RyanCodrai
added a commit
that referenced
this pull request
Jun 10, 2026
Audited docs/ against the merged behavior changes: - api.md: document that search() raises ValueError on non-finite/oversized query coords; dim must be a positive multiple of 8 and <= 65536 (MAX_DIM); zero-width add raises; load now validates the header before allocating. - agno.md: duplicate derived doc_id is now keep-all (both kept and deletable, matching LanceDb) — previous text implied last-write-wins; clarify delete_by_name/_content_id/_metadata target only matching docs. - agno/langchain/haystack/llama_index: loading a side-car out of sync with its .tvim now raises ValueError at load instead of a later KeyError. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RyanCodrai
added a commit
that referenced
this pull request
Jun 10, 2026
* Release: turbovec 0.8.0 (Python) + 0.9.0 (Rust crate) Security-audit release (#108): untrusted-load hardening, binding panic fixes, integration data-integrity fixes, and the x86 scalar-fallback correctness fix. Resolves #104, #105, #106. Minor bump on both surfaces because a few inputs that previously panicked or were silently accepted now return typed errors. See CHANGELOG.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: correct stale benchmark figures in README Fact-checked the README prose against benchmarks/results/. Several figures had drifted (mostly from before TQ+ landed): - ARM speed 12-20% -> 10-19% (actual range 10.3-19.4%) - OpenAI R@1 +0.4-3.4 pts -> +0.2-1.9 pts (no config reaches 3.4) - GloVe R@1 +0.3/-1.2 -> +0.9/tied (TQ+ closed the 2-bit gap) - x86 2-bit 'within ~1% ST, 2-4% MT' -> trails 3-8% on both ST and MT - softened 'matches the Shannon lower bound' -> 'near-optimal' (the how-it-works section already states within 2.7x of the bound) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: fix benchmark + baseline-note drift in README - intro: 'matches the Shannon lower bound' -> 'near-optimal distortion'; trim to 'no separate training phase' - speed bullet/x86 prose: characterize x86 2-bit honestly (behind, most visibly d=1536 ST ~8%) instead of 'match-or-beat'/'3-8%' - recall: OpenAI converge to 1.0 by k=8 (>=0.997 at k=4) - baselines note: drop stale 'visible gap on GloVe' (TQ+ closed it; GloVe is now level at 2-bit, ahead at 4-bit) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: fix drift vs the security release (#108) Audited docs/ against the merged behavior changes: - api.md: document that search() raises ValueError on non-finite/oversized query coords; dim must be a positive multiple of 8 and <= 65536 (MAX_DIM); zero-width add raises; load now validates the header before allocating. - agno.md: duplicate derived doc_id is now keep-all (both kept and deletable, matching LanceDb) — previous text implied last-write-wins; clarify delete_by_name/_content_id/_metadata target only matching docs. - agno/langchain/haystack/llama_index: loading a side-car out of sync with its .tvim now raises ValueError at load instead of a later KeyError. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Mark AddError/ConstructError #[non_exhaustive] Adding the DimTooLarge variant this release is already breaking for downstream exhaustive matches. Mark both public error enums #[non_exhaustive] now so future variant additions stop being breaking changes — this release is the one-time free moment to do it. The Python binding only uses Display (e.to_string()), so it is unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Require a 64-bit target via compile_error guard turbovec's usize size/offset arithmetic in encode/pack/search assumes a 64-bit pointer width; on 32-bit/wasm those products can overflow and index out of bounds. The untrusted-load path is already gated by checked_mul in io.rs, so this is not a vulnerability via file input — but a developer adding very large data on a 32-bit target could still overflow. Refuse to compile on non-64-bit targets rather than ship a silently-unsafe build. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Add SECURITY.md (private vulnerability disclosure policy) Route security reports through GitHub private vulnerability reporting / Security Advisories instead of public issues, with reporting steps, what to include, expectations, supported-version policy, and scope. #105 came in as a public issue; this gives finders a private channel. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * SECURITY.md: drop the direct-contact fallback Private vulnerability reporting is the single channel; no alternate contact. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jun 12, 2026
Open
a-tokyo
added a commit
to a-tokyo/turbovec
that referenced
this pull request
Jun 13, 2026
…ure/ts Sync the Node-bindings branch onto the latest upstream before proposing it upstream. Upstream's RyanCodrai#108 security audit independently hardened the same core files this branch had touched, and RyanCodrai#109 cut crate 0.9.0 / Python 0.8.0. Reconciliation: - turbovec/src/io.rs, lib.rs: take upstream's hardening wholesale; drop our now -redundant copies. Removed our duplicate `pub use io::MAX_DIM` (upstream defines `pub const MAX_DIM` at the crate root); upstream already exposes `MAX_INPUT_MAGNITUDE` as `pub` for the bindings. Net core delta from this branch is now ~zero — the package is essentially additive. - turbovec-node/src/error.rs + ts/errors.ts + docs/api.md: handle upstream's new `DimTooLarge` variants (both enums are now `#[non_exhaustive]`) via a new `DIM_TOO_LARGE` JS error code, plus a wildcard arm so future core variants don't break the binding build. - turbovec/tests/io_versioning.rs: the lazy-header regression test now anchors on upstream's wording (clean InvalidData on a dim=0/n>0 header is unchanged). - CHANGELOG/docs/api.md: keep our Node section under Unreleased alongside the 0.9.0/0.8.0 release entry and upstream's richer Python tables. Verified against the synced core: turbovec-node clippy clean; 193 vitest pass; cargo test -p turbovec green (incl. io_versioning).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two adversarial security-audit passes over the core crate, the PyO3 binding, and all four framework integrations, fixing every confirmed finding. Each fix is its own commit (split into PRs later if you prefer) and ships with regression tests. No CI/infra changes are bundled.
Closes #104
Closes #105
Closes #106
Fixes (first pass)
io.rsloadbit_width/dimand usechecked_mul+ atake()-capped reader before allocating. Closes the unchecked-multiply overflow, the header-driven multi-GB allocation DoS, a divide-by-zero panic, and a silent-wrong-results path (bit_width5–8 passed the old length assert). Broader than #105 as filed.searchPanicException; now a typedValueError, mirroringadd.doc_idorphaned a vector (counted + searchable but undeletable; leaked on upsert)._str_to_u64is now one-to-many — keep-all, matching agno's upstream (LanceDb appends). NB: this is the opposite of the issue's suggested keep-last patch, which would have made agno diverge from its own reference store.DuplicatePolicy/resolve_duplicates; langchain (KEEP_LAST) and llama_index (REJECT) adopt it. Behaviour-preserving.search.rsfrom_shape_vec().unwrap()sites now map to a catchableRuntimeError.Fixes (second pass)
add()slipped past thedim%8check (0%8==0), divided by zero (uncatchable panic), and wedged the index atdim=0. Now aValueError; index stays uncommitted.delete_by_name/_content_id/_metadatadelegated todelete_by_id, which removes every handle under a shareddoc_id— so deleting by one attribute also deleted content-identical id-twins (and left a stale name entry). Each now removes only predicate-matching handles, matching LanceDb.io.rssearchlazily builds adim×dimf64 rotation matrix whose size isn't bounded by the file length, so a tiny crafted.tv/.tvimwith a hugedimcould OOM the process on first search. CapdimatMAX_DIM = 65536at construct/add/load..tvimindex raised an opaqueKeyErrormid-query in langchain/haystack/llama_index. A sharedcheck_persisted_handles(using onlylen/contains) now raises a cleanValueErrorat load. agno was already immune.pack.rsrepack_3bit(zero callers; 3-bit goes throughrepack).Verification
metadata_separatoris dropped by LlamaIndex's ownmetadata_dict_to_nodeon reconstruction (verified againstllama-index-core), so the assertion was testing framework behaviour, not ours.Not in this PR