Skip to content

Security audit fixes: untrusted-load hardening, binding panics, integration data-integrity, x86 scalar fallback#108

Merged
RyanCodrai merged 13 commits into
mainfrom
security/audit-fixes
Jun 10, 2026
Merged

Security audit fixes: untrusted-load hardening, binding panics, integration data-integrity, x86 scalar fallback#108
RyanCodrai merged 13 commits into
mainfrom
security/audit-fixes

Conversation

@RyanCodrai

Copy link
Copy Markdown
Owner

Summary

Two adversarial security-audit passes over the core crate, the PyO3 binding, and all four framework integrations, fixing every confirmed finding. Each fix is its own commit (split into PRs later if you prefer) and ships with regression tests. No CI/infra changes are bundled.

Closes #104
Closes #105
Closes #106

Fixes (first pass)

Severity Area What
V1+V2 Critical/High io.rs load Validate bit_width/dim and use checked_mul + a take()-capped reader before allocating. Closes the unchecked-multiply overflow, the header-driven multi-GB allocation DoS, a divide-by-zero panic, and a silent-wrong-results path (bit_width 5–8 passed the old length assert). Broader than #105 as filed.
V3 High Python search NaN/Inf/overflow query coords raised an uncatchable PanicException; now a typed ValueError, mirroring add.
V4 High agno Duplicate derived doc_id orphaned a vector (counted + searchable but undeletable; leaked on upsert). _str_to_u64 is now one-to-many — keep-all, matching agno's upstream (LanceDb appends). NB: this is the opposite of the issue's suggested keep-last patch, which would have made agno diverge from its own reference store.
refactor integrations Extracted a shared, unit-tested DuplicatePolicy/resolve_duplicates; langchain (KEEP_LAST) and llama_index (REJECT) adopt it. Behaviour-preserving.
V5 Medium search.rs x86 scalar fallback read the perm0-interleaved code layout as if sequential → silently-wrong top-k on pre-AVX2 x86. De-interleave added. Verified end-to-end on real Sapphire Rapids hardware (forced scalar path matches the AVX-512/AVX2 kernels for 2/3/4-bit).
V6 Low binding The last unguarded from_shape_vec().unwrap() sites now map to a catchable RuntimeError.

Fixes (second pass)

Severity Area What
V7 High core A 0-column array to a lazy add() slipped past the dim%8 check (0%8==0), divided by zero (uncatchable panic), and wedged the index at dim=0. Now a ValueError; index stays uncommitted.
V8 Medium agno delete_by_name/_content_id/_metadata delegated to delete_by_id, which removes every handle under a shared doc_id — so deleting by one attribute also deleted content-identical id-twins (and left a stale name entry). Each now removes only predicate-matching handles, matching LanceDb.
V9 Medium core/io.rs search lazily builds a dim×dim f64 rotation matrix whose size isn't bounded by the file length, so a tiny crafted .tv/.tvim with a huge dim could OOM the process on first search. Cap dim at MAX_DIM = 65536 at construct/add/load.
V10 Medium integrations A persisted JSON side-car out of sync with its .tvim index raised an opaque KeyError mid-query in langchain/haystack/llama_index. A shared check_persisted_handles (using only len/contains) now raises a clean ValueError at load. agno was already immune.
cleanup pack.rs Removed dead, untested repack_3bit (zero callers; 3-bit goes through repack).

Verification

  • Full Rust suite green; full Python suite green (378 tests), including the new regression tests for every finding.
  • V5 verified on a real x86 host (c3-standard-8, AVX-512BW), not just by cross-compile.
  • Also fixed a pre-existing, unrelated test failure: metadata_separator is dropped by LlamaIndex's own metadata_dict_to_node on reconstruction (verified against llama-index-core), so the assertion was testing framework behaviour, not ours.

Not in this PR

RyanCodrai and others added 13 commits June 10, 2026 14:34
The load path (io::load -> read_header_codes_scales -> from_parts) bypassed
the constructor invariants, so an untrusted .tv/.tvim file could:
  - carry a bit_width outside 2..=4, causing a divide-by-zero panic in
    pack::repack (0 or >8) or silently-wrong scores (5..8 passes the
    length assert);
  - carry a dim that isn't a multiple of 8, diverging the read/validate
    size formulas into a panic;
  - overflow the unchecked (dim/8)*bit_width*n_vectors multiply (undersized
    alloc -> OOB on 32-bit targets);
  - declare billions of vectors and drive a multi-GB allocation from a
    tiny file (vec![0u8; n] before read).

Validate bit_width and dim ranges before allocating, compute all sizes
with checked_mul, and read payloads via a take()-capped reader that grows
only to the bytes actually present. Add security regression tests covering
each crafted-file case.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
TurboQuantIndex.search / IdMapIndex.search panicked inside the core on
NaN/Inf/overflow-magnitude query coordinates, surfacing to Python as an
uncatchable PanicException (a BaseException, missed by 'except Exception').
add() already maps the same condition to ValueError.

Expose first_invalid_coord from the core and pre-validate queries in the
binding (mirroring the existing dim-mismatch check), so search now raises
ValueError consistently. Add regression tests for both index types.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
agno's reference store (LanceDb) is append-only with no unique constraint:
two documents that derive the same doc_id (a repeated explicit doc.id, or
identical content with no id) are BOTH stored. Our wrapper's one-to-one
_str_to_u64 map could not represent that — the second write overwrote the
first, orphaning the earlier vector: present in the index count and search
results but unreachable by id, so undeletable, and leaked on every upsert.
_deserialize re-orphaned them on load via a last-wins dict comprehension.

Make _str_to_u64 a one-to-many map (doc_id -> set of handles), matching
LanceDb's keep-all default and the behaviour the _remove_handle docstring
already assumed. delete_by_id now removes every handle under the id.

This deliberately does NOT dedup keep-last (the issue's suggested fix and
langchain's behaviour): that would make agno diverge from its upstream,
which keeps duplicates. The pinned identical-content/distinct-content_id
test stays green. Add keep-both and persistence-roundtrip regression tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Extract the in-batch duplicate-resolution logic the wrappers each
re-implemented into a pure, unit-tested _dedup.py exposing DuplicatePolicy
{KEEP_LAST, KEEP_FIRST, REJECT, KEEP_ALL} and resolve_duplicates(keys,
policy) -> kept indices.

langchain adopts KEEP_LAST and llama_index adopts REJECT (both
behaviour-preserving; their original error wording is retained). agno is
KEEP_ALL (its keep-all map fix stands). Haystack keeps its own logic: its
DuplicatePolicy resolution is stateful against the existing store with
deferred issue-#89 removal and does not reduce to the pure in-batch
function; the enum documents the mapping for reference.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
On x86_64 the blocked code buffer is perm0-interleaved hi/lo nibbles (the
layout the AVX2/AVX-512 kernels consume). The scalar fallback —
score_query_into_heap, taken on pre-AVX2 x86 or VMs that don't expose AVX2
at runtime — read it as the sequential layout used on other targets,
decoding bytes for the wrong vectors and returning silently-wrong top-k
results (indices stayed in bounds, so no crash).

Add pack::deinterleave_x86_code_byte to reconstruct each vector's
sequential code byte from the interleaved layout, and use it in the scalar
fallback under cfg(target_arch = "x86_64"); other targets read directly as
before. A round-trip unit test packs a block exactly as the x86 packer does
and verifies recovery on every architecture, so the fix is verified on ARM
even though the x86 search path can't run here.

NOTE: the de-interleave math is verified by unit test on ARM and the x86
branch compiles (checked via --target x86_64-apple-darwin), but the full
x86 scalar search path has not been executed on x86 hardware in this change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The four Array2::from_shape_vec(...).unwrap() calls that reassemble search
results were the only remaining unguarded panic sites in the binding. They
are provably unreachable today (the dimensions come from the core's own
output), but a future change to result shaping would surface as an
uncatchable PanicException. Map the ShapeError to a catchable
RuntimeError via a shared shape_err helper. No behaviour change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add a cfg(test)-only FORCE_SCALAR_FALLBACK switch on the x86 dispatch and a
cfg(target_arch=x86_64) test that runs search() once on the SIMD kernels and
once forced onto score_query_into_heap, asserting identical top-k for
2/3/4-bit. score_query_into_heap is not compiled on aarch64, so this is the
only end-to-end coverage of its scoring path + the perm0 de-interleave.

Verified passing on a GCP c3-standard-8 (Sapphire Rapids, AVX-512BW):
  pack::tests::deinterleave_x86_recovers_sequential_code_bytes ... ok
  x86_scalar_fallback_tests::scalar_fallback_matches_simd_topk ... ok

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A 0-column array to TurboQuantIndex.add on a lazy index slipped past the
dim%8 check (0%8==0), committed dim=0, then panicked in add() on
vectors.len()/dim (divide by zero) — surfacing to Python as an uncatchable
PanicException and leaving the index permanently wedged at dim=0.
add_2d now rejects dim==0, mirroring IdMapIndex::add_with_ids_2d, so Python
gets a ValueError and the lazy index stays uncommitted. Found by the
second-pass audit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…on (V8)

The derived doc_id excludes name/content_id/metadata, so distinct documents
that share content collide on one doc_id. delete_by_name/_by_content_id/
_by_metadata delegated to delete_by_id, which (post-#104, correctly) removes
every handle under a doc_id — so deleting by one attribute also deleted an
id-twin with a different name/content_id/metadata. delete_by_name also left
a stale _name_to_ids entry because _remove_handle's cleanup matched on
doc_id alone.

Remove the handles matching the predicate directly via _remove_handle
(LanceDb deletes predicate-matching rows), and match (id, name) in the name
cleanup. delete_by_id (delete by explicit id) keeps its remove-all semantics.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
search/prepare lazily build a dim x dim f64 rotation matrix whose size
scales with dim^2 and is NOT bounded by any loaded file's length. A crafted
.tv/.tvim declaring a huge dim passes the dim%8 check, then the first search
drives a multi-GB allocation from a tiny file (resource-exhaustion DoS) —
the gap V2's file-length bound can't catch (the matrix isn't read from the
file). Reject dim > MAX_DIM at construction, first add, and load. 65536 is
far above any real embedding dim (~4096 max in common use); it bounds the
catastrophic cases (e.g. dim=500000 -> ~2TB) though dim=65536 itself still
permits ~34GB, so it's a sanity bound rather than a hard memory bound.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
repack_3bit had zero callers anywhere in the crate (3-bit support goes
through the main repack, which handles bit_width 2/3/4 uniformly). Dead,
untested code is latent-bug surface — remove it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
langchain/haystack/llama_index persist a binary .tvim index plus a JSON
side-car of handle->payload maps. If the two drift out of sync (partial
copy, stale backup, hand-edited/tampered side-car), an index-returned
handle wouldn't resolve and surfaced as an opaque KeyError deep inside a
query. agno was already immune (tolerant .get() + a dimensions check).

Add a shared _persist.check_persisted_handles: using only IdMapIndex's
len + contains, it verifies the side-car's handle set and the index are a
bijection (equal size, every side-car handle present, no duplicates) and
raises a clean ValueError at load otherwise. Wire it into all three
loaders. Add a desync regression test per wrapper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
LlamaIndex's metadata_dict_to_node does not restore metadata_separator:
node_to_metadata_dict serializes it (it is present in _node_content), but
reconstruction drops it back to the framework default. So no store built on
the framework serializer preserves it, and the assertion only passed
historically because the old default equalled the test's value. The newer
llama-index-core default is '\n', exposing the drop. Remove the input
kwarg and the assertion (verified directly against llama-index-core); the
field is outside our control.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@RyanCodrai RyanCodrai merged commit c9a305b into main Jun 10, 2026
6 checks passed
@RyanCodrai RyanCodrai deleted the security/audit-fixes branch June 10, 2026 16:26
RyanCodrai added a commit that referenced this pull request Jun 10, 2026
Audited docs/ against the merged behavior changes:
- api.md: document that search() raises ValueError on non-finite/oversized
  query coords; dim must be a positive multiple of 8 and <= 65536 (MAX_DIM);
  zero-width add raises; load now validates the header before allocating.
- agno.md: duplicate derived doc_id is now keep-all (both kept and
  deletable, matching LanceDb) — previous text implied last-write-wins;
  clarify delete_by_name/_content_id/_metadata target only matching docs.
- agno/langchain/haystack/llama_index: loading a side-car out of sync with
  its .tvim now raises ValueError at load instead of a later KeyError.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RyanCodrai added a commit that referenced this pull request Jun 10, 2026
* Release: turbovec 0.8.0 (Python) + 0.9.0 (Rust crate)

Security-audit release (#108): untrusted-load hardening, binding panic
fixes, integration data-integrity fixes, and the x86 scalar-fallback
correctness fix. Resolves #104, #105, #106. Minor bump on both surfaces
because a few inputs that previously panicked or were silently accepted now
return typed errors. See CHANGELOG.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: correct stale benchmark figures in README

Fact-checked the README prose against benchmarks/results/. Several figures
had drifted (mostly from before TQ+ landed):
- ARM speed 12-20% -> 10-19% (actual range 10.3-19.4%)
- OpenAI R@1 +0.4-3.4 pts -> +0.2-1.9 pts (no config reaches 3.4)
- GloVe R@1 +0.3/-1.2 -> +0.9/tied (TQ+ closed the 2-bit gap)
- x86 2-bit 'within ~1% ST, 2-4% MT' -> trails 3-8% on both ST and MT
- softened 'matches the Shannon lower bound' -> 'near-optimal' (the
  how-it-works section already states within 2.7x of the bound)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: fix benchmark + baseline-note drift in README

- intro: 'matches the Shannon lower bound' -> 'near-optimal distortion';
  trim to 'no separate training phase'
- speed bullet/x86 prose: characterize x86 2-bit honestly (behind, most
  visibly d=1536 ST ~8%) instead of 'match-or-beat'/'3-8%'
- recall: OpenAI converge to 1.0 by k=8 (>=0.997 at k=4)
- baselines note: drop stale 'visible gap on GloVe' (TQ+ closed it; GloVe is
  now level at 2-bit, ahead at 4-bit)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: fix drift vs the security release (#108)

Audited docs/ against the merged behavior changes:
- api.md: document that search() raises ValueError on non-finite/oversized
  query coords; dim must be a positive multiple of 8 and <= 65536 (MAX_DIM);
  zero-width add raises; load now validates the header before allocating.
- agno.md: duplicate derived doc_id is now keep-all (both kept and
  deletable, matching LanceDb) — previous text implied last-write-wins;
  clarify delete_by_name/_content_id/_metadata target only matching docs.
- agno/langchain/haystack/llama_index: loading a side-car out of sync with
  its .tvim now raises ValueError at load instead of a later KeyError.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Mark AddError/ConstructError #[non_exhaustive]

Adding the DimTooLarge variant this release is already breaking for
downstream exhaustive matches. Mark both public error enums
#[non_exhaustive] now so future variant additions stop being breaking
changes — this release is the one-time free moment to do it. The Python
binding only uses Display (e.to_string()), so it is unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Require a 64-bit target via compile_error guard

turbovec's usize size/offset arithmetic in encode/pack/search assumes a
64-bit pointer width; on 32-bit/wasm those products can overflow and index
out of bounds. The untrusted-load path is already gated by checked_mul in
io.rs, so this is not a vulnerability via file input — but a developer
adding very large data on a 32-bit target could still overflow. Refuse to
compile on non-64-bit targets rather than ship a silently-unsafe build.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Add SECURITY.md (private vulnerability disclosure policy)

Route security reports through GitHub private vulnerability reporting /
Security Advisories instead of public issues, with reporting steps, what to
include, expectations, supported-version policy, and scope. #105 came in as
a public issue; this gives finders a private channel.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* SECURITY.md: drop the direct-contact fallback

Private vulnerability reporting is the single channel; no alternate contact.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
a-tokyo added a commit to a-tokyo/turbovec that referenced this pull request Jun 13, 2026
…ure/ts

Sync the Node-bindings branch onto the latest upstream before proposing it
upstream. Upstream's RyanCodrai#108 security audit independently hardened the same core
files this branch had touched, and RyanCodrai#109 cut crate 0.9.0 / Python 0.8.0.

Reconciliation:
- turbovec/src/io.rs, lib.rs: take upstream's hardening wholesale; drop our now
  -redundant copies. Removed our duplicate `pub use io::MAX_DIM` (upstream
  defines `pub const MAX_DIM` at the crate root); upstream already exposes
  `MAX_INPUT_MAGNITUDE` as `pub` for the bindings. Net core delta from this
  branch is now ~zero — the package is essentially additive.
- turbovec-node/src/error.rs + ts/errors.ts + docs/api.md: handle upstream's new
  `DimTooLarge` variants (both enums are now `#[non_exhaustive]`) via a new
  `DIM_TOO_LARGE` JS error code, plus a wildcard arm so future core variants
  don't break the binding build.
- turbovec/tests/io_versioning.rs: the lazy-header regression test now anchors on
  upstream's wording (clean InvalidData on a dim=0/n>0 header is unchanged).
- CHANGELOG/docs/api.md: keep our Node section under Unreleased alongside the
  0.9.0/0.8.0 release entry and upstream's richer Python tables.

Verified against the synced core: turbovec-node clippy clean; 193 vitest pass;
cargo test -p turbovec green (incl. io_versioning).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant