Improved models for SIMD intrinsics#1481
Conversation
The executable bit-vector reference models the SIMD trust axioms are
validated against, unioned from the per-branch work:
- ARM/NEON core_arch (arm.rs + arm/{neon,neon_handwritten,interpretations}).
- x86 interpretations superset (3-way merge) incl. the per-lane lift bodies;
deduped _mm256_madd_epi16, reconciled the equivalent _mm256_bsrli_epi128 idiom.
- bitvec.rs keeps the hax-0.3.7 #_v_F fix + the u8/64-bit lane interps.
x86/interpretations.rs: mark the _mm256_{loadu,storeu}_si256_{i16,i32,u8}
typed wrappers #[hax_lib::exclude]. They call BitVec::{from_slice,to_vec}
(generic, themselves excluded) and are only the Rust-side reference for the
differential tests; left un-excluded they extract as dangling references and
break F* typechecking of the Int_vec module for any consumer that builds it.
The harness used to validate the SIMD intrinsic reference models against the
real hardware semantics and to audit the trust surface:
- scripts/cross-validate.py, intrinsics-audit.py, classify-nospec.py,
check-intrinsics-parity.sh.
- the trust-surface index (intrinsics-trust-index.{md,csv}), the no-spec
classification, and the cross-validation findings.
avx2_extract.rs / arm64_extract.rs = ml-kem ⊕ sha3 (trust-base avx2 dropped:
never F*-verified). cmpgt_epi16 uses ml-kem's sound VPCMPGTW form. sha3's
u64x4 Keccak lane view + ARM64 Keccak fallback bodies are folded in.
avx2_extract.rs:
- Relocate sha3's u64x4 lane view (vec256_as_u64x4 / get_lane_u64x4 /
lemma_get_lane_u64x4_bit) out of the Vec256 fstar::replace(interface) block
into a fstar::before(interface) block on get_lane_u64, so it follows the
Vec256/Vec128 struct typeclass instances. Keeps ml-kem's i16-view block
byte-identical to its verified original.
- Restore ml-kem's i16-view lemma_mm256_xor_si256 / lemma_mm256_srli_epi16
alongside sha3's u64-view lemmas; they are explicitly called by
Libcrux_ml_kem.Vector.Avx2.Compress.
avx2.rs: add the #[hax_lib::ensures(|_r| future(output).len() == output.len())]
length-preservation post to the six &mut [T] storeu wrappers, matching
mm_storeu_bytes_si128 which already had it. #[hax_lib::opaque] without an
explicit ensures extracts a trivial post under hax 0.3.7, dropping the
output-length fact ML-DSA's Simd.Avx2.Rejection_sample proofs depend on.
Affects only the real Libcrux_intrinsics.Avx2 (ML-DSA's path); ml-kem/sha3
use the Avx2_extract model and never build it.
fstar-helpers/fstar-bitvec: BitVec.Intrinsics.fsti union (ml-kem base + sha3
u64x4 keccak ops) + Bitvec.{U64Rotate,VxarqProof} keccak-rotate helpers.
…q/vbcaxq/vrax1q/vxarq)
The four ARM64 NEON SHA3-extension fallbacks in
crates/utils/intrinsics/src/arm64_extract.rs
_veor3q_u64 == (a ^ b) ^ c
_vbcaxq_u64 == a ^ (b & ~c)
_vrax1q_u64 == a ^ rotate_left(b, 1)
_vxarq_u64 == rotate_left(a ^ b, LEFT)
have REAL bodies (compositions of the basic NEON ops), not unimplemented!()
model stubs, and carry #[hax_lib::ensures] specs. But until now the only F*
artifact any consumer built for the intrinsics crate was the .fsti interface,
so these fallback specs were trusted as axioms — nobody proved the bodies.
This adds a dedicated, isolated F* verification that actually proves them:
- fstar-helpers/fstar-bitvec/Bitvec.Sha3FallbackProof.fst: states each
fallback body (mirrored verbatim from the hax-extracted Arm64_extract.fst)
as a Pure function whose ensures is the fallback's spec, and discharges it.
The proofs rest ONLY on the basic NEON op specs (e_veorq_u64 / e_vbicq_u64 /
e_vshlq_n_u64 / e_vshrq_n_u64 from Arm64_extract.fsti) plus the one auditable
u64 rotate identity in Bitvec.U64Rotate. They do NOT depend on any of the
unimplemented!() model ops that keep Arm64_extract.fst from verifying whole —
which is why the fallback proofs are isolated into their own module rather
than rooting Arm64_extract.fst / Avx2_extract.fst.
- crates/utils/intrinsics/proofs/fstar/Makefile: dedicated verification whose
ROOTS are Bitvec.U64Rotate / Bitvec.VxarqProof / Bitvec.Sha3FallbackProof.
Only Libcrux_intrinsics.Arm64_extract.fsti is pulled in (the .fst bodies and
the model ops stay trusted as .fsti axioms, as in every consumer crate).
- .github/workflows/intrinsics-hax.yml: extracts the intrinsics crate and runs
the verification, modeled on sha3-hax.yml (hax pinned via get-hax-ref,
F* v2026.03.24).
Verified clean from a cold cache: 41 modules, all VCs discharged, no admits on
the four fallbacks. The .gitignore is narrowed so the hand-written Makefile is
tracked while generated extraction artifacts stay ignored.
`get_lane_u64` stores a Vec256 into a `[u64; 4]` by reinterpreting it as `&mut [u8]` via `core::slice::from_raw_parts_mut`. hax rejects raw pointers (`reject_RawOrMutPointer`), so any consumer that extracts the real `avx2.rs` (the core-models world, i.e. without `--cfg pre_core_models` — ml-dsa) failed to extract the intrinsics crate. Every other raw-pointer store/load wrapper in this file is already `#[hax_lib::opaque]` (e.g. mm256_storeu_si256_u8, mm_storeu_si128_u128); get_lane_u64 was the lone exception. Mark it opaque too, so hax extracts it as an uninterpreted `val` instead of trying to model the pointer body. Consumers that need a functional spec use the `avx2_extract` model (pre_core_models world); ml-dsa does not use get_lane_u64 at all. Fixes the ML-DSA `extract` CI job (reject_RawOrMutPointer at avx2.rs).
franziskuskiefer
left a comment
There was a problem hiding this comment.
Looks like CI isn't happy.
Also, should this be against the hax-0.3.7 branch instead of main then?
yep, fixing. |
Review re-requested
…jection_sample
Upstreamed from the libcrux-ml-dsa-proofs campaign. Adds the trusted F*
spec layer (D6.3) for the AVX2 intrinsics ml-dsa relies on; each `val` is
the trusted spec for a CPU-differential-tested core-models model.
- Spec.Intrinsics.fsti: 15 `val` SMTPat lemmas — store/load/setzero/
blendv/cmpeq/or/sign (compute_hint / use_hint) and movemask /
storeu_si128 / loadu_si256_u8 + projections (rejection_sample).
- intrinsics-trust-index.{md,csv}: refresh the trust ledger to match
(T1 193 -> 194, D6.3 F* spec coverage 76.7% -> 86.6%).
franziskuskiefer
left a comment
There was a problem hiding this comment.
I think this needs a bit of work/cleanup/splitup.
The checked in F* code is stale. Should it be dropped?
| #!/usr/bin/env python3 | ||
| """ | ||
| classify-nospec.py — For each genuinely spec-free T1 wrapper, find call sites | ||
| in ml-kem / ml-dsa / sha-3 and report the verification status of the caller. |
There was a problem hiding this comment.
All these things look global from the outside but only do something for those three algorithms. That needs to be made clear somewhere, and we need a way to extend this for other algorithms like AES or anything else that's using intrinsics.
| /// `result[i] = saturate_to_i16( (2 * a[i] * b[i]) >> 16 )`. | ||
| /// Saturation only fires for the (i16::MIN, i16::MIN) corner case (which | ||
| /// would yield a value >= 2^15 = i16 overflow). | ||
| pub fn vqdmulhq_s16(a: i16x8, b: i16x8) -> i16x8 { |
There was a problem hiding this comment.
This doesn't appear to handle the (i16::MIN, i16::MIN) case correctly.
This makes me think that there may be more issues (vqdmulhq_n_s32 appears to have it too). Can you improve testing, esp. for corner cases and make the PR more reviewable, i.e. into smaller digestible chunks?
| @@ -0,0 +1,132 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
This script fails. Please make sure it works and we should run it on CI to make sure it's an actual CI gate.
| in FStar.Classical.forall_intro aux | ||
| "# | ||
| )] | ||
| #[hax_lib::requires(LEFT >= 0 && LEFT <= 64)] |
There was a problem hiding this comment.
Is it <= 64 or <64. lemma_mm256_slli_epi64_u64x4 says <64.
| // clamped to 16), it is NOT taken modulo 16. The old `tmp % 16` here | ||
| // mirrored a Rust core bug whose fix we upstreamed, so the model must | ||
| // follow the spec rather than the (now-corrected) implementation. | ||
| let tmp = IMM8 % 256; |
There was a problem hiding this comment.
Some are using rem_euclid and other like this one % 256. We should make sure it's consistent and what we want.
| if args.print_summary: | ||
| print(summary) | ||
| else: | ||
| print(summary) | ||
| return 0 |
There was a problem hiding this comment.
This seems a bit silly.
| # SIMD intrinsics trust index | ||
|
|
||
| Generated by `crates/utils/core-models/scripts/intrinsics-audit.py`. | ||
| See `crates/utils/core-models/INTRINSICS-TRUST-PLAN.md` for the trust-ladder definition (L0..L4) and the D6 sub-percentages. |
There was a problem hiding this comment.
There's no INTRINSICS-TRUST-PLAN.md.
| Rust_primitives.Integers.v sh > 0 /\ Rust_primitives.Integers.v sh < 64 | ||
|
|
||
| (* Bridge lemma — see module header. *) | ||
| assume |
There was a problem hiding this comment.
I don't think this is listed as axiom?
…(parity with ml-dsa-proofs) coeff_gather_bv_lemma (masked 3-byte coefficient bit decomposition; analog of the existing shl_casted_u8_bv_lemma) + u8_to_bv_logand15_lemma / u8_to_bv_shr4_lemma (u8 low/high nibble bit lemmas). These are the D6.3 intrinsic-model val axioms backing the AVX2 rejection_sample functional proof landed on ml-dsa-proofs (6bcc4f0); CPU/Python-validatable, D6.5 (machine-int proof) deferred. Byte-identical to the versions verified there. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…8_as_i16x8 bit-decomp axiom For the ml-kem NEON from_bytes/to_bytes functional proofs, give the byte load/store intrinsics their bit_vec ensures (mirror of the AVX2 mm256_loadu/storeu_si256_u8 ensures), and add the canonical i16x8 lane-bit decomposition axiom for the 128-bit register: - e_vld1q_bytes: len==16 ==> bit_vec_equal result (bit_vec_of_int_t_array bytes 8) - e_vst1q_bytes: length + (len==16 ==> bit_vec_equal (bit_vec_of_int_t_array out 8) v) - bit_vec_of_int_t_array_vec128_as_i16x8_lemma (Arm64) — hardware-agnostic bit math, identical statement to the core-models-validated AVX2 analog. VALIDATION (hardware-differential, native arm64): the load/store LE-reinterpret semantics induced by these ensures are validated bit-exact against real NEON hardware — 2,000,000 differential checks, 0 fails (libcrux-notes/agent-status/neon_vld1q_bytes_hwdiff_validate-2026-06-22.rs). e_vst1q_bytes uses a single F* ensures (length /\ bit_vec) like the AVX2 storeu (the Lean-only Rust-expr length ensures is dropped; hax rejects two stacked ensures here). Shared crate — strengthening posts only, additive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Give mm256_srai_epi32 its lane32-form ensures (was l_True): for 0<=s<32, lane32 result j == (lane32 vector j) / pow2 s. F* integer division is Euclidean floor, which matches the sign-filling arithmetic shift, and shift 0 is the identity (sound, unlike the logical srli). Needed by the ml-kem AVX2 montgomery_reduce_i32s proof (shift 16 sign-extends the low i16 lane). VALIDATION: the srai_epi32_general_lane_formula transcription test in interpretations.rs checks the axiom RHS against the core-models _mm256_srai_epi32 model (signed a[i] >> s) over random inputs incl. negative lanes and i32 edges; cargo test -p core-models green (23/23). The model is hardware-differential-tested via the existing mk! entry. Shared crate; additive (strengthens a post only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… axiom
The i32 and u32 lane VIEWS of the same 128-bit NEON register
(vec128_as_i32x4 / vec128_as_u32x4) are independent abstract vals; the
existing _vreinterpretq_s32_u32 / _vreinterpretq_u32_s32 ensures only assert
register equality (result == a), so a value cannot be tracked across the
signedness reinterpret. Add e_vreinterpret_i32_u32_lane_bridge stating the two
views are 2's-complement related (the bit-preserving bitcast semantics).
Hardware-differential validated bit-exact against real arm64
vreinterpretq_{s32_u32,u32_s32}: 24,000,000 lane-checks, 0 fails
(neon_vreinterpret_s32_u32_hwdiff_validate-2026-06-23.rs). Used by the NEON
d-bit compress functional proof (ML-KEM A3).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Strengthen the F* model ensures of _vst1q_u8 in arm64_extract.rs from
length-only to per-byte content: when the output slice has length 16,
out_future[k] == get_lane_u8x16 v k for k<16 (lane k -> byte k, LE),
anchored to the same get_lane_u8x16 convention as the trusted _vld1q_u8.
Validated bit-exact against real arm64 hardware (14,400,000 differential
checks, 0 fails: load->store round-trip + store of vreinterpretq_u8_s64;
see libcrux-notes neon_vst1q_u8_hwdiff_validate-2026-06-24). The body
stays lean::replace_body("()"). This is the trusted axiom that the Neon
serialize_10/12 proofs (landing in the same session) consume.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This PR does some cleanup in the SIMD intrinsics models, with a view to unifying
the proofs in SHA-3, ML-KEM, and ML-DSA and adding more tests for these models:
-
crates/utils/core-models— executable reference models, differential tests, and lemmas-
crates/utils/intrinsics— the trusted F* axioms inavx2_extract.rs/arm64_extract.rsThis PR is layered on top of the upgrade-0.3.7 branch and is a pre-requisite for merging the SHA-3, ML-KEM, and ML-DSA proofs.