Skip to content

Improved models for SIMD intrinsics#1481

Open
karthikbhargavan wants to merge 17 commits into
upgrade-hax-0.3.7from
pr2-intrinsics
Open

Improved models for SIMD intrinsics#1481
karthikbhargavan wants to merge 17 commits into
upgrade-hax-0.3.7from
pr2-intrinsics

Conversation

@karthikbhargavan

@karthikbhargavan karthikbhargavan commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

This PR does some cleanup in the SIMD intrinsics models, with a view to unifying
the proofs in SHA-3, ML-KEM, and ML-DSA and adding more tests for these models:

-crates/utils/core-models — executable reference models, differential tests, and lemmas
-crates/utils/intrinsics — the trusted F* axioms in avx2_extract.rs / arm64_extract.rs

This PR is layered on top of the upgrade-0.3.7 branch and is a pre-requisite for merging the SHA-3, ML-KEM, and ML-DSA proofs.

The executable bit-vector reference models the SIMD trust axioms are
validated against, unioned from the per-branch work:
- ARM/NEON core_arch (arm.rs + arm/{neon,neon_handwritten,interpretations}).
- x86 interpretations superset (3-way merge) incl. the per-lane lift bodies;
  deduped _mm256_madd_epi16, reconciled the equivalent _mm256_bsrli_epi128 idiom.
- bitvec.rs keeps the hax-0.3.7 #_v_F fix + the u8/64-bit lane interps.

x86/interpretations.rs: mark the _mm256_{loadu,storeu}_si256_{i16,i32,u8}
typed wrappers #[hax_lib::exclude]. They call BitVec::{from_slice,to_vec}
(generic, themselves excluded) and are only the Rust-side reference for the
differential tests; left un-excluded they extract as dangling references and
break F* typechecking of the Int_vec module for any consumer that builds it.
The harness used to validate the SIMD intrinsic reference models against the
real hardware semantics and to audit the trust surface:
- scripts/cross-validate.py, intrinsics-audit.py, classify-nospec.py,
  check-intrinsics-parity.sh.
- the trust-surface index (intrinsics-trust-index.{md,csv}), the no-spec
  classification, and the cross-validation findings.
avx2_extract.rs / arm64_extract.rs = ml-kem ⊕ sha3 (trust-base avx2 dropped:
never F*-verified). cmpgt_epi16 uses ml-kem's sound VPCMPGTW form. sha3's
u64x4 Keccak lane view + ARM64 Keccak fallback bodies are folded in.

avx2_extract.rs:
- Relocate sha3's u64x4 lane view (vec256_as_u64x4 / get_lane_u64x4 /
  lemma_get_lane_u64x4_bit) out of the Vec256 fstar::replace(interface) block
  into a fstar::before(interface) block on get_lane_u64, so it follows the
  Vec256/Vec128 struct typeclass instances. Keeps ml-kem's i16-view block
  byte-identical to its verified original.
- Restore ml-kem's i16-view lemma_mm256_xor_si256 / lemma_mm256_srli_epi16
  alongside sha3's u64-view lemmas; they are explicitly called by
  Libcrux_ml_kem.Vector.Avx2.Compress.

avx2.rs: add the #[hax_lib::ensures(|_r| future(output).len() == output.len())]
length-preservation post to the six &mut [T] storeu wrappers, matching
mm_storeu_bytes_si128 which already had it. #[hax_lib::opaque] without an
explicit ensures extracts a trivial post under hax 0.3.7, dropping the
output-length fact ML-DSA's Simd.Avx2.Rejection_sample proofs depend on.
Affects only the real Libcrux_intrinsics.Avx2 (ML-DSA's path); ml-kem/sha3
use the Avx2_extract model and never build it.

fstar-helpers/fstar-bitvec: BitVec.Intrinsics.fsti union (ml-kem base + sha3
u64x4 keccak ops) + Bitvec.{U64Rotate,VxarqProof} keccak-rotate helpers.
…q/vbcaxq/vrax1q/vxarq)

The four ARM64 NEON SHA3-extension fallbacks in
crates/utils/intrinsics/src/arm64_extract.rs
  _veor3q_u64  == (a ^ b) ^ c
  _vbcaxq_u64  == a ^ (b & ~c)
  _vrax1q_u64  == a ^ rotate_left(b, 1)
  _vxarq_u64   == rotate_left(a ^ b, LEFT)
have REAL bodies (compositions of the basic NEON ops), not unimplemented!()
model stubs, and carry #[hax_lib::ensures] specs. But until now the only F*
artifact any consumer built for the intrinsics crate was the .fsti interface,
so these fallback specs were trusted as axioms — nobody proved the bodies.

This adds a dedicated, isolated F* verification that actually proves them:

  - fstar-helpers/fstar-bitvec/Bitvec.Sha3FallbackProof.fst: states each
    fallback body (mirrored verbatim from the hax-extracted Arm64_extract.fst)
    as a Pure function whose ensures is the fallback's spec, and discharges it.
    The proofs rest ONLY on the basic NEON op specs (e_veorq_u64 / e_vbicq_u64 /
    e_vshlq_n_u64 / e_vshrq_n_u64 from Arm64_extract.fsti) plus the one auditable
    u64 rotate identity in Bitvec.U64Rotate. They do NOT depend on any of the
    unimplemented!() model ops that keep Arm64_extract.fst from verifying whole —
    which is why the fallback proofs are isolated into their own module rather
    than rooting Arm64_extract.fst / Avx2_extract.fst.

  - crates/utils/intrinsics/proofs/fstar/Makefile: dedicated verification whose
    ROOTS are Bitvec.U64Rotate / Bitvec.VxarqProof / Bitvec.Sha3FallbackProof.
    Only Libcrux_intrinsics.Arm64_extract.fsti is pulled in (the .fst bodies and
    the model ops stay trusted as .fsti axioms, as in every consumer crate).

  - .github/workflows/intrinsics-hax.yml: extracts the intrinsics crate and runs
    the verification, modeled on sha3-hax.yml (hax pinned via get-hax-ref,
    F* v2026.03.24).

Verified clean from a cold cache: 41 modules, all VCs discharged, no admits on
the four fallbacks. The .gitignore is narrowed so the hand-written Makefile is
tracked while generated extraction artifacts stay ignored.
@karthikbhargavan karthikbhargavan requested a review from a team as a code owner June 17, 2026 10:53
`get_lane_u64` stores a Vec256 into a `[u64; 4]` by reinterpreting it as
`&mut [u8]` via `core::slice::from_raw_parts_mut`. hax rejects raw pointers
(`reject_RawOrMutPointer`), so any consumer that extracts the real `avx2.rs`
(the core-models world, i.e. without `--cfg pre_core_models` — ml-dsa) failed
to extract the intrinsics crate.

Every other raw-pointer store/load wrapper in this file is already
`#[hax_lib::opaque]` (e.g. mm256_storeu_si256_u8, mm_storeu_si128_u128);
get_lane_u64 was the lone exception. Mark it opaque too, so hax extracts it as
an uninterpreted `val` instead of trying to model the pointer body. Consumers
that need a functional spec use the `avx2_extract` model (pre_core_models
world); ml-dsa does not use get_lane_u64 at all.

Fixes the ML-DSA `extract` CI job (reject_RawOrMutPointer at avx2.rs).

@franziskuskiefer franziskuskiefer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like CI isn't happy.
Also, should this be against the hax-0.3.7 branch instead of main then?

@karthikbhargavan karthikbhargavan marked this pull request as draft June 17, 2026 11:21
@karthikbhargavan karthikbhargavan changed the base branch from main to upgrade-hax-0.3.7 June 17, 2026 11:23
@karthikbhargavan

Copy link
Copy Markdown
Collaborator Author

Looks like CI isn't happy. Also, should this be against the hax-0.3.7 branch instead of main then?

yep, fixing.

@karthikbhargavan karthikbhargavan marked this pull request as ready for review June 17, 2026 14:55
@github-actions github-actions Bot dismissed franziskuskiefer’s stale review June 17, 2026 14:55

Review re-requested

…jection_sample

Upstreamed from the libcrux-ml-dsa-proofs campaign. Adds the trusted F*
spec layer (D6.3) for the AVX2 intrinsics ml-dsa relies on; each `val` is
the trusted spec for a CPU-differential-tested core-models model.

- Spec.Intrinsics.fsti: 15 `val` SMTPat lemmas — store/load/setzero/
  blendv/cmpeq/or/sign (compute_hint / use_hint) and movemask /
  storeu_si128 / loadu_si256_u8 + projections (rejection_sample).
- intrinsics-trust-index.{md,csv}: refresh the trust ledger to match
  (T1 193 -> 194, D6.3 F* spec coverage 76.7% -> 86.6%).

@franziskuskiefer franziskuskiefer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a bit of work/cleanup/splitup.

The checked in F* code is stale. Should it be dropped?

#!/usr/bin/env python3
"""
classify-nospec.py — For each genuinely spec-free T1 wrapper, find call sites
in ml-kem / ml-dsa / sha-3 and report the verification status of the caller.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these things look global from the outside but only do something for those three algorithms. That needs to be made clear somewhere, and we need a way to extend this for other algorithms like AES or anything else that's using intrinsics.

/// `result[i] = saturate_to_i16( (2 * a[i] * b[i]) >> 16 )`.
/// Saturation only fires for the (i16::MIN, i16::MIN) corner case (which
/// would yield a value >= 2^15 = i16 overflow).
pub fn vqdmulhq_s16(a: i16x8, b: i16x8) -> i16x8 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to handle the (i16::MIN, i16::MIN) case correctly.
This makes me think that there may be more issues (vqdmulhq_n_s32 appears to have it too). Can you improve testing, esp. for corner cases and make the PR more reviewable, i.e. into smaller digestible chunks?

@@ -0,0 +1,132 @@
#!/usr/bin/env bash

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script fails. Please make sure it works and we should run it on CI to make sure it's an actual CI gate.

in FStar.Classical.forall_intro aux
"#
)]
#[hax_lib::requires(LEFT >= 0 && LEFT <= 64)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it <= 64 or <64. lemma_mm256_slli_epi64_u64x4 says <64.

// clamped to 16), it is NOT taken modulo 16. The old `tmp % 16` here
// mirrored a Rust core bug whose fix we upstreamed, so the model must
// follow the spec rather than the (now-corrected) implementation.
let tmp = IMM8 % 256;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some are using rem_euclid and other like this one % 256. We should make sure it's consistent and what we want.

Comment on lines +836 to +840
if args.print_summary:
print(summary)
else:
print(summary)
return 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit silly.

# SIMD intrinsics trust index

Generated by `crates/utils/core-models/scripts/intrinsics-audit.py`.
See `crates/utils/core-models/INTRINSICS-TRUST-PLAN.md` for the trust-ladder definition (L0..L4) and the D6 sub-percentages.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no INTRINSICS-TRUST-PLAN.md.

Rust_primitives.Integers.v sh > 0 /\ Rust_primitives.Integers.v sh < 64

(* Bridge lemma — see module header. *)
assume

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is listed as axiom?

karthikbhargavan and others added 5 commits June 19, 2026 18:26
…(parity with ml-dsa-proofs)

coeff_gather_bv_lemma (masked 3-byte coefficient bit decomposition; analog of the
existing shl_casted_u8_bv_lemma) + u8_to_bv_logand15_lemma / u8_to_bv_shr4_lemma
(u8 low/high nibble bit lemmas). These are the D6.3 intrinsic-model val axioms
backing the AVX2 rejection_sample functional proof landed on ml-dsa-proofs
(6bcc4f0); CPU/Python-validatable, D6.5 (machine-int proof) deferred.
Byte-identical to the versions verified there.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…8_as_i16x8 bit-decomp axiom

For the ml-kem NEON from_bytes/to_bytes functional proofs, give the byte
load/store intrinsics their bit_vec ensures (mirror of the AVX2
mm256_loadu/storeu_si256_u8 ensures), and add the canonical i16x8
lane-bit decomposition axiom for the 128-bit register:
 - e_vld1q_bytes: len==16 ==> bit_vec_equal result (bit_vec_of_int_t_array bytes 8)
 - e_vst1q_bytes: length + (len==16 ==> bit_vec_equal (bit_vec_of_int_t_array out 8) v)
 - bit_vec_of_int_t_array_vec128_as_i16x8_lemma (Arm64) — hardware-agnostic
   bit math, identical statement to the core-models-validated AVX2 analog.

VALIDATION (hardware-differential, native arm64): the load/store
LE-reinterpret semantics induced by these ensures are validated bit-exact
against real NEON hardware — 2,000,000 differential checks, 0 fails
(libcrux-notes/agent-status/neon_vld1q_bytes_hwdiff_validate-2026-06-22.rs).
e_vst1q_bytes uses a single F* ensures (length /\ bit_vec) like the AVX2
storeu (the Lean-only Rust-expr length ensures is dropped; hax rejects
two stacked ensures here). Shared crate — strengthening posts only, additive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Give mm256_srai_epi32 its lane32-form ensures (was l_True): for 0<=s<32,
lane32 result j == (lane32 vector j) / pow2 s.  F* integer division is
Euclidean floor, which matches the sign-filling arithmetic shift, and shift
0 is the identity (sound, unlike the logical srli).  Needed by the ml-kem
AVX2 montgomery_reduce_i32s proof (shift 16 sign-extends the low i16 lane).

VALIDATION: the srai_epi32_general_lane_formula transcription test in
interpretations.rs checks the axiom RHS against the core-models
_mm256_srai_epi32 model (signed a[i] >> s) over random inputs incl.
negative lanes and i32 edges; cargo test -p core-models green (23/23).
The model is hardware-differential-tested via the existing mk! entry.
Shared crate; additive (strengthens a post only).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… axiom

The i32 and u32 lane VIEWS of the same 128-bit NEON register
(vec128_as_i32x4 / vec128_as_u32x4) are independent abstract vals; the
existing _vreinterpretq_s32_u32 / _vreinterpretq_u32_s32 ensures only assert
register equality (result == a), so a value cannot be tracked across the
signedness reinterpret. Add e_vreinterpret_i32_u32_lane_bridge stating the two
views are 2's-complement related (the bit-preserving bitcast semantics).

Hardware-differential validated bit-exact against real arm64
vreinterpretq_{s32_u32,u32_s32}: 24,000,000 lane-checks, 0 fails
(neon_vreinterpret_s32_u32_hwdiff_validate-2026-06-23.rs). Used by the NEON
d-bit compress functional proof (ML-KEM A3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Strengthen the F* model ensures of _vst1q_u8 in arm64_extract.rs from
length-only to per-byte content: when the output slice has length 16,
out_future[k] == get_lane_u8x16 v k for k<16 (lane k -> byte k, LE),
anchored to the same get_lane_u8x16 convention as the trusted _vld1q_u8.

Validated bit-exact against real arm64 hardware (14,400,000 differential
checks, 0 fails: load->store round-trip + store of vreinterpretq_u8_s64;
see libcrux-notes neon_vst1q_u8_hwdiff_validate-2026-06-24). The body
stays lean::replace_body("()"). This is the trusted axiom that the Neon
serialize_10/12 proofs (landing in the same session) consume.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants