From 35579b430878640cdcc7d26f2a50728e5a471678 Mon Sep 17 00:00:00 2001 From: T Date: Sat, 27 Jun 2026 11:09:16 +0100 Subject: [PATCH 1/7] fix(802): defer string_length to Tier 3 for UTF-8 byte soundness (v0.0.182) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit vera verify translated string_length to Z3's Length, which counts Unicode code points, but Vera's runtime counts UTF-8 bytes — so ensures(string_length("é") == 1) verified at Tier 1 while the runtime returns 2 (a false proof for every non-ASCII string). Final sub-task of the #392 smt.py soundness audit. string_length is now modeled at Tier 1 only for a string literal (its exact byte length is known); a non-literal argument defers to a runtime-guarded Tier 3 obligation (Z3's string theory has no byte-length operator), matching the numeric-cast / quantifier / decimal precedent. The boolean predicates string_contains / string_starts_with / string_ends_with stay Tier 1 — UTF-8 is self-synchronizing, so a valid substring/prefix/suffix matches at the byte level exactly when it matches at the code-point level. The offset-sensitive builtins (string_slice / string_index_of / string_char_code / string_chars) were never modeled in Z3 and already defer. Tests that pinned the old z3.Length Tier-1 behavior are rewritten to the sound deferral; refinement / E501 / #667 tests that used string_length only as an example builtin switch to string_starts_with (a still-Tier-1 predicate). New test_string_length_soundness.py covers the literal byte model, slot deferral, and predicate soundness; ch04_string_builtins gains a non-ASCII byte-length case. Closes #802. Closes #392. Co-Authored-By: Claude --- CHANGELOG.md | 9 +- HISTORY.md | 3 +- README.md | 2 +- ROADMAP.md | 3 +- TESTING.md | 9 +- docs/index.html | 2 +- docs/index.md | 2 +- docs/llms-full.txt | 2 +- docs/llms.txt | 2 +- pyproject.toml | 2 +- scripts/check_spec_examples.py | 2 +- spec/06-contracts.md | 2 + tests/conformance/ch04_string_builtins.vera | 15 ++ tests/test_obligations.py | 12 +- tests/test_string_length_soundness.py | 149 ++++++++++++++++++++ tests/test_verifier.py | 96 ++++++++----- tests/test_verifier_coverage.py | 23 +-- uv.lock | 2 +- vera/__init__.py | 2 +- vera/smt.py | 28 ++-- 20 files changed, 283 insertions(+), 84 deletions(-) create mode 100644 tests/test_string_length_soundness.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 078dc4f3..cae1a53a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,12 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). ## [Unreleased] +## [0.0.182] - 2026-06-27 + +### Fixed + +- **Verifier soundness: `string_length` no longer proves a false length for non-ASCII strings** (final sub-task of the [#392](https://github.com/aallan/vera/issues/392) `smt.py` soundness audit). `vera verify` translated `string_length` to Z3's `Length`, which counts Unicode **code points**, but Vera's runtime counts **UTF-8 bytes** — so `ensures(string_length("é") == 1)` verified at Tier 1 while the runtime returns `2`. `string_length` is now modeled at Tier 1 only for a string **literal** (whose exact byte length is known); on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator), matching the numeric-cast / quantifier / decimal precedent. The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 — UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. This was the last of the six soundness bugs ([#797](https://github.com/aallan/vera/issues/797)–[#802](https://github.com/aallan/vera/issues/802)) found by the [#392](https://github.com/aallan/vera/issues/392) audit, which is now complete. ([#802](https://github.com/aallan/vera/issues/802)) + ## [0.0.181] - 2026-06-27 ### Fixed @@ -2604,7 +2610,8 @@ Small docs sweep — closes six aging documentation issues in one PR. No code c - Grammar: handler body simplified to avoid LALR reduce/reduce conflict - `pyproject.toml`: corrected build backend, package discovery, PEP 639 compliance -[Unreleased]: https://github.com/aallan/vera/compare/v0.0.181...HEAD +[Unreleased]: https://github.com/aallan/vera/compare/v0.0.182...HEAD +[0.0.182]: https://github.com/aallan/vera/compare/v0.0.181...v0.0.182 [0.0.181]: https://github.com/aallan/vera/compare/v0.0.180...v0.0.181 [0.0.180]: https://github.com/aallan/vera/compare/v0.0.179...v0.0.180 [0.0.179]: https://github.com/aallan/vera/compare/v0.0.178...v0.0.179 diff --git a/HISTORY.md b/HISTORY.md index ffe60430..e04bb8ec 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -392,6 +392,7 @@ Proof obligations became first-class records with a warm Z3 session, and the LSP | v0.0.179 | 26 Jun | **The `smt.py` soundness-audit batch lands: signed `/` and `%` truncate toward zero, a body `assert(P)` carries a Tier-1 obligation (loud `E507`), contract-position divisions get a `div_zero` obligation, and a prior `assert`/`assume` now discharges later obligations plus the postcondition at Tier 1** ([#392](https://github.com/aallan/vera/issues/392)). | | v0.0.180 | 26 Jun | **`@Float64` contracts are verified against Z3's IEEE-754 FloatingPoint sort instead of mathematical reals, so Tier-1 proofs respect `NaN`/`Inf`/signed-zero/rounding and `%` (C `fmod`) and match the runtime, so unsound relational / reflexive Float64 contracts are now correctly rejected** ([#797](https://github.com/aallan/vera/issues/797)). | | v0.0.181 | 27 Jun | **`@Int` / `@Nat` arithmetic overflow now traps: `+` / `-` / `*` are partial operations carrying an auto-synthesised `int_overflow` obligation, so a result that provably leaves i64 / u64 range is a compile error (E528) and a dynamic one is a runtime-guarded trap, instead of the verifier proving postconditions the two's-complement runtime then violated** ([#798](https://github.com/aallan/vera/issues/798)). | +| v0.0.182 | 27 Jun | **`string_length` UTF-8 byte soundness defers non-literal `string_length` to Tier 3 and models string literals at their exact byte length, completing the `smt.py` soundness audit** ([#802](https://github.com/aallan/vera/issues/802)). | --- @@ -407,4 +408,4 @@ Proof obligations became first-class records with a warm Z3 session, and the LSP | Spec chapters | 7 | 10 | 11 | 12 | 13 | 13 | 13 | 13 | | Code coverage | — | — | — | 90% | 91% | 96% | 96% | 95% | -Total: **1,400+ commits, 181 tagged releases, 77 active development days.** +Total: **1,400+ commits, 182 tagged releases, 77 active development days.** diff --git a/README.md b/README.md index cc05be26..f5160fc9 100644 --- a/README.md +++ b/README.md @@ -215,7 +215,7 @@ cp /path/to/vera/SKILL.md ~/.claude/skills/vera-language/SKILL.md ## Project status -Vera is in **active development** at v0.0.181 — 1,400+ commits, 181 releases, 5,156 tests, 91% code coverage, 93 conformance programs, 35 examples, and a 13-chapter specification. See **[HISTORY.md](HISTORY.md)** for how the compiler was built. +Vera is in **active development** at v0.0.182 — 1,400+ commits, 182 releases, 5,164 tests, 91% code coverage, 93 conformance programs, 35 examples, and a 13-chapter specification. See **[HISTORY.md](HISTORY.md)** for how the compiler was built. The reference compiler — parser, AST, type checker, contract verifier (Z3), WASM code generator, module system, browser runtime, and runtime contract insertion — is working. The language specification is in draft across [13 chapters](spec/). diff --git a/ROADMAP.md b/ROADMAP.md index 9a86edd5..63953d34 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -8,7 +8,7 @@ Priority lives in this file and nowhere else — issues carry kind and area labe ## Where we are -5,156 tests, 93 conformance programs, 35 examples, 13 spec chapters. +5,164 tests, 93 conformance programs, 35 examples, 13 spec chapters. ## The roadmap @@ -20,7 +20,6 @@ The infrastructure that catches the next regression before a user does, plus the | Issue | What | |---|---| -| [#392](https://github.com/aallan/vera/issues/392) | Audit the `smt.py` Z3 translation layer for soundness — a bug here silently bypasses verification. | | [#807](https://github.com/aallan/vera/issues/807) | Tier-1 model the deferred-but-modelable `@Float64` builtins from #797. **Blocked by #798**: `int_to_float`/`float_to_int` cross the Int↔Float boundary that the unbounded-`Int` idealisation (#798) makes unsound, so they wait on that fix; `float_clamp` is unblocked but needs a verify-vs-run codegen differential first. The four format/parse builtins (`float_to_string`, `parse_float64`, `decimal_from_float`/`_to_float`) are Tier-3 by necessity, not deferred. | | [#808](https://github.com/aallan/vera/issues/808) | Give the `@Int` / `@Nat` arithmetic-overflow runtime trap (#798) a precise `overflow` trap kind, so a dynamic overflow surfaces as a dedicated trap message rather than the generic `unreachable` kind it currently shares. | | [#592](https://github.com/aallan/vera/issues/592) | End-to-end behavioural tests for the five UTF-8 decode sites currently pinned only by structural greps. | diff --git a/TESTING.md b/TESTING.md index 551e5b8c..a64bdabc 100644 --- a/TESTING.md +++ b/TESTING.md @@ -6,7 +6,7 @@ This is the single source of truth for Vera's testing infrastructure, coverage d | Metric | Value | |--------|-------| -| **Tests** | 5,156 across 44 files (~63,000 lines of test code; 5,124 passed + 16 stress, 16 skipped) | +| **Tests** | 5,164 across 45 files (~63,000 lines of test code; 5,132 passed + 16 stress, 16 skipped) | | **Compiler code coverage** | 95% Python, 61% JavaScript — 91% combined (CI minimum: 80%) | | **Conformance programs** | 93 programs across 9 spec chapters, validating every language feature | | **Example programs** | 35, all validated through `vera check` + `vera verify` | @@ -58,8 +58,8 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_parser.py` | 129 | 968 | Grammar rules, operator precedence, parse errors | | `test_ast.py` | 128 | 1,122 | AST transformation, node structure, serialisation, string escape sequences, ability declarations | | `test_checker.py` | 548 | 6,347 | Type synthesis, slot resolution, effects, effect subtyping, contracts, exhaustiveness, cross-module typing, visibility, error codes, string built-ins, generic rejection, IO operation types, Markdown types, Regex types, abilities, Map collection, Set collection, Decimal type, Json type, Html type, Http effect, Inference effect, removed legacy name regression | -| `test_obligations.py` | 284 | 1083 | Reified proof obligations + warm `VerificationSession` (#222 Phase A): full-corpus differential oracle (warm session == cold `verify()` on diagnostics, summary, and obligation stream, plus warm-twice determinism, across all 35 examples and every verify/run-level conformance program), summary↔obligation tier-bookkeeping consistency, per-kind unit tests (requires / ensures / decreases / nat_sub / call_pre statuses, counterexamples, error codes), content-key stability + same-text-two-sites span disambiguation, session solver reuse, type-error short-circuit, ADT-registry resync between programs; plus the Phase B incremental suite — identical-source full replay, callee-body-edit replays callers while callee-contract-edit invalidates them, span-shift and ADT-edit conservative invalidation, cross-program isolation, timeout-status never cached (monkeypatched solver), FIFO eviction bound; plus the #727 dedup pin — a violating call in a let RHS records exactly one E501 diagnostic and one call_pre obligation | -| `test_verifier.py` | 473 | 9,270 | Z3 verification, counterexamples, tier classification, call-site preconditions, branch-aware preconditions, pipe operator, cross-module contracts, match/ADT verification, decreases verification, mutual recursion, refined Bool/String/Float64 param sorts, **@Nat subtraction underflow obligation** (#520 — Path-A obligation discharge via requires/path-conditions/path-aware Z3 refutation, pure-literal exclusion, Int-Int and Nat-Int → Int exemptions), **@Nat binding-site narrowing obligation** (#552 — Tier-1 `value >= 0` at let/call-arg/effect-op-arg/ctor-field/match-bind/literal-destructure narrowing, #520 double-emit disjointness, E503 counterexample-witness pin; #747 — generic-instantiation, ADT sub-pattern, non-literal tuple-destructure, and cross-module imported-ctor narrowing, with concrete-site Tier-3 narrowings (incl. the if-expr-source SMT gap, recorded per-component) classified as codegen-guarded `tier3_runtime`, generic function-formal call-args guarded on the monomorphised callee (keyed on `call_target`, not the erased generic decl), and the two genuinely-unguarded residuals — effect-op-arg narrowings (no codegen guard yet, #754) and generic-instantiated constructor-field narrowings (constructors carry no per-field `@Nat` mono metadata) — surfacing the `E504` unguarded warning; #749 — IndexExpr/InterpolatedString walker-recursion pins, `_fresh_slot_var` nat-alias unit test, `_narrows_into_nat` verifier/codegen soundness parity), **refinement-predicate verification** (#746 — Tier-1 discharge at let/call-arg/ctor-field/effect-op-arg/match-bind/tuple-destructure narrowings and refined return positions, E505 violation with counterexample, E506 Tier-3 for untranslatable/non-primitive-base predicates, the R3 already-refined-source exemption via predicate-AST equality, the R5 violating-return pin, the refinement-over-`@Nat` `>= 0 && P` conjoin, and the **refined-ADT-sub-pattern arm-fact carry** — a `Some(@PosInt)` bind on `Option` carries the field's source-type predicate into the arm body so a downstream `@Nat` narrowing discharges instead of a false E503, and reaches call **preconditions** in the arm body via the SMT match-translation fact hook so `Some(@PosInt) -> needs_positive(@PosInt.0)` discharges instead of a false E501, while a genuine `Option` narrowing stays obligated (E505/E501), never silently assumed; an **alias-base refined return** (`{ @Age \| @Age.0 >= 18 }`) is assumed by the caller via the predicate binder name; and a **refined return from a match arm** (`Some(@PosInt) -> @PosInt.0`) discharges via a global `arm-matched => source-fact` SMT implication on both the normal and generic-fast paths, a violating `Option` payload still E505), **per-monomorphization generic verification** (#732 — an instantiated `forall` body is verified per concrete instantiation: a generic-body `@Nat` underflow is caught (naming the instantiation, with a counterexample) or discharged when guarded, the never-instantiated residual stays Tier-3 `E520`, collapsed-type-var (`A=B=Int`) De Bruijn reindex soundness, one-diagnostic dedup across multiple instantiations, and discovery of a generic reached only through a `decreases(...)` measure, plus a mixed-`tier3`/`timeout` aggregate-label completeness pin and a recursive-generic-clone source-name pin so `decreases` resolves on the monomorphised instance), **primitive-operation safety obligations** (#680 — division/modulo by-zero `E526` and array-index-bounds `E527`, the in-bounds/out-of-bounds two-check with float-exemption, honest Tier-3 for opaque lengths, the off-by-one and `i >= 0` lower-bound pins, walker recursion through array-literal / assert / interpolated-string / let-destructure positions, the opaque-shadow Tier-3 routing that stops an untranslatable `let`/destructure from silently discharging a stale same-type-outer divisor, De Bruijn-correct `E526`/`E527` fix hints rendered from the actual operands, Tier-1 projection of literal-constructor destructure components (`let Tuple<@Int, @Int> = Tuple(10, 6); _ / @Int.1` discharges `10 != 0` rather than shadowing to Tier-3, while a non-literal source stays a tracked Tier-3 shadow), a tracked placeholder pushed for *every* destructured component so same-type De Bruijn positions don't collapse (`Tuple(10, )` keeps `@Int.0` on the opaque second component, never silently discharging against the literal `10`), `_contains_opaque_shadow` recursion so a *compound* divisor / subtraction operand embedding a shadow (`shadow + 1`) falls to Tier-3 instead of a spurious `E526`/`E502` counterexample, and tracked shadowing of match-arm pattern slots bound over an untranslatable (effect-op) scrutinee so a primitive op in the arm falls to Tier-3 rather than silently discharging against a stale same-name outer (`match Source.next(()) { Some(@Int) -> 1 / @Int.0 }`); plus a **multi-agent shadow/projection audit battery** (`TestShadowAuditDivision680` / `TestShadowAuditSubtraction680` / `TestShadowAuditIndex680` / `TestDestructureDeBruijnAlignment680` / `TestShadowAuditInteractions680`) — 57 differential tests pinning the invariant trichotomy (safe → verified, opaque/shadow → Tier-3, provably-unsafe → loud) across compound shadows, modulo parity, multi-component De Bruijn alignment with value-distinct siblings, intervening-type namespaces, stacked / nested destructures, opaque match scrutinees, nested blocks, and intra-block scoping; **mutation-validated** — every shadow/projection test was confirmed to flip RED when its target machinery (`_is_opaque_shadow` / `_contains_opaque_shadow` / the De Bruijn placeholder push / the `let`-shadow / the match-arm `_fresh_pattern_env` / literal projection) is deliberately broken, so none is green-for-the-wrong-reason) | +| `test_obligations.py` | 284 | 1087 | Reified proof obligations + warm `VerificationSession` (#222 Phase A): full-corpus differential oracle (warm session == cold `verify()` on diagnostics, summary, and obligation stream, plus warm-twice determinism, across all 35 examples and every verify/run-level conformance program), summary↔obligation tier-bookkeeping consistency, per-kind unit tests (requires / ensures / decreases / nat_sub / call_pre statuses, counterexamples, error codes), content-key stability + same-text-two-sites span disambiguation, session solver reuse, type-error short-circuit, ADT-registry resync between programs; plus the Phase B incremental suite — identical-source full replay, callee-body-edit replays callers while callee-contract-edit invalidates them, span-shift and ADT-edit conservative invalidation, cross-program isolation, timeout-status never cached (monkeypatched solver), FIFO eviction bound; plus the #727 dedup pin — a violating call in a let RHS records exactly one E501 diagnostic and one call_pre obligation | +| `test_verifier.py` | 473 | 9,288 | Z3 verification, counterexamples, tier classification, call-site preconditions, branch-aware preconditions, pipe operator, cross-module contracts, match/ADT verification, decreases verification, mutual recursion, refined Bool/String/Float64 param sorts, **@Nat subtraction underflow obligation** (#520 — Path-A obligation discharge via requires/path-conditions/path-aware Z3 refutation, pure-literal exclusion, Int-Int and Nat-Int → Int exemptions), **@Nat binding-site narrowing obligation** (#552 — Tier-1 `value >= 0` at let/call-arg/effect-op-arg/ctor-field/match-bind/literal-destructure narrowing, #520 double-emit disjointness, E503 counterexample-witness pin; #747 — generic-instantiation, ADT sub-pattern, non-literal tuple-destructure, and cross-module imported-ctor narrowing, with concrete-site Tier-3 narrowings (incl. the if-expr-source SMT gap, recorded per-component) classified as codegen-guarded `tier3_runtime`, generic function-formal call-args guarded on the monomorphised callee (keyed on `call_target`, not the erased generic decl), and the two genuinely-unguarded residuals — effect-op-arg narrowings (no codegen guard yet, #754) and generic-instantiated constructor-field narrowings (constructors carry no per-field `@Nat` mono metadata) — surfacing the `E504` unguarded warning; #749 — IndexExpr/InterpolatedString walker-recursion pins, `_fresh_slot_var` nat-alias unit test, `_narrows_into_nat` verifier/codegen soundness parity), **refinement-predicate verification** (#746 — Tier-1 discharge at let/call-arg/ctor-field/effect-op-arg/match-bind/tuple-destructure narrowings and refined return positions, E505 violation with counterexample, E506 Tier-3 for untranslatable/non-primitive-base predicates, the R3 already-refined-source exemption via predicate-AST equality, the R5 violating-return pin, the refinement-over-`@Nat` `>= 0 && P` conjoin, and the **refined-ADT-sub-pattern arm-fact carry** — a `Some(@PosInt)` bind on `Option` carries the field's source-type predicate into the arm body so a downstream `@Nat` narrowing discharges instead of a false E503, and reaches call **preconditions** in the arm body via the SMT match-translation fact hook so `Some(@PosInt) -> needs_positive(@PosInt.0)` discharges instead of a false E501, while a genuine `Option` narrowing stays obligated (E505/E501), never silently assumed; an **alias-base refined return** (`{ @Age \| @Age.0 >= 18 }`) is assumed by the caller via the predicate binder name; and a **refined return from a match arm** (`Some(@PosInt) -> @PosInt.0`) discharges via a global `arm-matched => source-fact` SMT implication on both the normal and generic-fast paths, a violating `Option` payload still E505), **per-monomorphization generic verification** (#732 — an instantiated `forall` body is verified per concrete instantiation: a generic-body `@Nat` underflow is caught (naming the instantiation, with a counterexample) or discharged when guarded, the never-instantiated residual stays Tier-3 `E520`, collapsed-type-var (`A=B=Int`) De Bruijn reindex soundness, one-diagnostic dedup across multiple instantiations, and discovery of a generic reached only through a `decreases(...)` measure, plus a mixed-`tier3`/`timeout` aggregate-label completeness pin and a recursive-generic-clone source-name pin so `decreases` resolves on the monomorphised instance), **primitive-operation safety obligations** (#680 — division/modulo by-zero `E526` and array-index-bounds `E527`, the in-bounds/out-of-bounds two-check with float-exemption, honest Tier-3 for opaque lengths, the off-by-one and `i >= 0` lower-bound pins, walker recursion through array-literal / assert / interpolated-string / let-destructure positions, the opaque-shadow Tier-3 routing that stops an untranslatable `let`/destructure from silently discharging a stale same-type-outer divisor, De Bruijn-correct `E526`/`E527` fix hints rendered from the actual operands, Tier-1 projection of literal-constructor destructure components (`let Tuple<@Int, @Int> = Tuple(10, 6); _ / @Int.1` discharges `10 != 0` rather than shadowing to Tier-3, while a non-literal source stays a tracked Tier-3 shadow), a tracked placeholder pushed for *every* destructured component so same-type De Bruijn positions don't collapse (`Tuple(10, )` keeps `@Int.0` on the opaque second component, never silently discharging against the literal `10`), `_contains_opaque_shadow` recursion so a *compound* divisor / subtraction operand embedding a shadow (`shadow + 1`) falls to Tier-3 instead of a spurious `E526`/`E502` counterexample, and tracked shadowing of match-arm pattern slots bound over an untranslatable (effect-op) scrutinee so a primitive op in the arm falls to Tier-3 rather than silently discharging against a stale same-name outer (`match Source.next(()) { Some(@Int) -> 1 / @Int.0 }`); plus a **multi-agent shadow/projection audit battery** (`TestShadowAuditDivision680` / `TestShadowAuditSubtraction680` / `TestShadowAuditIndex680` / `TestDestructureDeBruijnAlignment680` / `TestShadowAuditInteractions680`) — 57 differential tests pinning the invariant trichotomy (safe → verified, opaque/shadow → Tier-3, provably-unsafe → loud) across compound shadows, modulo parity, multi-component De Bruijn alignment with value-distinct siblings, intervening-type namespaces, stacked / nested destructures, opaque match scrutinees, nested blocks, and intra-block scoping; **mutation-validated** — every shadow/projection test was confirmed to flip RED when its target machinery (`_is_opaque_shadow` / `_contains_opaque_shadow` / the De Bruijn placeholder push / the `let`-shadow / the match-arm `_fresh_pattern_env` / literal projection) is deliberately broken, so none is green-for-the-wrong-reason) | | `test_soundness_392.py` | 36 | 584 | #392 audit batches 1–2 — verifier soundness/completeness fixes: signed div/mod truncate toward zero (#799), body `assert(P)` carries a Tier-1 obligation (#800), divisions in contract predicates carry a `div_zero` obligation (#801), and the #804 assume-half of #800's `assert` rule — a prior `assert`/`assume` discharges later obligations (including a later call's precondition) + the postcondition at Tier 1, removing false E501/E503/E500/E505. Test-first: each fails on the pre-fix verifier | | `test_int_overflow.py` | 6 | 143 | #798 — `@Int`/`@Nat` arithmetic-overflow obligations (part of the #392 `smt.py` soundness audit): `+`/`-`/`*` on `@Int`/`@Nat` now emit an `int_overflow` obligation (the analog of `nat_sub`/`div_zero`) rather than modelling the operands as Z3's unbounded integers, so a `ensures(@Int.result > @Int.0)` over `@Int.0 + 1` no longer proves a contract the i64/u64 runtime violates under two's-complement wraparound. Unbounded operands leave the obligation undischarged (Tier-3, runtime-guarded); operand bounds that prove the result stays in range discharge it at Tier 1. Test-first: each fails on the pre-fix verifier (no `int_overflow` obligation emitted) | | `test_int_overflow_codegen.py` | 50 | 567 | #798 Stage 3 — runtime overflow-trap codegen: the codegen emits a guard at *exactly* the `@Int`/`@Nat` `+`/`-`/`*` sites the verifier obligates, so `vera run`/`vera compile` programs trap on overflow instead of silently wrapping at the i64/u64 boundary. The trap is a bare `unreachable` (Option A — matches the #520 `nat_sub` and #552 nat-bind precedent), classified `kind="unreachable"` today (a precise `"overflow"` trap kind via a host import is a follow-up). Test-first: every `*_traps` test fails on the pre-Stage-3 codegen (the op wraps silently → no trap → `execute` returns a value); every `*_no_trap` test passes before and after (safe arithmetic unchanged) | @@ -75,6 +75,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_walker_defensive_branches_597.py` | 21 | 296 | Synthetic-AST tests for the 11 defensive `isinstance` branches added by #597 (`_scan_io_ops` / `_scan_expr_for_handlers` / `_infer_expr_wasm_type` / `_infer_vera_type`) plus the 5 pr-review fixes (#2/#3/#8 — ModuleCall/AnonFn/QualifiedCall return None; dead `is not None` guards on Block/HandleExpr removed) | | `test_check_walker_coverage_597.py` | 15 | 311 | Unit tests for `scripts/check_walker_coverage.py` parsing logic — Expr subclass extraction, isinstance flattening (incl. tuple form), checklist-block anchoring (incl. CR-3 regression test: `# Foo → bar` outside WALKER_COVERAGE block not counted), section-header tolerance, auto-discovery invariants, end-to-end main exit code | | `test_stress.py` | 16 | 553 | Scale-dependent regression tests (#596) — `@pytest.mark.stress`, skipped by default. 9 logical tests × eager-GC lane parametrisation = 16 test instances. 10K `array_map`, 5K nested-array `array_map`, 1K-deep tail recursion with allocating arg, 1M-deep tail recursion with allocating arg (#549 GC-aware TCO), 20×20 nested array-fold-of-array-fold, 100K `array_fold`, 10K String allocations, 1K `State` get/put cycles, 10K `IO.print` calls. Pins #570 / #515 / #593 / #549 / #487 / #348 / #573 regression coverage | +| `test_string_length_soundness.py` | 8 | 149 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | | `test_errors.py` | 52 | 525 | Error code registry, diagnostic formatting, serialisation, SourceLocation, error display sync (README/HTML/spec) | | `test_formatter.py` | 124 | 1,074 | Comment extraction, interior comment positioning, expression/declaration formatting, match arm block bodies, idempotency, parenthesization, spec rules, ability declarations | | `test_cli.py` | 242 | 3,504 | CLI commands (check, verify, compile, run, test, fmt, version, quiet), subprocess integration, JSON error paths, runtime traps, arg validation, multi-file resolution, IO exit codes, --explain-slots, `builtins`/`effects`/`errors` introspection dispatch | @@ -82,7 +83,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_resolver.py` | 15 | 411 | Module resolution, path lookup, parse caching, circular import detection | | `test_types.py` | 73 | 388 | Type operations: subtyping, effect subtyping, equality, substitution, pretty-printing, canonical names | | `test_wasm.py` | 24 | 344 | WASM internals: StringPool, WasmSlotEnv, translation edge cases via full pipeline | -| `test_verifier_coverage.py` | 91 | 1,589 | Verifier/SMT coverage gaps: SMT encoding paths, verifier edge cases, defensive branches, **#667 SMT translator coverage for `FloatLit` / `IndexExpr` / `ArrayLit`** (Tier 1 verification of float/array literal/index contract predicates) | +| `test_verifier_coverage.py` | 91 | 1,594 | Verifier/SMT coverage gaps: SMT encoding paths, verifier edge cases, defensive branches, **#667 SMT translator coverage for `FloatLit` / `IndexExpr` / `ArrayLit`** (Tier 1 verification of float/array literal/index contract predicates) | | `test_wasm_coverage.py` | 226 | 3,976 | WASM coverage gaps: helpers unit tests, inference branches, closure free-var walking, operator/data/context edge cases | | `test_tester.py` | 17 | 445 | Contract-driven testing: tier classification, input generation, test execution, skip message content | | `test_tester_coverage.py` | 35 | 930 | Tester coverage gaps: String/Float64/ADT parameter input generation, Bool/Byte parameters, unsatisfiable preconditions, type expression edge cases, FP model-value extraction (NaN/Inf/signed-zero, #797) | diff --git a/docs/index.html b/docs/index.html index 91d5d2e9..1f3f1aeb 100644 --- a/docs/index.html +++ b/docs/index.html @@ -268,7 +268,7 @@

A programming language designed for LLMs to write, not For Agents → SKILL.md

- v0.0.181 + v0.0.182 CI

diff --git a/docs/index.md b/docs/index.md index 9c30c61d..a48c7bda 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,7 +4,7 @@ From the Latin *veritas* — truth. In Vera, verification is a first-class citizen. -**Current version:** [0.0.181](https://github.com/aallan/vera/releases/tag/v0.0.181) · [GitHub](https://github.com/aallan/vera) · [SKILL.md](https://veralang.dev/SKILL.md) (agent language reference) +**Current version:** [0.0.182](https://github.com/aallan/vera/releases/tag/v0.0.182) · [GitHub](https://github.com/aallan/vera) · [SKILL.md](https://veralang.dev/SKILL.md) (agent language reference) ## Why? diff --git a/docs/llms-full.txt b/docs/llms-full.txt index 8381dbab..4c37f17d 100644 --- a/docs/llms-full.txt +++ b/docs/llms-full.txt @@ -2,7 +2,7 @@ > Vera is a statically typed, purely functional programming language designed for large language models to write. It uses typed slot references (@T.n) instead of variable names, requires contracts on every function, and compiles to WebAssembly. -This file contains the core Vera language documentation — language reference, agent instructions, FAQ, error codes, and formal grammar — compiled into a single document. Version 0.0.181. For the full documentation index including the 13-chapter specification and supplementary docs, see llms.txt. +This file contains the core Vera language documentation — language reference, agent instructions, FAQ, error codes, and formal grammar — compiled into a single document. Version 0.0.182. For the full documentation index including the 13-chapter specification and supplementary docs, see llms.txt. ======================================================================== diff --git a/docs/llms.txt b/docs/llms.txt index bf0d45a7..6df4f16d 100644 --- a/docs/llms.txt +++ b/docs/llms.txt @@ -4,7 +4,7 @@ Vera uses De Bruijn indexing for bindings: `@Int.0` is the most recent `Int` binding, `@Int.1` the one before. There are no variable names. Contracts are mandatory — every function must declare `requires(...)`, `ensures(...)`, and `effects(...)`. The Z3 SMT solver verifies contracts statically where possible; remaining contracts become runtime assertions. All side effects (IO, Http, State, Exceptions, Async, Inference, Random) are tracked in the type system via algebraic effects. -Current version: 0.0.181. The reference compiler is written in Python. Install with `pip install -e .` from the repository. +Current version: 0.0.182. The reference compiler is written in Python. Install with `pip install -e .` from the repository. ## Homepage diff --git a/pyproject.toml b/pyproject.toml index 7aaaf397..1c2abfb1 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "vera" -version = "0.0.181" +version = "0.0.182" description = "Vera: a programming language designed for LLMs, with full contracts, algebraic effects, and typed slot references" readme = "README.md" license = "MIT" diff --git a/scripts/check_spec_examples.py b/scripts/check_spec_examples.py index 6ea16b19..2001fd12 100644 --- a/scripts/check_spec_examples.py +++ b/scripts/check_spec_examples.py @@ -142,7 +142,7 @@ ("05-functions.md", 324): "FRAGMENT", # fn(A -> B) in param position # Chapter 6 — inline function type in type alias - ("06-contracts.md", 347): "FRAGMENT", # type SafeDiv = fn(...) + fn apply_div + ("06-contracts.md", 349): "FRAGMENT", # type SafeDiv = fn(...) + fn apply_div # Chapter 7 — anonymous function at top level ("07-effects.md", 116): "FRAGMENT", # effect Logger + anonymous fn body diff --git a/spec/06-contracts.md b/spec/06-contracts.md index f885e2dc..60cc420c 100644 --- a/spec/06-contracts.md +++ b/spec/06-contracts.md @@ -254,6 +254,8 @@ The `@Nat` obligations (E502 / E503) carry the most nuance, spanning many bindin **Integer overflow** ([#798](https://github.com/aallan/vera/issues/798)). `@Int` is a signed 64-bit machine integer and `@Nat` an unsigned one; `+` / `-` / `*` wrap at the i64 / u64 boundary. Like `@Nat` underflow and signed-division `MIN / -1`, an overflowing operation is a *partial* operation that **traps** at runtime rather than silently wrapping, so each `@Int` / `@Nat` `+` / `-` / `*` carries an obligation that the result stays in range. It is classified at the operands' **common (coerced) type** — `@Int` if either operand is `@Int` (since `@Nat <: @Int`), else `@Nat` — not one operand's self-type (a non-negative literal is `@Nat`, but `5 + @Int.0` is an i64 add) nor the possibly-narrowed result type (an `@Int.0 + 1` stored into a `@Nat` slot is still an i64 add). A two-check mirrors array indexing: the result provably in range → **Tier 1**; provably out of range (e.g. a literal `u64.MAX + 1`, or `@Int.0 + 1` under `requires(@Int.0 == i64.MAX)`) → a compile error (**E528**); otherwise — dynamic operands — a runtime-guarded **Tier 3** trap. `@Nat` subtraction is excluded — it is the underflow obligation (E502) above, never a high-overflow. +**String length** ([#802](https://github.com/aallan/vera/issues/802)). Vera strings are UTF-8 byte sequences and `string_length` returns the **byte** count, but Z3's string theory (SMT-LIB 2.6) models strings as sequences of Unicode **code points** — its `Length` counts code points, which disagrees with the runtime on every multibyte character (`string_length("é")` is `2`, not `1`). `string_length` is therefore modeled at **Tier 1** only for a string **literal**, whose exact byte length is known; on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator). The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay **Tier 1**: UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. The other byte/offset-sensitive string builtins (`string_slice`, `string_index_of`, `string_char_code`, `string_chars`) are not translated to Z3 and already fall to **Tier 3**. + Runtime traps for unguarded primitives are Vera-native: each trap carries a kind label (`divide_by_zero`, `out_of_bounds`, etc.), a per-kind Fix paragraph naming the precondition that would have prevented it, and a source backtrace — so a missing static guarantee is still a recoverable signal. ### 6.4.4 SMT Solver Integration diff --git a/tests/conformance/ch04_string_builtins.vera b/tests/conformance/ch04_string_builtins.vera index 77a2c99d..4647e903 100644 --- a/tests/conformance/ch04_string_builtins.vera +++ b/tests/conformance/ch04_string_builtins.vera @@ -1,5 +1,8 @@ -- Conformance: string built-in functions (Chapter 4) -- Tests: string_length, string_concat, string_slice, string_char_code, to_string +-- Also covers #802: string_length counts UTF-8 bytes, and a string-literal +-- length is modeled at its exact byte count (so a non-ASCII length both +-- verifies at Tier 1 and runs). effect IO { op print(String -> Unit); } @@ -12,6 +15,18 @@ public fn test_length(@Unit -> @Int) string_length("hello") } +-- #802: "é" (U+00E9) is one Unicode code point but TWO UTF-8 bytes, and +-- string_length counts bytes. Pre-fix the verifier proved this == 1 (Z3's +-- code-point Length); now a string-literal length is modeled at its exact byte +-- count, so == 2 both verifies (Tier 1) and runs. +public fn test_length_non_ascii(@Unit -> @Int) + requires(true) + ensures(@Int.result == 2) + effects(pure) +{ + string_length("é") +} + public fn test_concat(@Unit -> @String) requires(true) ensures(true) diff --git a/tests/test_obligations.py b/tests/test_obligations.py index e1e99385..d4cfd22c 100644 --- a/tests/test_obligations.py +++ b/tests/test_obligations.py @@ -590,9 +590,13 @@ def test_e501_renders_precondition_in_call_site_terms(self) -> None: arguments substituted, and the fix shows concrete code — the guard with the rendered call, and the requires() to add. """ + # Uses string_starts_with (a Tier-1 string predicate) rather than + # string_length, which now defers to Tier 3 (#802) and so would not + # produce a static call-site E501 — the test is about E501 *rendering*, + # not string_length specifically. source = ( "private fn need_pos(@String -> @String)\n" - " requires(string_length(@String.0) > 0)\n" + ' requires(string_starts_with(@String.0, "x"))\n' " ensures(true)\n" " effects(pure)\n" "{\n" @@ -612,13 +616,13 @@ def test_e501_renders_precondition_in_call_site_terms(self) -> None: e501 = [d for d in result.diagnostics if d.error_code == "E501"] assert len(e501) == 1 d = e501[0] - assert 'At this call site: string_length("") > 0' in d.description + assert 'At this call site: string_starts_with("", "x")' in d.description assert d.fix is not None assert ( - 'if string_length("") > 0 then { need_pos("") } else { ... }' + 'if string_starts_with("", "x") then { need_pos("") } else { ... }' in d.fix ) - assert "requires(string_length(\"\") > 0)" in d.fix + assert 'requires(string_starts_with("", "x"))' in d.fix def test_e501_substitution_resolves_de_bruijn_order(self) -> None: """Slot substitution must honour De Bruijn most-recent-first: diff --git a/tests/test_string_length_soundness.py b/tests/test_string_length_soundness.py new file mode 100644 index 00000000..cb4b6372 --- /dev/null +++ b/tests/test_string_length_soundness.py @@ -0,0 +1,149 @@ +"""Regression tests for #802 — string_length code-point vs UTF-8 byte mismatch. + +Part of the #392 ``smt.py`` soundness audit. Vera's runtime ``string_length`` +counts UTF-8 **bytes**; Z3's ``Length`` over ``z3.String`` counts Unicode **code +points**. They disagree on every multibyte character, so modeling +``string_length`` with ``z3.Length`` proved *false* contracts at Tier 1 — e.g. +``string_length("é") == 1`` verified, yet the runtime returns ``2``: + + public fn f(@String -> @Int) + requires(@String.0 == "é") # U+00E9: 1 code point, 2 UTF-8 bytes + ensures(@Int.result == 1) # FALSE — runtime returns 2 + effects(pure) + { string_length(@String.0) } + +Fix (#802): a ``string_length`` of a **non-literal** argument defers to Tier 3 +(no byte-length operator exists in Z3's string theory), matching the numeric-cast +/ quantifier / decimal precedent; a string **literal** is modeled at its exact +UTF-8 byte length (sound *and* precise). + +Written test-first: the slot-deferral and literal-byte-count tests fail on the +pre-fix verifier, which proves the false contract at Tier 1 via ``z3.Length``. +""" + +from __future__ import annotations + +from vera.checker import typecheck_with_artifacts +from vera.parser import parse_to_ast +from vera.verifier import VerifyResult, verify + + +def _verify(source: str) -> VerifyResult: + ast_ = parse_to_ast(source) + _diags, arts = typecheck_with_artifacts(ast_, source) + return verify( + ast_, source, + expr_types=arts.expr_semantic_types, + expr_target_types=arts.expr_target_types, + ) + + +def _ok(result: VerifyResult) -> bool: + """Verification succeeded iff no error-severity diagnostics (mirrors the + ``ok`` field of ``vera verify --json``).""" + return not any(d.severity == "error" for d in result.diagnostics) + + +class TestStringLengthSoundness802: + def test_slot_arg_false_length_not_proved_at_tier1(self) -> None: + # The issue's probe: byte length of "é" is 2, so ensures(result == 1) is + # FALSE. Pre-fix the verifier PROVED it at Tier 1 (z3.Length("é") == 1). + # After the fix the slot-arg string_length defers to Tier 3, so the false + # postcondition is NOT proved — it is runtime-guarded, and the runtime + # correctly rejects it. + result = _verify(""" +public fn f(@String -> @Int) + requires(@String.0 == "é") + ensures(@Int.result == 1) + effects(pure) +{ string_length(@String.0) } +""") + # The string_length-derived postcondition must not be a Tier-1 proof; + # it must defer to a runtime-guarded Tier-3 obligation. + assert result.summary.tier3_runtime >= 1, ( + result.summary.tier1_verified, result.summary.tier3_runtime, + ) + + def test_slot_arg_true_length_also_defers(self) -> None: + # Even the TRUE byte-length contract defers for a slot argument (Z3 has no + # byte-length operator) — soundness over precision. + result = _verify(""" +public fn f(@String -> @Int) + requires(@String.0 == "é") + ensures(@Int.result == 2) + effects(pure) +{ string_length(@String.0) } +""") + assert result.summary.tier3_runtime >= 1, ( + result.summary.tier1_verified, result.summary.tier3_runtime, + ) + + def test_literal_arg_byte_length_proved_at_tier1(self) -> None: + # A string LITERAL has a known exact UTF-8 byte length, modeled precisely: + # string_length("é") == 2 verifies at Tier 1 (byte model). Pre-fix this + # was DISPROVED (z3.Length gave the code-point count 1). + ok = _verify(""" +public fn f(@Unit -> @Int) + requires(true) ensures(@Int.result == 2) effects(pure) +{ string_length("é") } +""") + assert _ok(ok), [d.description for d in ok.diagnostics] + assert ok.summary.tier1_verified >= 1 + + def test_literal_arg_wrong_codepoint_length_disproved(self) -> None: + # The code-point count (1) is the wrong answer for "é"; it must NOT + # verify. Pre-fix this was PROVED (z3.Length("é") == 1) — the bug. + bad = _verify(""" +public fn f(@Unit -> @Int) + requires(true) ensures(@Int.result == 1) effects(pure) +{ string_length("é") } +""") + assert not _ok(bad), "string_length(\"é\") == 1 is false (byte length is 2)" + + def test_ascii_literal_byte_length_still_proved(self) -> None: + # ASCII is 1 byte per code point, so the byte model agrees with the old + # code-point answer — ASCII literal lengths still verify at Tier 1. + ok = _verify(""" +public fn f(@Unit -> @Int) + requires(true) ensures(@Int.result == 2) effects(pure) +{ string_length("ab") } +""") + assert _ok(ok) and ok.summary.tier1_verified >= 1 + + +class TestStringPredicateSoundness802: + """contains / starts_with / ends_with are boolean predicates; under UTF-8 + self-synchronization a valid-UTF-8 needle matches at the byte level iff at + the code-point level, so Z3's Contains/PrefixOf/SuffixOf stay sound on + non-ASCII input — they remain Tier-1 (this is the verifier side; the + end-to-end verify↔run agreement is pinned in test_string_length_codegen).""" + + def test_starts_with_non_ascii_tier1(self) -> None: + result = _verify(""" +public fn f(@String -> @Bool) + requires(@String.0 == "été") + ensures(@Bool.result == true) + effects(pure) +{ string_starts_with(@String.0, "ét") } +""") + assert _ok(result) and result.summary.tier3_runtime == 0 + + def test_ends_with_non_ascii_tier1(self) -> None: + result = _verify(""" +public fn f(@String -> @Bool) + requires(@String.0 == "café") + ensures(@Bool.result == true) + effects(pure) +{ string_ends_with(@String.0, "fé") } +""") + assert _ok(result) and result.summary.tier3_runtime == 0 + + def test_contains_non_ascii_tier1(self) -> None: + result = _verify(""" +public fn f(@String -> @Bool) + requires(@String.0 == "café") + ensures(@Bool.result == true) + effects(pure) +{ string_contains(@String.0, "afé") } +""") + assert _ok(result) and result.summary.tier3_runtime == 0 diff --git a/tests/test_verifier.py b/tests/test_verifier.py index c2e57134..7afef80b 100644 --- a/tests/test_verifier.py +++ b/tests/test_verifier.py @@ -4033,6 +4033,12 @@ def test_overall_tier_counts(self) -> None: +8 T1, +47 T3, +55 total, +0 tier3_unguarded — verified by reconstructing the prior 265/31/296/0 baseline with int_overflow obligations excluded. + + #802: string_length on a non-literal argument now defers to Tier 3 + (Z3's Length counts code points, Vera counts UTF-8 bytes), so two + example contracts over a slot-arg string_length move T1 -> T3. Net: + -2 T1, +2 T3, +0 total (the obligations persist, only their tier + changes): 273/78/351 -> 271/80/351. """ t1 = t3 = total = t3u = 0 for f in sorted(EXAMPLES_DIR.glob("*.vera")): @@ -4045,8 +4051,8 @@ def test_overall_tier_counts(self) -> None: total += result.summary.total t3u += sum(1 for o in result.obligations if o.status == "tier3_unguarded") - assert t1 == 273, f"Expected 273 T1, got {t1}" - assert t3 == 78, f"Expected 78 T3, got {t3}" + assert t1 == 271, f"Expected 271 T1, got {t1}" + assert t3 == 80, f"Expected 80 T3, got {t3}" assert total == 351, f"Expected 351 total, got {total}" assert t3u == 0, f"Expected 0 tier3_unguarded, got {t3u}" @@ -4151,42 +4157,49 @@ def test_mutual_recursion_example_all_t1(self) -> None: class TestStringLengthVerification: - """string_length() on @String arguments uses z3.Length() — native Z3 string theory (Tier 1). - - The uninterpreted function path is only a fallback for non-SeqSort arguments and is - never reached in practice now that String params are correctly declared as SeqSort. + """string_length defers to Tier 3 for non-literal arguments (#802). + + Z3's ``Length`` over ``z3.String`` counts Unicode code points, but Vera's + runtime ``string_length`` counts UTF-8 bytes — so modeling a non-literal + ``string_length`` with ``z3.Length`` proved false contracts for non-ASCII + input. The only sound Tier-1 model is the exact byte length of a string + *literal*; everything else defers to a runtime-guarded Tier-3 obligation. + (Comprehensive coverage: tests/test_string_length_soundness.py.) """ - def test_string_length_gt_zero_requires_tier1(self) -> None: - """requires(string_length(@String.0) > 0) is verified Tier 1.""" + def test_string_length_slot_in_ensures_defers_to_tier3(self) -> None: + """ensures over a slot-arg string_length can't be Tier-1 proved (the + byte count is unknown to Z3); it defers to a runtime-guarded Tier-3 + obligation rather than the old unsound z3.Length proof.""" result = _verify(""" -private fn non_empty(@String -> @Int) - requires(string_length(@String.0) > 0) - ensures(true) +private fn get_length(@String -> @Int) + requires(true) + ensures(@Int.result >= 0) effects(pure) { string_length(@String.0) } """) - assert result.summary.tier1_verified >= 1 - assert result.summary.tier3_runtime == 0 + assert result.summary.tier3_runtime >= 1 - def test_string_length_ensures_tier1(self) -> None: - """ensures(@Int.result >= 0) on string_length return is verified Tier 1.""" + def test_string_length_literal_byte_length_tier1(self) -> None: + """string_length of a literal is modeled at its exact UTF-8 byte length + and verifies at Tier 1 (ASCII: 2 bytes for "hi").""" result = _verify(""" -private fn get_length(@String -> @Int) +private fn two(@Unit -> @Int) requires(true) - ensures(@Int.result >= 0) + ensures(@Int.result == 2) effects(pure) { - string_length(@String.0) + string_length("hi") } """) assert result.summary.tier1_verified >= 1 assert result.summary.tier3_runtime == 0 - def test_string_length_comparison_tier1(self) -> None: - """string_length in both requires and ensures resolves to Tier 1.""" + def test_string_length_slot_comparison_defers(self) -> None: + """A slot-arg string_length in an ensures comparison defers to Tier 3 + (was an unsound Tier-1 z3.Length proof).""" result = _verify(""" private fn longer_than(@String, @Int -> @Bool) requires(@Int.0 >= 0) @@ -4196,8 +4209,7 @@ def test_string_length_comparison_tier1(self) -> None: string_length(@String.0) > @Int.0 } """) - assert result.summary.tier1_verified >= 2 - assert result.summary.tier3_runtime == 0 + assert result.summary.tier3_runtime >= 1 class TestStringPredicateVerification: @@ -4297,19 +4309,21 @@ def test_refined_string_param_string_predicate_tier1(self) -> None: """RefinedType(STRING) param uses SeqSort — string predicates resolve to Tier 1. Without the RefinedType branch in _is_string_type, the parameter falls through to - declare_int (IntSort) and string_length uses the uninterpreted function, which cannot - prove string_length(@NonEmptyString.0) > 0 even with the requires assumption (Tier 3). - With the fix, z3.Length() is used and Z3 proves the ensures from the requires (Tier 1). + declare_int (IntSort) and the string predicate uses an uninterpreted function, which + cannot prove it even with the requires assumption (Tier 3). With the fix the param + is a SeqSort and Z3's PrefixOf proves the ensures from the requires (Tier 1). Uses + string_starts_with (a Tier-1 predicate); string_length now defers to Tier 3 for + non-literal arguments (#802), so it is no longer the right probe for SeqSort wiring. """ result = _verify(""" -type NonEmptyString = { @String | string_length(@String.0) > 0 }; +type HttpsUrl = { @String | string_starts_with(@String.0, "https://") }; -private fn pass_through(@NonEmptyString -> @Bool) - requires(string_length(@NonEmptyString.0) > 0) +private fn pass_through(@HttpsUrl -> @Bool) + requires(string_starts_with(@HttpsUrl.0, "https://")) ensures(@Bool.result) effects(pure) { - string_length(@NonEmptyString.0) > 0 + string_starts_with(@HttpsUrl.0, "https://") } """) assert result.summary.tier3_runtime == 0 @@ -4405,16 +4419,18 @@ def test_substitutes_binder_with_value(self) -> None: assert z3.is_false(z3.simplify(z3.substitute(result, (v, z3.IntVal(-1))))) def test_string_predicate_with_builtin_call(self) -> None: - """A predicate calling a builtin (`string_length(@String.0) > 0`) + """A predicate calling a builtin (`string_starts_with(@String.0, "h")`) translates with the binder substituted — same surface as a `requires` - clause, so `translate_expr` handles it.""" + clause, so `translate_expr` handles it. (string_length is no longer the + probe here: it defers to Tier 3 for non-literal args (#802), so a + string_length refinement translates to None.)""" import z3 from vera.smt import SmtContext from vera.types import RefinedType, STRING from vera.verifier import ContractVerifier pred = self._predicate_of( - "type NEStr = { @String | string_length(@String.0) > 0 };\n" + 'type HStr = { @String | string_starts_with(@String.0, "h") };\n' ) refined = RefinedType(STRING, pred) smt = SmtContext() @@ -4422,10 +4438,10 @@ def test_string_predicate_with_builtin_call(self) -> None: result = ContractVerifier._translate_refined_predicate(smt, refined, s) assert result is not None assert z3.is_true( - z3.simplify(z3.substitute(result, (s, z3.StringVal("ab")))) + z3.simplify(z3.substitute(result, (s, z3.StringVal("hi")))) ) assert z3.is_false( - z3.simplify(z3.substitute(result, (s, z3.StringVal("")))) + z3.simplify(z3.substitute(result, (s, z3.StringVal("x")))) ) def test_non_primitive_base_is_none(self) -> None: @@ -4561,13 +4577,15 @@ def test_multislot_and_predicate_discharges(self) -> None: assert [d for d in result.diagnostics if d.severity == "error"] == [] assert len(self._refine_obligations(result, "verified")) == 1 - def test_string_length_predicate_discharges(self) -> None: - """A predicate calling a builtin (`string_length(...) > 0`) discharges - a non-empty string literal at a call argument (R8).""" + def test_string_predicate_discharges(self) -> None: + """A predicate calling a builtin (`string_starts_with(...)`) discharges + a matching string literal at a call argument (R8). (string_length now + defers to Tier 3 for non-literal args (#802), so it is no longer the + probe for refinement discharge.)""" result = _verify(""" -type NonEmpty = { @String | string_length(@String.0) > 0 }; +type StartsH = { @String | string_starts_with(@String.0, "h") }; -private fn use(@NonEmpty -> @Int) +private fn use(@StartsH -> @Int) requires(true) ensures(true) effects(pure) { 0 } diff --git a/tests/test_verifier_coverage.py b/tests/test_verifier_coverage.py index 7a3b70f0..9725c371 100644 --- a/tests/test_verifier_coverage.py +++ b/tests/test_verifier_coverage.py @@ -1236,11 +1236,16 @@ def test_string_return_type_typed_at_call_site(self) -> None: """#667 follow-up — `_translate_call_with_info` now declares String-returning calls with a String Z3 var rather than falling back to Int. Pin: the caller's - postcondition `string_length(result) > 0` translates and - the predicate counts as Tier 1. Pre-fix the call's - return var was Int, so `string_length(int_var)` got an - Int-domain length function rather than the String-domain - one, breaking the predicate's typing. + postcondition `string_starts_with(result, "h")` translates + and the predicate counts as Tier 1. Pre-fix the call's + return var was Int, so the string predicate got an + Int-domain function rather than the String-domain one, + breaking the predicate's typing. + + Uses string_starts_with (a Tier-1 string predicate) rather + than string_length, which defers to Tier 3 for non-literal + arguments (#802) and so would warn regardless of the call's + return typing — masking this #667 test. Uses an @Int arg (not @Unit) so the call's argument translates cleanly — UnitLit returns None from @@ -1251,15 +1256,15 @@ def test_string_return_type_typed_at_call_site(self) -> None: result = _verify(""" private fn echo_str(@Int -> @String) requires(true) - ensures(string_length(@String.result) > 0) + ensures(string_starts_with(@String.result, "h")) effects(pure) { "hello" } -public fn use_str(@Int -> @Int) +public fn use_str(@Int -> @Bool) requires(true) - ensures(@Int.result > 0) + ensures(@Bool.result) effects(pure) -{ string_length(echo_str(@Int.0)) } +{ string_starts_with(echo_str(@Int.0), "h") } """) errors = [d for d in result.diagnostics if d.severity == "error"] assert errors == [], f"Expected no errors, got: {errors}" diff --git a/uv.lock b/uv.lock index 9087371d..a857b940 100644 --- a/uv.lock +++ b/uv.lock @@ -1540,7 +1540,7 @@ wheels = [ [[package]] name = "vera" -version = "0.0.181" +version = "0.0.182" source = { editable = "." } dependencies = [ { name = "lark" }, diff --git a/vera/__init__.py b/vera/__init__.py index e493ebaa..b5c6eb3a 100644 --- a/vera/__init__.py +++ b/vera/__init__.py @@ -1,4 +1,4 @@ """Vera: a programming language designed for LLMs.""" -__version__ = "0.0.181" +__version__ = "0.0.182" version = __version__ diff --git a/vera/smt.py b/vera/smt.py index 02a2c57a..60733b93 100644 --- a/vera/smt.py +++ b/vera/smt.py @@ -1053,22 +1053,20 @@ def _translate_call( if call.name == "nat_to_int" and len(call.args) == 1: return self.translate_expr(call.args[0], env) - # Built-in: string_length() — use z3.Length() for String sorts so that - # Z3's string theory gives exact lengths (e.g. for literal arguments at - # call sites). Fall back to an uninterpreted function for other sorts. + # Built-in: string_length() (#802). Vera's runtime string_length counts + # UTF-8 *bytes*, but Z3's Length over z3.String counts Unicode *code + # points* — they disagree on every multibyte character (e.g. "é" is 1 + # code point but 2 bytes), so z3.Length proved false contracts at Tier 1. + # A string LITERAL has a known exact byte length, so model it precisely; + # for any non-literal argument no byte-length operator exists in Z3's + # string theory, so defer to Tier 3 (return None), matching the + # numeric-cast / quantifier / decimal precedent. if call.name == "string_length" and len(call.args) == 1: - arg = self.translate_expr(call.args[0], env) - if arg is not None: - if isinstance(arg.sort(), z3.SeqSortRef): - result = z3.Length(arg) - else: - length_fn = z3.Function( - "string_length", arg.sort(), z3.IntSort(), - ) - result = length_fn(arg) - self.solver.add(result >= 0) - return result - return None # pragma: no cover + arg_node = call.args[0] + if isinstance(arg_node, ast.StringLit): + byte_len = len(arg_node.value.encode("utf-8")) + return z3.IntVal(byte_len) + return None # Built-ins: string_contains / string_starts_with / string_ends_with # Z3's native string theory encodes these exactly. From 0195d9cfdde27dae9c17bebf038f51fe4cb757b2 Mon Sep 17 00:00:00 2001 From: T Date: Sat, 27 Jun 2026 11:39:18 +0100 Subject: [PATCH 2/7] fix(802): also defer astral-plane string literals (review finding) + runtime/edge tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adversarial review of #810 found a second Z3-string unsoundness in the same class as #802: Z3's string sort alphabet only reaches U+2FFFF, and the Python binding silently stores any higher code point as the literal's escape string — so string_contains("\u{10FFFF}", "f") proved true at Tier 1 while the runtime returns false. Since this PR closes the #392 smt.py string-soundness audit, fixing it here keeps that close honest. - vera/smt.py: a string literal containing a code point > U+2FFFF defers to Tier 3 (StringVal returns None), so the predicates never match phantom escape bytes. string_length is unaffected (it byte-counts the decoded value, not StringVal), so an astral literal's byte length is still soundly Tier-1. - tests/test_string_length_soundness.py: TestAstralStringLiteral802 (the regression — pre-fix proved the false contract at Tier 1) plus literal-edge coverage (escaped-unicode decode, 4-byte emoji, empty string). - tests/test_codegen.py: a non-ASCII runtime byte-length test (string_length("é") == 2, "😀" == 4) — pins the runtime premise the verifier's byte model relies on, so a codegen regression to code-point counting can't hide behind green verifier-side tests. - spec/06 §6.4.3 + CHANGELOG: note the astral-literal Tier-3 deferral. Refs #802, #392. Co-Authored-By: Claude --- CHANGELOG.md | 2 +- README.md | 2 +- ROADMAP.md | 2 +- TESTING.md | 6 +-- spec/06-contracts.md | 2 +- tests/test_codegen.py | 20 ++++++++ tests/test_string_length_soundness.py | 70 ++++++++++++++++++++++++++- vera/smt.py | 11 +++++ 8 files changed, 107 insertions(+), 8 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index cae1a53a..86c2574f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,7 +10,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). ### Fixed -- **Verifier soundness: `string_length` no longer proves a false length for non-ASCII strings** (final sub-task of the [#392](https://github.com/aallan/vera/issues/392) `smt.py` soundness audit). `vera verify` translated `string_length` to Z3's `Length`, which counts Unicode **code points**, but Vera's runtime counts **UTF-8 bytes** — so `ensures(string_length("é") == 1)` verified at Tier 1 while the runtime returns `2`. `string_length` is now modeled at Tier 1 only for a string **literal** (whose exact byte length is known); on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator), matching the numeric-cast / quantifier / decimal precedent. The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 — UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. This was the last of the six soundness bugs ([#797](https://github.com/aallan/vera/issues/797)–[#802](https://github.com/aallan/vera/issues/802)) found by the [#392](https://github.com/aallan/vera/issues/392) audit, which is now complete. ([#802](https://github.com/aallan/vera/issues/802)) +- **Verifier soundness: `string_length` no longer proves a false length for non-ASCII strings** (final sub-task of the [#392](https://github.com/aallan/vera/issues/392) `smt.py` soundness audit). `vera verify` translated `string_length` to Z3's `Length`, which counts Unicode **code points**, but Vera's runtime counts **UTF-8 bytes** — so `ensures(string_length("é") == 1)` verified at Tier 1 while the runtime returns `2`. `string_length` is now modeled at Tier 1 only for a string **literal** (whose exact byte length is known); on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator), matching the numeric-cast / quantifier / decimal precedent. The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 — UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level — with one further guard: a string literal containing a code point above Z3's string-sort alphabet (> U+2FFFF) also defers to Tier 3, because the Z3 Python binding would otherwise store it as escape text and let the predicates match phantom bytes. This was the last of the six soundness bugs ([#797](https://github.com/aallan/vera/issues/797)–[#802](https://github.com/aallan/vera/issues/802)) found by the [#392](https://github.com/aallan/vera/issues/392) audit, which is now complete. ([#802](https://github.com/aallan/vera/issues/802)) ## [0.0.181] - 2026-06-27 diff --git a/README.md b/README.md index f5160fc9..2ccca6aa 100644 --- a/README.md +++ b/README.md @@ -215,7 +215,7 @@ cp /path/to/vera/SKILL.md ~/.claude/skills/vera-language/SKILL.md ## Project status -Vera is in **active development** at v0.0.182 — 1,400+ commits, 182 releases, 5,164 tests, 91% code coverage, 93 conformance programs, 35 examples, and a 13-chapter specification. See **[HISTORY.md](HISTORY.md)** for how the compiler was built. +Vera is in **active development** at v0.0.182 — 1,400+ commits, 182 releases, 5,170 tests, 91% code coverage, 93 conformance programs, 35 examples, and a 13-chapter specification. See **[HISTORY.md](HISTORY.md)** for how the compiler was built. The reference compiler — parser, AST, type checker, contract verifier (Z3), WASM code generator, module system, browser runtime, and runtime contract insertion — is working. The language specification is in draft across [13 chapters](spec/). diff --git a/ROADMAP.md b/ROADMAP.md index 63953d34..de442c94 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -8,7 +8,7 @@ Priority lives in this file and nowhere else — issues carry kind and area labe ## Where we are -5,164 tests, 93 conformance programs, 35 examples, 13 spec chapters. +5,170 tests, 93 conformance programs, 35 examples, 13 spec chapters. ## The roadmap diff --git a/TESTING.md b/TESTING.md index a64bdabc..50e03457 100644 --- a/TESTING.md +++ b/TESTING.md @@ -6,7 +6,7 @@ This is the single source of truth for Vera's testing infrastructure, coverage d | Metric | Value | |--------|-------| -| **Tests** | 5,164 across 45 files (~63,000 lines of test code; 5,132 passed + 16 stress, 16 skipped) | +| **Tests** | 5,170 across 45 files (~63,000 lines of test code; 5,138 passed + 16 stress, 16 skipped) | | **Compiler code coverage** | 95% Python, 61% JavaScript — 91% combined (CI minimum: 80%) | | **Conformance programs** | 93 programs across 9 spec chapters, validating every language feature | | **Example programs** | 35, all validated through `vera check` + `vera verify` | @@ -65,7 +65,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_int_overflow_codegen.py` | 50 | 567 | #798 Stage 3 — runtime overflow-trap codegen: the codegen emits a guard at *exactly* the `@Int`/`@Nat` `+`/`-`/`*` sites the verifier obligates, so `vera run`/`vera compile` programs trap on overflow instead of silently wrapping at the i64/u64 boundary. The trap is a bare `unreachable` (Option A — matches the #520 `nat_sub` and #552 nat-bind precedent), classified `kind="unreachable"` today (a precise `"overflow"` trap kind via a host import is a follow-up). Test-first: every `*_traps` test fails on the pre-Stage-3 codegen (the op wraps silently → no trap → `execute` returns a value); every `*_no_trap` test passes before and after (safe arithmetic unchanged) | | `test_int_overflow_differential.py` | 143 | 364 | #798 Stage 3 verifier↔codegen classification differential (cross-component soundness rule): the codegen overflow guard must fire at exactly the sites the verifier obligates *and* classify each site's operand type (`@Int` i64 vs `@Nat` u64) identically — else a Tier-1-clean program traps spuriously or a wrapping op slips through unguarded. Over a corpus exercising all five operand combos plus the literal-left ambiguity (which the pre-fix codegen mis-classified as `@Nat`), asserts the verifier's per-site gated classification equals the codegen's site for site, both sides driven by the same `ast.span_key` | | `test_monomorphize_differential.py` | 13 | 682 | #732 differential soundness: the verifier's per-monomorphization instantiation discovery covers every instantiation codegen emits (name coverage + per-generic count), over real generic programs (conformance ch02/ch09, `examples/generics.vera`) plus inline cases for the soundness-critical scenarios — collapsed type vars, **prelude combinator emission** (`option_map`), transitive generics, a generic whose type arg is fixed only by a **where-helper's return** (a `Float64`-returning helper, so the unresolved-var `"Bool"` phantom default cannot mask a miss), a generic whose type arg is fixed only by an **imported constructor** (`id2(MkBox(7))` — the verifier's mono-context must include `_module_constructors`, else it phantom-defaults and misses codegen's `id2`), a generic whose type arg is fixed only by an **imported function's return** (`id_g(make_int(...))` — the verifier's mono-context must seed `fn_ret_types` from imported functions, else it phantom-defaults and misses codegen's `id_g`, plus a **private-shadow** case pinning the imported-fn seeding stays unfiltered like codegen since filtering would diverge into a false Tier-1), and a generic reached only through a **contract clause or `where` helper** (codegen must seed Pass 1.5 from the shared node-level walk, not just `decl.body`, or it skips the clone → `CodegenSkip` at run time) — so a missed instantiation (a false Tier-1) is caught. Guards against a vacuous pass when codegen emits nothing, plus a **determinism guard** (`vera compile --wat` is byte-stable across `PYTHONHASHSEED` — the mono worklist sorts its instantiation sets) | -| `test_codegen.py` | 1,213 | 21,107 | WASM compilation, arithmetic, Float64, Byte, arrays (incl. compound element types), ADTs, match (incl. nested patterns), generics, State\, Exn\ handlers, control flow, strings, string escape sequences, IO (read\_line, read\_file, write\_file, args, exit, get\_env, sleep, time, stderr), bounds checking, quantifiers, assert/assume, refinement type aliases, pipe operator, string built-ins, built-in shadowing, parse\_nat Result, GC, Markdown host bindings, Regex host bindings, Map collection, Set collection, Decimal type, Json type, Html type, Http effect, Inference effect, Random effect, example round-trips, GC shadow stack overflow, **WASM tail-call optimization** (#517 — `return_call` emission for tail-position calls, 50K- and 1M-iteration stress, structural assertions on `return_call`/plain `call` boundary, **GC-aware TCO for allocating fns (#549 — `$gc_sp` restore before each `return_call`)**, postcondition-fallback regression (still reverts to plain `call`), analyzer unit tests covering Block-trailing / IfExpr-both-branches / MatchExpr-arm-bodies / let-value-NOT-marked / call-args-NOT-marked / ExprStmt-statement-NOT-marked / IfExpr-condition-NOT-marked / MatchExpr-scrutinee-NOT-marked), **@Nat narrowing runtime guards** (#552 let site; #747 extends to tuple-destructure / top-level match-bind / ADT sub-pattern / concrete ctor-field / call-arg sites — `i64.lt_s; unreachable` net, @Int targets exempt), **refinement-predicate runtime guards** (#746 — primitive-base (incl. `@Bool` / `@Float64`) refined params/returns traverse a `$vera.contract_fail` predicate guard at the function boundary: violating value traps, valid passes, call-args guarded transitively, generic refined returns guarded on the monomorphised instance; non-primitive `@Array` refinements (`array_length(...) > 0`) guarded too, the alias-aware `@Nat` base conjoins its implicit `>= 0`, and erased `@Unit` refinements emit no guard; **tuple-component boundary guards** decompose a `Tuple` param/return at the FFI boundary — loading each refined / `@Nat` component from the heap value and guarding it recursively for nested tuples, so a violating component traps instead of laundering past the verifier's component assumption; a **generic tuple alias** (`type Box = Tuple`) substitutes its argument so `Box` still guards its component, and a mutually-recursive (infinite) tuple alias **fails closed with E617** rather than silently emitting partial guards; a **refinement OVER a tuple** (`type Pair = { @Tuple \| true }`) is unwrapped so its components are guarded too, recursively through nested `Tuple`) | +| `test_codegen.py` | 1,214 | 21,127 | WASM compilation, arithmetic, Float64, Byte, arrays (incl. compound element types), ADTs, match (incl. nested patterns), generics, State\, Exn\ handlers, control flow, strings, string escape sequences, IO (read\_line, read\_file, write\_file, args, exit, get\_env, sleep, time, stderr), bounds checking, quantifiers, assert/assume, refinement type aliases, pipe operator, string built-ins, built-in shadowing, parse\_nat Result, GC, Markdown host bindings, Regex host bindings, Map collection, Set collection, Decimal type, Json type, Html type, Http effect, Inference effect, Random effect, example round-trips, GC shadow stack overflow, **WASM tail-call optimization** (#517 — `return_call` emission for tail-position calls, 50K- and 1M-iteration stress, structural assertions on `return_call`/plain `call` boundary, **GC-aware TCO for allocating fns (#549 — `$gc_sp` restore before each `return_call`)**, postcondition-fallback regression (still reverts to plain `call`), analyzer unit tests covering Block-trailing / IfExpr-both-branches / MatchExpr-arm-bodies / let-value-NOT-marked / call-args-NOT-marked / ExprStmt-statement-NOT-marked / IfExpr-condition-NOT-marked / MatchExpr-scrutinee-NOT-marked), **@Nat narrowing runtime guards** (#552 let site; #747 extends to tuple-destructure / top-level match-bind / ADT sub-pattern / concrete ctor-field / call-arg sites — `i64.lt_s; unreachable` net, @Int targets exempt), **refinement-predicate runtime guards** (#746 — primitive-base (incl. `@Bool` / `@Float64`) refined params/returns traverse a `$vera.contract_fail` predicate guard at the function boundary: violating value traps, valid passes, call-args guarded transitively, generic refined returns guarded on the monomorphised instance; non-primitive `@Array` refinements (`array_length(...) > 0`) guarded too, the alias-aware `@Nat` base conjoins its implicit `>= 0`, and erased `@Unit` refinements emit no guard; **tuple-component boundary guards** decompose a `Tuple` param/return at the FFI boundary — loading each refined / `@Nat` component from the heap value and guarding it recursively for nested tuples, so a violating component traps instead of laundering past the verifier's component assumption; a **generic tuple alias** (`type Box = Tuple`) substitutes its argument so `Box` still guards its component, and a mutually-recursive (infinite) tuple alias **fails closed with E617** rather than silently emitting partial guards; a **refinement OVER a tuple** (`type Pair = { @Tuple \| true }`) is unwrapped so its components are guarded too, recursively through nested `Tuple`) | | `test_codegen_contracts.py` | 32 | 570 | Runtime pre/postconditions, contract fail messages, old/new state postconditions | | `test_codegen_monomorphize.py` | 71 | 1,320 | Generic instantiation, type inference, monomorphization edge cases, ability constraint satisfaction (Eq/Ord/Hash/Show), operation rewriting (eq/compare), show/hash dispatch, ADT auto-derivation, array operations (slice/map/filter/fold) | | `test_codegen_closures.py` | 50 | 1,618 | Closure lifting, captured variables, higher-order functions, iterative-builder shadow-stack regressions (#570), closure return-value shadow-push balance for both i32-pair and i32-ADT branches across array_map and array_mapi, plus VERA_EAGER_GC injection self-test (#593), IndexExpr-of-FnCall element-type inference (#614), non-contiguous capture and walker-order miscompiles (#615) | @@ -75,7 +75,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_walker_defensive_branches_597.py` | 21 | 296 | Synthetic-AST tests for the 11 defensive `isinstance` branches added by #597 (`_scan_io_ops` / `_scan_expr_for_handlers` / `_infer_expr_wasm_type` / `_infer_vera_type`) plus the 5 pr-review fixes (#2/#3/#8 — ModuleCall/AnonFn/QualifiedCall return None; dead `is not None` guards on Block/HandleExpr removed) | | `test_check_walker_coverage_597.py` | 15 | 311 | Unit tests for `scripts/check_walker_coverage.py` parsing logic — Expr subclass extraction, isinstance flattening (incl. tuple form), checklist-block anchoring (incl. CR-3 regression test: `# Foo → bar` outside WALKER_COVERAGE block not counted), section-header tolerance, auto-discovery invariants, end-to-end main exit code | | `test_stress.py` | 16 | 553 | Scale-dependent regression tests (#596) — `@pytest.mark.stress`, skipped by default. 9 logical tests × eager-GC lane parametrisation = 16 test instances. 10K `array_map`, 5K nested-array `array_map`, 1K-deep tail recursion with allocating arg, 1M-deep tail recursion with allocating arg (#549 GC-aware TCO), 20×20 nested array-fold-of-array-fold, 100K `array_fold`, 10K String allocations, 1K `State` get/put cycles, 10K `IO.print` calls. Pins #570 / #515 / #593 / #549 / #487 / #348 / #573 regression coverage | -| `test_string_length_soundness.py` | 8 | 149 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | +| `test_string_length_soundness.py` | 13 | 217 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | | `test_errors.py` | 52 | 525 | Error code registry, diagnostic formatting, serialisation, SourceLocation, error display sync (README/HTML/spec) | | `test_formatter.py` | 124 | 1,074 | Comment extraction, interior comment positioning, expression/declaration formatting, match arm block bodies, idempotency, parenthesization, spec rules, ability declarations | | `test_cli.py` | 242 | 3,504 | CLI commands (check, verify, compile, run, test, fmt, version, quiet), subprocess integration, JSON error paths, runtime traps, arg validation, multi-file resolution, IO exit codes, --explain-slots, `builtins`/`effects`/`errors` introspection dispatch | diff --git a/spec/06-contracts.md b/spec/06-contracts.md index 60cc420c..35309b28 100644 --- a/spec/06-contracts.md +++ b/spec/06-contracts.md @@ -254,7 +254,7 @@ The `@Nat` obligations (E502 / E503) carry the most nuance, spanning many bindin **Integer overflow** ([#798](https://github.com/aallan/vera/issues/798)). `@Int` is a signed 64-bit machine integer and `@Nat` an unsigned one; `+` / `-` / `*` wrap at the i64 / u64 boundary. Like `@Nat` underflow and signed-division `MIN / -1`, an overflowing operation is a *partial* operation that **traps** at runtime rather than silently wrapping, so each `@Int` / `@Nat` `+` / `-` / `*` carries an obligation that the result stays in range. It is classified at the operands' **common (coerced) type** — `@Int` if either operand is `@Int` (since `@Nat <: @Int`), else `@Nat` — not one operand's self-type (a non-negative literal is `@Nat`, but `5 + @Int.0` is an i64 add) nor the possibly-narrowed result type (an `@Int.0 + 1` stored into a `@Nat` slot is still an i64 add). A two-check mirrors array indexing: the result provably in range → **Tier 1**; provably out of range (e.g. a literal `u64.MAX + 1`, or `@Int.0 + 1` under `requires(@Int.0 == i64.MAX)`) → a compile error (**E528**); otherwise — dynamic operands — a runtime-guarded **Tier 3** trap. `@Nat` subtraction is excluded — it is the underflow obligation (E502) above, never a high-overflow. -**String length** ([#802](https://github.com/aallan/vera/issues/802)). Vera strings are UTF-8 byte sequences and `string_length` returns the **byte** count, but Z3's string theory (SMT-LIB 2.6) models strings as sequences of Unicode **code points** — its `Length` counts code points, which disagrees with the runtime on every multibyte character (`string_length("é")` is `2`, not `1`). `string_length` is therefore modeled at **Tier 1** only for a string **literal**, whose exact byte length is known; on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator). The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay **Tier 1**: UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. The other byte/offset-sensitive string builtins (`string_slice`, `string_index_of`, `string_char_code`, `string_chars`) are not translated to Z3 and already fall to **Tier 3**. +**String length** ([#802](https://github.com/aallan/vera/issues/802)). Vera strings are UTF-8 byte sequences and `string_length` returns the **byte** count, but Z3's string theory (SMT-LIB 2.6) models strings as sequences of Unicode **code points** — its `Length` counts code points, which disagrees with the runtime on every multibyte character (`string_length("é")` is `2`, not `1`). `string_length` is therefore modeled at **Tier 1** only for a string **literal**, whose exact byte length is known; on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator). The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay **Tier 1**: UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. The other byte/offset-sensitive string builtins (`string_slice`, `string_index_of`, `string_char_code`, `string_chars`) are not translated to Z3 and already fall to **Tier 3**. One further deferral keeps the predicates honest: Z3's string-sort alphabet only reaches U+2FFFF, and its Python binding silently stores any higher code point as the literal's *escape text* rather than the character — so a **string literal containing a code point above U+2FFFF defers to Tier 3** (it cannot be faithfully modeled), instead of letting a predicate match phantom escape bytes the runtime never sees. Runtime traps for unguarded primitives are Vera-native: each trap carries a kind label (`divide_by_zero`, `out_of_bounds`, etc.), a per-kind Fix paragraph naming the precondition that would have prevented it, and a source backtrace — so a missing static guarantee is still a recoverable signal. diff --git a/tests/test_codegen.py b/tests/test_codegen.py index 1f97c7d8..13da7fcb 100644 --- a/tests/test_codegen.py +++ b/tests/test_codegen.py @@ -1203,6 +1203,26 @@ def test_string_length_with_escapes(self) -> None: ''' assert _run(source, fn="len") == 3 + def test_string_length_non_ascii_counts_utf8_bytes(self) -> None: + """#802: string_length counts UTF-8 BYTES at runtime — "é" (U+00E9) is + 2 bytes and "😀" (U+1F600) is 4 bytes, each a single code point. This + pins the runtime premise the verifier's literal byte-model relies on: + if codegen regressed to code-point counting, the verifier-side + soundness tests (test_string_length_soundness.py) would stay green but + this would catch it.""" + two = r''' +public fn len(@Unit -> @Nat) + requires(true) ensures(true) effects(pure) +{ string_length("\u{e9}") } +''' + assert _run(two, fn="len") == 2 + four = r''' +public fn len(@Unit -> @Nat) + requires(true) ensures(true) effects(pure) +{ string_length("\u{1F600}") } +''' + assert _run(four, fn="len") == 4 + # ===================================================================== # Bool comparison codegen (i32 path) diff --git a/tests/test_string_length_soundness.py b/tests/test_string_length_soundness.py index cb4b6372..96b87fd9 100644 --- a/tests/test_string_length_soundness.py +++ b/tests/test_string_length_soundness.py @@ -107,6 +107,35 @@ def test_ascii_literal_byte_length_still_proved(self) -> None: public fn f(@Unit -> @Int) requires(true) ensures(@Int.result == 2) effects(pure) { string_length("ab") } +""") + assert _ok(ok) and ok.summary.tier1_verified >= 1 + + def test_escaped_unicode_literal_byte_length(self) -> None: + # The byte count comes from the DECODED literal value, not the raw source + # text: "\\u{e9}" decodes to é (2 UTF-8 bytes), not its 6 source chars. + # A naive raw-source-length model would (wrongly) prove == 6. + ok = _verify(""" +public fn f(@Unit -> @Int) + requires(true) ensures(@Int.result == 2) effects(pure) +{ string_length("\\u{e9}") } +""") + assert _ok(ok) and ok.summary.tier1_verified >= 1 + + def test_four_byte_emoji_literal_byte_length(self) -> None: + # A 4-byte UTF-8 character (U+1F600) — code-point count 1, byte count 4, + # the case where code points and bytes diverge most. + ok = _verify(""" +public fn f(@Unit -> @Int) + requires(true) ensures(@Int.result == 4) effects(pure) +{ string_length("\\u{1F600}") } +""") + assert _ok(ok) and ok.summary.tier1_verified >= 1 + + def test_empty_string_literal_byte_length(self) -> None: + ok = _verify(""" +public fn f(@Unit -> @Int) + requires(true) ensures(@Int.result == 0) effects(pure) +{ string_length("") } """) assert _ok(ok) and ok.summary.tier1_verified >= 1 @@ -116,7 +145,8 @@ class TestStringPredicateSoundness802: self-synchronization a valid-UTF-8 needle matches at the byte level iff at the code-point level, so Z3's Contains/PrefixOf/SuffixOf stay sound on non-ASCII input — they remain Tier-1 (this is the verifier side; the - end-to-end verify↔run agreement is pinned in test_string_length_codegen).""" + end-to-end runtime byte semantics are pinned by the non-ASCII string_length + runtime tests in tests/test_codegen.py).""" def test_starts_with_non_ascii_tier1(self) -> None: result = _verify(""" @@ -147,3 +177,41 @@ def test_contains_non_ascii_tier1(self) -> None: { string_contains(@String.0, "afé") } """) assert _ok(result) and result.summary.tier3_runtime == 0 + + +class TestAstralStringLiteral802: + """Z3's string sort alphabet is U+0000..U+2FFFF. Above that the Python + binding's `z3.StringVal` silently stores the literal's *escape string* + instead of the character, so a predicate over such a literal could prove a + false contract (`string_contains("\\u{10FFFF}", "f")` — the astral char has + no `f` byte, yet the phantom escape string does). Such literals defer to + Tier 3 (smt.py returns None for them), so the verifier never falsely proves + over them. `string_length` is unaffected — it byte-counts the decoded value, + not `z3.StringVal` (covered in TestStringLengthSoundness802).""" + + def test_astral_predicate_not_proved_at_tier1(self) -> None: + # Pre-fix the verifier PROVED this false contract at Tier 1 (the phantom + # escape string matched "f"); now the astral literal defers, so nothing + # about it is a Tier-1 proof — no false proof. (U+10FFFF as bytes + # f4 8f bf bf contains no "f"; the runtime returns false.) + result = _verify(""" +public fn check(@String -> @Bool) + requires(@String.0 == "\\u{10FFFF}") + ensures(@Bool.result == true) + effects(pure) +{ string_contains(@String.0, "f") } +""") + assert result.summary.tier1_verified == 0, ( + result.summary.tier1_verified, result.summary.tier3_runtime, + ) + + def test_astral_string_length_still_byte_modeled(self) -> None: + # string_length bypasses z3.StringVal (it byte-counts the decoded value), + # so even an astral literal's byte length is soundly Tier-1: U+10FFFF is + # 4 UTF-8 bytes. + ok = _verify(""" +public fn f(@Unit -> @Int) + requires(true) ensures(@Int.result == 4) effects(pure) +{ string_length("\\u{10FFFF}") } +""") + assert _ok(ok) and ok.summary.tier1_verified >= 1 diff --git a/vera/smt.py b/vera/smt.py index 60733b93..1709cb93 100644 --- a/vera/smt.py +++ b/vera/smt.py @@ -572,6 +572,17 @@ def translate_expr( return z3.BoolVal(expr.value) if isinstance(expr, ast.StringLit): + # #802: Z3's string sort alphabet is U+0000..U+2FFFF. For any code + # point above that, the Python binding's z3.StringVal silently stores + # the *escape string* (the literal "\U000xxxxx" ASCII characters) + # instead of the character — so Contains/PrefixOf/SuffixOf over such + # a literal match phantom ASCII bytes the byte-level runtime never + # sees, proving false contracts at Tier 1. Defer the literal to + # Tier 3 (return None) rather than model it unsoundly. (string_length + # is unaffected: it models a literal via Python's UTF-8 byte count, + # not z3.StringVal.) + if any(ord(ch) > 0x2FFFF for ch in expr.value): + return None return z3.StringVal(expr.value) if isinstance(expr, ast.FloatLit): From 2961d91f2b929071d5ab6d1827914581a8bca0d6 Mon Sep 17 00:00:00 2001 From: T Date: Sat, 27 Jun 2026 12:08:10 +0100 Subject: [PATCH 3/7] =?UTF-8?q?fix(802):=20guard=20string=20literals=20Z3?= =?UTF-8?q?=20can't=20model=20=E2=80=94=20surrogate=20crash=20+=20cutoff/a?= =?UTF-8?q?stral=20tests=20(review)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CodeRabbit review of #810: - vera/smt.py: a lone surrogate (U+D800..U+DFFF) is not UTF-8-encodable, so the new string_length literal byte-model (value.encode) AND z3.StringVal both crashed `vera verify` with UnicodeEncodeError. Guard both: string_length catches UnicodeEncodeError -> Tier 3; the StringVal literal-deferral now covers surrogates alongside the > U+2FFFF astral cutoff. - tests/test_string_length_soundness.py: a surrogate regression test (no crash, defers), an exact-cutoff boundary test (\u{2FFFF} Tier-1 vs \u{30000} Tier-3, pinning > vs >=), and the astral predicate test reworked to the literal-on-literal form so it asserts the clean Tier-3 fallback (_ok + tier3 >= 1). - spec/06 section 6.4.3 + CHANGELOG: note the surrogate deferral. Note: CR's finding 1 suggested a string_length boundary test, but string_length byte-counts via Python and does not exercise the z3.StringVal cutoff, so the boundary test is on a predicate instead. Refs #802, #392. Co-Authored-By: Claude --- CHANGELOG.md | 2 +- README.md | 2 +- ROADMAP.md | 2 +- TESTING.md | 4 +- spec/06-contracts.md | 2 +- tests/test_string_length_soundness.py | 66 +++++++++++++++++++++++---- vera/smt.py | 31 ++++++++----- 7 files changed, 83 insertions(+), 26 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 86c2574f..0f7e0752 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,7 +10,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). ### Fixed -- **Verifier soundness: `string_length` no longer proves a false length for non-ASCII strings** (final sub-task of the [#392](https://github.com/aallan/vera/issues/392) `smt.py` soundness audit). `vera verify` translated `string_length` to Z3's `Length`, which counts Unicode **code points**, but Vera's runtime counts **UTF-8 bytes** — so `ensures(string_length("é") == 1)` verified at Tier 1 while the runtime returns `2`. `string_length` is now modeled at Tier 1 only for a string **literal** (whose exact byte length is known); on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator), matching the numeric-cast / quantifier / decimal precedent. The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 — UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level — with one further guard: a string literal containing a code point above Z3's string-sort alphabet (> U+2FFFF) also defers to Tier 3, because the Z3 Python binding would otherwise store it as escape text and let the predicates match phantom bytes. This was the last of the six soundness bugs ([#797](https://github.com/aallan/vera/issues/797)–[#802](https://github.com/aallan/vera/issues/802)) found by the [#392](https://github.com/aallan/vera/issues/392) audit, which is now complete. ([#802](https://github.com/aallan/vera/issues/802)) +- **Verifier soundness: `string_length` no longer proves a false length for non-ASCII strings** (final sub-task of the [#392](https://github.com/aallan/vera/issues/392) `smt.py` soundness audit). `vera verify` translated `string_length` to Z3's `Length`, which counts Unicode **code points**, but Vera's runtime counts **UTF-8 bytes** — so `ensures(string_length("é") == 1)` verified at Tier 1 while the runtime returns `2`. `string_length` is now modeled at Tier 1 only for a string **literal** (whose exact byte length is known); on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator), matching the numeric-cast / quantifier / decimal precedent. The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 — UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level — with one further guard: a string literal containing a code point above Z3's string-sort alphabet (> U+2FFFF), or a lone surrogate (U+D800–U+DFFF) with no UTF-8 encoding, also defers to Tier 3 — the Z3 Python binding would otherwise store it as escape text (or raise) and let the predicates match phantom bytes. This was the last of the six soundness bugs ([#797](https://github.com/aallan/vera/issues/797)–[#802](https://github.com/aallan/vera/issues/802)) found by the [#392](https://github.com/aallan/vera/issues/392) audit, which is now complete. ([#802](https://github.com/aallan/vera/issues/802)) ## [0.0.181] - 2026-06-27 diff --git a/README.md b/README.md index 2ccca6aa..6ce76fa8 100644 --- a/README.md +++ b/README.md @@ -215,7 +215,7 @@ cp /path/to/vera/SKILL.md ~/.claude/skills/vera-language/SKILL.md ## Project status -Vera is in **active development** at v0.0.182 — 1,400+ commits, 182 releases, 5,170 tests, 91% code coverage, 93 conformance programs, 35 examples, and a 13-chapter specification. See **[HISTORY.md](HISTORY.md)** for how the compiler was built. +Vera is in **active development** at v0.0.182 — 1,400+ commits, 182 releases, 5,172 tests, 91% code coverage, 93 conformance programs, 35 examples, and a 13-chapter specification. See **[HISTORY.md](HISTORY.md)** for how the compiler was built. The reference compiler — parser, AST, type checker, contract verifier (Z3), WASM code generator, module system, browser runtime, and runtime contract insertion — is working. The language specification is in draft across [13 chapters](spec/). diff --git a/ROADMAP.md b/ROADMAP.md index de442c94..1cf51faf 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -8,7 +8,7 @@ Priority lives in this file and nowhere else — issues carry kind and area labe ## Where we are -5,170 tests, 93 conformance programs, 35 examples, 13 spec chapters. +5,172 tests, 93 conformance programs, 35 examples, 13 spec chapters. ## The roadmap diff --git a/TESTING.md b/TESTING.md index 50e03457..7dbed6e3 100644 --- a/TESTING.md +++ b/TESTING.md @@ -6,7 +6,7 @@ This is the single source of truth for Vera's testing infrastructure, coverage d | Metric | Value | |--------|-------| -| **Tests** | 5,170 across 45 files (~63,000 lines of test code; 5,138 passed + 16 stress, 16 skipped) | +| **Tests** | 5,172 across 45 files (~63,000 lines of test code; 5,140 passed + 16 stress, 16 skipped) | | **Compiler code coverage** | 95% Python, 61% JavaScript — 91% combined (CI minimum: 80%) | | **Conformance programs** | 93 programs across 9 spec chapters, validating every language feature | | **Example programs** | 35, all validated through `vera check` + `vera verify` | @@ -75,7 +75,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_walker_defensive_branches_597.py` | 21 | 296 | Synthetic-AST tests for the 11 defensive `isinstance` branches added by #597 (`_scan_io_ops` / `_scan_expr_for_handlers` / `_infer_expr_wasm_type` / `_infer_vera_type`) plus the 5 pr-review fixes (#2/#3/#8 — ModuleCall/AnonFn/QualifiedCall return None; dead `is not None` guards on Block/HandleExpr removed) | | `test_check_walker_coverage_597.py` | 15 | 311 | Unit tests for `scripts/check_walker_coverage.py` parsing logic — Expr subclass extraction, isinstance flattening (incl. tuple form), checklist-block anchoring (incl. CR-3 regression test: `# Foo → bar` outside WALKER_COVERAGE block not counted), section-header tolerance, auto-discovery invariants, end-to-end main exit code | | `test_stress.py` | 16 | 553 | Scale-dependent regression tests (#596) — `@pytest.mark.stress`, skipped by default. 9 logical tests × eager-GC lane parametrisation = 16 test instances. 10K `array_map`, 5K nested-array `array_map`, 1K-deep tail recursion with allocating arg, 1M-deep tail recursion with allocating arg (#549 GC-aware TCO), 20×20 nested array-fold-of-array-fold, 100K `array_fold`, 10K String allocations, 1K `State` get/put cycles, 10K `IO.print` calls. Pins #570 / #515 / #593 / #549 / #487 / #348 / #573 regression coverage | -| `test_string_length_soundness.py` | 13 | 217 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | +| `test_string_length_soundness.py` | 15 | 265 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | | `test_errors.py` | 52 | 525 | Error code registry, diagnostic formatting, serialisation, SourceLocation, error display sync (README/HTML/spec) | | `test_formatter.py` | 124 | 1,074 | Comment extraction, interior comment positioning, expression/declaration formatting, match arm block bodies, idempotency, parenthesization, spec rules, ability declarations | | `test_cli.py` | 242 | 3,504 | CLI commands (check, verify, compile, run, test, fmt, version, quiet), subprocess integration, JSON error paths, runtime traps, arg validation, multi-file resolution, IO exit codes, --explain-slots, `builtins`/`effects`/`errors` introspection dispatch | diff --git a/spec/06-contracts.md b/spec/06-contracts.md index 35309b28..7193c0ab 100644 --- a/spec/06-contracts.md +++ b/spec/06-contracts.md @@ -254,7 +254,7 @@ The `@Nat` obligations (E502 / E503) carry the most nuance, spanning many bindin **Integer overflow** ([#798](https://github.com/aallan/vera/issues/798)). `@Int` is a signed 64-bit machine integer and `@Nat` an unsigned one; `+` / `-` / `*` wrap at the i64 / u64 boundary. Like `@Nat` underflow and signed-division `MIN / -1`, an overflowing operation is a *partial* operation that **traps** at runtime rather than silently wrapping, so each `@Int` / `@Nat` `+` / `-` / `*` carries an obligation that the result stays in range. It is classified at the operands' **common (coerced) type** — `@Int` if either operand is `@Int` (since `@Nat <: @Int`), else `@Nat` — not one operand's self-type (a non-negative literal is `@Nat`, but `5 + @Int.0` is an i64 add) nor the possibly-narrowed result type (an `@Int.0 + 1` stored into a `@Nat` slot is still an i64 add). A two-check mirrors array indexing: the result provably in range → **Tier 1**; provably out of range (e.g. a literal `u64.MAX + 1`, or `@Int.0 + 1` under `requires(@Int.0 == i64.MAX)`) → a compile error (**E528**); otherwise — dynamic operands — a runtime-guarded **Tier 3** trap. `@Nat` subtraction is excluded — it is the underflow obligation (E502) above, never a high-overflow. -**String length** ([#802](https://github.com/aallan/vera/issues/802)). Vera strings are UTF-8 byte sequences and `string_length` returns the **byte** count, but Z3's string theory (SMT-LIB 2.6) models strings as sequences of Unicode **code points** — its `Length` counts code points, which disagrees with the runtime on every multibyte character (`string_length("é")` is `2`, not `1`). `string_length` is therefore modeled at **Tier 1** only for a string **literal**, whose exact byte length is known; on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator). The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay **Tier 1**: UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. The other byte/offset-sensitive string builtins (`string_slice`, `string_index_of`, `string_char_code`, `string_chars`) are not translated to Z3 and already fall to **Tier 3**. One further deferral keeps the predicates honest: Z3's string-sort alphabet only reaches U+2FFFF, and its Python binding silently stores any higher code point as the literal's *escape text* rather than the character — so a **string literal containing a code point above U+2FFFF defers to Tier 3** (it cannot be faithfully modeled), instead of letting a predicate match phantom escape bytes the runtime never sees. +**String length** ([#802](https://github.com/aallan/vera/issues/802)). Vera strings are UTF-8 byte sequences and `string_length` returns the **byte** count, but Z3's string theory (SMT-LIB 2.6) models strings as sequences of Unicode **code points** — its `Length` counts code points, which disagrees with the runtime on every multibyte character (`string_length("é")` is `2`, not `1`). `string_length` is therefore modeled at **Tier 1** only for a string **literal**, whose exact byte length is known; on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator). The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay **Tier 1**: UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. The other byte/offset-sensitive string builtins (`string_slice`, `string_index_of`, `string_char_code`, `string_chars`) are not translated to Z3 and already fall to **Tier 3**. One further deferral keeps the predicates honest: Z3's string-sort alphabet only reaches U+2FFFF, and its Python binding silently stores any higher code point as the literal's *escape text* rather than the character — so a **string literal containing a code point above U+2FFFF — or a lone surrogate (U+D800–U+DFFF), which has no UTF-8 encoding — defers to Tier 3** (it cannot be faithfully modeled), instead of letting a predicate match phantom escape bytes the runtime never sees. Runtime traps for unguarded primitives are Vera-native: each trap carries a kind label (`divide_by_zero`, `out_of_bounds`, etc.), a per-kind Fix paragraph naming the precondition that would have prevented it, and a source backtrace — so a missing static guarantee is still a recoverable signal. diff --git a/tests/test_string_length_soundness.py b/tests/test_string_length_soundness.py index 96b87fd9..c498e4c7 100644 --- a/tests/test_string_length_soundness.py +++ b/tests/test_string_length_soundness.py @@ -189,21 +189,69 @@ class TestAstralStringLiteral802: over them. `string_length` is unaffected — it byte-counts the decoded value, not `z3.StringVal` (covered in TestStringLengthSoundness802).""" - def test_astral_predicate_not_proved_at_tier1(self) -> None: + def test_astral_literal_predicate_defers_to_tier3(self) -> None: # Pre-fix the verifier PROVED this false contract at Tier 1 (the phantom - # escape string matched "f"); now the astral literal defers, so nothing - # about it is a Tier-1 proof — no false proof. (U+10FFFF as bytes - # f4 8f bf bf contains no "f"; the runtime returns false.) + # escape string matched "f"); now the astral literal defers cleanly to a + # runtime-guarded Tier-3 obligation. (U+10FFFF as bytes f4 8f bf bf + # contains no "f", so the runtime returns false — the contract is false, + # and the verifier must NOT prove it.) result = _verify(""" -public fn check(@String -> @Bool) - requires(@String.0 == "\\u{10FFFF}") +public fn f(@Unit -> @Bool) + requires(true) ensures(@Bool.result == true) effects(pure) +{ string_contains("\\u{10FFFF}", "f") } +""") + # Verification succeeds with no spurious error, no false Tier-1 proof: + # the astral predicate is a runtime-guarded Tier-3 obligation (pre-fix it + # was a Tier-1 proof, so tier3_runtime was 0). + assert _ok(result), [d.description for d in result.diagnostics] + assert result.summary.tier3_runtime >= 1, ( + result.summary.tier1_verified, result.summary.tier3_runtime, + ) + + def test_astral_cutoff_boundary(self) -> None: + # Pin the exact `ord(ch) > 0x2FFFF` cutoff (guards against a `>=` + # regression): U+2FFFF is on the modeled side — a predicate over it stays + # Tier 1 (tier3 == 0); U+30000, one above, defers (tier3 >= 1). Uses a + # predicate, not string_length, because string_length byte-counts via + # Python and so does not exercise the z3.StringVal alphabet cutoff. + modeled = _verify(""" +public fn f(@String -> @Bool) + requires(@String.0 == "\\u{2FFFF}") ensures(@Bool.result == true) effects(pure) -{ string_contains(@String.0, "f") } +{ string_starts_with(@String.0, "\\u{2FFFF}") } """) - assert result.summary.tier1_verified == 0, ( - result.summary.tier1_verified, result.summary.tier3_runtime, + assert _ok(modeled) and modeled.summary.tier3_runtime == 0, ( + modeled.summary.tier1_verified, modeled.summary.tier3_runtime, ) + deferred = _verify(""" +public fn f(@String -> @Bool) + requires(@String.0 == "\\u{30000}") + ensures(@Bool.result == true) + effects(pure) +{ string_starts_with(@String.0, "\\u{30000}") } +""") + assert deferred.summary.tier3_runtime >= 1, ( + deferred.summary.tier1_verified, deferred.summary.tier3_runtime, + ) + + def test_surrogate_literal_does_not_crash(self) -> None: + # A lone surrogate (U+D800) is not UTF-8-encodable; before the guard, + # string_length's `value.encode("utf-8")` and the predicate's + # `z3.StringVal` both raised UnicodeEncodeError and crashed `vera verify`. + # Both must now defer to Tier 3 without crashing. + length = _verify(""" +public fn f(@Unit -> @Int) + requires(true) ensures(@Int.result == 1) effects(pure) +{ string_length("\\u{D800}") } +""") + assert length.summary.tier3_runtime >= 1 # deferred, no crash + predicate = _verify(""" +public fn f(@Unit -> @Bool) + requires(true) ensures(@Bool.result == true) effects(pure) +{ string_contains("\\u{D800}", "x") } +""") + assert predicate.summary.tier3_runtime >= 1 # deferred, no crash def test_astral_string_length_still_byte_modeled(self) -> None: # string_length bypasses z3.StringVal (it byte-counts the decoded value), diff --git a/vera/smt.py b/vera/smt.py index 1709cb93..6cb516c9 100644 --- a/vera/smt.py +++ b/vera/smt.py @@ -572,16 +572,20 @@ def translate_expr( return z3.BoolVal(expr.value) if isinstance(expr, ast.StringLit): - # #802: Z3's string sort alphabet is U+0000..U+2FFFF. For any code - # point above that, the Python binding's z3.StringVal silently stores - # the *escape string* (the literal "\U000xxxxx" ASCII characters) - # instead of the character — so Contains/PrefixOf/SuffixOf over such - # a literal match phantom ASCII bytes the byte-level runtime never - # sees, proving false contracts at Tier 1. Defer the literal to - # Tier 3 (return None) rather than model it unsoundly. (string_length - # is unaffected: it models a literal via Python's UTF-8 byte count, - # not z3.StringVal.) - if any(ord(ch) > 0x2FFFF for ch in expr.value): + # #802: Z3's string sort cannot faithfully model two kinds of code + # point, so a literal containing either defers to Tier 3 (return + # None) rather than be reasoned over as a corrupted term: + # - above its alphabet (> U+2FFFF): the Python binding silently + # stores the *escape string* (literal "\U000xxxxx" ASCII chars) + # instead of the character, so Contains/PrefixOf/SuffixOf match + # phantom bytes the runtime never sees, proving false contracts; + # - a lone surrogate (U+D800..U+DFFF): not UTF-8-encodable, so + # z3.StringVal raises on it. + # (string_length models a literal via its UTF-8 byte count, not + # z3.StringVal, so it is unaffected by the alphabet limit — but it + # too must guard the surrogate case, below.) + if any(ord(ch) > 0x2FFFF or 0xD800 <= ord(ch) <= 0xDFFF + for ch in expr.value): return None return z3.StringVal(expr.value) @@ -1075,7 +1079,12 @@ def _translate_call( if call.name == "string_length" and len(call.args) == 1: arg_node = call.args[0] if isinstance(arg_node, ast.StringLit): - byte_len = len(arg_node.value.encode("utf-8")) + try: + byte_len = len(arg_node.value.encode("utf-8")) + except UnicodeEncodeError: + # A lone surrogate (U+D800..U+DFFF) is not UTF-8-encodable; + # its byte length is undefined, so defer to Tier 3. + return None return z3.IntVal(byte_len) return None From 7376df2bf13a16684b52ee5db5d64cc8458f6ad6 Mon Sep 17 00:00:00 2001 From: T Date: Sat, 27 Jun 2026 12:26:11 +0100 Subject: [PATCH 4/7] fix(802): scope spec U+2FFFF deferral to predicates; assert clean Tier-3 fallback (review) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CodeRabbit review of #810: - spec/06 section 6.4.3: the string-length paragraph overstated the U+2FFFF cutoff. It applies only to the z3.StringVal-based predicate translation; string_length byte-counts the decoded literal, so an astral (> U+2FFFF) literal's length stays Tier 1. Reworded to scope the cutoff to the predicates, with the lone surrogate (no UTF-8 encoding) still deferring on both paths. - tests/test_string_length_soundness.py: the surrogate test now also asserts _ok(length) and _ok(predicate), so it fails on an E500 / checker error rather than only requiring tier3_runtime — pinning the graceful fallback, not just any deferral. Refs #802, #392. Co-Authored-By: Claude --- TESTING.md | 2 +- spec/06-contracts.md | 2 +- tests/test_string_length_soundness.py | 7 +++++-- 3 files changed, 7 insertions(+), 4 deletions(-) diff --git a/TESTING.md b/TESTING.md index 7dbed6e3..a8c8dd65 100644 --- a/TESTING.md +++ b/TESTING.md @@ -75,7 +75,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_walker_defensive_branches_597.py` | 21 | 296 | Synthetic-AST tests for the 11 defensive `isinstance` branches added by #597 (`_scan_io_ops` / `_scan_expr_for_handlers` / `_infer_expr_wasm_type` / `_infer_vera_type`) plus the 5 pr-review fixes (#2/#3/#8 — ModuleCall/AnonFn/QualifiedCall return None; dead `is not None` guards on Block/HandleExpr removed) | | `test_check_walker_coverage_597.py` | 15 | 311 | Unit tests for `scripts/check_walker_coverage.py` parsing logic — Expr subclass extraction, isinstance flattening (incl. tuple form), checklist-block anchoring (incl. CR-3 regression test: `# Foo → bar` outside WALKER_COVERAGE block not counted), section-header tolerance, auto-discovery invariants, end-to-end main exit code | | `test_stress.py` | 16 | 553 | Scale-dependent regression tests (#596) — `@pytest.mark.stress`, skipped by default. 9 logical tests × eager-GC lane parametrisation = 16 test instances. 10K `array_map`, 5K nested-array `array_map`, 1K-deep tail recursion with allocating arg, 1M-deep tail recursion with allocating arg (#549 GC-aware TCO), 20×20 nested array-fold-of-array-fold, 100K `array_fold`, 10K String allocations, 1K `State` get/put cycles, 10K `IO.print` calls. Pins #570 / #515 / #593 / #549 / #487 / #348 / #573 regression coverage | -| `test_string_length_soundness.py` | 15 | 265 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | +| `test_string_length_soundness.py` | 15 | 268 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | | `test_errors.py` | 52 | 525 | Error code registry, diagnostic formatting, serialisation, SourceLocation, error display sync (README/HTML/spec) | | `test_formatter.py` | 124 | 1,074 | Comment extraction, interior comment positioning, expression/declaration formatting, match arm block bodies, idempotency, parenthesization, spec rules, ability declarations | | `test_cli.py` | 242 | 3,504 | CLI commands (check, verify, compile, run, test, fmt, version, quiet), subprocess integration, JSON error paths, runtime traps, arg validation, multi-file resolution, IO exit codes, --explain-slots, `builtins`/`effects`/`errors` introspection dispatch | diff --git a/spec/06-contracts.md b/spec/06-contracts.md index 7193c0ab..83cc64b0 100644 --- a/spec/06-contracts.md +++ b/spec/06-contracts.md @@ -254,7 +254,7 @@ The `@Nat` obligations (E502 / E503) carry the most nuance, spanning many bindin **Integer overflow** ([#798](https://github.com/aallan/vera/issues/798)). `@Int` is a signed 64-bit machine integer and `@Nat` an unsigned one; `+` / `-` / `*` wrap at the i64 / u64 boundary. Like `@Nat` underflow and signed-division `MIN / -1`, an overflowing operation is a *partial* operation that **traps** at runtime rather than silently wrapping, so each `@Int` / `@Nat` `+` / `-` / `*` carries an obligation that the result stays in range. It is classified at the operands' **common (coerced) type** — `@Int` if either operand is `@Int` (since `@Nat <: @Int`), else `@Nat` — not one operand's self-type (a non-negative literal is `@Nat`, but `5 + @Int.0` is an i64 add) nor the possibly-narrowed result type (an `@Int.0 + 1` stored into a `@Nat` slot is still an i64 add). A two-check mirrors array indexing: the result provably in range → **Tier 1**; provably out of range (e.g. a literal `u64.MAX + 1`, or `@Int.0 + 1` under `requires(@Int.0 == i64.MAX)`) → a compile error (**E528**); otherwise — dynamic operands — a runtime-guarded **Tier 3** trap. `@Nat` subtraction is excluded — it is the underflow obligation (E502) above, never a high-overflow. -**String length** ([#802](https://github.com/aallan/vera/issues/802)). Vera strings are UTF-8 byte sequences and `string_length` returns the **byte** count, but Z3's string theory (SMT-LIB 2.6) models strings as sequences of Unicode **code points** — its `Length` counts code points, which disagrees with the runtime on every multibyte character (`string_length("é")` is `2`, not `1`). `string_length` is therefore modeled at **Tier 1** only for a string **literal**, whose exact byte length is known; on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator). The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay **Tier 1**: UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. The other byte/offset-sensitive string builtins (`string_slice`, `string_index_of`, `string_char_code`, `string_chars`) are not translated to Z3 and already fall to **Tier 3**. One further deferral keeps the predicates honest: Z3's string-sort alphabet only reaches U+2FFFF, and its Python binding silently stores any higher code point as the literal's *escape text* rather than the character — so a **string literal containing a code point above U+2FFFF — or a lone surrogate (U+D800–U+DFFF), which has no UTF-8 encoding — defers to Tier 3** (it cannot be faithfully modeled), instead of letting a predicate match phantom escape bytes the runtime never sees. +**String length** ([#802](https://github.com/aallan/vera/issues/802)). Vera strings are UTF-8 byte sequences and `string_length` returns the **byte** count, but Z3's string theory (SMT-LIB 2.6) models strings as sequences of Unicode **code points** — its `Length` counts code points, which disagrees with the runtime on every multibyte character (`string_length("é")` is `2`, not `1`). `string_length` is therefore modeled at **Tier 1** only for a string **literal**, whose exact byte length is known; on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator). The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay **Tier 1**: UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. The other byte/offset-sensitive string builtins (`string_slice`, `string_index_of`, `string_char_code`, `string_chars`) are not translated to Z3 and already fall to **Tier 3**. Two further deferrals keep the predicates honest. Z3's string-sort alphabet only reaches U+2FFFF, and its Python binding silently stores any higher code point as the literal's *escape text* rather than the character — so a literal containing a code point **above U+2FFFF** is unusable in the `z3.StringVal`-based predicate translation (`string_contains` / `string_starts_with` / `string_ends_with`) and defers to **Tier 3** there, instead of letting a predicate match phantom escape bytes the runtime never sees. `string_length` is unaffected by this one — it byte-counts the *decoded* literal, so an astral literal's length stays **Tier 1**. A **lone surrogate** (U+D800–U+DFFF) has no UTF-8 encoding at all, so it defers on **both** paths: `z3.StringVal` raises on it, and `string_length`'s byte count cannot be taken. Runtime traps for unguarded primitives are Vera-native: each trap carries a kind label (`divide_by_zero`, `out_of_bounds`, etc.), a per-kind Fix paragraph naming the precondition that would have prevented it, and a source backtrace — so a missing static guarantee is still a recoverable signal. diff --git a/tests/test_string_length_soundness.py b/tests/test_string_length_soundness.py index c498e4c7..7e24ca6a 100644 --- a/tests/test_string_length_soundness.py +++ b/tests/test_string_length_soundness.py @@ -245,13 +245,16 @@ def test_surrogate_literal_does_not_crash(self) -> None: requires(true) ensures(@Int.result == 1) effects(pure) { string_length("\\u{D800}") } """) - assert length.summary.tier3_runtime >= 1 # deferred, no crash + # Graceful Tier-3 fallback, not a crash and not an E500/checker error. + assert _ok(length), [d.description for d in length.diagnostics] + assert length.summary.tier3_runtime >= 1 predicate = _verify(""" public fn f(@Unit -> @Bool) requires(true) ensures(@Bool.result == true) effects(pure) { string_contains("\\u{D800}", "x") } """) - assert predicate.summary.tier3_runtime >= 1 # deferred, no crash + assert _ok(predicate), [d.description for d in predicate.diagnostics] + assert predicate.summary.tier3_runtime >= 1 def test_astral_string_length_still_byte_modeled(self) -> None: # string_length bypasses z3.StringVal (it byte-counts the decoded value), From b350578e8defe86f96ea2f329eb28cca3d152693 Mon Sep 17 00:00:00 2001 From: T Date: Sat, 27 Jun 2026 13:58:13 +0100 Subject: [PATCH 5/7] docs(802): correct surrogate "raises" rationale; add runtime soundness-loop test (review-pr) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit pr-review-toolkit review of #810 (4 agents — all found the fix sound; 0 critical/important code issues). Addressed the agreed findings: - Doc accuracy: comments / spec / CHANGELOG / test-docstring claimed z3.StringVal "raises" on a lone surrogate. It does not (on z3 4.16 it silently stores the code point's escape text, same as the > U+2FFFF astral case); only string_length's value.encode("utf-8") genuinely raises. Reworded vera/smt.py, spec/06, CHANGELOG, and the surrogate test docstring; fixed the astral escape-text illustration ("\U000xxxxx" -> the actual "\u{...}" form). - tests/test_codegen.py: added a runtime soundness-loop test — a slot-arg string_length contract that `vera verify` DEFERS to Tier 3 is enforced at runtime (ensures(@Int.result == 1) over "é" raises a postcondition violation), closing the "deferral is sound, not merely imprecise" loop the verifier-side tests leave open. No behaviour change (comment / test / doc only). Refs #802, #392. Co-Authored-By: Claude --- CHANGELOG.md | 2 +- README.md | 2 +- ROADMAP.md | 2 +- TESTING.md | 6 +++--- spec/06-contracts.md | 2 +- tests/test_codegen.py | 19 +++++++++++++++++++ tests/test_string_length_soundness.py | 9 +++++---- vera/smt.py | 23 +++++++++++++---------- 8 files changed, 44 insertions(+), 21 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0f7e0752..ac4bc4d4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,7 +10,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). ### Fixed -- **Verifier soundness: `string_length` no longer proves a false length for non-ASCII strings** (final sub-task of the [#392](https://github.com/aallan/vera/issues/392) `smt.py` soundness audit). `vera verify` translated `string_length` to Z3's `Length`, which counts Unicode **code points**, but Vera's runtime counts **UTF-8 bytes** — so `ensures(string_length("é") == 1)` verified at Tier 1 while the runtime returns `2`. `string_length` is now modeled at Tier 1 only for a string **literal** (whose exact byte length is known); on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator), matching the numeric-cast / quantifier / decimal precedent. The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 — UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level — with one further guard: a string literal containing a code point above Z3's string-sort alphabet (> U+2FFFF), or a lone surrogate (U+D800–U+DFFF) with no UTF-8 encoding, also defers to Tier 3 — the Z3 Python binding would otherwise store it as escape text (or raise) and let the predicates match phantom bytes. This was the last of the six soundness bugs ([#797](https://github.com/aallan/vera/issues/797)–[#802](https://github.com/aallan/vera/issues/802)) found by the [#392](https://github.com/aallan/vera/issues/392) audit, which is now complete. ([#802](https://github.com/aallan/vera/issues/802)) +- **Verifier soundness: `string_length` no longer proves a false length for non-ASCII strings** (final sub-task of the [#392](https://github.com/aallan/vera/issues/392) `smt.py` soundness audit). `vera verify` translated `string_length` to Z3's `Length`, which counts Unicode **code points**, but Vera's runtime counts **UTF-8 bytes** — so `ensures(string_length("é") == 1)` verified at Tier 1 while the runtime returns `2`. `string_length` is now modeled at Tier 1 only for a string **literal** (whose exact byte length is known); on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator), matching the numeric-cast / quantifier / decimal precedent. The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 — UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level — with one further guard: a string literal containing a code point above Z3's string-sort alphabet (> U+2FFFF), or a lone surrogate (U+D800–U+DFFF) with no UTF-8 encoding, also defers to Tier 3 — the Z3 Python binding would otherwise store it as escape text and let the predicates match phantom bytes (and a surrogate has no UTF-8 byte length to take). This was the last of the six soundness bugs ([#797](https://github.com/aallan/vera/issues/797)–[#802](https://github.com/aallan/vera/issues/802)) found by the [#392](https://github.com/aallan/vera/issues/392) audit, which is now complete. ([#802](https://github.com/aallan/vera/issues/802)) ## [0.0.181] - 2026-06-27 diff --git a/README.md b/README.md index 6ce76fa8..62a2517a 100644 --- a/README.md +++ b/README.md @@ -215,7 +215,7 @@ cp /path/to/vera/SKILL.md ~/.claude/skills/vera-language/SKILL.md ## Project status -Vera is in **active development** at v0.0.182 — 1,400+ commits, 182 releases, 5,172 tests, 91% code coverage, 93 conformance programs, 35 examples, and a 13-chapter specification. See **[HISTORY.md](HISTORY.md)** for how the compiler was built. +Vera is in **active development** at v0.0.182 — 1,400+ commits, 182 releases, 5,173 tests, 91% code coverage, 93 conformance programs, 35 examples, and a 13-chapter specification. See **[HISTORY.md](HISTORY.md)** for how the compiler was built. The reference compiler — parser, AST, type checker, contract verifier (Z3), WASM code generator, module system, browser runtime, and runtime contract insertion — is working. The language specification is in draft across [13 chapters](spec/). diff --git a/ROADMAP.md b/ROADMAP.md index 1cf51faf..4e01b129 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -8,7 +8,7 @@ Priority lives in this file and nowhere else — issues carry kind and area labe ## Where we are -5,172 tests, 93 conformance programs, 35 examples, 13 spec chapters. +5,173 tests, 93 conformance programs, 35 examples, 13 spec chapters. ## The roadmap diff --git a/TESTING.md b/TESTING.md index a8c8dd65..a15fb678 100644 --- a/TESTING.md +++ b/TESTING.md @@ -6,7 +6,7 @@ This is the single source of truth for Vera's testing infrastructure, coverage d | Metric | Value | |--------|-------| -| **Tests** | 5,172 across 45 files (~63,000 lines of test code; 5,140 passed + 16 stress, 16 skipped) | +| **Tests** | 5,173 across 45 files (~63,000 lines of test code; 5,141 passed + 16 stress, 16 skipped) | | **Compiler code coverage** | 95% Python, 61% JavaScript — 91% combined (CI minimum: 80%) | | **Conformance programs** | 93 programs across 9 spec chapters, validating every language feature | | **Example programs** | 35, all validated through `vera check` + `vera verify` | @@ -65,7 +65,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_int_overflow_codegen.py` | 50 | 567 | #798 Stage 3 — runtime overflow-trap codegen: the codegen emits a guard at *exactly* the `@Int`/`@Nat` `+`/`-`/`*` sites the verifier obligates, so `vera run`/`vera compile` programs trap on overflow instead of silently wrapping at the i64/u64 boundary. The trap is a bare `unreachable` (Option A — matches the #520 `nat_sub` and #552 nat-bind precedent), classified `kind="unreachable"` today (a precise `"overflow"` trap kind via a host import is a follow-up). Test-first: every `*_traps` test fails on the pre-Stage-3 codegen (the op wraps silently → no trap → `execute` returns a value); every `*_no_trap` test passes before and after (safe arithmetic unchanged) | | `test_int_overflow_differential.py` | 143 | 364 | #798 Stage 3 verifier↔codegen classification differential (cross-component soundness rule): the codegen overflow guard must fire at exactly the sites the verifier obligates *and* classify each site's operand type (`@Int` i64 vs `@Nat` u64) identically — else a Tier-1-clean program traps spuriously or a wrapping op slips through unguarded. Over a corpus exercising all five operand combos plus the literal-left ambiguity (which the pre-fix codegen mis-classified as `@Nat`), asserts the verifier's per-site gated classification equals the codegen's site for site, both sides driven by the same `ast.span_key` | | `test_monomorphize_differential.py` | 13 | 682 | #732 differential soundness: the verifier's per-monomorphization instantiation discovery covers every instantiation codegen emits (name coverage + per-generic count), over real generic programs (conformance ch02/ch09, `examples/generics.vera`) plus inline cases for the soundness-critical scenarios — collapsed type vars, **prelude combinator emission** (`option_map`), transitive generics, a generic whose type arg is fixed only by a **where-helper's return** (a `Float64`-returning helper, so the unresolved-var `"Bool"` phantom default cannot mask a miss), a generic whose type arg is fixed only by an **imported constructor** (`id2(MkBox(7))` — the verifier's mono-context must include `_module_constructors`, else it phantom-defaults and misses codegen's `id2`), a generic whose type arg is fixed only by an **imported function's return** (`id_g(make_int(...))` — the verifier's mono-context must seed `fn_ret_types` from imported functions, else it phantom-defaults and misses codegen's `id_g`, plus a **private-shadow** case pinning the imported-fn seeding stays unfiltered like codegen since filtering would diverge into a false Tier-1), and a generic reached only through a **contract clause or `where` helper** (codegen must seed Pass 1.5 from the shared node-level walk, not just `decl.body`, or it skips the clone → `CodegenSkip` at run time) — so a missed instantiation (a false Tier-1) is caught. Guards against a vacuous pass when codegen emits nothing, plus a **determinism guard** (`vera compile --wat` is byte-stable across `PYTHONHASHSEED` — the mono worklist sorts its instantiation sets) | -| `test_codegen.py` | 1,214 | 21,127 | WASM compilation, arithmetic, Float64, Byte, arrays (incl. compound element types), ADTs, match (incl. nested patterns), generics, State\, Exn\ handlers, control flow, strings, string escape sequences, IO (read\_line, read\_file, write\_file, args, exit, get\_env, sleep, time, stderr), bounds checking, quantifiers, assert/assume, refinement type aliases, pipe operator, string built-ins, built-in shadowing, parse\_nat Result, GC, Markdown host bindings, Regex host bindings, Map collection, Set collection, Decimal type, Json type, Html type, Http effect, Inference effect, Random effect, example round-trips, GC shadow stack overflow, **WASM tail-call optimization** (#517 — `return_call` emission for tail-position calls, 50K- and 1M-iteration stress, structural assertions on `return_call`/plain `call` boundary, **GC-aware TCO for allocating fns (#549 — `$gc_sp` restore before each `return_call`)**, postcondition-fallback regression (still reverts to plain `call`), analyzer unit tests covering Block-trailing / IfExpr-both-branches / MatchExpr-arm-bodies / let-value-NOT-marked / call-args-NOT-marked / ExprStmt-statement-NOT-marked / IfExpr-condition-NOT-marked / MatchExpr-scrutinee-NOT-marked), **@Nat narrowing runtime guards** (#552 let site; #747 extends to tuple-destructure / top-level match-bind / ADT sub-pattern / concrete ctor-field / call-arg sites — `i64.lt_s; unreachable` net, @Int targets exempt), **refinement-predicate runtime guards** (#746 — primitive-base (incl. `@Bool` / `@Float64`) refined params/returns traverse a `$vera.contract_fail` predicate guard at the function boundary: violating value traps, valid passes, call-args guarded transitively, generic refined returns guarded on the monomorphised instance; non-primitive `@Array` refinements (`array_length(...) > 0`) guarded too, the alias-aware `@Nat` base conjoins its implicit `>= 0`, and erased `@Unit` refinements emit no guard; **tuple-component boundary guards** decompose a `Tuple` param/return at the FFI boundary — loading each refined / `@Nat` component from the heap value and guarding it recursively for nested tuples, so a violating component traps instead of laundering past the verifier's component assumption; a **generic tuple alias** (`type Box = Tuple`) substitutes its argument so `Box` still guards its component, and a mutually-recursive (infinite) tuple alias **fails closed with E617** rather than silently emitting partial guards; a **refinement OVER a tuple** (`type Pair = { @Tuple \| true }`) is unwrapped so its components are guarded too, recursively through nested `Tuple`) | +| `test_codegen.py` | 1,215 | 21,146 | WASM compilation, arithmetic, Float64, Byte, arrays (incl. compound element types), ADTs, match (incl. nested patterns), generics, State\, Exn\ handlers, control flow, strings, string escape sequences, IO (read\_line, read\_file, write\_file, args, exit, get\_env, sleep, time, stderr), bounds checking, quantifiers, assert/assume, refinement type aliases, pipe operator, string built-ins, built-in shadowing, parse\_nat Result, GC, Markdown host bindings, Regex host bindings, Map collection, Set collection, Decimal type, Json type, Html type, Http effect, Inference effect, Random effect, example round-trips, GC shadow stack overflow, **WASM tail-call optimization** (#517 — `return_call` emission for tail-position calls, 50K- and 1M-iteration stress, structural assertions on `return_call`/plain `call` boundary, **GC-aware TCO for allocating fns (#549 — `$gc_sp` restore before each `return_call`)**, postcondition-fallback regression (still reverts to plain `call`), analyzer unit tests covering Block-trailing / IfExpr-both-branches / MatchExpr-arm-bodies / let-value-NOT-marked / call-args-NOT-marked / ExprStmt-statement-NOT-marked / IfExpr-condition-NOT-marked / MatchExpr-scrutinee-NOT-marked), **@Nat narrowing runtime guards** (#552 let site; #747 extends to tuple-destructure / top-level match-bind / ADT sub-pattern / concrete ctor-field / call-arg sites — `i64.lt_s; unreachable` net, @Int targets exempt), **refinement-predicate runtime guards** (#746 — primitive-base (incl. `@Bool` / `@Float64`) refined params/returns traverse a `$vera.contract_fail` predicate guard at the function boundary: violating value traps, valid passes, call-args guarded transitively, generic refined returns guarded on the monomorphised instance; non-primitive `@Array` refinements (`array_length(...) > 0`) guarded too, the alias-aware `@Nat` base conjoins its implicit `>= 0`, and erased `@Unit` refinements emit no guard; **tuple-component boundary guards** decompose a `Tuple` param/return at the FFI boundary — loading each refined / `@Nat` component from the heap value and guarding it recursively for nested tuples, so a violating component traps instead of laundering past the verifier's component assumption; a **generic tuple alias** (`type Box = Tuple`) substitutes its argument so `Box` still guards its component, and a mutually-recursive (infinite) tuple alias **fails closed with E617** rather than silently emitting partial guards; a **refinement OVER a tuple** (`type Pair = { @Tuple \| true }`) is unwrapped so its components are guarded too, recursively through nested `Tuple`) | | `test_codegen_contracts.py` | 32 | 570 | Runtime pre/postconditions, contract fail messages, old/new state postconditions | | `test_codegen_monomorphize.py` | 71 | 1,320 | Generic instantiation, type inference, monomorphization edge cases, ability constraint satisfaction (Eq/Ord/Hash/Show), operation rewriting (eq/compare), show/hash dispatch, ADT auto-derivation, array operations (slice/map/filter/fold) | | `test_codegen_closures.py` | 50 | 1,618 | Closure lifting, captured variables, higher-order functions, iterative-builder shadow-stack regressions (#570), closure return-value shadow-push balance for both i32-pair and i32-ADT branches across array_map and array_mapi, plus VERA_EAGER_GC injection self-test (#593), IndexExpr-of-FnCall element-type inference (#614), non-contiguous capture and walker-order miscompiles (#615) | @@ -75,7 +75,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_walker_defensive_branches_597.py` | 21 | 296 | Synthetic-AST tests for the 11 defensive `isinstance` branches added by #597 (`_scan_io_ops` / `_scan_expr_for_handlers` / `_infer_expr_wasm_type` / `_infer_vera_type`) plus the 5 pr-review fixes (#2/#3/#8 — ModuleCall/AnonFn/QualifiedCall return None; dead `is not None` guards on Block/HandleExpr removed) | | `test_check_walker_coverage_597.py` | 15 | 311 | Unit tests for `scripts/check_walker_coverage.py` parsing logic — Expr subclass extraction, isinstance flattening (incl. tuple form), checklist-block anchoring (incl. CR-3 regression test: `# Foo → bar` outside WALKER_COVERAGE block not counted), section-header tolerance, auto-discovery invariants, end-to-end main exit code | | `test_stress.py` | 16 | 553 | Scale-dependent regression tests (#596) — `@pytest.mark.stress`, skipped by default. 9 logical tests × eager-GC lane parametrisation = 16 test instances. 10K `array_map`, 5K nested-array `array_map`, 1K-deep tail recursion with allocating arg, 1M-deep tail recursion with allocating arg (#549 GC-aware TCO), 20×20 nested array-fold-of-array-fold, 100K `array_fold`, 10K String allocations, 1K `State` get/put cycles, 10K `IO.print` calls. Pins #570 / #515 / #593 / #549 / #487 / #348 / #573 regression coverage | -| `test_string_length_soundness.py` | 15 | 268 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | +| `test_string_length_soundness.py` | 15 | 269 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | | `test_errors.py` | 52 | 525 | Error code registry, diagnostic formatting, serialisation, SourceLocation, error display sync (README/HTML/spec) | | `test_formatter.py` | 124 | 1,074 | Comment extraction, interior comment positioning, expression/declaration formatting, match arm block bodies, idempotency, parenthesization, spec rules, ability declarations | | `test_cli.py` | 242 | 3,504 | CLI commands (check, verify, compile, run, test, fmt, version, quiet), subprocess integration, JSON error paths, runtime traps, arg validation, multi-file resolution, IO exit codes, --explain-slots, `builtins`/`effects`/`errors` introspection dispatch | diff --git a/spec/06-contracts.md b/spec/06-contracts.md index 83cc64b0..a438c0b7 100644 --- a/spec/06-contracts.md +++ b/spec/06-contracts.md @@ -254,7 +254,7 @@ The `@Nat` obligations (E502 / E503) carry the most nuance, spanning many bindin **Integer overflow** ([#798](https://github.com/aallan/vera/issues/798)). `@Int` is a signed 64-bit machine integer and `@Nat` an unsigned one; `+` / `-` / `*` wrap at the i64 / u64 boundary. Like `@Nat` underflow and signed-division `MIN / -1`, an overflowing operation is a *partial* operation that **traps** at runtime rather than silently wrapping, so each `@Int` / `@Nat` `+` / `-` / `*` carries an obligation that the result stays in range. It is classified at the operands' **common (coerced) type** — `@Int` if either operand is `@Int` (since `@Nat <: @Int`), else `@Nat` — not one operand's self-type (a non-negative literal is `@Nat`, but `5 + @Int.0` is an i64 add) nor the possibly-narrowed result type (an `@Int.0 + 1` stored into a `@Nat` slot is still an i64 add). A two-check mirrors array indexing: the result provably in range → **Tier 1**; provably out of range (e.g. a literal `u64.MAX + 1`, or `@Int.0 + 1` under `requires(@Int.0 == i64.MAX)`) → a compile error (**E528**); otherwise — dynamic operands — a runtime-guarded **Tier 3** trap. `@Nat` subtraction is excluded — it is the underflow obligation (E502) above, never a high-overflow. -**String length** ([#802](https://github.com/aallan/vera/issues/802)). Vera strings are UTF-8 byte sequences and `string_length` returns the **byte** count, but Z3's string theory (SMT-LIB 2.6) models strings as sequences of Unicode **code points** — its `Length` counts code points, which disagrees with the runtime on every multibyte character (`string_length("é")` is `2`, not `1`). `string_length` is therefore modeled at **Tier 1** only for a string **literal**, whose exact byte length is known; on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator). The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay **Tier 1**: UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. The other byte/offset-sensitive string builtins (`string_slice`, `string_index_of`, `string_char_code`, `string_chars`) are not translated to Z3 and already fall to **Tier 3**. Two further deferrals keep the predicates honest. Z3's string-sort alphabet only reaches U+2FFFF, and its Python binding silently stores any higher code point as the literal's *escape text* rather than the character — so a literal containing a code point **above U+2FFFF** is unusable in the `z3.StringVal`-based predicate translation (`string_contains` / `string_starts_with` / `string_ends_with`) and defers to **Tier 3** there, instead of letting a predicate match phantom escape bytes the runtime never sees. `string_length` is unaffected by this one — it byte-counts the *decoded* literal, so an astral literal's length stays **Tier 1**. A **lone surrogate** (U+D800–U+DFFF) has no UTF-8 encoding at all, so it defers on **both** paths: `z3.StringVal` raises on it, and `string_length`'s byte count cannot be taken. +**String length** ([#802](https://github.com/aallan/vera/issues/802)). Vera strings are UTF-8 byte sequences and `string_length` returns the **byte** count, but Z3's string theory (SMT-LIB 2.6) models strings as sequences of Unicode **code points** — its `Length` counts code points, which disagrees with the runtime on every multibyte character (`string_length("é")` is `2`, not `1`). `string_length` is therefore modeled at **Tier 1** only for a string **literal**, whose exact byte length is known; on any non-literal argument it defers to a runtime-guarded **Tier 3** obligation (Z3's string theory has no byte-length operator). The boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay **Tier 1**: UTF-8 is self-synchronizing, so a valid substring / prefix / suffix matches at the byte level exactly when it matches at the code-point level. The other byte/offset-sensitive string builtins (`string_slice`, `string_index_of`, `string_char_code`, `string_chars`) are not translated to Z3 and already fall to **Tier 3**. Two further deferrals keep the predicates honest. Z3's string-sort alphabet only reaches U+2FFFF, and its Python binding silently stores any higher code point as the literal's *escape text* rather than the character — so a literal containing a code point **above U+2FFFF** is unusable in the `z3.StringVal`-based predicate translation (`string_contains` / `string_starts_with` / `string_ends_with`) and defers to **Tier 3** there, instead of letting a predicate match phantom escape bytes the runtime never sees. `string_length` is unaffected by this one — it byte-counts the *decoded* literal, so an astral literal's length stays **Tier 1**. A **lone surrogate** (U+D800–U+DFFF) defers on **both** paths: `z3.StringVal` stores it as phantom escape text just like the astral case, and — since it has no UTF-8 encoding at all — `string_length`'s byte count cannot be taken either. Runtime traps for unguarded primitives are Vera-native: each trap carries a kind label (`divide_by_zero`, `out_of_bounds`, etc.), a per-kind Fix paragraph naming the precondition that would have prevented it, and a source backtrace — so a missing static guarantee is still a recoverable signal. diff --git a/tests/test_codegen.py b/tests/test_codegen.py index 13da7fcb..66dd7a60 100644 --- a/tests/test_codegen.py +++ b/tests/test_codegen.py @@ -1223,6 +1223,25 @@ def test_string_length_non_ascii_counts_utf8_bytes(self) -> None: ''' assert _run(four, fn="len") == 4 + def test_string_length_deferred_contract_enforced_at_runtime(self) -> None: + """#802 soundness loop: a slot-arg string_length contract that `vera + verify` DEFERS to Tier 3 is still enforced at runtime — the false + `ensures(@Int.result == 1)` over "é" (2 bytes) raises a postcondition + violation. This proves the deferral is *sound* (the runtime catches the + false contract verify could not prove), not merely imprecise — it closes + the loop the verifier-side deferral tests in + tests/test_string_length_soundness.py leave open.""" + source = r''' +public fn f(@String -> @Int) + requires(true) ensures(@Int.result == 1) effects(pure) +{ string_length(@String.0) } +''' + result = _compile_ok(source) + with pytest.raises( + (wasmtime.WasmtimeError, wasmtime.Trap, RuntimeError) + ): + execute(result, fn_name="f", raw_args=["é"]) + # ===================================================================== # Bool comparison codegen (i32 path) diff --git a/tests/test_string_length_soundness.py b/tests/test_string_length_soundness.py index 7e24ca6a..8dc3a03e 100644 --- a/tests/test_string_length_soundness.py +++ b/tests/test_string_length_soundness.py @@ -236,10 +236,11 @@ def test_astral_cutoff_boundary(self) -> None: ) def test_surrogate_literal_does_not_crash(self) -> None: - # A lone surrogate (U+D800) is not UTF-8-encodable; before the guard, - # string_length's `value.encode("utf-8")` and the predicate's - # `z3.StringVal` both raised UnicodeEncodeError and crashed `vera verify`. - # Both must now defer to Tier 3 without crashing. + # A lone surrogate (U+D800) is not UTF-8-encodable, so string_length's + # `value.encode("utf-8")` raised UnicodeEncodeError and crashed + # `vera verify` before the guard. (The predicate's `z3.StringVal` does + # not raise — it stores phantom escape text like the astral case — but is + # guarded the same way.) Both must now defer to Tier 3 without crashing. length = _verify(""" public fn f(@Unit -> @Int) requires(true) ensures(@Int.result == 1) effects(pure) diff --git a/vera/smt.py b/vera/smt.py index 6cb516c9..108497f9 100644 --- a/vera/smt.py +++ b/vera/smt.py @@ -574,16 +574,19 @@ def translate_expr( if isinstance(expr, ast.StringLit): # #802: Z3's string sort cannot faithfully model two kinds of code # point, so a literal containing either defers to Tier 3 (return - # None) rather than be reasoned over as a corrupted term: - # - above its alphabet (> U+2FFFF): the Python binding silently - # stores the *escape string* (literal "\U000xxxxx" ASCII chars) - # instead of the character, so Contains/PrefixOf/SuffixOf match - # phantom bytes the runtime never sees, proving false contracts; - # - a lone surrogate (U+D800..U+DFFF): not UTF-8-encodable, so - # z3.StringVal raises on it. - # (string_length models a literal via its UTF-8 byte count, not - # z3.StringVal, so it is unaffected by the alphabet limit — but it - # too must guard the surrogate case, below.) + # None) rather than be reasoned over as a corrupted term. For both, + # the Python binding silently stores the code point's *escape text* + # instead of the character (e.g. z3.StringVal("\U00030000") holds + # "\u{5c}u{30000}" — the backslash itself becomes "\u{5c}"), so + # Contains/PrefixOf/SuffixOf match phantom ASCII bytes the runtime + # never sees, proving false contracts. The two cases: + # - above the alphabet (> U+2FFFF); and + # - a lone surrogate (U+D800..U+DFFF), which additionally has no + # UTF-8 encoding at all (see string_length below). + # We must return None *before* z3.StringVal sees either. (string_length + # models a literal via its UTF-8 byte count, not z3.StringVal, so it + # is unaffected by the alphabet limit — but it too must guard the + # surrogate case, below, where the byte encoding genuinely fails.) if any(ord(ch) > 0x2FFFF or 0xD800 <= ord(ch) <= 0xDFFF for ch in expr.value): return None From 2d2db6e0b741c66e855f3d8ab4533432c9d0bd80 Mon Sep 17 00:00:00 2001 From: T Date: Sat, 27 Jun 2026 14:34:35 +0100 Subject: [PATCH 6/7] test(802): pin exact observables in two soundness tests (CodeRabbit) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CodeRabbit review of b350578 (1 inline + 1 outside-diff, both "Functional Correctness | Minor"). Both findings are the same anti-pattern — an assertion satisfiable by something other than its named target — verified valid by reproduction + mutation: - tests/test_codegen.py: the round-5 runtime soundness-loop test used pytest.raises((WasmtimeError, Trap, RuntimeError)). execute() normalises every wasmtime trap into WasmTrapError (a RuntimeError subclass), so the broad form green-passes on any failure — a setup error or a regression that traps for a different reason. Narrowed to raises(WasmTrapError) + assert kind == "contract_violation" (confirmed empirically: a postcondition violation raises exactly this kind). Mutation: asserting the wrong kind flips the test RED. - tests/test_string_length_soundness.py (test_astral_cutoff_boundary): the slot-arg precondition form (a String param compared to an astral literal) produces TWO tier3 obligations — the astral requires literal also defers — so tier3 >= 1 would still pass if only the requires deferred and the body predicate regressed to Tier 1, masking the soundness-critical path. Rewrote to literal-on-literal with a trivial precondition so a SINGLE postcondition obligation is driven solely by the body predicate's astral literal, and added an explicit clean-deferral assertion. Mutation: removing the > 0x2FFFF guard now flips the isolated assertion RED. Test-only, plus the two TESTING.md line counts. No behaviour change. Refs #802, #392. Co-Authored-By: Claude --- TESTING.md | 4 ++-- tests/test_codegen.py | 12 +++++++++--- tests/test_string_length_soundness.py | 21 +++++++++++++++------ 3 files changed, 26 insertions(+), 11 deletions(-) diff --git a/TESTING.md b/TESTING.md index a15fb678..eadca89a 100644 --- a/TESTING.md +++ b/TESTING.md @@ -65,7 +65,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_int_overflow_codegen.py` | 50 | 567 | #798 Stage 3 — runtime overflow-trap codegen: the codegen emits a guard at *exactly* the `@Int`/`@Nat` `+`/`-`/`*` sites the verifier obligates, so `vera run`/`vera compile` programs trap on overflow instead of silently wrapping at the i64/u64 boundary. The trap is a bare `unreachable` (Option A — matches the #520 `nat_sub` and #552 nat-bind precedent), classified `kind="unreachable"` today (a precise `"overflow"` trap kind via a host import is a follow-up). Test-first: every `*_traps` test fails on the pre-Stage-3 codegen (the op wraps silently → no trap → `execute` returns a value); every `*_no_trap` test passes before and after (safe arithmetic unchanged) | | `test_int_overflow_differential.py` | 143 | 364 | #798 Stage 3 verifier↔codegen classification differential (cross-component soundness rule): the codegen overflow guard must fire at exactly the sites the verifier obligates *and* classify each site's operand type (`@Int` i64 vs `@Nat` u64) identically — else a Tier-1-clean program traps spuriously or a wrapping op slips through unguarded. Over a corpus exercising all five operand combos plus the literal-left ambiguity (which the pre-fix codegen mis-classified as `@Nat`), asserts the verifier's per-site gated classification equals the codegen's site for site, both sides driven by the same `ast.span_key` | | `test_monomorphize_differential.py` | 13 | 682 | #732 differential soundness: the verifier's per-monomorphization instantiation discovery covers every instantiation codegen emits (name coverage + per-generic count), over real generic programs (conformance ch02/ch09, `examples/generics.vera`) plus inline cases for the soundness-critical scenarios — collapsed type vars, **prelude combinator emission** (`option_map`), transitive generics, a generic whose type arg is fixed only by a **where-helper's return** (a `Float64`-returning helper, so the unresolved-var `"Bool"` phantom default cannot mask a miss), a generic whose type arg is fixed only by an **imported constructor** (`id2(MkBox(7))` — the verifier's mono-context must include `_module_constructors`, else it phantom-defaults and misses codegen's `id2`), a generic whose type arg is fixed only by an **imported function's return** (`id_g(make_int(...))` — the verifier's mono-context must seed `fn_ret_types` from imported functions, else it phantom-defaults and misses codegen's `id_g`, plus a **private-shadow** case pinning the imported-fn seeding stays unfiltered like codegen since filtering would diverge into a false Tier-1), and a generic reached only through a **contract clause or `where` helper** (codegen must seed Pass 1.5 from the shared node-level walk, not just `decl.body`, or it skips the clone → `CodegenSkip` at run time) — so a missed instantiation (a false Tier-1) is caught. Guards against a vacuous pass when codegen emits nothing, plus a **determinism guard** (`vera compile --wat` is byte-stable across `PYTHONHASHSEED` — the mono worklist sorts its instantiation sets) | -| `test_codegen.py` | 1,215 | 21,146 | WASM compilation, arithmetic, Float64, Byte, arrays (incl. compound element types), ADTs, match (incl. nested patterns), generics, State\, Exn\ handlers, control flow, strings, string escape sequences, IO (read\_line, read\_file, write\_file, args, exit, get\_env, sleep, time, stderr), bounds checking, quantifiers, assert/assume, refinement type aliases, pipe operator, string built-ins, built-in shadowing, parse\_nat Result, GC, Markdown host bindings, Regex host bindings, Map collection, Set collection, Decimal type, Json type, Html type, Http effect, Inference effect, Random effect, example round-trips, GC shadow stack overflow, **WASM tail-call optimization** (#517 — `return_call` emission for tail-position calls, 50K- and 1M-iteration stress, structural assertions on `return_call`/plain `call` boundary, **GC-aware TCO for allocating fns (#549 — `$gc_sp` restore before each `return_call`)**, postcondition-fallback regression (still reverts to plain `call`), analyzer unit tests covering Block-trailing / IfExpr-both-branches / MatchExpr-arm-bodies / let-value-NOT-marked / call-args-NOT-marked / ExprStmt-statement-NOT-marked / IfExpr-condition-NOT-marked / MatchExpr-scrutinee-NOT-marked), **@Nat narrowing runtime guards** (#552 let site; #747 extends to tuple-destructure / top-level match-bind / ADT sub-pattern / concrete ctor-field / call-arg sites — `i64.lt_s; unreachable` net, @Int targets exempt), **refinement-predicate runtime guards** (#746 — primitive-base (incl. `@Bool` / `@Float64`) refined params/returns traverse a `$vera.contract_fail` predicate guard at the function boundary: violating value traps, valid passes, call-args guarded transitively, generic refined returns guarded on the monomorphised instance; non-primitive `@Array` refinements (`array_length(...) > 0`) guarded too, the alias-aware `@Nat` base conjoins its implicit `>= 0`, and erased `@Unit` refinements emit no guard; **tuple-component boundary guards** decompose a `Tuple` param/return at the FFI boundary — loading each refined / `@Nat` component from the heap value and guarding it recursively for nested tuples, so a violating component traps instead of laundering past the verifier's component assumption; a **generic tuple alias** (`type Box = Tuple`) substitutes its argument so `Box` still guards its component, and a mutually-recursive (infinite) tuple alias **fails closed with E617** rather than silently emitting partial guards; a **refinement OVER a tuple** (`type Pair = { @Tuple \| true }`) is unwrapped so its components are guarded too, recursively through nested `Tuple`) | +| `test_codegen.py` | 1,215 | 21,152 | WASM compilation, arithmetic, Float64, Byte, arrays (incl. compound element types), ADTs, match (incl. nested patterns), generics, State\, Exn\ handlers, control flow, strings, string escape sequences, IO (read\_line, read\_file, write\_file, args, exit, get\_env, sleep, time, stderr), bounds checking, quantifiers, assert/assume, refinement type aliases, pipe operator, string built-ins, built-in shadowing, parse\_nat Result, GC, Markdown host bindings, Regex host bindings, Map collection, Set collection, Decimal type, Json type, Html type, Http effect, Inference effect, Random effect, example round-trips, GC shadow stack overflow, **WASM tail-call optimization** (#517 — `return_call` emission for tail-position calls, 50K- and 1M-iteration stress, structural assertions on `return_call`/plain `call` boundary, **GC-aware TCO for allocating fns (#549 — `$gc_sp` restore before each `return_call`)**, postcondition-fallback regression (still reverts to plain `call`), analyzer unit tests covering Block-trailing / IfExpr-both-branches / MatchExpr-arm-bodies / let-value-NOT-marked / call-args-NOT-marked / ExprStmt-statement-NOT-marked / IfExpr-condition-NOT-marked / MatchExpr-scrutinee-NOT-marked), **@Nat narrowing runtime guards** (#552 let site; #747 extends to tuple-destructure / top-level match-bind / ADT sub-pattern / concrete ctor-field / call-arg sites — `i64.lt_s; unreachable` net, @Int targets exempt), **refinement-predicate runtime guards** (#746 — primitive-base (incl. `@Bool` / `@Float64`) refined params/returns traverse a `$vera.contract_fail` predicate guard at the function boundary: violating value traps, valid passes, call-args guarded transitively, generic refined returns guarded on the monomorphised instance; non-primitive `@Array` refinements (`array_length(...) > 0`) guarded too, the alias-aware `@Nat` base conjoins its implicit `>= 0`, and erased `@Unit` refinements emit no guard; **tuple-component boundary guards** decompose a `Tuple` param/return at the FFI boundary — loading each refined / `@Nat` component from the heap value and guarding it recursively for nested tuples, so a violating component traps instead of laundering past the verifier's component assumption; a **generic tuple alias** (`type Box = Tuple`) substitutes its argument so `Box` still guards its component, and a mutually-recursive (infinite) tuple alias **fails closed with E617** rather than silently emitting partial guards; a **refinement OVER a tuple** (`type Pair = { @Tuple \| true }`) is unwrapped so its components are guarded too, recursively through nested `Tuple`) | | `test_codegen_contracts.py` | 32 | 570 | Runtime pre/postconditions, contract fail messages, old/new state postconditions | | `test_codegen_monomorphize.py` | 71 | 1,320 | Generic instantiation, type inference, monomorphization edge cases, ability constraint satisfaction (Eq/Ord/Hash/Show), operation rewriting (eq/compare), show/hash dispatch, ADT auto-derivation, array operations (slice/map/filter/fold) | | `test_codegen_closures.py` | 50 | 1,618 | Closure lifting, captured variables, higher-order functions, iterative-builder shadow-stack regressions (#570), closure return-value shadow-push balance for both i32-pair and i32-ADT branches across array_map and array_mapi, plus VERA_EAGER_GC injection self-test (#593), IndexExpr-of-FnCall element-type inference (#614), non-contiguous capture and walker-order miscompiles (#615) | @@ -75,7 +75,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_walker_defensive_branches_597.py` | 21 | 296 | Synthetic-AST tests for the 11 defensive `isinstance` branches added by #597 (`_scan_io_ops` / `_scan_expr_for_handlers` / `_infer_expr_wasm_type` / `_infer_vera_type`) plus the 5 pr-review fixes (#2/#3/#8 — ModuleCall/AnonFn/QualifiedCall return None; dead `is not None` guards on Block/HandleExpr removed) | | `test_check_walker_coverage_597.py` | 15 | 311 | Unit tests for `scripts/check_walker_coverage.py` parsing logic — Expr subclass extraction, isinstance flattening (incl. tuple form), checklist-block anchoring (incl. CR-3 regression test: `# Foo → bar` outside WALKER_COVERAGE block not counted), section-header tolerance, auto-discovery invariants, end-to-end main exit code | | `test_stress.py` | 16 | 553 | Scale-dependent regression tests (#596) — `@pytest.mark.stress`, skipped by default. 9 logical tests × eager-GC lane parametrisation = 16 test instances. 10K `array_map`, 5K nested-array `array_map`, 1K-deep tail recursion with allocating arg, 1M-deep tail recursion with allocating arg (#549 GC-aware TCO), 20×20 nested array-fold-of-array-fold, 100K `array_fold`, 10K String allocations, 1K `State` get/put cycles, 10K `IO.print` calls. Pins #570 / #515 / #593 / #549 / #487 / #348 / #573 regression coverage | -| `test_string_length_soundness.py` | 15 | 269 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | +| `test_string_length_soundness.py` | 15 | 278 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | | `test_errors.py` | 52 | 525 | Error code registry, diagnostic formatting, serialisation, SourceLocation, error display sync (README/HTML/spec) | | `test_formatter.py` | 124 | 1,074 | Comment extraction, interior comment positioning, expression/declaration formatting, match arm block bodies, idempotency, parenthesization, spec rules, ability declarations | | `test_cli.py` | 242 | 3,504 | CLI commands (check, verify, compile, run, test, fmt, version, quiet), subprocess integration, JSON error paths, runtime traps, arg validation, multi-file resolution, IO exit codes, --explain-slots, `builtins`/`effects`/`errors` introspection dispatch | diff --git a/tests/test_codegen.py b/tests/test_codegen.py index 66dd7a60..93631b17 100644 --- a/tests/test_codegen.py +++ b/tests/test_codegen.py @@ -25,6 +25,7 @@ compile, execute, ) +from vera.codegen.api import WasmTrapError from vera.parser import parse_file from vera.transform import transform @@ -1237,10 +1238,15 @@ def test_string_length_deferred_contract_enforced_at_runtime(self) -> None: { string_length(@String.0) } ''' result = _compile_ok(source) - with pytest.raises( - (wasmtime.WasmtimeError, wasmtime.Trap, RuntimeError) - ): + # Pin the *exact* observable: execute() normalises every wasmtime trap + # into a WasmTrapError, so a broad raises((WasmtimeError, Trap, + # RuntimeError)) would green-pass on any failure — a compile/setup + # error or a regression that traps for a different reason. Assert the + # contract-violation kind so the test fails iff this specific deferral + # stops being caught. + with pytest.raises(WasmTrapError) as excinfo: execute(result, fn_name="f", raw_args=["é"]) + assert excinfo.value.kind == "contract_violation" # ===================================================================== diff --git a/tests/test_string_length_soundness.py b/tests/test_string_length_soundness.py index 8dc3a03e..7ee8f0c7 100644 --- a/tests/test_string_length_soundness.py +++ b/tests/test_string_length_soundness.py @@ -214,23 +214,32 @@ def test_astral_cutoff_boundary(self) -> None: # Tier 1 (tier3 == 0); U+30000, one above, defers (tier3 >= 1). Uses a # predicate, not string_length, because string_length byte-counts via # Python and so does not exercise the z3.StringVal alphabet cutoff. + # + # Literal-on-literal with requires(true) so the *single* ensures + # obligation is driven solely by the body predicate's astral literal. + # A slot-arg requires(@String.0 == "\u{30000}") form adds a second + # tier3 obligation (the astral requires literal also defers), so + # `tier3 >= 1` there would still pass if only the requires deferred and + # the body predicate regressed to Tier 1 — masking the soundness- + # critical path. Pinning one obligation removes that ambiguity. modeled = _verify(""" -public fn f(@String -> @Bool) - requires(@String.0 == "\\u{2FFFF}") +public fn f(@Unit -> @Bool) + requires(true) ensures(@Bool.result == true) effects(pure) -{ string_starts_with(@String.0, "\\u{2FFFF}") } +{ string_starts_with("\\u{2FFFF}", "\\u{2FFFF}") } """) assert _ok(modeled) and modeled.summary.tier3_runtime == 0, ( modeled.summary.tier1_verified, modeled.summary.tier3_runtime, ) deferred = _verify(""" -public fn f(@String -> @Bool) - requires(@String.0 == "\\u{30000}") +public fn f(@Unit -> @Bool) + requires(true) ensures(@Bool.result == true) effects(pure) -{ string_starts_with(@String.0, "\\u{30000}") } +{ string_starts_with("\\u{30000}", "\\u{30000}") } """) + assert _ok(deferred), [d.description for d in deferred.diagnostics] assert deferred.summary.tier3_runtime >= 1, ( deferred.summary.tier1_verified, deferred.summary.tier3_runtime, ) From 2f4ba0b6ca3ac5656bb8557d29f133035ad92b6c Mon Sep 17 00:00:00 2001 From: T Date: Sat, 27 Jun 2026 15:04:29 +0100 Subject: [PATCH 7/7] docs(802): note astral/surrogate predicate Tier-3 deferral in TESTING row (CodeRabbit) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CodeRabbit full review of 2d2db6e flagged that the test_string_length_soundness.py summary row described the string_length byte-count fix and the Tier-1 predicate cases but omitted the astral/surrogate literal predicate path the file also covers (TestAstralStringLiteral802) — which defers to Tier 3 because z3.StringVal cannot model those code points. Appended that clause so the row matches the file's actual coverage. Doc-only. Refs #802, #392. Co-Authored-By: Claude --- TESTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/TESTING.md b/TESTING.md index eadca89a..8a24c0dd 100644 --- a/TESTING.md +++ b/TESTING.md @@ -75,7 +75,7 @@ python scripts/fix_allowlists.py --fix # auto-fix stale allowlists | `test_walker_defensive_branches_597.py` | 21 | 296 | Synthetic-AST tests for the 11 defensive `isinstance` branches added by #597 (`_scan_io_ops` / `_scan_expr_for_handlers` / `_infer_expr_wasm_type` / `_infer_vera_type`) plus the 5 pr-review fixes (#2/#3/#8 — ModuleCall/AnonFn/QualifiedCall return None; dead `is not None` guards on Block/HandleExpr removed) | | `test_check_walker_coverage_597.py` | 15 | 311 | Unit tests for `scripts/check_walker_coverage.py` parsing logic — Expr subclass extraction, isinstance flattening (incl. tuple form), checklist-block anchoring (incl. CR-3 regression test: `# Foo → bar` outside WALKER_COVERAGE block not counted), section-header tolerance, auto-discovery invariants, end-to-end main exit code | | `test_stress.py` | 16 | 553 | Scale-dependent regression tests (#596) — `@pytest.mark.stress`, skipped by default. 9 logical tests × eager-GC lane parametrisation = 16 test instances. 10K `array_map`, 5K nested-array `array_map`, 1K-deep tail recursion with allocating arg, 1M-deep tail recursion with allocating arg (#549 GC-aware TCO), 20×20 nested array-fold-of-array-fold, 100K `array_fold`, 10K String allocations, 1K `State` get/put cycles, 10K `IO.print` calls. Pins #570 / #515 / #593 / #549 / #487 / #348 / #573 regression coverage | -| `test_string_length_soundness.py` | 15 | 278 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization) | +| `test_string_length_soundness.py` | 15 | 278 | #802 — string_length code-point vs UTF-8 byte soundness: a non-literal `string_length` defers to Tier 3 (the issue's `"é"` probe no longer proves `== 1` at Tier 1), a string-literal length is modeled at its exact UTF-8 byte count (`== 2` for `"é"`), and the boolean predicates `string_contains` / `string_starts_with` / `string_ends_with` stay Tier 1 (sound under UTF-8 self-synchronization), while a predicate over an astral (> U+2FFFF) or lone-surrogate literal defers to Tier 3 (z3.StringVal cannot model those code points) | | `test_errors.py` | 52 | 525 | Error code registry, diagnostic formatting, serialisation, SourceLocation, error display sync (README/HTML/spec) | | `test_formatter.py` | 124 | 1,074 | Comment extraction, interior comment positioning, expression/declaration formatting, match arm block bodies, idempotency, parenthesization, spec rules, ability declarations | | `test_cli.py` | 242 | 3,504 | CLI commands (check, verify, compile, run, test, fmt, version, quiet), subprocess integration, JSON error paths, runtime traps, arg validation, multi-file resolution, IO exit codes, --explain-slots, `builtins`/`effects`/`errors` introspection dispatch |