Skip to content

fix(802): defer string_length to Tier 3 for UTF-8 byte soundness (v0.0.182)#810

Merged
aallan merged 7 commits into
mainfrom
fix/802-string-utf8-bytes
Jun 27, 2026
Merged

fix(802): defer string_length to Tier 3 for UTF-8 byte soundness (v0.0.182)#810
aallan merged 7 commits into
mainfrom
fix/802-string-utf8-bytes

Conversation

@aallan

@aallan aallan commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Closes #802 — the final sub-task of the #392 smt.py soundness audit. With this, all six soundness bugs the audit found (#797#802) are fixed, so this PR also Closes #392.

The bug

vera verify translated string_length to Z3's Length, which counts Unicode code points, but Vera's runtime string_length counts UTF-8 bytes. They disagree on every multibyte character, so the verifier proved false contracts:

public fn f(@String -> @Int)
  requires(@String.0 == "é")      # U+00E9: 1 code point, 2 UTF-8 bytes
  ensures(@Int.result == 1)       # FALSE — runtime returns 2
  effects(pure)
{ string_length(@String.0) }

vera verify → proved at Tier 1 (Z3 Length("é") == 1), yet vera run raises a postcondition violation (string_length("é") is 2). Silent for ASCII (1 byte = 1 code point); wrong for any multibyte character.

The fix

  • string_length (vera/smt.py): modeled at Tier 1 only for a string literal, whose exact UTF-8 byte length is known (string_length("é") now proves == 2, disproves == 1); on any non-literal argument it defers to a runtime-guarded Tier 3 obligation — Z3's string theory has no byte-length operator, so this matches the numeric-cast / quantifier / decimal precedent the audit established. Sound, and it preserves the literal-arg precision the original z3.Length was added for (with the correct byte count).
  • Audited the other modeled string builtins and confirmed they're sound, so they're untouched: string_contains / string_starts_with / string_ends_with stay Tier 1 because UTF-8 is self-synchronizing — a valid substring/prefix/suffix matches at the byte level exactly when it matches at the code-point level (verified by a verify↔run differential on non-ASCII inputs). The offset-sensitive builtins (string_slice, string_index_of, string_char_code, string_chars) were never translated to Z3 and already defer to Tier 3.

Also found in review — astral-plane string literals (folded in)

The adversarial review of this PR surfaced a second Z3-string unsoundness in the same class, so I fixed it here (this PR closes the #392 audit — shipping a known Z3-string hole would make that close dishonest). Z3's string-sort alphabet only reaches U+2FFFF, and its Python binding silently stores any higher code point as the literal's escape text — so string_contains("\u{10FFFF}", "f") proved true at Tier 1 (matching a phantom escape byte) while the runtime returns false. Reproduced, then fixed in vera/smt.py: a string literal containing a code point > U+2FFFF defers to Tier 3 (StringVal returns None), so the predicates never match phantom bytes. string_length is unaffected (it byte-counts the decoded value, not StringVal). Regression test: TestAstralStringLiteral802. The review also prompted a runtime byte-length test (string_length("é") == 2, "😀" == 4) — pinning the runtime premise the verifier's byte model relies on — and escaped-unicode / 4-byte / empty-literal coverage.

Exit checklist (§0)

  • pytest tests/ — green (5,138 passed, 16 skipped, 16 deselected)
  • mypy vera/ — clean
  • check_conformance.py — 93/93 (ch04_string_builtins.vera gains a non-ASCII byte-length case that verifies at Tier 1 and runs as 2)
  • check_examples.py — 35/35
  • check_version_sync.py — 0.0.182
  • CHANGELOG [0.0.182] + link refs; doc-counts / site-assets / spec-examples green
  • New behaviour covered by tests that fail on the pre-fix verifiertests/test_string_length_soundness.py (RED-confirmed: pre-fix proves == 1 / disproves == 2)

Tests

  • New tests/test_string_length_soundness.py (8): the literal byte-model ("é"2, not 1), slot-arg deferral (the issue's probe no longer proves at Tier 1), ASCII preservation, and the predicate-soundness guards (verify↔run agree on non-ASCII).
  • Rewritten tests that pinned the old unsound behaviourTestStringLengthVerification asserted string_length was Tier-1 via z3.Length (it now asserts the deferral / literal byte-model). Refinement, E501-rendering, and Expand smt.translate_expr coverage: FloatLit, ArrayLit, IndexExpr #667 call-typing tests that used string_length only as an example builtin switch to string_starts_with (a still-Tier-1 predicate), preserving each test's actual purpose. The aggregate tier-count test moves -2 T1 / +2 T3 (two example contracts defer), total unchanged.

Reproduce-before-fix note (CR's suggested option)

The issue offered a byte-indexed Z3 model as the alternative to deferral. A full byte model would require re-encoding Vera strings as Z3 byte-sequences (SeqSort of BitVec8) — a large rework that would itself need an audit. The literal-byte-length model here fixes the unsoundness, keeps the documented literal-arg precision, and minimises collateral weakening (sound checks like string_length("") > 0 over a literal still decide statically); a richer byte model can follow if Tier-1 string-length reasoning over variables is ever needed.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved verifier soundness for string_length by modelling supported string literals using exact UTF-8 byte length; non-literal inputs now defer to runtime-guarded Tier 3 checks, with added safeguards for astral and lone-surrogate edge cases.
    • Retained Tier 1 handling for string_contains, string_starts_with, and string_ends_with where sound, with Tier 3 deferrals for problematic literals.
  • Tests
    • Added a new string_length soundness regression suite and updated tiering/verification expectations, including an end-to-end contract-violation case for issue 802.
  • Documentation
    • Updated spec chapter 6.4.3 and release materials for v0.0.182, refreshing documented release and test-status totals.

…0.182)

vera verify translated string_length to Z3's Length, which counts Unicode code
points, but Vera's runtime counts UTF-8 bytes — so ensures(string_length("é")
== 1) verified at Tier 1 while the runtime returns 2 (a false proof for every
non-ASCII string). Final sub-task of the #392 smt.py soundness audit.

string_length is now modeled at Tier 1 only for a string literal (its exact byte
length is known); a non-literal argument defers to a runtime-guarded Tier 3
obligation (Z3's string theory has no byte-length operator), matching the
numeric-cast / quantifier / decimal precedent. The boolean predicates
string_contains / string_starts_with / string_ends_with stay Tier 1 — UTF-8 is
self-synchronizing, so a valid substring/prefix/suffix matches at the byte level
exactly when it matches at the code-point level. The offset-sensitive builtins
(string_slice / string_index_of / string_char_code / string_chars) were never
modeled in Z3 and already defer.

Tests that pinned the old z3.Length Tier-1 behavior are rewritten to the sound
deferral; refinement / E501 / #667 tests that used string_length only as an
example builtin switch to string_starts_with (a still-Tier-1 predicate). New
test_string_length_soundness.py covers the literal byte model, slot deferral,
and predicate soundness; ch04_string_builtins gains a non-ASCII byte-length case.

Closes #802.
Closes #392.

Co-Authored-By: Claude <noreply@anthropic.invalid>
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1d219998-b6ce-47e3-834a-c7ce0cc8eda4

📥 Commits

Reviewing files that changed from the base of the PR and between 2690a08 and 2f4ba0b.

⛔ Files ignored due to path filters (6)
  • docs/index.html is excluded by !docs/**
  • docs/index.md is excluded by !docs/**
  • docs/llms-full.txt is excluded by !docs/**
  • docs/llms.txt is excluded by !docs/**
  • tests/conformance/ch04_string_builtins.vera is excluded by !**/*.vera
  • uv.lock is excluded by !**/*.lock, !uv.lock
📒 Files selected for processing (15)
  • CHANGELOG.md
  • HISTORY.md
  • README.md
  • ROADMAP.md
  • TESTING.md
  • pyproject.toml
  • scripts/check_spec_examples.py
  • spec/06-contracts.md
  • tests/test_codegen.py
  • tests/test_obligations.py
  • tests/test_string_length_soundness.py
  • tests/test_verifier.py
  • tests/test_verifier_coverage.py
  • vera/__init__.py
  • vera/smt.py

📝 Walkthrough

Walkthrough

This PR changes string_length to use UTF-8 byte counts for string literals and defer non-literal cases to Tier 3. It adds regression tests, updates verifier expectations for related string predicates, and bumps release/version metadata to 0.0.182.

Changes

String length soundness

Layer / File(s) Summary
Translation and contract
spec/06-contracts.md, scripts/check_spec_examples.py, CHANGELOG.md, vera/smt.py
string_length is reclassified for literals vs non-literals, the spec example allowlist is updated, the release note is added, and SMT translation now uses UTF-8 byte lengths or Tier 3 fallback.
Regression coverage
tests/test_string_length_soundness.py, tests/test_obligations.py, tests/test_verifier.py, tests/test_verifier_coverage.py, tests/test_codegen.py
New regression cases cover literal and non-literal string_length verification, non-ASCII string predicates, and updated verifier/codegen expectations for the revised string semantics.
Release metadata
HISTORY.md, README.md, ROADMAP.md, TESTING.md, pyproject.toml, vera/__init__.py
Release notes, status text, test counts, version fields, and package version are bumped to 0.0.182.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • aallan/vera#456: Directly overlaps with the vera/smt.py string_length translation path changed here.

Suggested labels

compiler, tests, spec, docs, ci

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarises the main change: deferring non-literal string_length to Tier 3 for UTF-8 byte soundness.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/802-string-utf8-bytes

Comment @coderabbitai help to get the list of available commands.

@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.03%. Comparing base (2690a08) to head (2f4ba0b).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #810   +/-   ##
=======================================
  Coverage   92.03%   92.03%           
=======================================
  Files          89       89           
  Lines       26853    26855    +2     
  Branches      321      321           
=======================================
+ Hits        24713    24717    +4     
+ Misses       2132     2130    -2     
  Partials        8        8           
Flag Coverage Δ
javascript 65.33% <ø> (ø)
python 95.09% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…runtime/edge tests

Adversarial review of #810 found a second Z3-string unsoundness in the same class
as #802: Z3's string sort alphabet only reaches U+2FFFF, and the Python binding
silently stores any higher code point as the literal's escape string — so
string_contains("\u{10FFFF}", "f") proved true at Tier 1 while the runtime returns
false. Since this PR closes the #392 smt.py string-soundness audit, fixing it here
keeps that close honest.

- vera/smt.py: a string literal containing a code point > U+2FFFF defers to Tier 3
  (StringVal returns None), so the predicates never match phantom escape bytes.
  string_length is unaffected (it byte-counts the decoded value, not StringVal),
  so an astral literal's byte length is still soundly Tier-1.
- tests/test_string_length_soundness.py: TestAstralStringLiteral802 (the
  regression — pre-fix proved the false contract at Tier 1) plus literal-edge
  coverage (escaped-unicode decode, 4-byte emoji, empty string).
- tests/test_codegen.py: a non-ASCII runtime byte-length test
  (string_length("é") == 2, "😀" == 4) — pins the runtime premise the verifier's
  byte model relies on, so a codegen regression to code-point counting can't hide
  behind green verifier-side tests.
- spec/06 §6.4.3 + CHANGELOG: note the astral-literal Tier-3 deferral.

Refs #802, #392.

Co-Authored-By: Claude <noreply@anthropic.invalid>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
vera/smt.py (1)

1075-1079: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Reject lone surrogate escapes before UTF-8 sizing vera/smt.py:1075-1079

_decode_string_escapes() accepts \u{D800}-style values, so len(arg_node.value.encode("utf-8")) can raise UnicodeEncodeError instead of falling back to Tier 3. Reject surrogate code points upstream or handle the encode failure here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vera/smt.py` around lines 1075 - 1079, The string_length fast path in
`vera.smt` can crash on lone surrogate escapes because
`arg_node.value.encode("utf-8")` assumes valid UTF-8. Update the `string_length`
handling in the SMT translation logic to either reject surrogate code points
before calling `encode` or catch `UnicodeEncodeError` and fall back to the Tier
3 path, using the existing `call`/`arg_node` string-literal branch as the place
to make the change.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_string_length_soundness.py`:
- Around line 124-132: Add an explicit boundary test around the ord(ch) >
0x2FFFF check in the string length soundness suite: extend
test_four_byte_emoji_literal_byte_length or add a nearby case that verifies both
\u{2FFFF} and \u{30000} so the compiler behavior at the exact cutoff is covered.
Use the existing _verify helper and string_length assertions to ensure \u{2FFFF}
is treated on the lower side and \u{30000} on the higher side, catching any
accidental > versus >= regression.
- Around line 192-206: The astral-literal test only checks that Tier 1 did not
verify, which can still pass if verification fails differently or the obligation
disappears. In test_astral_predicate_not_proved_at_tier1, keep the existing
tier1_verified assertion but also assert _ok(result) and that
result.summary.tier3_runtime is at least 1 so the test explicitly covers the
intended Tier-3 fallback behavior.

---

Outside diff comments:
In `@vera/smt.py`:
- Around line 1075-1079: The string_length fast path in `vera.smt` can crash on
lone surrogate escapes because `arg_node.value.encode("utf-8")` assumes valid
UTF-8. Update the `string_length` handling in the SMT translation logic to
either reject surrogate code points before calling `encode` or catch
`UnicodeEncodeError` and fall back to the Tier 3 path, using the existing
`call`/`arg_node` string-literal branch as the place to make the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c7827713-5433-4a0e-bc61-ad9a8cebf0dd

📥 Commits

Reviewing files that changed from the base of the PR and between 35579b4 and 0195d9c.

📒 Files selected for processing (8)
  • CHANGELOG.md
  • README.md
  • ROADMAP.md
  • TESTING.md
  • spec/06-contracts.md
  • tests/test_codegen.py
  • tests/test_string_length_soundness.py
  • vera/smt.py

Comment thread tests/test_string_length_soundness.py
Comment thread tests/test_string_length_soundness.py Outdated
…toff/astral tests (review)

CodeRabbit review of #810:
- vera/smt.py: a lone surrogate (U+D800..U+DFFF) is not UTF-8-encodable, so the
  new string_length literal byte-model (value.encode) AND z3.StringVal both
  crashed `vera verify` with UnicodeEncodeError. Guard both: string_length
  catches UnicodeEncodeError -> Tier 3; the StringVal literal-deferral now covers
  surrogates alongside the > U+2FFFF astral cutoff.
- tests/test_string_length_soundness.py: a surrogate regression test (no crash,
  defers), an exact-cutoff boundary test (\u{2FFFF} Tier-1 vs \u{30000} Tier-3,
  pinning > vs >=), and the astral predicate test reworked to the
  literal-on-literal form so it asserts the clean Tier-3 fallback (_ok +
  tier3 >= 1).
- spec/06 section 6.4.3 + CHANGELOG: note the surrogate deferral.

Note: CR's finding 1 suggested a string_length boundary test, but string_length
byte-counts via Python and does not exercise the z3.StringVal cutoff, so the
boundary test is on a predicate instead.

Refs #802, #392.

Co-Authored-By: Claude <noreply@anthropic.invalid>
@aallan

aallan commented Jun 27, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai Good catch — reproduced (it crashed vera verify with UnicodeEncodeError) and fixed in 2961d91. A lone surrogate broke both paths: string_length's value.encode("utf-8") (now wrapped → Tier 3 on UnicodeEncodeError) and z3.StringVal in the predicate path (the literal-deferral guard now covers U+D800–U+DFFF alongside the > U+2FFFF astral cutoff). Regression: test_surrogate_literal_does_not_crash exercises both a string_length and a string_contains over "\u{D800}".

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@aallan Nice fix — that closes the hole I was worried about.

Catching UnicodeEncodeError in the string_length literal path and widening the literal-deferral guard to include U+D800–U+DFFF on the predicate path is the right soundness-preserving behavior here. The regression coverage in test_surrogate_literal_does_not_crash also looks well targeted since it exercises both string_length and string_contains on "\u{D800}".


✏️ Learnings added
Learnt from: aallan
URL: https://github.com/aallan/vera/pull/810

Timestamp: 2026-06-27T11:15:31.091Z
Learning: In `vera/smt.py`, string literals containing lone surrogates (`U+D800–U+DFFF`) must not be translated at Tier 1: the `string_length` literal path can raise `UnicodeEncodeError` on UTF-8 encoding and should defer to Tier 3, and the predicate literal path should also defer rather than calling `z3.StringVal`. Regression coverage exists in `tests/test_string_length_soundness.py` via `test_surrogate_literal_does_not_crash`, which exercises both `string_length` and `string_contains` on `"\u{D800}"`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@spec/06-contracts.md`:
- Around line 257-258: The String length/spec paragraph overstates the U+2FFFF
fallback and conflicts with `vera/smt.py` and
`tests/test_string_length_soundness.py`. Update the wording in the string
builtins section so the U+2FFFF cutoff applies only to `z3.StringVal`-based
predicate translation (with lone surrogates still deferring to Tier 3), while
`string_length` on literals remains Tier 1 via byte-counting. Keep the existing
references to `string_length`, `string_contains`, `string_starts_with`,
`string_ends_with`, and `z3.StringVal` to anchor the revised scope.

In `@tests/test_string_length_soundness.py`:
- Around line 243-254: The current assertions in test_string_length_soundness.py
only check that tier3_runtime is reached, which still allows a deferred
verification error; update the soundness checks around _verify for string_length
and string_contains to also assert _ok(length) and _ok(predicate). Keep the
existing Tier-3 fallback assertion, but make the test fail if the verifier
reports an E500/checker error instead of a clean graceful fallback, using the
same length and predicate variables.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 161dfdbf-8c72-49e1-9346-2798caa1e07e

📥 Commits

Reviewing files that changed from the base of the PR and between 0195d9c and 2961d91.

📒 Files selected for processing (7)
  • CHANGELOG.md
  • README.md
  • ROADMAP.md
  • TESTING.md
  • spec/06-contracts.md
  • tests/test_string_length_soundness.py
  • vera/smt.py

Comment thread spec/06-contracts.md Outdated
Comment thread tests/test_string_length_soundness.py Outdated
T and others added 2 commits June 27, 2026 12:26
…r-3 fallback (review)

CodeRabbit review of #810:
- spec/06 section 6.4.3: the string-length paragraph overstated the U+2FFFF
  cutoff. It applies only to the z3.StringVal-based predicate translation;
  string_length byte-counts the decoded literal, so an astral (> U+2FFFF)
  literal's length stays Tier 1. Reworded to scope the cutoff to the predicates,
  with the lone surrogate (no UTF-8 encoding) still deferring on both paths.
- tests/test_string_length_soundness.py: the surrogate test now also asserts
  _ok(length) and _ok(predicate), so it fails on an E500 / checker error rather
  than only requiring tier3_runtime — pinning the graceful fallback, not just any
  deferral.

Refs #802, #392.

Co-Authored-By: Claude <noreply@anthropic.invalid>
…s-loop test (review-pr)

pr-review-toolkit review of #810 (4 agents — all found the fix sound; 0
critical/important code issues). Addressed the agreed findings:
- Doc accuracy: comments / spec / CHANGELOG / test-docstring claimed z3.StringVal
  "raises" on a lone surrogate. It does not (on z3 4.16 it silently stores the
  code point's escape text, same as the > U+2FFFF astral case); only
  string_length's value.encode("utf-8") genuinely raises. Reworded vera/smt.py,
  spec/06, CHANGELOG, and the surrogate test docstring; fixed the astral
  escape-text illustration ("\U000xxxxx" -> the actual "\u{...}" form).
- tests/test_codegen.py: added a runtime soundness-loop test — a slot-arg
  string_length contract that `vera verify` DEFERS to Tier 3 is enforced at
  runtime (ensures(@Int.result == 1) over "é" raises a postcondition violation),
  closing the "deferral is sound, not merely imprecise" loop the verifier-side
  tests leave open.

No behaviour change (comment / test / doc only). Refs #802, #392.

Co-Authored-By: Claude <noreply@anthropic.invalid>
@aallan

aallan commented Jun 27, 2026

Copy link
Copy Markdown
Owner Author

pr-review-toolkit review (4 agents, parallel)

Ran a comprehensive review of this PR — code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer — against head b350578.

Critical / Important: none. All four agents independently validated the fix as sound:

  • The literal byte-model + non-literal Tier-3 deferral is conservative (None → Tier 3 verified at every translate_expr consumer — never a false discharge); the except UnicodeEncodeError is correctly narrow.
  • The astral cutoff (> U+2FFFF) and surrogate guards are complete w.r.t. the inputs Z3 can't model; mutation-validated (reverting the byte-model / flipping > to >= / removing the surrogate guard each flips the right tests RED).
  • The UTF-8 self-synchronization soundness of the kept predicates was re-checked over ~20k randomized cases (0 mismatches).

Addressed (the only findings — doc accuracy + one test), in b350578:

  • Two agents flagged that the comments/spec/CHANGELOG/test-docstring said z3.StringVal raises on a lone surrogate. Verified on z3 4.16: it does not — it silently stores the code point's escape text (same as the astral case); only string_length's encode("utf-8") genuinely raises. The guard returns None before StringVal either way, so behaviour was always sound — only the rationale was wrong. Reworded vera/smt.py, spec/06, CHANGELOG, the surrogate test docstring; corrected the astral escape-text illustration to the actual \u{...} form.
  • Added a runtime soundness-loop test (tests/test_codegen.py): a slot-arg string_length contract that vera verify defers to Tier 3 raises a postcondition violation at runtime — closing the "deferral is sound, not just imprecise" loop the verifier-side tests leave open.

No behaviour change in this round.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_string_length_soundness.py (1)

217-236: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

This boundary test is exercising the requires(...) fallback, not just the z3.StringVal cut-off.

The "\u{30000}" literal in requires(@String.0 == ...) takes the repo’s known untranslatable-literal path before string_starts_with(...) becomes the interesting part, so tier3_runtime >= 1 can still pass without pinning the > 0x2FFFF predicate boundary itself. Use literal-on-literal probes for both sides of the cut-off and assert the deferred case stays _ok(...).

Suggested fix
-        modeled = _verify("""
-public fn f(`@String` -> `@Bool`)
-  requires(`@String.0` == "\\u{2FFFF}")
+        modeled = _verify("""
+public fn f(`@Unit` -> `@Bool`)
+  requires(true)
   ensures(`@Bool.result` == true)
   effects(pure)
-{ string_starts_with(`@String.0`, "\\u{2FFFF}") }
+{ string_starts_with("\\u{2FFFF}", "\\u{2FFFF}") }
 """)
         assert _ok(modeled) and modeled.summary.tier3_runtime == 0, (
             modeled.summary.tier1_verified, modeled.summary.tier3_runtime,
         )
         deferred = _verify("""
-public fn f(`@String` -> `@Bool`)
-  requires(`@String.0` == "\\u{30000}")
+public fn f(`@Unit` -> `@Bool`)
+  requires(true)
   ensures(`@Bool.result` == true)
   effects(pure)
-{ string_starts_with(`@String.0`, "\\u{30000}") }
+{ string_starts_with("\\u{30000}", "\\u{30000}") }
 """)
+        assert _ok(deferred), [d.description for d in deferred.diagnostics]
         assert deferred.summary.tier3_runtime >= 1, (
             deferred.summary.tier1_verified, deferred.summary.tier3_runtime,
         )

Based on learnings, astral literals inside requires(...) are dropped as untranslatable in this repo, and as per path instructions this test should pin the exact observable it claims to cover.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_string_length_soundness.py` around lines 217 - 236, The current
test in test_string_length_soundness is hitting the untranslatable-literal
fallback in requires(...) instead of validating the string boundary in
string_starts_with(...). Update the probe to use literal-on-literal checks on
both sides of the 0x2FFFF cutoff, and make the deferred case assert _ok(...)
with the expected tier3_runtime behavior so the test directly covers the
intended boundary rather than the fallback path. Use the existing _verify, _ok,
and string_starts_with symbols to keep the assertions focused on the cutoff
behavior.

Sources: Path instructions, Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_codegen.py`:
- Around line 1239-1243: The test in `test_codegen.py` is too broad because it
accepts raw `wasmtime.Trap`, `WasmtimeError`, and any `RuntimeError`, which can
hide unrelated failures in `execute()`. Narrow the assertion to the
contract-violation wrapper exposed by `vera/codegen/api.py::execute`, and verify
the observable trap shape it normalises to instead of matching generic exception
types. Keep the test focused on the `execute` call with the invalid raw argument
and assert the specific wrapper/error class that represents contract violations.

---

Outside diff comments:
In `@tests/test_string_length_soundness.py`:
- Around line 217-236: The current test in test_string_length_soundness is
hitting the untranslatable-literal fallback in requires(...) instead of
validating the string boundary in string_starts_with(...). Update the probe to
use literal-on-literal checks on both sides of the 0x2FFFF cutoff, and make the
deferred case assert _ok(...) with the expected tier3_runtime behavior so the
test directly covers the intended boundary rather than the fallback path. Use
the existing _verify, _ok, and string_starts_with symbols to keep the assertions
focused on the cutoff behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b261ff19-6151-4a8c-9540-3dd409c58d5a

📥 Commits

Reviewing files that changed from the base of the PR and between 7376df2 and b350578.

📒 Files selected for processing (8)
  • CHANGELOG.md
  • README.md
  • ROADMAP.md
  • TESTING.md
  • spec/06-contracts.md
  • tests/test_codegen.py
  • tests/test_string_length_soundness.py
  • vera/smt.py

Comment thread tests/test_codegen.py
CodeRabbit review of b350578 (1 inline + 1 outside-diff, both "Functional
Correctness | Minor").  Both findings are the same anti-pattern — an assertion
satisfiable by something other than its named target — verified valid by
reproduction + mutation:

- tests/test_codegen.py: the round-5 runtime soundness-loop test used
  pytest.raises((WasmtimeError, Trap, RuntimeError)).  execute() normalises
  every wasmtime trap into WasmTrapError (a RuntimeError subclass), so the
  broad form green-passes on any failure — a setup error or a regression that
  traps for a different reason.  Narrowed to raises(WasmTrapError) + assert
  kind == "contract_violation" (confirmed empirically: a postcondition
  violation raises exactly this kind).  Mutation: asserting the wrong kind
  flips the test RED.

- tests/test_string_length_soundness.py (test_astral_cutoff_boundary): the
  slot-arg precondition form (a String param compared to an astral literal)
  produces TWO tier3 obligations — the astral requires literal also defers —
  so tier3 >= 1 would still pass if only the requires deferred and the body
  predicate regressed to Tier 1, masking the soundness-critical path.  Rewrote
  to literal-on-literal with a trivial precondition so a SINGLE postcondition
  obligation is driven solely by the body predicate's astral literal, and
  added an explicit clean-deferral assertion.  Mutation: removing the
  > 0x2FFFF guard now flips the isolated assertion RED.

Test-only, plus the two TESTING.md line counts.  No behaviour change.
Refs #802, #392.

Co-Authored-By: Claude <noreply@anthropic.invalid>
@aallan

aallan commented Jun 27, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai Re: the outside-diff finding on tests/test_string_length_soundness.py test_astral_cutoff_boundary (217-236) — valid on a precision point, fixed in 2d2db6e. The mechanism is slightly different from the writeup: the slot-arg requires(@String.0 == "\u{30000}") form produces two tier3 obligations (the astral requires literal defers as an obligation, not just an assumption), so tier3 >= 1 couldn't isolate the body-predicate cutoff — a body-only regression to Tier 1 would still pass. Rewrote both sides to literal-on-literal with requires(true), so a single ensures obligation is driven solely by the body predicate's astral literal, and added assert _ok(deferred). Mutation-checked: removing the > 0x2FFFF guard now flips the isolated tier3 >= 1 assertion RED.

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@aallan Nice catch on the precision issue — thanks for tightening that.

You're right: with requires(@String.0 == "\u{30000}"), the test was conflating two Tier 3 sources, so tier3 >= 1 wasn't sufficient to prove the body-predicate cutoff itself was what deferred. Rewriting it to literal-on-literal under requires(true) makes the obligation accounting much cleaner, and adding assert _ok(deferred) plus the mutation check is exactly the kind of isolation I was hoping for.

That addresses my concern.


✏️ Learnings added
Learnt from: aallan
URL: https://github.com/aallan/vera/pull/810

Timestamp: 2026-06-27T13:40:10.152Z
Learning: In `tests/test_string_length_soundness.py`, the `test_astral_cutoff_boundary` regression must isolate the body-predicate astral cutoff in `vera/smt.py` without introducing extra Tier 3 obligations from assumptions. A form like `requires(String.0 == "\u{30000}")` creates two Tier 3 obligations, so `tier3 >= 1` is not specific enough. The robust pattern is literal-on-literal with `requires(true)`, so a single `ensures` obligation is driven solely by the body predicate's astral literal, together with `assert _ok(deferred)` and a mutation check that removing the `> 0x2FFFF` guard flips the isolated assertion red.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@aallan

aallan commented Jun 27, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai resume

@aallan

aallan commented Jun 27, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
✅ Action performed

Reviews resumed.

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@TESTING.md`:
- Line 78: Update the TESTING.md table summary for
test_string_length_soundness.py to also mention the astral/surrogate-literal
predicate regression; the current row only describes the string_length
byte-count behavior and Tier 1 predicates. Edit the row text so it references
the astral/surrogate literal path and its Tier 3 deferral, using the existing
test_string_length_soundness.py entry as the anchor.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bb3cf7ef-55e5-494f-9c17-021c199aef95

📥 Commits

Reviewing files that changed from the base of the PR and between 7376df2 and 2d2db6e.

📒 Files selected for processing (8)
  • CHANGELOG.md
  • README.md
  • ROADMAP.md
  • TESTING.md
  • spec/06-contracts.md
  • tests/test_codegen.py
  • tests/test_string_length_soundness.py
  • vera/smt.py

Comment thread TESTING.md Outdated
… row (CodeRabbit)

CodeRabbit full review of 2d2db6e flagged that the test_string_length_soundness.py
summary row described the string_length byte-count fix and the Tier-1 predicate
cases but omitted the astral/surrogate literal predicate path the file also
covers (TestAstralStringLiteral802) — which defers to Tier 3 because z3.StringVal
cannot model those code points.  Appended that clause so the row matches the
file's actual coverage.

Doc-only. Refs #802, #392.

Co-Authored-By: Claude <noreply@anthropic.invalid>
@aallan

aallan commented Jun 27, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@aallan aallan merged commit d725d2d into main Jun 27, 2026
27 checks passed
@aallan aallan deleted the fix/802-string-utf8-bytes branch June 27, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant