fix(test): regenerate kpm_regression baseline and close CI gap (#155)#158
Merged
kalwalt merged 3 commits intoMay 23, 2026
Merged
Conversation
The `test_full_pipeline_pose` test has been silently failing on `dev` because no CI job ran the integration tests under `tests/` with `--features ffi-backend`. The `kpm-build` job only runs `--lib` tests, and `build-and-test` runs the workspace without `ffi-backend`, so the C++-backed full-pipeline test was never executed in CI. This let the `EXPECTED_FULL_POSE` / `EXPECTED_FULL_ERROR` baseline constants in `crates/core/tests/kpm_regression.rs` drift out of sync with the actual pipeline output across the M9 series. Changes: - Regenerate `EXPECTED_FULL_POSE` and `EXPECTED_FULL_ERROR` against the current C++-backed pipeline on `pinball-demo.jpg`. Capture was done via a temporary `arlog_e!` block inside the test (per CLAUDE.md §2 logging convention), then removed. - Document the regeneration procedure in the `EXPECTED_FULL_POSE` doc comment so future maintainers have a one-glance recipe. - Add a new Ubuntu-only step to the `kpm-build` job that runs the three `ffi-backend` integration tests (`kpm_regression`, `nft_pipeline`, `ar2_pinball_io`). This closes the gate so a stale baseline can never silently slip through CI again. - Add design doc `docs/design/m9-kpm-regression-baseline-fix.md` capturing Understanding Summary, Decision Log, Assumptions, Risks, and Verification workflow (matches the M9 series doc pattern). Closes #155. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
…inux CI on PR #158 surfaced R2 from the design doc: the baseline I regenerated on Windows fails on the Ubuntu runner by ~6e-2 in pose[0][2] — far above the 1e-2 tolerance. The original Linux baseline was actually correct all along; the local Windows failure that motivated this PR was cross-platform rounding variance accumulating through the C++ FREAK + RANSAC + ICP chain, not staleness. Changes: - Restore the original EXPECTED_FULL_POSE / EXPECTED_FULL_ERROR values (Linux baseline). - Gate test_full_pipeline_pose to target_os = "linux" so Windows/macOS local runs of `cargo test` skip rather than misreport the cross-platform variance. - Update EXPECTED_FULL_POSE doc with explicit platform-sensitivity note and Linux-only regeneration procedure. - Update design doc with R2 materialization and resolution. The CI gate is unchanged (Ubuntu-only step in kpm-build job) and still catches genuine drift on the platform that owns the baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On reflection the regen capture is a one-shot informational dump, which per CLAUDE.md §2 maps to arlog_i!, not arlog_e! (which is for misconfiguration / wiring errors). Update the recipe in EXPECTED_FULL_POSE doc comment and design doc D2 accordingly, and document RUST_LOG=info in the run command. Refs #155. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kalwalt
added a commit
that referenced
this pull request
May 23, 2026
…inux CI on PR #158 surfaced R2 from the design doc: the baseline I regenerated on Windows fails on the Ubuntu runner by ~6e-2 in pose[0][2] — far above the 1e-2 tolerance. The original Linux baseline was actually correct all along; the local Windows failure that motivated this PR was cross-platform rounding variance accumulating through the C++ FREAK + RANSAC + ICP chain, not staleness. Changes: - Restore the original EXPECTED_FULL_POSE / EXPECTED_FULL_ERROR values (Linux baseline). - Gate test_full_pipeline_pose to target_os = "linux" so Windows/macOS local runs of `cargo test` skip rather than misreport the cross-platform variance. - Update EXPECTED_FULL_POSE doc with explicit platform-sensitivity note and Linux-only regeneration procedure. - Update design doc with R2 materialization and resolution. The CI gate is unchanged (Ubuntu-only step in kpm-build job) and still catches genuine drift on the platform that owns the baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EXPECTED_FULL_POSE/EXPECTED_FULL_ERRORbaseline incrates/core/tests/kpm_regression.rs::test_full_pipeline_poseagainst the current C++-backed pipeline output onpinball-demo.jpg.kpm-buildCI job that runs the threeffi-backendintegration tests (kpm_regression,nft_pipeline,ar2_pinball_io) so a stale baseline cannot silently slip through CI again.docs/design/m9-kpm-regression-baseline-fix.mdcapturing Understanding Summary, Decision Log (D1–D6 + Q1/Q2), Assumptions, Risks, and the regeneration recipe.Why
test_full_pipeline_posewas silently failing ondev:pose[0][2]diverged from the baseline by6.134e-2(tolerance1.0e-2). Reproduced against clean post-#153 base with M9-2 work stashed — the failure is pre-existing and accumulated drift across the M9 series.CI was the root cause: every M9 PR merged green because no job ran the integration tests with
--features ffi-backend.kpm-buildonly runs--lib --features dual-mode, andbuild-and-testruns the workspace withoutffi-backend.Regeneration recipe
Documented in the
EXPECTED_FULL_POSEdoc comment inkpm_regression.rs. Capture was done via a temporaryarlog_e!block (per CLAUDE.md §2 logging convention), then removed.Test plan
cargo test -p webarkitlib-rs --test kpm_regression --features ffi-backend— green (5/5)cargo test -p webarkitlib-rs --test nft_pipeline --test ar2_pinball_io --features ffi-backend— green (no drift to fix in these)cargo fmt --all -- --check— cleancargo clippy --workspace -- -D warnings— clean (mirrors CI)cargo test --all-features— 463 passed, 8 ignoredRun ffi-backend integration tests) passes on Ubuntu in this PRCloses #155.
🤖 Generated with Claude Code