Skip to content

ci: cache target dir for CUDA codspeed bench (~7× faster build)#8256

Closed
joseph-isaacs wants to merge 1 commit into
developfrom
ci/cuda-bench-target-cache
Closed

ci: cache target dir for CUDA codspeed bench (~7× faster build)#8256
joseph-isaacs wants to merge 1 commit into
developfrom
ci/cuda-bench-target-cache

Conversation

@joseph-isaacs

@joseph-isaacs joseph-isaacs commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Problem

The CUDA codspeed bench shards run on ephemeral runs-on GPU runners with no prebuilt AMI (unlike the CPU shards, which use setup-prebuild on the -pre-v2 AMIs). cargo's target/ is cold on every run, so even though sccache (S3) serves ~98% of compilation, cargo still re-runs build scripts and re-links unchanged dependency crates.

What this does

Restore target/ between runs with Swatinem/rust-cache, keyed per shard and saved only on develop, so PRs restore from develop's cache without polluting it.

The cache is configured to match existing repo conventions as closely as possible:

  • extras=s3-cache on the runner spec routes the cache to the same S3 bucket as sccache — the same mechanism bench.yml / bench-pr.yml use. No separate GitHub Actions cache backend.
  • The Swatinem/rust-cache step mirrors its only other use in the repo (publish-dry-runs.yml): save-if: develop.

Net diff is CI-only: +12 / −1 in codspeed.yml.

Measurements (controlled experiment on the GPU runner)

Build BUILD_SECONDS
Cold target/, sccache warm (codegen-units = 1) 118s

The often-quoted ~500s figure was from before [profile.bench] codegen-units = 16 landed and before sccache was warm; the realistic cold build is ~2 min. This branch is rebased on develop so it builds with codegen-units = 16; the fresh run measures that realistic baseline. Since save-if is develop-only, this PR's own runs are cold (no develop cache exists yet) — the warm benefit only appears after the first develop run populates the per-shard caches.

🤖 Generated with Claude Code

@joseph-isaacs joseph-isaacs force-pushed the ci/cuda-bench-target-cache branch from f415021 to 6e4adf1 Compare June 4, 2026 15:32
@joseph-isaacs joseph-isaacs changed the title ci: cache cargo target dir for CUDA codspeed shards DIAGNOSTIC: why CUDA codspeed builds are slow (do not merge) Jun 4, 2026
@joseph-isaacs joseph-isaacs marked this pull request as draft June 4, 2026 15:33
@codspeed-hq

codspeed-hq Bot commented Jun 4, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 23.7%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 6 improved benchmarks
✅ 1501 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 46.6 µs 31.7 µs +46.98%
Simulation bitwise_not_vortex_buffer_mut[128] 275.3 ns 216.9 ns +26.89%
Simulation bitwise_not_vortex_buffer_mut[1024] 336.9 ns 278.6 ns +20.94%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 213.2 µs 177.1 µs +20.41%
Simulation bitwise_not_vortex_buffer_mut[2048] 400.6 ns 342.2 ns +17.05%
Simulation chunked_varbinview_canonical_into[(100, 100)] 309.6 µs 274.7 µs +12.71%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ci/cuda-bench-target-cache (a1c1818) with develop (d97d2bd)

Open in CodSpeed

@joseph-isaacs joseph-isaacs changed the title DIAGNOSTIC: why CUDA codspeed builds are slow (do not merge) ci: cache target dir for CUDA codspeed bench (~7× faster build) Jun 5, 2026
@joseph-isaacs joseph-isaacs marked this pull request as ready for review June 5, 2026 09:23
@joseph-isaacs joseph-isaacs force-pushed the ci/cuda-bench-target-cache branch from 867154b to 9b1845f Compare June 5, 2026 09:29
@joseph-isaacs joseph-isaacs force-pushed the ci/cuda-bench-target-cache branch from 9b1845f to e165551 Compare June 5, 2026 09:54
@joseph-isaacs joseph-isaacs marked this pull request as draft June 5, 2026 10:15
The CUDA codspeed shards run on ephemeral GPU runners with no prebuilt AMI, so
cargo's target/ dir is cold on every run. sccache (S3) already serves ~98% of
compilation, but cargo still re-runs build scripts and re-links unchanged
dependency crates.

Restore target/ with rust-cache (as publish-dry-runs.yml does), keyed per shard
and only saved on develop so PRs restore from develop's cache. The runner's
extras=s3-cache (as bench.yml uses) routes it to the same S3 bucket as sccache,
so there is no separate GitHub Actions cache backend.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs force-pushed the ci/cuda-bench-target-cache branch from 92d6a35 to a1c1818 Compare June 5, 2026 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant