Skip to content

Add PGO Applicability Feature (Prototype) #5193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

valeriepieger
Copy link

Changes

Add initial PGO prototype per Feature Request #3456

  • tools/devtool: add pgo_build subcommand with four parts:
    • instrument (build with -Cprofile-generate)
    • profile (redirects to reference the pgo-getting-started guide)
    • merge (invoke llvm-profdata merge)
    • optimize (rebuild with -Cprofile-use)
  • docs: add docs/pgo-getting-started.md, a step-by-step PGO guide
  • .gitignore: ignore generated PGO files (rootfs.ext4, vmlinux, .profraw)

Reason

To provide Firecracker contributors with a prototype to measure and apply Profile-Guided Optimization (PGO). These changes combine the common PGO workflow into devtool, document it with the pgo-getting-started guide, and include a community benchmark table so others can reproduce and share results.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

valeriepieger and others added 27 commits May 6, 2025 20:59
Added commands to build with instrumentation, profile, merge the
.profraw files, and optimize/rebuild

Signed-off-by: Valerie <[email protected]>
Added detailed PGO steps and benchmarking table in .md file

Signed-off-by: Valerie <[email protected]>
Adds functionality for pvtime, which displays steal time
to guest on ARM machines. PVTime is persisted across
snapshots as well (snapshot ver updated).

- Added ipa per vCPU for mem region storing steal time info.
- Persists this ipa per vCPU across snapshots.
- Shared steal time mem region is setup on boot and restore
  from snapshot in builder.rs.

Signed-off-by: Dakshin Devanand <[email protected]>
Added integration tests checking:
- steal time increase
- steal time persistence across snapshots
- pvtime existence on ARM

Motivated by addition of PVTime functionality
for ARM.

Signed-off-by: Dakshin Devanand <[email protected]>
Add a changelog entry to inform about addition
of PVTime (steal time) functionality on ARM.

Signed-off-by: Dakshin Devanand <[email protected]>
Bumps the firecracker group with 12 updates:

| Package | From | To |
| --- | --- | --- |
| [zerocopy](https://github.com/google/zerocopy) | `0.8.24` | `0.8.25` |
| [syn](https://github.com/dtolnay/syn) | `2.0.100` | `2.0.101` |
| [micro_http](https://github.com/firecracker-microvm/micro-http) | ``e854e50`` | ``4f62153`` |
| [aws-lc-sys](https://github.com/aws/aws-lc-rs) | `0.28.1` | `0.28.2` |
| [cc](https://github.com/rust-lang/cc-rs) | `1.2.19` | `1.2.20` |
| [getrandom](https://github.com/rust-random/getrandom) | `0.2.15` | `0.2.16` |
| [jiff](https://github.com/BurntSushi/jiff) | `0.2.9` | `0.2.10` |
| [jiff-static](https://github.com/BurntSushi/jiff) | `0.2.9` | `0.2.10` |
| [toml](https://github.com/toml-rs/toml) | `0.8.20` | `0.8.21` |
| [toml_datetime](https://github.com/toml-rs/toml) | `0.6.8` | `0.6.9` |
| [toml_edit](https://github.com/toml-rs/toml) | `0.22.24` | `0.22.25` |
| [winnow](https://github.com/winnow-rs/winnow) | `0.7.6` | `0.7.7` |


Updates `zerocopy` from 0.8.24 to 0.8.25
- [Release notes](https://github.com/google/zerocopy/releases)
- [Changelog](https://github.com/google/zerocopy/blob/main/CHANGELOG.md)
- [Commits](google/zerocopy@v0.8.24...v0.8.25)

Updates `syn` from 2.0.100 to 2.0.101
- [Release notes](https://github.com/dtolnay/syn/releases)
- [Commits](dtolnay/syn@2.0.100...2.0.101)

Updates `micro_http` from `e854e50` to `4f62153`
- [Commits](firecracker-microvm/micro-http@e854e50...4f62153)

Updates `aws-lc-sys` from 0.28.1 to 0.28.2
- [Release notes](https://github.com/aws/aws-lc-rs/releases)
- [Commits](aws/aws-lc-rs@aws-lc-sys/v0.28.1...aws-lc-sys/v0.28.2)

Updates `cc` from 1.2.19 to 1.2.20
- [Release notes](https://github.com/rust-lang/cc-rs/releases)
- [Changelog](https://github.com/rust-lang/cc-rs/blob/main/CHANGELOG.md)
- [Commits](rust-lang/cc-rs@cc-v1.2.19...cc-v1.2.20)

Updates `getrandom` from 0.2.15 to 0.2.16
- [Changelog](https://github.com/rust-random/getrandom/blob/master/CHANGELOG.md)
- [Commits](rust-random/getrandom@v0.2.15...v0.2.16)

Updates `jiff` from 0.2.9 to 0.2.10
- [Release notes](https://github.com/BurntSushi/jiff/releases)
- [Changelog](https://github.com/BurntSushi/jiff/blob/master/CHANGELOG.md)
- [Commits](BurntSushi/jiff@jiff-static-0.2.9...jiff-static-0.2.10)

Updates `jiff-static` from 0.2.9 to 0.2.10
- [Release notes](https://github.com/BurntSushi/jiff/releases)
- [Changelog](https://github.com/BurntSushi/jiff/blob/master/CHANGELOG.md)
- [Commits](BurntSushi/jiff@jiff-static-0.2.9...jiff-static-0.2.10)

Updates `toml` from 0.8.20 to 0.8.21
- [Commits](toml-rs/toml@toml-v0.8.20...toml-v0.8.21)

Updates `toml_datetime` from 0.6.8 to 0.6.9
- [Commits](toml-rs/toml@toml_datetime-v0.6.8...toml_datetime-v0.6.9)

Updates `toml_edit` from 0.22.24 to 0.22.25
- [Commits](toml-rs/toml@v0.22.24...v0.22.25)

Updates `winnow` from 0.7.6 to 0.7.7
- [Changelog](https://github.com/winnow-rs/winnow/blob/main/CHANGELOG.md)
- [Commits](winnow-rs/winnow@v0.7.6...v0.7.7)

---
updated-dependencies:
- dependency-name: zerocopy
  dependency-version: 0.8.25
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: syn
  dependency-version: 2.0.101
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: micro_http
  dependency-version: 4f621532e81ee2ad096a9c9592fdacc40d19de48
  dependency-type: direct:production
  dependency-group: firecracker
- dependency-name: aws-lc-sys
  dependency-version: 0.28.2
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: cc
  dependency-version: 1.2.20
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: getrandom
  dependency-version: 0.2.16
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: jiff
  dependency-version: 0.2.10
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: jiff-static
  dependency-version: 0.2.10
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: toml
  dependency-version: 0.8.21
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: toml_datetime
  dependency-version: 0.6.9
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: toml_edit
  dependency-version: 0.22.25
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: winnow
  dependency-version: 0.7.7
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
...

Signed-off-by: dependabot[bot] <[email protected]>
The signal handler unit test registers the signal handlers, spins up a
new thread, and sends signals to itself (using kill) to check that the
metrics get updated. This is a very complicated test for a unit test,
and it's already covered in the integration test test_signals.py.
As this test is seldomly failing in our CI, it's best to get rid of its
complexity and rely on the integration tests, which are better suited
for this kind of test.

Signed-off-by: Riccardo Mancini <[email protected]>
We seldom have failures in the CI where the test_mknod_and_own_dev fails
because a file already exists.
The test is using the actual /dev to create tmp devices. As there's no
reason to use the actual /dev, move it to use a random folder and clean
it up after the test.

Signed-off-by: Riccardo Mancini <[email protected]>
The test was reimplementing the logic for creating a temporary directory
instead of using TempDir, so I've changed it to simplify it.
Also, save the PathBuf object instead of the String to be able to do
Path operations in a canonical way without formatting strings.

Signed-off-by: Riccardo Mancini <[email protected]>
This test has been flaky for a while, where sometimes the file is empty.
As we're just interested that the message got delivered, not that the
file was created in a timely manner, I'm adding a small retry.

Signed-off-by: Riccardo Mancini <[email protected]>
test_balloon_snapshot started being flaky after we changed the logic on
how we wait for the RSS to become stable.
One theory is that we are not waiting enough time for the stats to
refresh, so this change adds a sleep to ensure we have waited enough for
the stats to be "fresh".

Failure:
```
assert 189022208 > 189022208
```

Signed-off-by: Riccardo Mancini <[email protected]>
We found a single failure for which the steal time between snapshot
and restore went slightly above 2s on an AMD instance.
As the purpose of this check is to ensure the value is "sane" (iow not
a completely random number), not that it's really accurate (that's a
kernel problem), I'm bumping it to 10s.

Signed-off-by: Riccardo Mancini <[email protected]>
An upstream patch was backported to ubuntu 24.04 6.8.0-58 kernel that
makes the nx hugepages recover thread a child of the firecracker
process, thus increasing process count to 7.
As we're not really interested in knowing how many threads we have in
this test, let's remove the assertion altogether.

Signed-off-by: Riccardo Mancini <[email protected]>
Currently we measure boottime by looking at the Firecracker
logs and waiting for the guest to trigger a special boot device
inside Firecracker.
There is an alternative way of measuring boot time from
a guest perspective by using systemd-analyze. Will help us to understand
boot times better.

Signed-off-by: Egor Lazarchuk <[email protected]>
Emit both boot time metrics for boot device: clock time and cpu time.
Additionally, stop subtracting VM build time from a guest boot time at
the metric emit time. This can be done later at the visualization stage.

Signed-off-by: Egor Lazarchuk <[email protected]>
Since we already collect this delta, put it into
metrics as well.

Signed-off-by: Egor Lazarchuk <[email protected]>
This metrics has very small numbers which makes it too
volatile for A/B tests.

Signed-off-by: Egor Lazarchuk <[email protected]>
It's --pytest-opts, not --test anymore.

Fixes: 316d955 ("test(ab): generalize --test to --pytest-opts")
Signed-off-by: Patrick Roy <[email protected]>
By reducing the dimension set to eliminate all dimensions that only
every take on a single value straight during parsing, we broke the
ignore list, which does require all dimensions to be present. Fix this
by moving the "dimension reduction" to only happen during printing of
failure messages.

Fixes: fcb39a6 ("test(ab): do not print dimension that are the same across all metrics")
Signed-off-by: Patrick Roy <[email protected]>
block throughput metrics on m8g.metal instances for test scenarios using
the async fio engine and more than 1 vcpu are volatile, so exclude them
from A/B-testing.

Suggested-by: Riccardo Mancini <[email protected]>
Signed-off-by: Patrick Roy <[email protected]>
Bumps the firecracker group with 9 updates:

| Package | From | To |
| --- | --- | --- |
| [aws-lc-fips-sys](https://github.com/aws/aws-lc-rs) | `0.13.5` | `0.13.6` |
| [cc](https://github.com/rust-lang/cc-rs) | `1.2.20` | `1.2.21` |
| [hashbrown](https://github.com/rust-lang/hashbrown) | `0.15.2` | `0.15.3` |
| [jiff](https://github.com/BurntSushi/jiff) | `0.2.10` | `0.2.12` |
| [jiff-static](https://github.com/BurntSushi/jiff) | `0.2.10` | `0.2.12` |
| [toml](https://github.com/toml-rs/toml) | `0.8.21` | `0.8.22` |
| [toml_edit](https://github.com/toml-rs/toml) | `0.22.25` | `0.22.26` |
| [toml_write](https://github.com/toml-rs/toml) | `0.1.0` | `0.1.1` |
| [winnow](https://github.com/winnow-rs/winnow) | `0.7.7` | `0.7.9` |


Updates `aws-lc-fips-sys` from 0.13.5 to 0.13.6
- [Release notes](https://github.com/aws/aws-lc-rs/releases)
- [Commits](aws/aws-lc-rs@aws-lc-fips-sys/v0.13.5...aws-lc-fips-sys/v0.13.6)

Updates `cc` from 1.2.20 to 1.2.21
- [Release notes](https://github.com/rust-lang/cc-rs/releases)
- [Changelog](https://github.com/rust-lang/cc-rs/blob/main/CHANGELOG.md)
- [Commits](rust-lang/cc-rs@cc-v1.2.20...cc-v1.2.21)

Updates `hashbrown` from 0.15.2 to 0.15.3
- [Release notes](https://github.com/rust-lang/hashbrown/releases)
- [Changelog](https://github.com/rust-lang/hashbrown/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/hashbrown/commits/v0.15.3)

Updates `jiff` from 0.2.10 to 0.2.12
- [Release notes](https://github.com/BurntSushi/jiff/releases)
- [Changelog](https://github.com/BurntSushi/jiff/blob/master/CHANGELOG.md)
- [Commits](BurntSushi/jiff@jiff-static-0.2.10...jiff-static-0.2.12)

Updates `jiff-static` from 0.2.10 to 0.2.12
- [Release notes](https://github.com/BurntSushi/jiff/releases)
- [Changelog](https://github.com/BurntSushi/jiff/blob/master/CHANGELOG.md)
- [Commits](BurntSushi/jiff@jiff-static-0.2.10...jiff-static-0.2.12)

Updates `toml` from 0.8.21 to 0.8.22
- [Commits](toml-rs/toml@toml-v0.8.21...toml-v0.8.22)

Updates `toml_edit` from 0.22.25 to 0.22.26
- [Commits](toml-rs/toml@v0.22.25...v0.22.26)

Updates `toml_write` from 0.1.0 to 0.1.1
- [Commits](toml-rs/toml@toml_write-v0.1.0...toml_write-v0.1.1)

Updates `winnow` from 0.7.7 to 0.7.9
- [Changelog](https://github.com/winnow-rs/winnow/blob/main/CHANGELOG.md)
- [Commits](winnow-rs/winnow@v0.7.7...v0.7.9)

---
updated-dependencies:
- dependency-name: aws-lc-fips-sys
  dependency-version: 0.13.6
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: cc
  dependency-version: 1.2.21
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: hashbrown
  dependency-version: 0.15.3
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: jiff
  dependency-version: 0.2.12
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: jiff-static
  dependency-version: 0.2.12
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: toml
  dependency-version: 0.8.22
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: toml_edit
  dependency-version: 0.22.26
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: toml_write
  dependency-version: 0.1.1
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: winnow
  dependency-version: 0.7.9
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
...

Signed-off-by: dependabot[bot] <[email protected]>
When starting firecracker without the API server and configuring it
through a config files, specifying a custom cpu template requires
writing this cpu template into a separate json file, and then passing
the path to this separate json file in the config json file. This is a
bit silly, since everything is just json. So additionally allow just
directly specifying the custom cpu template inside the config file.

Signed-off-by: Patrick Roy <[email protected]>
Creates a new [Unreleased] section and move changes to be included in
v1.12.0 to [1.12.0] section.

Signed-off-by: Takahiro Itazuri <[email protected]>
Adds v1.12 to the release status table and marks v1.10 as unsupported.

Signed-off-by: Takahiro Itazuri <[email protected]>
Intel Sapphire Rapids and ARM Graviton4 are not supported.

Signed-off-by: Takahiro Itazuri <[email protected]>
Runs `cargo update`.

Signed-off-by: Takahiro Itazuri <[email protected]>
We're going to release v1.12.0 tomorrow. Bumps the version to 1.13.0-dev
in main branch.

Signed-off-by: Takahiro Itazuri <[email protected]>
@Manciukic Manciukic linked an issue May 7, 2025 that may be closed by this pull request
3 tasks
@Manciukic
Copy link
Contributor

Hey, it looks like something went wrong with the rebase pull from main to your branch. It looks like main has been rebased on top of your branch rather than the other way around. Could you fix that?

Copy link
Contributor

@Manciukic Manciukic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal.
The biggest comment is that this should use the devctr for reproducibility across developers. However, let's first focus on understanding what improvement can we expect from this before making it nice and all.
Were you able to collect profile information for more complex workloads (like our performance tests)?
What's the 15% improvement reported in the table?
What's the performance like in other tests?

case "$1" in
instrument)
echo "Building instrumented Firecracker binary"
RUSTFLAGS="-Cprofile-generate=/tmp/firecracker-profdata" cargo build --release
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these commands should be run inside the devctr to avoid the missing dependencies issues mentioned in the doc. Maybe we can extend cmd_build to add pgo-instrument and pgo-optimize profiles

Try to touch all major systems you personally care about optimizing so that you
can benchmark it against the base build later.

Here's an example process of booting a minimal microVM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we generate all the binaries with pgo profile information enabled in a folder (build/cargo_target/${toolchain}/pgo-instrument), we could then run the devtool tests directly passing the --binary-dir option.

However, this will introduce the problem of extracting the profiling information from the container


| Machine (CPU/RAM) | Firecracker (non-PGO) | Firecracker (PGO) | Δ (PGO vs. baseline) | Notes |
| ----------------------------- | --------------------: | ----------------: | -------------------: | -------------------------------------------- |
| AMD Ryzen 7 7700X; 32 GiB RAM | 0.01275 | 0.01079 | -15.37% | Ubuntu 24.04; used test_boottime.py for both |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this metric? which is the unit?
it'd be interesting to see the other performance tests as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra file (vmlinux as well)

Comment on lines +201 to +202
Once you have this PGO build, you can run any of the repository's existing
performance tests to observe the speed-ups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missing instructions on how to run the tests on the newly generated binaries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] PGO applicability
6 participants