propolis-server: collect and report VSS minus VM mappings #889

iximeow · 2025-03-26T23:40:10Z

collect VSS from /proc/self/psinfo and report it as an Oximeter metric for the VM. VSS is maintained as mappings are adjusted so it's about as close to free as a syscall can get. Propolis' VSS includes guest-mapped regions; this change subtracts out the size of a guest's ASpace so we only get the amount Propolis has a say in. no RSS measurement here, but that's really only useful to tell us if the virtual mappings are overly-large (like, do we have a 128 MiB heap when we're only using 4 MiB?) - that'd be nice to know in the limit, but somewhat less interesting now. most of our VSS comes from propolis-server itself anyway.

the original approach here was walking through /proc/{getpid()}/xmap to get both RSS and VSS, and letting us exclude mappings that are part of the guest's accessible memory - in some sense this is aesthetically pleasant, but collecting and iterating the thousands of mappings meant this took up to seconds(!) of time that potentially could block the guest VM (!) if it needed the process lock at any point while we were measuring.

jclulow · 2025-03-27T00:02:57Z

bin/propolis-server/src/lib/stats/process.rs

+
+    // Safety: getpid() does not alter any state, or even read mutable state.
+    let mypid = unsafe { libc::getpid() };
+    let xmap_path = format!("/proc/{}/xmap", mypid);


I think this could just be "/proc/self/xmap" ?

i hoped and thought so, but i don't have a /proc/self* at all locally. i was kinda curious about it: if you have it maybe it's added by something downstream of illumos, and helios doesn't get the fun?

pmooney informs me that /proc/self/ in fact there and just not in ls /proc (or, consequently, tab completion for /proc/s :) )

the s-e-l-f is silent

not suitable for.. a few reasons. but this does get reasonable numbers! `propolis-server` memory use is in no small part... the binary itself. given a 32 GiB VM with a few cores, this takes almost five seconds on my workstation to get through all of /proc/{getpid()}/xmap. given there being no I/O here, that's five seconds of CPU time to calculate memory stats. not good, not something i'd want to merge. additionally, while walking /proc/{getpid()}/xmap we at various points hold the process' address space lock (for writing!), which can be very much observed by the guest VM itself. i happened to log into the above-mentioned VM while memory stat collection was going on, and login hung for two seconds while that finished. steady-state behavior is less bad (`while true; do date; sleep 0.2; done` does not show any disruption for example), but, again, not shippable behavior without at *least* a feature flag. using `pmap -x <pid>` does similar operations with respect to the system, opening `xmap` and reading entries and on, but it seems somewhat faster than the code here. hundreds of milliseconds instead of plural seconds. i'm not sure what's different, but the service-interrupting nature of the metric collection makes it kind of a showstopper either way.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

iximeow · 2025-04-12T02:09:03Z

this now collects useful information with at most 30 microseconds holding locks we'd care about - i'm not certain it would get through to oximeter yet (at least i haven't seen it work locally) but this is not outright harmful anymore!

…l rev this obviously wouldn't be for landing, but it's demonstrative. to get an updated Oximeter schema here, we'd need to bump Omicron. then Propolis depends on one Omicron while Crucible depends on another, or this is sequenced like: * merge Propolis schema change in Omicron * update Omicron in Crucible * update Omicron and Crucible in Propolis * happy ... conveniently i already want to do this for Progenitor anyway.

iximeow

the Omicron change this points to is oxidecomputer/omicron#8071.

i've left a comment on the Cargo.toml diff but i do not intend for the Cargo.toml or most of the Cargo.lock changes to land here. the CI failure is due to three tests that seem downstream of the version bumping, but i... really do not understand why.

iximeow · 2025-04-30T17:25:49Z

Cargo.toml

+[patch."https://github.com/oxidecomputer/omicron"]
+#internal-dns = { path = "../omicron/internal-dns" }
+nexus-client = { git = "https://github.com/oxidecomputer//omicron.git", rev = "3950a44ba51c1696b19392d20fd2cbb9457a000d" }
+omicron-uuid-kinds = { git = "https://github.com/oxidecomputer//omicron.git", rev = "3950a44ba51c1696b19392d20fd2cbb9457a000d" }
+nexus-types = { git = "https://github.com/oxidecomputer//omicron.git", rev = "3950a44ba51c1696b19392d20fd2cbb9457a000d" }
+illumos-utils = { git = "https://github.com/oxidecomputer//omicron.git", rev = "3950a44ba51c1696b19392d20fd2cbb9457a000d" }
+omicron-common = { git = "https://github.com/oxidecomputer//omicron.git", rev = "3950a44ba51c1696b19392d20fd2cbb9457a000d" }
+oximeter-instruments = { git = "https://github.com/oxidecomputer//omicron.git", rev = "3950a44ba51c1696b19392d20fd2cbb9457a000d" }
+oximeter-producer = { git = "https://github.com/oxidecomputer//omicron.git", rev = "3950a44ba51c1696b19392d20fd2cbb9457a000d" }
+oximeter = { git = "https://github.com/oxidecomputer//omicron.git", rev = "3950a44ba51c1696b19392d20fd2cbb9457a000d" }
+
+# i've just picked what Omicron happens to currently point to.
+tufaceous-artifact = { git = "https://github.com/oxidecomputer//tufaceous.git", rev = "04681f26ba09e144e5467dd6bd22c4887692a670" }


this all specifically is not intended to merge. to get a new Oximeter metric here, that's a change the the schema (and bump of oximeter). having Oximeter at one rev while Omicron is at other ones got me to a state where Propolis wouldn't build. bumping all of Omicron here left Propolis and Crucible using different Omicrons which again did not build because Cargo.toml had an old tufaceous-artifact.

this has somehow ended up with the three test failures in the last build, which all look like:

phd-runner: [INSTANCE_ENSURE_INTERNAL - EVENT] permanent error from instance_spec_ensure e = Error Response: status: 400 Bad Request; headers: { "content-type": "application/json", "x-request-id": "58e1ca30-797a-48bb-9191-a36b956e243b", "content-length": "533", "date": "Wed, 30 Apr 2025 08:59:40 GMT" }; value: Error { error_code: None, message: "unable to parse JSON body: init.value.spec.components.boot-disk.type: unknown variant `NvmeDisk`, expected one of `virtio_disk`, `nvme_disk`, `virtio_nic`, `serial_port`, `pci_pci_bridge`, `qemu_pvpanic`, `boot_settings`, `soft_npu_pci_port`, `soft_npu_port`, `soft_npu_p9`, `p9fs`, `migration_failure_injector`, `crucible_storage_backend`, `file_storage_backend`, `blob_storage_backend`, `virtio_network_backend`, `dlpi_network_backend` at line 1 column 180", request_id: "58e1ca30-797a-48bb-9191-a36b956e243b" }

and frankly i'm at a loss about what changed the casing of variants here. it is presumably related to my wanton version bumping.

anyway, this PR is at "i think it's good, would folks +1 the new metric?" - if others agree i'll go PR the schema change in Omicron and figure out getting that through to Propolis.

bin/propolis-server/src/lib/stats/mod.rs

hawkw · 2025-05-15T20:33:29Z

bin/propolis-server/src/lib/stats/process.rs

+    while !tx.is_closed() {
+        let new_process_stats = tokio::task::spawn_blocking(process_stats)
+            .await
+            .expect("collecting address space stats does not panic")
+            .expect("process_stats() does not error");
+
+        // [there isn't a nice way to plumb a slog Logger here, so this is an
+        // eprintln for demonstrative purposes but otherwise should be removed.]
+        eprintln!(
+            "Sending new stats at least.. {:?}, took {}ms",
+            new_process_stats,
+            new_process_stats.measurement_time.as_millis()
+        );
+        tx.send_replace(new_process_stats);
+
+        tokio::time::sleep(std::time::Duration::from_millis(15000)).await;
+    }


turbo nit, not actually important: consider using tokio::time::interval rather than sleep, it should be a bit more efficient as it reuses the same entry in the tokio timer wheel every time:

Suggested change

while !tx.is_closed() {

let new_process_stats = tokio::task::spawn_blocking(process_stats)

.await

.expect("collecting address space stats does not panic")

.expect("process_stats() does not error");

// [there isn't a nice way to plumb a slog Logger here, so this is an

// eprintln for demonstrative purposes but otherwise should be removed.]

eprintln!(

"Sending new stats at least.. {:?}, took {}ms",

new_process_stats,

new_process_stats.measurement_time.as_millis()

);

tx.send_replace(new_process_stats);

tokio::time::sleep(std::time::Duration::from_millis(15000)).await;

}

let mut interval = tokio::time::interval(std::time::Duration::from_millis(15000);

while !tx.is_closed() {

interval.tick().await;

let new_process_stats = tokio::task::spawn_blocking(process_stats)

.await

.expect("collecting address space stats does not panic")

.expect("process_stats() does not error");

// [there isn't a nice way to plumb a slog Logger here, so this is an

// eprintln for demonstrative purposes but otherwise should be removed.]

eprintln!(

"Sending new stats at least.. {:?}, took {}ms",

new_process_stats,

new_process_stats.measurement_time.as_millis()

);

tx.send_replace(new_process_stats);

}

hawkw · 2025-05-15T20:35:07Z

bin/propolis-server/src/lib/stats/process.rs

+
+async fn process_stats_task(tx: watch::Sender<ProcessStats>) {
+    while !tx.is_closed() {
+        let new_process_stats = tokio::task::spawn_blocking(process_stats)


why not use tokio::fs rather than spawn_blockinging? not a big deal as tokio::fs is essentially spawn_blocking anyway, but might be a bit nicer...

hawkw · 2025-05-15T20:38:07Z

bin/propolis-server/src/lib/stats/process.rs

+        // [there isn't a nice way to plumb a slog Logger here, so this is an
+        // eprintln for demonstrative purposes but otherwise should be removed.]
+        eprintln!(
+            "Sending new stats at least.. {:?}, took {}ms",
+            new_process_stats,
+            new_process_stats.measurement_time.as_millis()
+        );


hmm, is it really that painful to pass a Logger into ServerStats::new?

hawkw · 2025-05-15T20:38:28Z

bin/propolis-server/src/lib/stats/process.rs

+    let mut info: psinfo = FromZeroes::new_zeroed();
+
+    let stats_read_start = Instant::now();
+
+    psinfo_file
+        .read(info.as_bytes_mut())
+        .context("reading struct psinfo from file")?;


cute use of zerocopy here :3

hawkw · 2025-05-15T20:39:50Z

bin/propolis-server/src/lib/stats/process.rs

+/// in line with the 64-bit version of the struct, and ignores a few fields at
+/// the end of the struct.
+#[derive(Copy, Clone, Debug, FromZeroes, AsBytes, FromBytes)]
+#[cfg(target_arch = "x86_64")]


shouldn't the whole module have this cfg? none of the code that references psinfo, which is ... all of it ... will compile if this isn't there...

on the other hand, nothing else here is dependent on the target architecture - i was kind of thinking that given an aarch64 target or something we'd be able to fill in the right struct for psinfo on that arch. but, yeah, as-is this just breaks even compiling propolis-server on other architectures.

i'll cfg(target_arch = "x86_64") the psinfo read and have it return 0 with a warn!() on other architectures. seeing that would imply running propolis-server on non-x86 targets which would be super neat..

If memory serves, there's other stuff that will just panic on any other architecture (certainly I believe all the CPUID stuff will), but that seems fine to me!

hawkw · 2025-05-15T20:45:48Z

bin/propolis-server/src/lib/stats/mod.rs

+    /// Process RSS is sampled in a standalone task. The `tokio::sync::watch`
+    /// here is to observe that sampling and update the `vmm_vsz` tracked
+    /// here.
+    process_stats_rx: tokio::sync::watch::Receiver<process::ProcessStats>,


I'm not convinced that this actually needs to be a tokio::sync::watch --- a watch channel is basically just an Arced read-write lock plus a thing for notifying the readers when the value inside the RwLock changes, and we never use watch::Receiver::changed in this code, just borrow(). So, it's an Arc<RwLock<ProcessStats>> with extra overhead for functionality we're not using here...and, because there's only a single reader and a single writer, it may as well be a Mutex instead of a RwLock.

I would also maybe consider just stuffing vsz in an AtomicUsize, but Patrick will probably disagree with me...

This comment was marked as resolved.

Sign in to view

jclulow reviewed Mar 27, 2025

View reviewed changes

iximeow and others added 4 commits April 12, 2025 00:33

Update bin/propolis-server/src/lib/stats/process.rs

4817c87

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

forgot the libc dep is load-bearing, note oximeter stat

c24d70f

vss in 30 microseconds or less

1392309

iximeow force-pushed the ixi/memory-stats branch from 78512ad to 1392309 Compare April 12, 2025 02:06

iximeow commented Apr 30, 2025

View reviewed changes

avoid the TOCTOU of reading VSS with/without VM components

f7aa60d

iximeow changed the title ~~collect memory stats, but it's not really suitable~~ propolis-server: collect and report VSS minus VM mappings Apr 30, 2025

iximeow marked this pull request as ready for review April 30, 2025 19:29

iximeow mentioned this pull request Apr 30, 2025

[VM metrics] add a metric for VMM VSZ oxidecomputer/omicron#8071

Open

virtual memory size is typically written "VSZ" not "VSS"....

bde892e

hawkw reviewed May 15, 2025

View reviewed changes

morlandi7 modified the milestones: 16, 17 Jul 15, 2025

morlandi7 modified the milestones: 17, 18 Sep 25, 2025

Uh oh!

propolis-server: collect and report VSS minus VM mappings #889

Are you sure you want to change the base?

propolis-server: collect and report VSS minus VM mappings #889

Uh oh!

Conversation

iximeow commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iximeow commented Apr 12, 2025

Uh oh!

iximeow left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iximeow May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

iximeow commented Mar 26, 2025 •

edited

Loading

iximeow left a comment •

edited

Loading

iximeow May 15, 2025 •

edited

Loading