firecracker-microvm · ShadowCurse · Sep 26, 2025 · Sep 30, 2025 · Sep 22, 2025 · Sep 22, 2025
@@ -10,6 +10,10 @@ and this project adheres to
 
 ### Added
 
+- [#5463](https://github.com/firecracker-microvm/firecracker/pull/5463): Added
+  support for `virtio-pmem` devices. See [documentation](docs/pmem.md) for more
+  information.
+
 ### Changed
 
 ### Deprecated

diff --git a/docs/device-api.md b/docs/device-api.md
diff --git a/docs/pmem.md b/docs/pmem.md
@@ -0,0 +1,210 @@
+# Using the Firecracker `virtio-pmem` device
+
+## What is a persistent memory device
+
+Persistent memory is a type of non-volatile, CPU accessible (with usual
+load/store instructions) memory that does not lose its content on power loss. In
+other words all writes to the memory persist over the power cycle. In hardware
+this known as NVDIMM memory (Non Volatile Double Inline Memory Module).
+
+## What is a `virtio-pmem` device:
+
+[`virtio-pmem`](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-68900019)
+is a device which emulates a persistent memory device without requiring a
+physical NVDIMM device be present on the host system. `virtio-pmem` is backed by
+a memory mapped file on the host side and is exposed to the guest kernel as a
+region in the guest physical memory. This allows the guest to directly access
+the host memory pages without a need to use guest driver or interact with VMM.
+From guest user-space perspective `virtio-pmem` devices are presented as normal
+block device like `/dev/pmem0`. This allows `virtio-pmem` to be used as rootfs
+device and make VM boot from it.
+
+> [!NOTE]
+>
+> Since `virtio-pmem` is located fully in memory, when used as a block device
+> there is no need to use guest page cache for its operations. This behaviour
+> can be configured by using `DAX` feature of the kernel.
+>
+> - To mount a device with `DAX` add `--flags=dax` to the `mount` command.
+> - To configure a root device with `DAX` append `rootflags=dax` to the kernel
+>   arguments.
+>
+> `DAX` support is not uniform for all file systems. Check the kernel
+> [documentation](https://github.com/torvalds/linux/blob/master/Documentation/filesystems/dax.rst)
+> for more information.
+
+## Prerequisites
+
+In order to use `virtio-pmem` device, guest kernel needs to built with support
+for it. The full list of configuration options needed for `virtio-pmem` and
+`DAX`:
+
+```
+# Needed for DAX on aarch64. Will be ignored on x86_64
+CONFIG_ARM64_PMEM=y
+
+CONFIG_DEVICE_MIGRATION=y
+CONFIG_ZONE_DEVICE=y
+CONFIG_VIRTIO_PMEM=y
+CONFIG_LIBNVDIMM=y
+CONFIG_BLK_DEV_PMEM=y
+CONFIG_ND_CLAIM=y
+CONFIG_ND_BTT=y
+CONFIG_BTT=y
+CONFIG_ND_PFN=y
+CONFIG_NVDIMM_PFN=y
+CONFIG_NVDIMM_DAX=y
+CONFIG_OF_PMEM=y
+CONFIG_NVDIMM_KEYS=y
+CONFIG_DAX=y
+CONFIG_DEV_DAX=y
+CONFIG_DEV_DAX_PMEM=y
+CONFIG_DEV_DAX_KMEM=y
+CONFIG_FS_DAX=y
+CONFIG_FS_DAX_PMD=y
+```
+
+## Configuration
+
+Firecracker implementation exposes these config options for the `virtio-pmem`
+device:
+
+- `id` - id of the device for internal use
+- `path_on_host` - path to the backing file
+- `root_device` - toggle to use this device as root device. Device will be
+  marked as `rw` in the kernel arguments
+- `read_only` - tells Firecracker to `mmap` the backing file in read-only mode.
+  If this device is also configured as `root_device`, it will be marked as `ro`
+  in the kernel arguments
+
+> [!NOTE]
+>
+> Devices will be exposed to the guest in the order in which they are configured
+> with sequential names in the form of `/dev/pmem{N}` like: `/dev/pmem0`,
+> `/dev/pmem1` ...
+
+> [!WARNING]
+>
+> Setting `virtio-pmem` device to `read-only` mode can lead to VM shutting down
+> on any attempt to write to the device. This is because from guest kernel
+> perspective `virtio-pmem` is always `read-write` capable. Use `read-only` mode
+> only if you want to ensure the underlying file is never written to.
+>
+> To mount the `pmem` device with `read-only` options add `-o ro` to the `mount`
+> command.
+>
+> The exact behaviour differs per platform:
+>
+> - x86_64 - if KVM is able to decode the write instruction used by the guest,
+>   it will return a MMIO_WRITE to the Firecracker where it will be discarded
+>   and the warning log will be printed.
+> - aarch64 - the instruction emulation is much stricter. Writes will result in
+>   an internal KVM error which will be returned to Firecracker in a form of an
+>   `ENOSYS` error. This will make Firecracker stop the VM with appropriate log
+>   message.
+
+> [!WARNING]
+>
+> `virtio-pmem` requires for the guest exposed memory region to be 2MB aligned.
+> This requirement is transitively carried to the backing file of the
+> `virtio-pmem`. Firecracker allows users to configure `virtio-pmem` with
+> backing file of any size and fills the memory gap between the end of the file
+> and the 2MB boundary with empty `PRIVATE | ANONYMOUS` memory pages. Users must
+> be careful to not write to this memory gap since it will not be synchronized
+> with backing file. This is not an issue if `virtio-pmem` is configured in
+> `read-only` mode.
+
+### Config file
+
+Configuration of the `virtio-pmem` device from config file follows similar
+pattern to `virtio-block` section. Here is an example configuration for a single
+`virtio-pmem` device:
+
+```json
+"pmem": [
+    {
+      "id": "pmem0",
+      "path_on_host": "./some_file",
+      "root_device": true,
+      "read_only": false
+    }
+]
+```
+
+### API
+
+Similar to other devices `virtio-pmem` can be configured with API calls. An
+example of configuration request:
+
+```console
+curl --unix-socket $socket_location -i \
+    -X PUT 'http://localhost/pmem/pmem0' \
+    -H 'Accept: application/json' \
+    -H 'Content-Type: application/json' \
+    -d "{
+         \"id\": \"pmem0\",
+         \"path_on_host\": \"./some_file\",
+         \"root_device\": true,
+         \"read_only\": false
+    }"
+```
+
+## Security
+
+It is not recommended to use the same backing file for `virtio-pmem` across
+different VMs, as this causes the same physical pages to be mapped to different
+VMs, whcih could be exploited as a side channel by an attacker inside the
+microVM. Users that want to use `virtio-pmem` to share memory are encouraged to
+carefully evaluate the security risk according to their threat model.
+
+## Snapshot support
+
+`virtio-pmem` works with snapshot functionality of Firecracker. Snapshot will
+contain the configuration options provided by the user. During restoration
+process, Firecracker will attempt to restore `virtio-pmem` device by opening
+same backing file as it was configured in the first place. This means all
+`virtio-pmem` backing files should be present in the same locations during
+restore as they were during initial `virtio-pmem` configuration.
+
+## Performance
+
+Event thought `virtio-pmem` allows for the direct access of host pages from the
+guest, the performance of the first access of each page will suffer from the
+internal KVM page fault which will have to set up Guest physical address to Host
+Virtual address translation. Consecutive accesses will not need to go through
+this process again.
+
+Since the number of page faults correlate to the size of the pages used to back
+`virtio-pmem` memory, it is possible to use huge pages to reduce number of
+required page fault. This can be done by using
+[`tmpfs`](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html) with
+transparent huge pages enabled or by using
+[`hugetblfs`](https://www.kernel.org/doc/html/latest/admin-guide/mm/hugetlbpage.html)
+if `virtio-pmem` is used for memory sharing.
+
+## Memory usage
+
+> [!NOTE] `virtio-pmem` memory can be paged out by the host, because it is
+> backed by a file with `MAP_SHARED` mapping type. To prevent this from
+> happening, you can use `vmtouch` or similar tool to lock file pages from being
+> evicted.
+
+`virtio-pmem` resides in host memory and does increase the maximum possible
+memory usage of a VM since now VM can use all of its RAM and access all of the
+`virtio-pmem` memory. In order to minimize the overhead, it is highly
+recommended to use `DAX` mode to avoid unnecessary duplication of data in guest
+page cache.
+
+As an example, a single VM with 128MB of memory booted from `virtio-pmem` device
+without `DAX` has `RSS` value of ~120MB, while with `DAX` it is ~96MB. The ~96MB
+is similar to memory usage of a VM booted using `virtio-block` as a root device.
+
+In the case where multiple VMs have `virtio-pmem` devices that point to the same
+underlying file the memory overhead can be amortized since total maximum memory
+usage will only include a single instance of `virtio-pmem` memory.
+
+As an example 2 VMs configured with 128MB of RAM without `virtio-pmem` devices
+can consume maximum of 128 + 128 = 256MB of host memory. If each of VMs will
+have a 100MB `virtio-pmem` device attached with shared backing file, the maximum
+memory consumption will be 128 + 128 + 100 = 356MB because 100MB of
+`virtio-pmem` will be shared between VMs.
diff --git a/resources/seccomp/aarch64-unknown-linux-musl.json b/resources/seccomp/aarch64-unknown-linux-musl.json
@@ -215,6 +215,19 @@
                 "syscall": "madvise",
                 "comment": "Used by the VirtIO balloon device and by musl for some customer workloads. It is also used by aws-lc during random number generation. They setup a memory page that mark with MADV_WIPEONFORK to be able to detect forks. They also call it with -1 to see if madvise is supported in certain platforms." 
             },
+            {
+                "syscall": "msync",
+                "comment": "Used by the VirtIO pmem device to sync the file content with the backing file.",
+                "args": [
+                    {
+                        "index": 2,
+                        "type": "dword",
+                        "op": "eq",
+                        "val": 4,
+                        "comment": "libc::MS_SYNC"
+                    }
+                ]
+            },
             {
                 "syscall": "mmap",
                 "comment": "Used by the VirtIO balloon device",

diff --git a/resources/seccomp/x86_64-unknown-linux-musl.json b/resources/seccomp/x86_64-unknown-linux-musl.json
@@ -215,6 +215,19 @@
                 "syscall": "madvise",
                 "comment": "Used by the VirtIO balloon device and by musl for some customer workloads. It is also used by aws-lc during random number generation. They setup a memory page that mark with MADV_WIPEONFORK to be able to detect forks. They also call it with -1 to see if madvise is supported in certain platforms."
             },
+            {
+                "syscall": "msync",
+                "comment": "Used by the VirtIO pmem device to sync the file content with the backing file.",
+                "args": [
+                    {
+                        "index": 2,
+                        "type": "dword",
+                        "op": "eq",
+                        "val": 4,
+                        "comment": "libc::MS_SYNC"
+                    }
+                ]
+            },
             {
                 "syscall": "mmap",
                 "comment": "Used by the VirtIO balloon device",

diff --git a/src/firecracker/src/api_server/parsed_request.rs b/src/firecracker/src/api_server/parsed_request.rs
@@ -24,6 +24,7 @@ use super::request::machine_configuration::{
 use super::request::metrics::parse_put_metrics;
 use super::request::mmds::{parse_get_mmds, parse_patch_mmds, parse_put_mmds};
 use super::request::net::{parse_patch_net, parse_put_net};
+use super::request::pmem::parse_put_pmem;
 use super::request::snapshot::{parse_patch_vm_state, parse_put_snapshot};
 use super::request::version::parse_get_version;
 use super::request::vsock::parse_put_vsock;
@@ -90,6 +91,7 @@ impl TryFrom<&Request> for ParsedRequest {
             (Method::Put, "boot-source", Some(body)) => parse_put_boot_source(body),
             (Method::Put, "cpu-config", Some(body)) => parse_put_cpu_config(body),
             (Method::Put, "drives", Some(body)) => parse_put_drive(body, path_tokens.next()),
+            (Method::Put, "pmem", Some(body)) => parse_put_pmem(body, path_tokens.next()),
             (Method::Put, "logger", Some(body)) => parse_put_logger(body),
             (Method::Put, "serial", Some(body)) => parse_put_serial(body),
             (Method::Put, "machine-config", Some(body)) => parse_put_machine_config(body),

diff --git a/src/firecracker/src/api_server/request/mod.rs b/src/firecracker/src/api_server/request/mod.rs
@@ -13,6 +13,7 @@ pub mod machine_configuration;
 pub mod metrics;
 pub mod mmds;
 pub mod net;
+pub mod pmem;
 pub mod serial;
 pub mod snapshot;
 pub mod version;

diff --git a/src/firecracker/src/api_server/request/pmem.rs b/src/firecracker/src/api_server/request/pmem.rs
@@ -0,0 +1,75 @@
+// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+use vmm::logger::{IncMetric, METRICS};
+use vmm::rpc_interface::VmmAction;
+use vmm::vmm_config::pmem::PmemConfig;
+
+use super::super::parsed_request::{ParsedRequest, RequestError, checked_id};
+use super::{Body, StatusCode};
+
+pub(crate) fn parse_put_pmem(
+    body: &Body,
+    id_from_path: Option<&str>,
+) -> Result<ParsedRequest, RequestError> {
+    METRICS.put_api_requests.pmem_count.inc();
+    let id = if let Some(id) = id_from_path {
+        checked_id(id)?
+    } else {
+        METRICS.put_api_requests.pmem_fails.inc();
+        return Err(RequestError::EmptyID);
+    };
+
+    let device_cfg = serde_json::from_slice::<PmemConfig>(body.raw()).inspect_err(|_| {
+        METRICS.put_api_requests.pmem_fails.inc();
+    })?;
+
+    if id != device_cfg.id {
+        METRICS.put_api_requests.pmem_fails.inc();
+        Err(RequestError::Generic(
+            StatusCode::BadRequest,
+            "The id from the path does not match the id from the body!".to_string(),
+        ))
+    } else {
+        Ok(ParsedRequest::new_sync(VmmAction::InsertPmemDevice(
+            device_cfg,
+        )))
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::api_server::parsed_request::tests::vmm_action_from_request;
+
+    #[test]
+    fn test_parse_put_pmem_request() {
+        parse_put_pmem(&Body::new("invalid_payload"), None).unwrap_err();
+        parse_put_pmem(&Body::new("invalid_payload"), Some("id")).unwrap_err();
+
+        let body = r#"{
+            "id": "bar",
+        }"#;
+        parse_put_pmem(&Body::new(body), Some("1")).unwrap_err();
+        let body = r#"{
+            "foo": "1",
+        }"#;
+        parse_put_pmem(&Body::new(body), Some("1")).unwrap_err();
+
+        let body = r#"{
+            "id": "1000",
+            "path_on_host": "dummy",
+            "root_device": true,
+            "read_only": true
+        }"#;
+        let r = vmm_action_from_request(parse_put_pmem(&Body::new(body), Some("1000")).unwrap());
+
+        let expected_config = PmemConfig {
+            id: "1000".to_string(),
+            path_on_host: "dummy".to_string(),
+            root_device: true,
+            read_only: true,
+        };
+        assert_eq!(r, VmmAction::InsertPmemDevice(expected_config));
+    }
+}