-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[virtio-pmem] Implementation #5463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ShadowCurse
wants to merge
14
commits into
firecracker-microvm:main
Choose a base branch
from
ShadowCurse:virtio_pmem
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
0c26352
chore(virtio-pmem): add msync syscall to seccomp filters
ShadowCurse 8ace79e
feat(virtio-pmem): add device configs and implementation
ShadowCurse 84dd9ce
feat(virtio-pmem): add device to the VmResources
ShadowCurse f0b78e7
feat: add a check for a single root device
ShadowCurse 072d7be
feat(virtio-pmem): add API requests parsing to Firecracker
ShadowCurse 478e6bb
feat: add allocator for past 64bit memory region
ShadowCurse fab616e
feat: add a counter for KVM slots
ShadowCurse fd62b21
feat(virtio-pmem): add device to the Vmm
ShadowCurse fc241c1
feat(virtio-pmem): add snapshot support
ShadowCurse 59cfd48
feat(virtio-pmem): add integration tests
ShadowCurse 7009a6a
feat(virtio-pmem): export device metrics
ShadowCurse 54d709a
chore(virtio-pmem): document new APIs
ShadowCurse df0fd2b
doc(virtio-pmem): add documentation
ShadowCurse 5bac831
chore: update CHANGELOG
ShadowCurse File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,210 @@ | ||
# Using the Firecracker `virtio-pmem` device | ||
|
||
## What is a persistent memory device | ||
|
||
Persistent memory is a type of non-volatile, CPU accessible (with usual | ||
load/store instructions) memory that does not lose its content on power loss. In | ||
other words all writes to the memory persist over the power cycle. In hardware | ||
this known as NVDIMM memory (Non Volatile Double Inline Memory Module). | ||
|
||
## What is a `virtio-pmem` device: | ||
|
||
[`virtio-pmem`](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-68900019) | ||
is a device which emulates a persistent memory device without requiring a | ||
physical NVDIMM device be present on the host system. `virtio-pmem` is backed by | ||
a memory mapped file on the host side and is exposed to the guest kernel as a | ||
region in the guest physical memory. This allows the guest to directly access | ||
the host memory pages without a need to use guest driver or interact with VMM. | ||
From guest user-space perspective `virtio-pmem` devices are presented as normal | ||
block device like `/dev/pmem0`. This allows `virtio-pmem` to be used as rootfs | ||
device and make VM boot from it. | ||
|
||
> [!NOTE] | ||
> | ||
> Since `virtio-pmem` is located fully in memory, when used as a block device | ||
> there is no need to use guest page cache for its operations. This behaviour | ||
> can be configured by using `DAX` feature of the kernel. | ||
> | ||
> - To mount a device with `DAX` add `--flags=dax` to the `mount` command. | ||
> - To configure a root device with `DAX` append `rootflags=dax` to the kernel | ||
> arguments. | ||
> | ||
> `DAX` support is not uniform for all file systems. Check the kernel | ||
> [documentation](https://github.com/torvalds/linux/blob/master/Documentation/filesystems/dax.rst) | ||
> for more information. | ||
|
||
## Prerequisites | ||
|
||
In order to use `virtio-pmem` device, guest kernel needs to built with support | ||
for it. The full list of configuration options needed for `virtio-pmem` and | ||
`DAX`: | ||
|
||
``` | ||
# Needed for DAX on aarch64. Will be ignored on x86_64 | ||
CONFIG_ARM64_PMEM=y | ||
|
||
CONFIG_DEVICE_MIGRATION=y | ||
CONFIG_ZONE_DEVICE=y | ||
CONFIG_VIRTIO_PMEM=y | ||
CONFIG_LIBNVDIMM=y | ||
CONFIG_BLK_DEV_PMEM=y | ||
CONFIG_ND_CLAIM=y | ||
CONFIG_ND_BTT=y | ||
CONFIG_BTT=y | ||
CONFIG_ND_PFN=y | ||
CONFIG_NVDIMM_PFN=y | ||
CONFIG_NVDIMM_DAX=y | ||
CONFIG_OF_PMEM=y | ||
CONFIG_NVDIMM_KEYS=y | ||
CONFIG_DAX=y | ||
CONFIG_DEV_DAX=y | ||
CONFIG_DEV_DAX_PMEM=y | ||
CONFIG_DEV_DAX_KMEM=y | ||
CONFIG_FS_DAX=y | ||
CONFIG_FS_DAX_PMD=y | ||
``` | ||
|
||
## Configuration | ||
|
||
Firecracker implementation exposes these config options for the `virtio-pmem` | ||
device: | ||
|
||
- `id` - id of the device for internal use | ||
- `path_on_host` - path to the backing file | ||
- `root_device` - toggle to use this device as root device. Device will be | ||
marked as `rw` in the kernel arguments | ||
- `read_only` - tells Firecracker to `mmap` the backing file in read-only mode. | ||
If this device is also configured as `root_device`, it will be marked as `ro` | ||
in the kernel arguments | ||
|
||
> [!NOTE] | ||
> | ||
> Devices will be exposed to the guest in the order in which they are configured | ||
> with sequential names in the form of `/dev/pmem{N}` like: `/dev/pmem0`, | ||
> `/dev/pmem1` ... | ||
|
||
> [!WARNING] | ||
> | ||
> Setting `virtio-pmem` device to `read-only` mode can lead to VM shutting down | ||
> on any attempt to write to the device. This is because from guest kernel | ||
> perspective `virtio-pmem` is always `read-write` capable. Use `read-only` mode | ||
> only if you want to ensure the underlying file is never written to. | ||
ShadowCurse marked this conversation as resolved.
Show resolved
Hide resolved
|
||
> | ||
> To mount the `pmem` device with `read-only` options add `-o ro` to the `mount` | ||
> command. | ||
> | ||
> The exact behaviour differs per platform: | ||
> | ||
> - x86_64 - if KVM is able to decode the write instruction used by the guest, | ||
> it will return a MMIO_WRITE to the Firecracker where it will be discarded | ||
> and the warning log will be printed. | ||
> - aarch64 - the instruction emulation is much stricter. Writes will result in | ||
> an internal KVM error which will be returned to Firecracker in a form of an | ||
> `ENOSYS` error. This will make Firecracker stop the VM with appropriate log | ||
> message. | ||
|
||
> [!WARNING] | ||
> | ||
> `virtio-pmem` requires for the guest exposed memory region to be 2MB aligned. | ||
> This requirement is transitively carried to the backing file of the | ||
> `virtio-pmem`. Firecracker allows users to configure `virtio-pmem` with | ||
> backing file of any size and fills the memory gap between the end of the file | ||
> and the 2MB boundary with empty `PRIVATE | ANONYMOUS` memory pages. Users must | ||
> be careful to not write to this memory gap since it will not be synchronized | ||
> with backing file. This is not an issue if `virtio-pmem` is configured in | ||
> `read-only` mode. | ||
|
||
### Config file | ||
|
||
Configuration of the `virtio-pmem` device from config file follows similar | ||
pattern to `virtio-block` section. Here is an example configuration for a single | ||
`virtio-pmem` device: | ||
|
||
```json | ||
"pmem": [ | ||
{ | ||
"id": "pmem0", | ||
"path_on_host": "./some_file", | ||
"root_device": true, | ||
"read_only": false | ||
} | ||
] | ||
``` | ||
|
||
### API | ||
|
||
Similar to other devices `virtio-pmem` can be configured with API calls. An | ||
example of configuration request: | ||
|
||
```console | ||
curl --unix-socket $socket_location -i \ | ||
-X PUT 'http://localhost/pmem/pmem0' \ | ||
-H 'Accept: application/json' \ | ||
-H 'Content-Type: application/json' \ | ||
-d "{ | ||
\"id\": \"pmem0\", | ||
\"path_on_host\": \"./some_file\", | ||
\"root_device\": true, | ||
\"read_only\": false | ||
}" | ||
``` | ||
ShadowCurse marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Security | ||
|
||
It is not recommended to use the same backing file for `virtio-pmem` across | ||
different VMs, as this causes the same physical pages to be mapped to different | ||
VMs, whcih could be exploited as a side channel by an attacker inside the | ||
microVM. Users that want to use `virtio-pmem` to share memory are encouraged to | ||
carefully evaluate the security risk according to their threat model. | ||
|
||
## Snapshot support | ||
|
||
`virtio-pmem` works with snapshot functionality of Firecracker. Snapshot will | ||
contain the configuration options provided by the user. During restoration | ||
process, Firecracker will attempt to restore `virtio-pmem` device by opening | ||
same backing file as it was configured in the first place. This means all | ||
`virtio-pmem` backing files should be present in the same locations during | ||
restore as they were during initial `virtio-pmem` configuration. | ||
|
||
## Performance | ||
|
||
Event thought `virtio-pmem` allows for the direct access of host pages from the | ||
guest, the performance of the first access of each page will suffer from the | ||
internal KVM page fault which will have to set up Guest physical address to Host | ||
Virtual address translation. Consecutive accesses will not need to go through | ||
this process again. | ||
|
||
Since the number of page faults correlate to the size of the pages used to back | ||
`virtio-pmem` memory, it is possible to use huge pages to reduce number of | ||
required page fault. This can be done by using | ||
[`tmpfs`](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html) with | ||
transparent huge pages enabled or by using | ||
[`hugetblfs`](https://www.kernel.org/doc/html/latest/admin-guide/mm/hugetlbpage.html) | ||
if `virtio-pmem` is used for memory sharing. | ||
|
||
## Memory usage | ||
|
||
> [!NOTE] `virtio-pmem` memory can be paged out by the host, because it is | ||
> backed by a file with `MAP_SHARED` mapping type. To prevent this from | ||
> happening, you can use `vmtouch` or similar tool to lock file pages from being | ||
> evicted. | ||
|
||
`virtio-pmem` resides in host memory and does increase the maximum possible | ||
memory usage of a VM since now VM can use all of its RAM and access all of the | ||
`virtio-pmem` memory. In order to minimize the overhead, it is highly | ||
recommended to use `DAX` mode to avoid unnecessary duplication of data in guest | ||
page cache. | ||
|
||
As an example, a single VM with 128MB of memory booted from `virtio-pmem` device | ||
without `DAX` has `RSS` value of ~120MB, while with `DAX` it is ~96MB. The ~96MB | ||
is similar to memory usage of a VM booted using `virtio-block` as a root device. | ||
|
||
In the case where multiple VMs have `virtio-pmem` devices that point to the same | ||
underlying file the memory overhead can be amortized since total maximum memory | ||
usage will only include a single instance of `virtio-pmem` memory. | ||
|
||
As an example 2 VMs configured with 128MB of RAM without `virtio-pmem` devices | ||
can consume maximum of 128 + 128 = 256MB of host memory. If each of VMs will | ||
have a 100MB `virtio-pmem` device attached with shared backing file, the maximum | ||
memory consumption will be 128 + 128 + 100 = 356MB because 100MB of | ||
`virtio-pmem` will be shared between VMs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
use vmm::logger::{IncMetric, METRICS}; | ||
use vmm::rpc_interface::VmmAction; | ||
use vmm::vmm_config::pmem::PmemConfig; | ||
|
||
use super::super::parsed_request::{ParsedRequest, RequestError, checked_id}; | ||
use super::{Body, StatusCode}; | ||
|
||
pub(crate) fn parse_put_pmem( | ||
body: &Body, | ||
id_from_path: Option<&str>, | ||
) -> Result<ParsedRequest, RequestError> { | ||
METRICS.put_api_requests.pmem_count.inc(); | ||
let id = if let Some(id) = id_from_path { | ||
checked_id(id)? | ||
} else { | ||
METRICS.put_api_requests.pmem_fails.inc(); | ||
return Err(RequestError::EmptyID); | ||
}; | ||
|
||
let device_cfg = serde_json::from_slice::<PmemConfig>(body.raw()).inspect_err(|_| { | ||
METRICS.put_api_requests.pmem_fails.inc(); | ||
})?; | ||
|
||
if id != device_cfg.id { | ||
METRICS.put_api_requests.pmem_fails.inc(); | ||
Err(RequestError::Generic( | ||
StatusCode::BadRequest, | ||
"The id from the path does not match the id from the body!".to_string(), | ||
)) | ||
} else { | ||
Ok(ParsedRequest::new_sync(VmmAction::InsertPmemDevice( | ||
device_cfg, | ||
))) | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use super::*; | ||
use crate::api_server::parsed_request::tests::vmm_action_from_request; | ||
|
||
#[test] | ||
fn test_parse_put_pmem_request() { | ||
parse_put_pmem(&Body::new("invalid_payload"), None).unwrap_err(); | ||
parse_put_pmem(&Body::new("invalid_payload"), Some("id")).unwrap_err(); | ||
|
||
let body = r#"{ | ||
"id": "bar", | ||
}"#; | ||
parse_put_pmem(&Body::new(body), Some("1")).unwrap_err(); | ||
let body = r#"{ | ||
"foo": "1", | ||
}"#; | ||
parse_put_pmem(&Body::new(body), Some("1")).unwrap_err(); | ||
|
||
let body = r#"{ | ||
"id": "1000", | ||
"path_on_host": "dummy", | ||
"root_device": true, | ||
"read_only": true | ||
}"#; | ||
let r = vmm_action_from_request(parse_put_pmem(&Body::new(body), Some("1000")).unwrap()); | ||
|
||
let expected_config = PmemConfig { | ||
id: "1000".to_string(), | ||
path_on_host: "dummy".to_string(), | ||
root_device: true, | ||
read_only: true, | ||
}; | ||
assert_eq!(r, VmmAction::InsertPmemDevice(expected_config)); | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention these in
kernel-policy.md
. Maybe put a link here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put a link into
kernel-policy
that points to this section (since there are a lot of configs).