-
Notifications
You must be signed in to change notification settings - Fork 108
Memory hotplug
This document explains in many technical details the interactions between the NEMU
hypervisor (relying on the new virt
machine type) and the Linux guest OS running inside the virtual machine. It highlights how the ACPI tables are used in order to allow the communication between hypervisor and guest OS, and also explains how the memory hotplug is performed.
This same documentation can be applied to any machine type leveraging the Hardware-Reduced
ACPI specification.
Here is a quick overview of the different components involved in the memory hotplug mechanism:
- The user is the one triggering the insertion or the removal of a block of memory, relying on the command line provided by the hypervisor.
- Once the hypervisor gets the information that some hotplug operation needs to be applied on a memory slot, it notifies the guest OS about it.
- The guest OS relies on the ACPI tables, and particularly the DSDT one, to evaluate the ACPI method associated with the event received. At some point, this method will notify the guest OS itself to trigger the appropriate driver handling the memory hotplug internally. The method is MSCN, which is used by the hypervisor as the mechanism to notify the guest OS.
- A lot of back and forth between the guest OS and the hypervisor through the range of I/O ports defined through the ACPI tables. The goal for the hypervisor being to provide the information about the memory slot status, and for the guest OS being to signal the hypervisor about the status of the hotplug operation.
A range of I/O ports is a convenient way for the hypervisor to establish some communication with the guest OS. By creating those regions and the ACPI methods accessing them, the hypervisor defines the expected memory accesses the guest OS will perform when evaluating one of those ACPI methods. Anytime a memory access to one of those regions is performed by the guest OS, the hypervisor traps it and takes the appropriate actions.
Here are the details of the range of I/O ports used by NEMU to communicate regarding its memory slots:
- Base address 0x0A00
- Size of the register 0x18
Here is how this range of I/O ports is defined through ACPI tables:
DSDT table
Device (\_SB.MHPD)
{
Name (_HID, "PNP0A06" /* Generic Container Device */) // _HID: Hardware ID
Name (_UID, "Memory hotplug resources") // _UID: Unique ID
Name (_CRS, ResourceTemplate () // _CRS: Current Resource Settings
{
IO (Decode16,
0x0A00, // Range Minimum
0x0A00, // Range Maximum
0x00, // Alignment
0x18, // Length
)
})
OperationRegion (HPMR, SystemIO, 0x0A00, 0x18)
}
Device (\_SB.MHPC)
{
Name (_HID, "PNP0A06" /* Generic Container Device */) // _HID: Hardware ID
Name (_UID, "DIMM devices") // _UID: Unique ID
Name (MDNR, 0x03)
Field (\_SB.MHPD.HPMR, DWordAcc, NoLock, Preserve)
{
MRBL, 32,
MRBH, 32,
MRLL, 32,
MRLH, 32,
MPX, 32
}
Field (\_SB.MHPD.HPMR, ByteAcc, NoLock, WriteAsZeros)
{
Offset (0x14),
MES, 1,
MINS, 1,
MRMV, 1,
MEJ, 1
}
Field (\_SB.MHPD.HPMR, DWordAcc, NoLock, Preserve)
{
MSEL, 32,
MOEV, 32,
MOSC, 32
}
...
}
MDNR: Number of memory slots chosen byt the user through NEMU CLI, by using the flag -m
with the option slots=3
. This value is directly hardcoded into the DSDT.
MRBL: Memory resource base address. First 32 bits of the 64 bits memory slot address, defined by the double word accessible at offset 0x00
of the base address 0x0A00
.
MRBH: Memory resource base address. Last 32 bits of the 64 bits memory slot address, defined by the double word accessible at offset 0x04
of the base address 0x0A00
.
MRLL: Memory slot size. First 32 bits of the 64 bits memory slot size, defined by the double word accessible at offset 0x08
of the base address 0x0A00
.
MRLH: Memory slot size. Last 32 bits of the 64 bits memory slot size, defined by the double word accessible at offset 0x0C
of the base address 0x0A00
.
MPX: Node proximity. It is defined by the double word accessible at offset 0x10
of the base address 0x0A00
.
MES: Memory slot enable. It is a flag indicating if the memory slot is enabled, defined by the bit 0
of the byte accessible at offset 0x14
of the base address 0x0A00
.
MINS: Memory slot insert. It is a flag indicating if the memory slot needs to be inserted, defined by the bit 1
of the byte accessible at offset 0x14
of the base address 0x0A00
.
MRMV: Memory slot remove. It is a flag indicating if the memory slot needs to be removed, defined by the bit 2
of the byte accessible at offset 0x14
of the base address 0x0A00
.
MEJ: Memory slot eject. It is a flag indicating if the memory slot has been ejected, defined by the bit 3
of the byte accessible at offset 0x14
of the base address 0x0A00
.
MSEL: Memory slot selector. It is a value indicating the memory slot index, defined by the double word accessible at offset 0x00
of the base address 0x0A00
.
MOEV: Guest OS event. It is a value always provided by the guest OS to define a type of event that happened regarding a specific memory slot. It is defined by the double word accessible at offset 0x04
of the base address 0x0A00
.
MOSC: Guest OS status. It is a value always provided by the guest OS to provide a status regarding a specific memory slot. It is defined by the double word accessible at offset 0x08
of the base address 0x0A00
.
Note: Both ByteAcc and DWordAcc specify the type of access for each field of the whole range of I/O ports.
And here is the code from NEMU taking care of handling any read/write operation from/to this range:
hw/acpi/memory_hotplug.c: Callbacks declaration
static const MemoryRegionOps acpi_memory_hotplug_ops = {
.read = acpi_memory_hotplug_read,
.write = acpi_memory_hotplug_write,
.endianness = DEVICE_LITTLE_ENDIAN,
.valid = {
.min_access_size = 1,
.max_access_size = 4,
},
};
Here is the important structure MemHotplugState
used by those two callbacks:
typedef struct MemHotplugState {
bool is_enabled; /* true if memory hotplug is supported */
MemoryRegion io;
uint32_t selector;
uint32_t dev_count;
MemStatus *devs;
} MemHotplugState;
hw/acpi/memory_hotplug.c: Read operations
static uint64_t acpi_memory_hotplug_read(void *opaque, hwaddr addr,
unsigned int size)
{
uint32_t val = 0;
MemHotplugState *mem_st = opaque;
MemStatus *mdev;
Object *o;
if (mem_st->selector >= mem_st->dev_count) {
trace_mhp_acpi_invalid_slot_selected(mem_st->selector);
return 0;
}
mdev = &mem_st->devs[mem_st->selector];
o = OBJECT(mdev->dimm);
switch (addr) {
case 0x0: /* Lo part of phys address where DIMM is mapped */
val = o ? object_property_get_uint(o, PC_DIMM_ADDR_PROP, NULL) : 0;
trace_mhp_acpi_read_addr_lo(mem_st->selector, val);
break;
case 0x4: /* Hi part of phys address where DIMM is mapped */
val =
o ? object_property_get_uint(o, PC_DIMM_ADDR_PROP, NULL) >> 32 : 0;
trace_mhp_acpi_read_addr_hi(mem_st->selector, val);
break;
case 0x8: /* Lo part of DIMM size */
val = o ? object_property_get_uint(o, PC_DIMM_SIZE_PROP, NULL) : 0;
trace_mhp_acpi_read_size_lo(mem_st->selector, val);
break;
case 0xc: /* Hi part of DIMM size */
val =
o ? object_property_get_uint(o, PC_DIMM_SIZE_PROP, NULL) >> 32 : 0;
trace_mhp_acpi_read_size_hi(mem_st->selector, val);
break;
case 0x10: /* node proximity for _PXM method */
val = o ? object_property_get_uint(o, PC_DIMM_NODE_PROP, NULL) : 0;
trace_mhp_acpi_read_pxm(mem_st->selector, val);
break;
case 0x14: /* pack and return is_* fields */
val |= mdev->is_enabled ? 1 : 0;
val |= mdev->is_inserting ? 2 : 0;
val |= mdev->is_removing ? 4 : 0;
trace_mhp_acpi_read_flags(mem_st->selector, val);
break;
default:
val = ~0;
break;
}
return val;
}
This function is the callback handling any read
to the range of I/O ports defined above. Depending on the address offset to be read, the hypervisor will return different values:
-
0x00
: Accessed when MRBL is read. -
0x04
: Accessed when MRBH is read. -
0x08
: Accessed when MRLL is read. -
0x0C
: Accessed when MRLH is read. -
0x10
: Accessed when MPX is read. -
0x14
: Accessed when one of the flag MES, MINS, or MRMV is read. The hypervisor will simply return the value based on its internal structures.
For any of these operations, the memory slot mdev
has to be specified. This has to happen by defining the memory slot index, by writing to the memory slot selector field MSEL. This way, when a read is performed, we make sure it happens regarding the right memory slot.
hw/acpi/memory_hotplug.c: Write operations
static void acpi_memory_hotplug_write(void *opaque, hwaddr addr, uint64_t data,
unsigned int size)
{
MemHotplugState *mem_st = opaque;
MemStatus *mdev;
ACPIOSTInfo *info;
DeviceState *dev = NULL;
HotplugHandler *hotplug_ctrl = NULL;
Error *local_err = NULL;
if (!mem_st->dev_count) {
return;
}
if (addr) {
if (mem_st->selector >= mem_st->dev_count) {
trace_mhp_acpi_invalid_slot_selected(mem_st->selector);
return;
}
}
switch (addr) {
case 0x0: /* DIMM slot selector */
mem_st->selector = data;
trace_mhp_acpi_write_slot(mem_st->selector);
break;
case 0x4: /* _OST event */
mdev = &mem_st->devs[mem_st->selector];
if (data == 1) {
/* TODO: handle device insert OST event */
} else if (data == 3) {
/* TODO: handle device remove OST event */
}
mdev->ost_event = data;
trace_mhp_acpi_write_ost_ev(mem_st->selector, mdev->ost_event);
break;
case 0x8: /* _OST status */
mdev = &mem_st->devs[mem_st->selector];
mdev->ost_status = data;
trace_mhp_acpi_write_ost_status(mem_st->selector, mdev->ost_status);
/* TODO: implement memory removal on guest signal */
info = acpi_memory_device_status(mem_st->selector, mdev);
qapi_event_send_acpi_device_ost(info, &error_abort);
qapi_free_ACPIOSTInfo(info);
break;
case 0x14: /* set is_* fields */
mdev = &mem_st->devs[mem_st->selector];
if (data & 2) { /* clear insert event */
mdev->is_inserting = false;
trace_mhp_acpi_clear_insert_evt(mem_st->selector);
} else if (data & 4) {
mdev->is_removing = false;
trace_mhp_acpi_clear_remove_evt(mem_st->selector);
} else if (data & 8) {
if (!mdev->is_enabled) {
trace_mhp_acpi_ejecting_invalid_slot(mem_st->selector);
break;
}
dev = DEVICE(mdev->dimm);
hotplug_ctrl = qdev_get_hotplug_handler(dev);
/* call pc-dimm unplug cb */
hotplug_handler_unplug(hotplug_ctrl, dev, &local_err);
if (local_err) {
trace_mhp_acpi_pc_dimm_delete_failed(mem_st->selector);
qapi_event_send_mem_unplug_error(dev->id,
error_get_pretty(local_err),
&error_abort);
error_free(local_err);
break;
}
trace_mhp_acpi_pc_dimm_deleted(mem_st->selector);
}
break;
default:
break;
}
}
This function is the callback handling any write
to the memory IO region defined above.
And here are the possibilities by writing to the IO region:
-
0x00
: Accessed when writing to MSEL, setting the memory slot selector with the value from the write access. -
0x04
: Accessed when writing to MOEV, setting the guest OS event regarding the memory slot, based on the value from the write access. -
0x08
: Accessed when writing to MOSC, setting the guest OS status regarding the memory slot, based on the value from the write access. -
0x14
: Accessed when one of the flag MINS, MRMV or MEJ is written. Writing1
to MINS or MRMV will actually clear the flag indicating that the memory slot needs to be inserted or removed. Writing1
to MEJ will trigger the ejection of the memory slot and the hypervisor will take care of completing the memory slot removal.
This section will focus on the interactions between the components, from the moment the memory block is added by the user through the NEMU CLI up to the guest OS.
Let's start where everything starts, the main()
function in vl.c
.
vl.c: main()
-> monitor_init_globals()
monitor.c: monitor_init_globals()
-> monitor_init_qmp_commands()
-> qmp_register_command()
When calling into qmp_register_command()
, NEMU registers the callback qmp_device_add()
that will be called whenever a device will be hotplugged using the monitor (QMP).
Let's create an extra block of 1 GiB of memory by hotplugging a new memory slot. First, we need to create the memory object representing 1 GiB, and then we can hotplug it through one of the available slot of the VM:
QEMU 3.0.0 monitor - type 'help' for more information
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Once the command is issued, the callback previously registered is triggered:
qdev-monitor.c: qmp_device_add()
-> qdev_device_add()
qdev_device_add()
is the central piece of code where the parsing of the device options is performed, which eventually triggers the creation of this device based on those parameters.
Once the driver name has been retrieved (pc-dimm in this case), here is the sequence creating the device internally:
qdev-monitor.c: qdev_device_add()
-> object_new(driver)
object.c: object_new()
-> object_new_with_type()
-> object_initialize_with_type()
object.c: object_initialize_with_type()
-> type_initialize()
-> class_init()
which calls into the callback provided by the specific driver. In this case, pc_dimm_class_init()
from hw/mem/pc-dimm.c is the one that matters.
object.c: object_initialize_with_type()
-> object_init_with_type()
-> instance_init()
which calls into the callback provided by the specific driver. In this case, pc_dimm_init()
from hw/mem/pc-dimm.c is the one that matters.
Because any object is tied with parent/child relationship in NEMU, the memory device that is being added here has a parent TYPE_DEVICE
. Therefore, this memory is also considered as a device, and because the parent has to be initialized too, the callback device_initfn()
(from hw/core/qdev.c) is getting called whenever instance_init()
is invoked.
static const TypeInfo device_type_info = {
.name = TYPE_DEVICE,
.parent = TYPE_OBJECT,
.instance_size = sizeof(DeviceState),
.instance_init = device_initfn,
...
};
hw/core/qdev.c: device_initfn()
creates an interesting boolean property called realized
for which it registers the callback device_set_realized()
:
static void device_initfn(Object *obj)
{
...
object_property_add_bool(obj, "realized",
device_get_realized, device_set_realized, NULL);
So whenever the property realized
is set, it triggers device_set_realized()
, which will eventually call into qdev_get_hotplug_handler()
. The point being to retrieve the hotplug handler that will be used to hotplug the device:
HotplugHandler *qdev_get_hotplug_handler(DeviceState *dev)
{
HotplugHandler *hotplug_ctrl;
if (dev->parent_bus && dev->parent_bus->hotplug_handler) {
hotplug_ctrl = dev->parent_bus->hotplug_handler;
} else {
hotplug_ctrl = qdev_get_machine_hotplug_handler(dev);
}
return hotplug_ctrl;
}
This handler is directly retrieved from what has been registered by the machine type (virt
machine type from hw/i386/virt/virt.c):
static void virt_machine_class_init(MachineClass *mc)
{
VirtMachineClass *vmc = VIRT_MACHINE_CLASS(mc);
HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(mc);
...
/* Hotplug handlers */
hc->pre_plug = virt_machine_device_pre_plug_cb;
hc->plug = virt_machine_device_plug_cb;
hc->unplug_request = virt_machine_device_unplug_request_cb;
hc->unplug = virt_machine_device_unplug_cb;
...
}
And right after retrieving this handler, it is used to call into pre_plug
and plug
callbacks:
static void device_set_realized(Object *obj, bool value, Error **errp)
{
...
hotplug_ctrl = qdev_get_hotplug_handler(dev);
if (hotplug_ctrl) {
hotplug_handler_pre_plug(hotplug_ctrl, dev, &local_err);
if (local_err != NULL) {
goto fail;
}
}
if (dc->realize) {
dc->realize(dev, &local_err);
}
if (local_err != NULL) {
goto fail;
}
DEVICE_LISTENER_CALL(realize, Forward, dev);
if (hotplug_ctrl) {
hotplug_handler_plug(hotplug_ctrl, dev, &local_err);
}
...
}
Right between pre_plug
and plug
callbacks, the device is realized by calling into the realize
callback of this type of device. After the device has been realized, the device exists internally.
So here is the point where the device has been created, where all the callback and handlers have been previously registered. Now, remember the function qdev_device_add()
from qdev-monitor.c, it's been creating the new device, but no hotplug handler has been triggered so far.
Well the missing piece is coming from the last bit of the function qdev_device_add()
:
DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
{
...
/* create device */
dev = DEVICE(object_new(driver));
...
object_property_set_bool(OBJECT(dev), true, "realized", &err);
if (err != NULL) {
dev->opts = NULL;
goto err_del_dev;
}
...
After the device and its parents has been created and properly instantiated, the property realized
is set to true
, triggering the entire hotplug chain into calling the hotplug callbacks registered by the machine type virt
. Let's look in details at plug
which calls into virt_machine_device_plug_cb()
:
static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
DeviceState *dev, Error **errp)
{
if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
virt_cpu_plug(hotplug_dev, dev, errp);
} else if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
virt_dimm_plug(hotplug_dev, dev, errp);
} else {
error_setg(errp, "virt: device plug for unsupported device"
" type: %s", object_get_typename(OBJECT(dev)));
}
}
Depending on the type of device being plugged here, and because those callbacks are generic and can be used both for CPU and memory, different functions might be triggered. In case of memory, the type TYPE_PC_DIMM
is the one being used, which calls into virt_dimm_plug()
. This function takes care of initializing the memory region needed by the memory slot. Once this is done, it calls into the plug
callback defined by its ACPI implementation in hw/i386/virt/acpi.c:
static void virt_acpi_class_init(ObjectClass *class, void *data)
{
...
HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(class);
...
hc->plug = virt_device_plug_cb;
hc->unplug_request = virt_device_unplug_request_cb;
hc->unplug = virt_device_unplug_cb;
...
}
hw/i386/virt/acpi.c: virt_device_plug_cb()
-> acpi_memory_plug_cb()
-> acpi_send_event()
hw/acpi/acpi_interface.c: acpi_send_event()
-> send_event()
where it refers to another callback that has been registered earlier with the machine type and its ACPI implementation:
static void virt_acpi_class_init(ObjectClass *class, void *data)
{
...
AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_CLASS(class);
...
adevc->send_event = virt_send_ged;
...
}
Eventually, this is going to call the function virt_send_ged()
responsible for sending an interrupt to the guest OS using GED events defined in the DSDT table with the GED object:
static void virt_send_ged(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
{
VirtAcpiState *s = VIRT_ACPI(adev);
if (ev & ACPI_CPU_HOTPLUG_STATUS) {
/* We inject the CPU hotplug interrupt */
qemu_irq_pulse(s->gsi[VIRT_GED_CPU_HOTPLUG_IRQ]);
} else if (ev & ACPI_MEMORY_HOTPLUG_STATUS) {
/* We inject the memory hotplug interrupt */
qemu_irq_pulse(s->gsi[VIRT_GED_MEMORY_HOTPLUG_IRQ]);
} else if (ev & ACPI_NVDIMM_HOTPLUG_STATUS) {
qemu_irq_pulse(s->gsi[VIRT_GED_NVDIMM_HOTPLUG_IRQ]);
} else if (ev & ACPI_PCI_HOTPLUG_STATUS) {
/* Inject PCI hotplug interrupt */
qemu_irq_pulse(s->gsi[VIRT_GED_PCI_HOTPLUG_IRQ]);
}
}
In case of memory hotplug, the ACPI table that matters the most is the DSDT table.
The starting point here will be the definition of the GED object. GED stands for Generic Event Device, and describes all interrupts associated with event generation. When an interrupt is asserted, the guest OS will execute the event method _EVT declared in the GED object:
Device (\_SB.GED)
{
Name (_HID, "ACPI0013") // _HID: Hardware ID
Name (_UID, Zero) // _UID: Unique ID
Name (_CRS, ResourceTemplate () // _CRS: Current Resource Settings
{
Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
{
0x00000010,
}
Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
{
0x00000011,
}
Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
{
0x00000013,
}
Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
{
0x00000012,
}
})
Method (_EVT, 1, Serialized) // _EVT: Event
{
Local0 = One
While ((Local0 == One))
{
Local0 = Zero
If ((Arg0 == 0x10))
{
\_SB.CPUS.CSCN ()
}
ElseIf ((Arg0 == 0x11))
{
\_SB.MHPC.MSCN ()
}
ElseIf ((Arg0 == 0x13))
{
Notify (\_SB.NVDR, 0x80) // Status Change
}
ElseIf ((Arg0 == 0x12))
{
Acquire (\_SB.PCI0.BLCK, 0xFFFF)
\_SB.PCI0.PCNT ()
Release (\_SB.PCI0.BLCK)
}
}
}
}
After the hypervisor sends the interrupt registered for the GED device, the method \_SB.MHPC.MSCN() is invoked:
Method (MSCN, 0, NotSerialized)
{
If ((MDNR == Zero))
{
Return (Zero)
}
Local0 = Zero
Acquire (MLCK, 0xFFFF)
While ((Local0 < MDNR))
{
MSEL = Local0
If ((MINS == One))
{
MTFY (Local0, One)
MINS = One
}
ElseIf ((MRMV == One))
{
MTFY (Local0, 0x03)
MRMV = One
}
Local0 += One
}
Release (MLCK)
Return (One)
}
If the number of memory slots is 0x0
, this function returns 0x0
without adding or removing any memory.
Local0 is assigned with 0x0
, as the index value of the current memory slot.
MLCK lock is acquired.
The code enters a loop that will end once every slot will have been analyzed, that is after it will have tried to insert or remove some memory for every single slot. Here are the details about what’s performed inside this loop:
MSEL is written with the value of Local0 (holding the memory slot index). From the hypervisor perspective, this read translates into the following code setting the value of the memory slot selector:
case 0x0: /* DIMM slot selector */
mem_st->selector = data;
Now the selector has been set, the value of MINS is being read:
case 0x14: /* pack and return is_* fields */
...
val |= mdev->is_inserting ? 2 : 0;
If MINS value is 0x1
, this means the current memory slot identified by the memory slot selector needs to be inserted.
MTFY method is invoked with Local0 as Arg0 and 0x1
as Arg1.
Method (MTFY, 2, NotSerialized)
{
If ((Arg0 == Zero))
{
Notify (MP00, Arg1)
}
...
}
Note: The number of memory slots listed here is the maximum number of memory slots that can be supported by the VM. This number differs from the number of slots enabled when the VM is started.
MTFY is the method used to notify the guest OS about a device with a notification type. In this case, if the Arg0 representing the memory slot index falls into one of the If
statement of MTFY, a Notify() method will be called with the memory device corresponding to the memory index as the Arg0, and the notification type as Arg1. The notification type being 0x1
, it translates into ACPI_NOTIFY_DEVICE_CHECK
. Here is an example of the memory device referred by this notification:
Device (MP00)
{
Name (_UID, "0x00") // _UID: Unique ID
Name (_HID, EisaId ("PNP0C80") /* Memory Device */) // _HID: Hardware ID
Method (_CRS, 0, NotSerialized) // _CRS: Current Resource Settings
{
Return (MCRS (_UID))
}
Method (_STA, 0, NotSerialized) // _STA: Status
{
Return (MRST (_UID))
}
Method (_PXM, 0, NotSerialized) // _PXM: Device Proximity
{
Return (MPXM (_UID))
}
Method (_OST, 3, NotSerialized) // _OST: OSPM Status Indication
{
MOST (_UID, Arg0, Arg1, Arg2)
}
Method (_EJ0, 1, NotSerialized) // _EJx: Eject Device
{
MEJ0 (_UID, Arg0)
}
}
The guest OS receives this notification through the handler registered previously (detailed through guest OS flow)
MINS is being written with 0x1
to clear is_inserting
flag since the memory has just been inserted.
case 0x14: /* set is_* fields */
mdev = &mem_st->devs[mem_st->selector];
if (data & 2) { /* clear insert event */
mdev->is_inserting = false;
Here the selector is still the same value as before, pointing to the same memory slot index. In case MINS was read but the memory didn’t need to be inserted, the code will check if the memory needs to be removed. The value of MRMV is being read:
case 0x14: /* pack and return is_* fields */
...
val |= mdev->is_removing ? 4 : 0;
If MRMV value is 0x1
, this means the current memory slot identified by the memory slot selector needs to be removed.
MTFY method is invoked with Local0 as Arg0 and 0x3
as Arg1. In this case, if the Arg0 representing the memory slot index falls into one of the If
statement of MTFY, a Notify() method will be called with the memory device corresponding to the memory index as the Arg0, and the notification type as Arg1. The notification type being 0x3
, it translates into ACPI_NOTIFY_EJECT_REQUEST
.
The guest OS receives this notification through the handler registered previously (detailed through guest OS flow)
MRMV is being written with 0x1
to clear is_removing
flag since the memory has just been removed.
case 0x14: /* set is_* fields */
mdev = &mem_st->devs[mem_st->selector];
if (data & 2) { /* clear insert event */
...
} else if (data & 4) {
mdev->is_removing = false;
Local0 is incremented, this way it represents the next memory slot. If the slot index goes over the number of existing slots, the code exits the loop.
MLCK lock is released.
MSCN() returns 0x1
, which is curious since no one uses the returned value.
Note: memory hotplug AML code generation in NEMU is part of hw/acpi/memory_hotplug.c
. It should be probably moved to hw/acpi/aml-build.c
.
All the code mentioned in this section belongs to the implementation of the different ACPI drivers coming from the Linux kernel sources under drivers/acpi:
Even before any hotplug event is actually being triggered, the Linux kernel will parse every single ACPI table, including the one we're interested in here, the DSDT table. Part of this table is the description of two generic devices taking care of memory hotplug:
Device (\_SB.MHPD)
{
Name (_HID, "PNP0A06" /* Generic Container Device */) // _HID: Hardware ID
Name (_UID, "Memory hotplug resources") // _UID: Unique ID
...
}
Device (\_SB.MHPC)
{
Name (_HID, "PNP0A06" /* Generic Container Device */) // _HID: Hardware ID
Name (_UID, "DIMM devices") // _UID: Unique ID
...
}
Note: Why does \_SB.MHPC depend on \_SB.MHPD? Wouldn't it be easier to define everything under the same device?
Because of this definition, the ACPI driver will initialize the corresponding generic driver. This driver, among other things, will register the memory handler that will be called anytime a memory slot is being added or removed:
bus.c: acpi_init()
--> acpi_scan_init()
scan.c: acpi_scan_init()
--> acpi_memory_hotplug_init()
acpi_memhotplug.c: acpi_memory_hotplug_init()
registers the memory hotplug handler through acpi_scan_add_handler()
passing the structure memory_device_handler
:
void __init acpi_memory_hotplug_init(void)
{
if (acpi_no_memhotplug) {
memory_device_handler.attach = NULL;
acpi_scan_add_handler(&memory_device_handler);
return;
}
acpi_scan_add_handler_with_hotplug(&memory_device_handler, "memory");
}
The structure memory_device_handler
defines two handlers attach
and detach
for memory insertion and removal:
static struct acpi_scan_handler memory_device_handler = {
.ids = memory_device_ids,
.attach = acpi_memory_device_add,
.detach = acpi_memory_device_remove,
.hotplug = {
.enabled = true,
},
};
Prior to that, acpi_init()
also initialized a generic notification handler by calling acpi_bus_init()
.
bus.c: acpi_bus_init()
--> acpi_install_notify_handler()
by providing acpi_bus_notify()
as the generic callback to be triggered in case of a notification.
Whenever the guest OS receives a notification following the execution of a Notify() method from the DSDT, the following flow applies:
Because the memory hotplug device is a generic device, it will trigger the handling of the notification through the generic handler acpi_bus_notify()
:
bus.c: acpi_bus_notify()
--> acpi_hotplug_schedule()
osl.c: acpi_hotplug_schedule()
--> acpi_hotplug_work_fn()
scan.c: acpi_hotplug_work_fn()
--> acpi_device_hotplug()
--> acpi_generic_hotplug_event()
acpi_generic_hotplug_event()
gets invoked with the type of notification passed as argument.
In case of memory insertion (ACPI_NOTIFY_DEVICE_CHECK
type):
case ACPI_NOTIFY_DEVICE_CHECK:
return acpi_scan_device_check(adev);
scan.c: acpi_scan_device_check()
--> acpi_bus_scan()
--> acpi_bus_attach()
--> acpi_scan_attach_handler()
--> handler->attach(device, devid)
As described by the flow above, the notification will end up calling into every attach()
handler registered for the current device. In case of memory_device_handler
previously registered, the following callback is going to be invoked:
static int acpi_memory_device_add(struct acpi_device *device,
const struct acpi_device_id *not_used)
This function will create a new memory device for the guest OS. In order to provision this device with the appropriate information, it will walk through _CRS resources of this memory device:
Method (_CRS, 0, NotSerialized) // _CRS: Current Resource Settings
{
Return (MCRS (_UID))
}
By evaluating _CRS method defined by any memory device in the DSDT, the code will end up executing the internally defined method MCRS with the _UID of the current device as Arg0:
Method (MCRS, 1, Serialized)
{
Acquire (MLCK, 0xFFFF)
MSEL = ToInteger (Arg0)
Name (MR64, ResourceTemplate ()
{
QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite,
0x0000000000000000, // Granularity
0x0000000000000000, // Range Minimum
0xFFFFFFFFFFFFFFFE, // Range Maximum
0x0000000000000000, // Translation Offset
0xFFFFFFFFFFFFFFFF, // Length
,, _Y00, AddressRangeMemory, TypeStatic)
})
CreateDWordField (MR64, \_SB.MHPC.MCRS._Y00._MIN, MINL) // _MIN: Minimum Base Address
CreateDWordField (MR64, 0x12, MINH)
CreateDWordField (MR64, \_SB.MHPC.MCRS._Y00._LEN, LENL) // _LEN: Length
CreateDWordField (MR64, 0x2A, LENH)
CreateDWordField (MR64, \_SB.MHPC.MCRS._Y00._MAX, MAXL) // _MAX: Maximum Base Address
CreateDWordField (MR64, 0x1A, MAXH)
MINH = MRBH /* \_SB_.MHPC.MRBH */
MINL = MRBL /* \_SB_.MHPC.MRBL */
LENH = MRLH /* \_SB_.MHPC.MRLH */
LENL = MRLL /* \_SB_.MHPC.MRLL */
MAXL = (MINL + LENL) /* \_SB_.MHPC.MCRS.LENL */
MAXH = (MINH + LENH) /* \_SB_.MHPC.MCRS.LENH */
If ((MAXL < MINL))
{
MAXH += One
}
If ((MAXL < One))
{
MAXH -= One
}
MAXL -= One
If ((MAXH == Zero))
{
Name (MR32, ResourceTemplate ()
{
DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite,
0x00000000, // Granularity
0x00000000, // Range Minimum
0xFFFFFFFE, // Range Maximum
0x00000000, // Translation Offset
0xFFFFFFFF, // Length
,, _Y01, AddressRangeMemory, TypeStatic)
})
CreateDWordField (MR32, \_SB.MHPC.MCRS._Y01._MIN, MIN) // _MIN: Minimum Base Address
CreateDWordField (MR32, \_SB.MHPC.MCRS._Y01._MAX, MAX) // _MAX: Maximum Base Address
CreateDWordField (MR32, \_SB.MHPC.MCRS._Y01._LEN, LEN) // _LEN: Length
MIN = MINL /* \_SB_.MHPC.MCRS.MINL */
MAX = MAXL /* \_SB_.MHPC.MCRS.MAXL */
LEN = LENL /* \_SB_.MHPC.MCRS.LENL */
Release (MLCK)
Return (MR32) /* \_SB_.MHPC.MCRS.MR32 */
}
Release (MLCK)
Return (MR64) /* \_SB_.MHPC.MCRS.MR64 */
}
MLCK lock is acquired.
MSEL is written with the Arg0 value, which is the device _UID, which also matches the slot index. This way, whenever a read will occur, the memory slot selector will point to the right memory slot.
A 64 bits memory region MR64 is defined. This region is the one that will be returned to the guest OS eventually. This is the description of the resources for the current device. It is set with extreme values by default (0x0000000000000000
for minimal base address, 0xFFFFFFFFFFFFFFFE
for maximal base address, and 0xFFFFFFFFFFFFFFFF
as the region size). It defines the resource name _Y00 in order to be able to access the implicit methods (_MIN, _LEN, _MAX) defined by the creation of the QWordMemory object:
Name (MR64, ResourceTemplate ()
{
QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite,
0x0000000000000000, // Granularity
0x0000000000000000, // Range Minimum
0xFFFFFFFFFFFFFFFE, // Range Maximum
0x0000000000000000, // Translation Offset
0xFFFFFFFFFFFFFFFF, // Length
,, _Y00, AddressRangeMemory, TypeStatic)
})
Several fields are defined to access and update this memory resource:
-
MINL: Minimum Low Base Address. Set with the value read from MRBL, corresponding to the first 32bits of
PC_DIMM_ADDR_PROP
. This property is set by the hypervisor when thePC-DIMM
device is created. -
MINH: Minimum High Base Address. Set with the value read from MRBH, corresponding to the last 32bits of
PC_DIMM_ADDR_PROP
. This property is set by the hypervisor when thePC-DIMM
device is created. -
LENL: Memory Low Size. Set with the value read from MRLL, corresponding to the first 32bits of
PC_DIMM_SIZE_PROP
. This property is read from the hypervisor, from the memory region representing the device. -
LENH: Memory High Size. Set with the value read from MRLH, corresponding to the last 32bits of
PC_DIMM_SIZE_PROP
. This property is read from the hypervisor, from the memory region representing the device. - MAXL: Maximum Low Base Address. Sum of MINL and LENL.
- MAXH: Maximum High Base Address. Sum of MINH and LENH.
MAXH is adjusted regarding the carried number, according to the value of MAXL compared to MINL.
And MAXL is subtracted by 0x1
to be assigned with the correct value.
If the value of MAXH is 0x0
, this means the resource can fit in a 32 bits memory region, and instead of returning the MR64 resource, the code will build a MR32 resource:
Name (MR32, ResourceTemplate ()
{
DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite,
0x00000000, // Granularity
0x00000000, // Range Minimum
0xFFFFFFFE, // Range Maximum
0x00000000, // Translation Offset
0xFFFFFFFF, // Length
,, _Y01, AddressRangeMemory, TypeStatic)
})
Several fields are defined to access and update this memory resource MR32:
- MIN: Minimum Base Address. Set with MINL value.
- LEN: Memory Region Size. Set with __LENL __value.
- MAX: Maximum Base Address. Set with MAXL value.
MLCK lock is released.
MR32 resource description is returned as the result of the MCRS method.
In case MAXH is different from 0x0
, this means the resource needs a 64 bits memory region to be described.
MLCK lock is released.
MR64 resource is returned as the result of the MCRS method.
Once the memory device resources have been retrieved through _CRS, the device method _STA is evaluated by the ACPI driver. The driver expect to check the status of the memory device, and in particular it expects the device to be ACPI_STA_DEVICE_PRESENT
and ACPI_STA_DEVICE_ENABLED
and ACPI_STA_DEVICE_FUNCTIONING
.
Method (_STA, 0, NotSerialized) // _STA: Status
{
Return (MRST (_UID))
}
_STA invokes the internally defined function MRST with _UID as Arg0:
Method (MRST, 1, NotSerialized)
{
Local0 = Zero
Acquire (MLCK, 0xFFFF)
MSEL = ToInteger (Arg0)
If ((MES == One))
{
Local0 = 0x0F
}
Release (MLCK)
Return (Local0)
}
Local0 is set to 0x0
, as this means the device is not enabled by default.
MLCK lock is acquired.
MSEL is written with the Arg0 value, which is the _UID of the device, meaning the code is setting the memory slot selector to the right memory slot.
MES value is read from the offset 0x14
of the base address 0x0A00
, which is assigned with the value of the NEMU flag is_enabled
from the current memory device:
case 0x14: /* pack and return is_* fields */
val |= mdev->is_enabled ? 1 : 0;
If MES value is 0x1
, this means the device is enabled, and the Local0 variable is assigned with 0xF
, which means all the following flags are enabled:
include/acpi/actypes.h
/* Flags for _STA method */
#define ACPI_STA_DEVICE_PRESENT 0x01
#define ACPI_STA_DEVICE_ENABLED 0x02
#define ACPI_STA_DEVICE_UI 0x04
#define ACPI_STA_DEVICE_FUNCTIONING 0x08
#define ACPI_STA_DEVICE_OK 0x08 /* Synonym */
MLCK lock is released.
Local0 value is returned, which means the device is either enabled 0xF
, or disabled 0x0
.
At this point, the guest OS knows about the device resources and it also knows if the device is ready to be used. As the last step, it enables the memory device so that it can be used by any process inside the guest.
In case of memory removal (ACPI_NOTIFY_EJECT_REQUEST
type):
case ACPI_NOTIFY_EJECT_REQUEST:
if (adev->handler && !adev->handler->hotplug.enabled) {
dev_info(&adev->dev, "Eject disabled\n");
return -EPERM;
}
acpi_evaluate_ost(adev->handle, ACPI_NOTIFY_EJECT_REQUEST,
ACPI_OST_SC_EJECT_IN_PROGRESS, NULL);
return acpi_scan_hot_remove(adev);
The method acpi_evaluate_ost()
is the first important step. The guest OS will send a message to the hypervisor, by evaluating the method _OST of the memory device defined. The message is meant to notify about the ejection of a specific memory slot is in progress.
Device (MP00)
{
Name (_UID, "0x00") // _UID: Unique ID
...
Method (_OST, 3, NotSerialized) // _OST: OSPM Status Indication
{
MOST (_UID, Arg0, Arg1, Arg2)
}
...
This method will call into a DSDT internal function MOST, taking all the parameters from _OST additionally to the memory slot index selector as Arg0:
Method (MOST, 4, NotSerialized)
{
Acquire (MLCK, 0xFFFF)
MSEL = ToInteger (Arg0)
MOEV = Arg1
MOSC = Arg2
Release (MLCK)
}
MLCK lock is acquired.
MSEL is written with Arg0 (0x0
in the example of device MP00) to set the memory selector index.
MOEV is written with Arg1 to pass down to the hypervisor the OST event from the guest OS:
case 0x4: /* _OST event */
mdev = &mem_st->devs[mem_st->selector];
...
mdev->ost_event = data;
MOSC is written with Arg2 to pass down to the hypervisor the OST status from the guest OS (ACPI_OST_SC_EJECT_IN_PROGRESS
in this case):
case 0x8: /* _OST status */
mdev = &mem_st->devs[mem_st->selector];
mdev->ost_status = data;
trace_mhp_acpi_write_ost_status(mem_st->selector, mdev->ost_status);
/* TODO: implement memory removal on guest signal */
info = acpi_memory_device_status(mem_st->selector, mdev);
qapi_event_send_acpi_device_ost(info, &error_abort);
qapi_free_ACPIOSTInfo(info);
break;
MLCK lock is released.
Here is the second important step, the actual ejection of the memory device.
scan.c: acpi_scan_hot_remove()
--> acpi_bus_trim()
--> handler->detach(adev)
As described by the flow above, the notification will end up calling into every detach()
handler registered for the current device. In case of memory_device_handler
previously registered, the following callback is going to be invoked:
static void acpi_memory_device_remove(struct acpi_device *device)
{
struct acpi_memory_device *mem_device;
if (!device || !acpi_driver_data(device))
return;
mem_device = acpi_driver_data(device);
acpi_memory_remove_memory(mem_device);
acpi_memory_device_free(mem_device);
}
This function will take care of removing the memory device from the guest OS by unbinding the memory blocks and by freeing the related structures.
Now, getting back to acpi_scan_hot_remove()
, when acpi_bus_trim()
returns, a few ACPI methods are invoked before to complete the removal notification.
The guest OS will try to evaluate the _LCK method if present. In our case, NEMU does not define such a method, meaning the guest OS has no way to lock or unlock the device.
Then, _EJ0 method is evaluated:
acpi_status acpi_evaluate_ej0(acpi_handle handle)
{
acpi_status status;
status = acpi_execute_simple_method(handle, "_EJ0", 1);
if (status == AE_NOT_FOUND)
acpi_handle_warn(handle, "No _EJ0 support for device\n");
else if (ACPI_FAILURE(status))
acpi_handle_warn(handle, "Eject failed (0x%x)\n", status);
return status;
}
Device (MP00)
{
Name (_UID, "0x00") // _UID: Unique ID
...
Method (_EJ0, 1, NotSerialized) // _EJx: Eject Device
{
MEJ0 (_UID, Arg0)
}
}
_EJ0 calls into the internal MEJ0 method:
Method (MEJ0, 2, NotSerialized)
{
Acquire (MLCK, 0xFFFF)
MSEL = ToInteger (Arg0)
MEJ = One
Release (MLCK)
}
MLCK lock is acquired.
MSEL is written with the _UID value corresponding to the memory slot index. This updates the selector value from the hypervisor:
case 0x0: /* DIMM slot selector */
mem_st->selector = data;
MEJ is written with the value 0x1
. This means the hypervisor will complete the removal of this memory device by calling into the appropriate unplug handler, using hotplug_handler_unplug()
:
case 0x14: /* set is_* fields */
mdev = &mem_st->devs[mem_st->selector];
if (data & 2) { /* clear insert event */
...
} else if (data & 8) {
...
dev = DEVICE(mdev->dimm);
hotplug_ctrl = qdev_get_hotplug_handler(dev);
/* call pc-dimm unplug cb */
hotplug_handler_unplug(hotplug_ctrl, dev, &local_err);
...
}
break;
MLCK lock is released.
Once _EJ0 successfully returns, the guest OS needs to check the device status. It evaluates the _STA method associated with the current memory device. This check will verify the state of the memory, and will return an error if it's not in the expected state:
static int acpi_scan_hot_remove(struct acpi_device *device)
{
...
/*
* Verify if eject was indeed successful. If not, log an error
* message. No need to call _OST since _EJ0 call was made OK.
*/
status = acpi_evaluate_integer(handle, "_STA", NULL, &sta);
if (ACPI_FAILURE(status)) {
acpi_handle_warn(handle,
"Status check after eject failed (0x%x)\n", status);
} else if (sta & ACPI_STA_DEVICE_ENABLED) {
acpi_handle_warn(handle,
"Eject incomplete - status 0x%llx\n", sta);
}
...
}
Device (MP00)
{
Name (_UID, "0x00") // _UID: Unique ID
...
Method (_STA, 0, NotSerialized) // _STA: Status
{
Return (MRST (_UID))
}
...
_STA calls into the internal MRST method with _UID as the Arg0, the same way it's explained in the previous section. In this case, because we're talking about the removal of the memory, the expected status is for the memory to be disabled at this point.
Once the insertion or the removal of a memory slot is done, the status is sent from the guest OS to the hypervisor. The error
value returned by the hotplug action will be translated into an OST status type:
void acpi_device_hotplug(struct acpi_device *adev, u32 src)
{
...
switch (error) {
case 0:
ost_code = ACPI_OST_SC_SUCCESS;
break;
case -EPERM:
ost_code = ACPI_OST_SC_EJECT_NOT_SUPPORTED;
break;
case -EBUSY:
ost_code = ACPI_OST_SC_DEVICE_BUSY;
break;
default:
ost_code = ACPI_OST_SC_NON_SPECIFIC_FAILURE;
break;
}
err_out:
acpi_evaluate_ost(adev->handle, src, ost_code, NULL);
...
}
The same way it was evaluated here, the ACPI method _OST will invoke the internal MOST method, in order to notify the hypervisor about the status of the hotplug action.