-
Notifications
You must be signed in to change notification settings - Fork 108
ACPI PCI discovery hotplug
This document attempts to call out the elements and mechanisms involved in the discovery as well as hotplug of PCI devices including
- QEMU framework
- ACPI Tables and Methods
- Linux Kernel functions and tables
The focus of this document is ACPI based PCI Hotplug as it is used by the virt platform.
The virt platform chose to use ACPI based hotplug to free the user from being forced to create PCIe hierarchies just to support hotplug. It also eliminates the need to model a PCIe controllers that supports hotplug. The virt platform uses APCI based hotplug for CPU and Memory. So using it for PCI results in no additional code. It also leads to a flat PCI tree which is easier to understand.
Having an APCI based implementation, allows for the modify OS behaviour in the future without requiring changes to the OS Kernel. This may provide us with flexibility in the future for further optimizations.
These solution emulate the full PCIe model including the emulation of the PCIe/SHPC controller. The hotplug of a device is performed through the emulated SHPC/PCIe controller, the guest OS will be notified through the driver of this PCIe controller whenever something changes regarding the devices topology. This has been implemented as part of the q35 machine type.
Note: The PCIe Native hotplug capability of a particular device is discovered via the OSC method corresponding to host bus on which the device resides. More details on the same can be found in subsequent sections.
QEMU requires an object implementing hotplug to implement the interface defined by HotplugDeviceClass
. These callbacks are invoked when a qemu qmp command is invoked to hotplug/unplug a device. These are universal callbacks which are invoked for any type of hotplug. The handlers that implement the callback will
/**
* HotplugDeviceClass:
*
* Interface to be implemented by a device performing
* hardware (un)plug functions.
*
* @parent: Opaque parent interface.
* @pre_plug: pre plug callback called at start of device.realize(true)
* @plug: plug callback called at end of device.realize(true).
* @post_plug: post plug callback called after device.realize(true) and device
* reset
* @unplug_request: unplug request callback.
* Used as a means to initiate device unplug for devices that
* require asynchronous unplug handling.
* @unplug: unplug callback.
* Used for device removal with devices that implement
* asynchronous and synchronous (surprise) removal.
*/
typedef struct HotplugHandlerClass {
/* <private> */
InterfaceClass parent;
/* <public> */
hotplug_fn pre_plug;
hotplug_fn plug;
void (*post_plug)(HotplugHandler *plug_handler, DeviceState *plugged_dev);
hotplug_fn unplug_request;
hotplug_fn unplug;
} HotplugHandlerClass;
Note: To be filled
The virt platform includes the definition of a new type which implements the hotplug handler interface and the apci device interface.
static const TypeInfo virt_acpi_info = {
.name = TYPE_VIRT_ACPI,
.parent = TYPE_SYS_BUS_DEVICE,
.instance_size = sizeof(VirtAcpiState),
.class_init = virt_acpi_class_init,
.interfaces = (InterfaceInfo[]) {
{ TYPE_HOTPLUG_HANDLER },
{ TYPE_ACPI_DEVICE_IF },
{ }
}
};
typedef struct VirtAcpiState {
SysBusDevice parent_obj;
AcpiCpuHotplug cpuhp;
CPUHotplugState cpuhp_state;
MemHotplugState memhp_state;
qemu_irq *gsi;
AcpiPciHpState pcihp_state;
PCIBus *pci_bus;
MemoryRegion sleep_iomem;
MemoryRegion reset_iomem;
} VirtAcpiState;
Enhancement: The PCIBus and AcpiPciHpState today can handle a single PCI domain (root bus). This needs to be converted to a list to allow for multiple domains.
The hotplug handlers for the virt platform are implicitly defined as part of the class init.
As part of virt_acpi_class_init
we register the callbacks to be invoked for hotplug events. We also implement the mechanism to notify the guest OS of the hotplug event. These callbacks are invoked when a qemu performs hotplug/unplug of a device.
static void virt_acpi_class_init(ObjectClass *class, void *data)
{
...
HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(class);
AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_CLASS(class);
....
hc->plug = virt_device_plug_cb;
hc->unplug_request = virt_device_unplug_request_cb;
hc->unplug = virt_device_unplug_cb;
...
adevc->send_event = virt_send_ged;
...
}
The handlers listed above virt_device_plug_cb
, virt_device_unplug_request_cb
and virt_device_unplug_cb
are invoked whenever a hotplug/unplug is initiated.
Opens:
- Document when
virt_device_unplug_cb
is typically invoked. - Why are we implementing hutplug callback registration as part of class init and not instance init. If we associated it with instance init, it would make it generic. Also it would allow the platform code more control.
Note: We do not support surprise removal of PCI devices.
In the VirtMachineState
we maintain pointers to the object that contain APCI relevant state as well as implementation of the hotplug interfaces.
A object of the type
typedef struct {
....
/* ACPI configuration */
AcpiConfiguration *acpi_configuration;
...
PCIBus *pci_bus;
/* ACPI device for hotplug and PM */
HotplugHandler *acpi_dev;
...
DeviceState *acpi;
} VirtMachineState;
The object acpi
is of type virt_acpi_info
, and hence implements both hotplug and ACPI interfaces.
Enhancements:
-
VirtMachineState
assumes a single domain today. This should be converted to a list.
The acpi object is created in virt_machine_state_init
by virt_acpi_init
static void virt_machine_state_init(MachineState *machine)
{
...
MachineClass *mc = MACHINE_GET_CLASS(machine);
VirtMachineState *vms = VIRT_MACHINE(machine);
...
virt_ioapic_init(vms);
virt_pci_init(vms);
...
vms->acpi = virt_acpi_init(vms->gsi, vms->pci_bus);
...
object_property_add_link(OBJECT(machine), "acpi-device",
TYPE_HOTPLUG_HANDLER,
(Object **)&vms->acpi_dev,
object_property_allow_set_link,
OBJ_PROP_LINK_STRONG, &error_abort);
object_property_set_link(OBJECT(machine), OBJECT(vms->acpi),
"acpi-device", &error_abort);
…
}
OPEN: Need to understand better the relationship between acpi_dev
and acpi
. apci
contains the actual implementation. apci_dev
is a pointer to the hotplug interface. This seem to indicate to qemu that the apci object implements the TYPE_HOTPLUG_HANDLER
interface.
Note: Here vms->gsi
and vms->pci_bus
are setup by virt_ioapic_init
and virt_pci_init
respectively.
static void virt_ioapic_init(VirtMachineState *vms)
{
...
vms->gsi = qemu_allocate_irqs(virt_gsi_handler, ioapic_irq, IOAPIC_NUM_PINS);
...
}
static void virt_pci_init(VirtMachineState *vms)
{
...
vms->pci_bus = pci_lite_init(get_system_memory(), get_system_io(),
pci_memory);
...
}
Note: Here vms->pci_bus
point to the root bus for a given PCI domain.
The PCI bus is registered as a hotpluggable resource as part of virt_acpi_init
.
DeviceState *virt_acpi_init(qemu_irq *gsi, PCIBus *pci_bus, PCIBus *pci_virt_bus)
{
DeviceState *dev;
VirtAcpiState *s;
...
dev = sysbus_create_simple(TYPE_VIRT_ACPI, -1, NULL);
...
s = VIRT_ACPI(dev);
s->gsi = gsi;
s->pci_bus = pci_bus;
...
if (pci_bus) {
/* Initialize PCI hotplug */
qbus_set_hotplug_handler(BUS(pci_bus), dev, NULL);
acpi_pcihp_init(OBJECT(s), &s->pcihp_state, s->pci_bus,
get_system_io(), true);
acpi_pcihp_reset(&s->pcihp_state);
}
...
return dev;
}
-
qbus_set_hotplug_handler
sets theQDEV_HOTPLUG_HANDLER_PROPERTY "hotplug-handler"
forpci_bus
to theapci
object that is created here. -
acpi_pcihp_init()
sets up the ACPI resources used for hotplug for a particular bus -
acpi_pcihp_reset()
enables hotplug on all child buses under this bus
acpi_pcihp_init
sets up the IO registers used to communicate between QEMU and ACPI.
Note: This IO Range is used by ACPI to manage hotplug for a specific device/bus/interaction. It is not a PCI resource (like an IO BAR).
void acpi_pcihp_init(Object *owner, AcpiPciHpState *s, PCIBus *root_bus,
MemoryRegion *address_space_io, bool bridges_enabled)
{
s->io_len = ACPI_PCIHP_SIZE;
s->io_base = ACPI_PCIHP_ADDR;
s->root= root_bus;
s->legacy_piix = !bridges_enabled;
memory_region_init_io(&s->io, owner, &acpi_pcihp_io_ops, s,
"acpi-pci-hotplug", s->io_len);
memory_region_add_subregion(address_space_io, s->io_base, &s->io);
object_property_add_uint16_ptr(owner, ACPI_PCIHP_IO_BASE_PROP, &s->io_base,
&error_abort);
object_property_add_uint16_ptr(owner, ACPI_PCIHP_IO_LEN_PROP, &s->io_len,
&error_abort);
}
acpi_pcihp_io_ops
: sets up handlers in QEMU to respond to read and write to the IO registers from the OS, which typically originate from ACPI code running within the OS.
OPEN:
-
virt_device_realize
sets up the ACPI registers for CPU and Memory.virt_acpi_init
sets up the remaining registers. It may be better for all of the initialization to be done invirt_acpi_init
as it is explicit. This allows the platform IO registers setup to be done explicitly and platform driven.
Enhancement:
- The PCI hotplug register space is also setup a property of the
acpi
object. This does not work when there are multiple domains. - The register address here is hardcoded. This will be modified and parameterized to support multiple domains.
- The register space should be defined as part of the platform definition.
acpi_pcihp_io_ops
defines pci_read
and pci_write
callbacks invoked when QEMU detects read s and writes to the registers.
Note: The AcpiPciHpState
corresponding to that region is passed down to the handlers.
static const MemoryRegionOps acpi_pcihp_io_ops = {
.read = pci_read,
.write = pci_write,
.endianness = DEVICE_LITTLE_ENDIAN,
.valid = {
.min_access_size = 4,
.max_access_size = 4,
},
};
In acpi_pcihp_init()
QEMU allocates an IO Register range to interact with the OS. The offsets with that register range define fields.
#define PCI_UP_BASE 0x0000
#define PCI_DOWN_BASE 0x0004
#define PCI_EJ_BASE 0x0008
#define PCI_RMV_BASE 0x000c
#define PCI_SEL_BASE 0x0010
These callbacks get the context AcpiPciHpState *s = opaque
and the offset within the hotplug register space, allowing them to respond with the required information in the case of read and react to information coming in from the OS. These IO registers are read/written to by APCI methods running in the OS which have been constructed by QEMU as part of the platform setup.
static uint64_t pci_read(void *opaque, hwaddr addr, unsigned int size)
{
AcpiPciHpState *s = opaque;
...
switch (addr) {
case PCI_UP_BASE:
val = s->acpi_pcihp_pci_status[bsel].up;
s->acpi_pcihp_pci_status[bsel].up = 0;
break;
case PCI_DOWN_BASE:
val = s->acpi_pcihp_pci_status[bsel].down;
break;
...
}
return val;
}
static void pci_write(void *opaque, hwaddr addr, uint64_t data,
unsigned int size)
{
AcpiPciHpState *s = opaque;
switch (addr) {
case PCI_EJ_BASE:
acpi_pcihp_eject_slot(s, s->hotplug_select, data);
break;
case PCI_SEL_BASE:
s->hotplug_select = s->legacy_piix ? ACPI_PCIHP_BSEL_DEFAULT : data;
...
}
}
Note: In some cases like multiple reads or writes may be invoked by a single serialized ACPI method. So the sequence of reads or writes can be used to infer intent.
QEMU generates an appropriate event whenever it needs to hotplug a device into a guest. The event types that QEMU can generate are defined as a part of TYPE_ACPI_DEVICE_IF "acpi-device-interface"
.
/* These values are part of guest ABI, and can not be changed */
typedef enum {
ACPI_PCI_HOTPLUG_STATUS = 2,
ACPI_CPU_HOTPLUG_STATUS = 4,
ACPI_MEMORY_HOTPLUG_STATUS = 8,
ACPI_NVDIMM_HOTPLUG_STATUS = 16,
ACPI_VMGENID_CHANGE_STATUS = 32,
} AcpiEventStatusBits;
In the case of PCI, the PCI Hotplug event is generated by QEMU whenever a PCI device is hot plugged. acpi_pcihp_device_plug_cb
sets up the IO Register with the slot number corresponding the slot on which the PCI device got hotplugged and raises a event.
void acpi_pcihp_device_plug_cb(HotplugHandler *hotplug_dev, AcpiPciHpState *s, DeviceState *dev, Error **errp) { ... s->acpi_pcihp_pci_status[bsel].up |= (1U << slot); acpi_send_event(DEVICE(hotplug_dev), ACPI_PCI_HOTPLUG_STATUS); }
QEMU needs to convert its internal event to a guest interrupt. This done in the case of the virt platform using interrupts associated with the GED device. (In legacy platforms this is typically done via the SCI Interrupt).
As part of acpi_conf_virt_init
a set of IRQs are registered in the virt platform to notify the OS for certain class of events.
static void acpi_conf_virt_init(MachineState *machine)
{
...
conf->cpu_hotplug_io_base = VIRT_CPU_HOTPLUG_IO_BASE;
conf->acpi_nvdimm_state = vms->acpi_nvdimm_state;
...
/* GED events */
GedEvent events[] = {
....
{
.irq = VIRT_GED_PCI_HOTPLUG_IRQ,
.event = GED_PCI_HOTPLUG,
},
};
...
}
Open: Why are set setting up the cpu_hotplug_io_base
and acpi_nvdimm_state
here. This should be done at platform init and not here.
virt_send_ged
is registered as the method to invoke for hotplug event delivery into the OS.
static void virt_acpi_class_init(ObjectClass *class, void *data)
{
DeviceClass *dc = DEVICE_CLASS(class);
SysBusDeviceClass *sbc = SYS_BUS_DEVICE_CLASS(class);
HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(class);
AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_CLASS(class);
...
adevc->ospm_status = virt_ospm_status;
adevc->send_event = virt_send_ged;
adevc->madt_cpu = pc_madt_cpu_entry;
}
QEMU invokes the send_event
handler when acpi_send_event
is invoked. virt_send_ged
then injects a specific interrupt into the guest.
static void virt_send_ged(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
{
VirtAcpiState *s = VIRT_ACPI(adev);
...
} else if (ev & ACPI_PCI_HOTPLUG_STATUS) {
/* Inject PCI hotplug interrupt */
qemu_irq_pulse(s->gsi[VIRT_GED_PCI_HOTPLUG_IRQ]);
}
}
The DSDT table is populated by QEMU to interact with Linux guest OS and vice versa.
The DSDT table communicate the information to the guest OS about the the location of the IO Registers, implementation of the methods that the OS can use to interact with QEMU.
The DSDT table is used to inform the OS about the existence of a PCI Host Bridge/Domain/Root bus, its topology, resources that are reserved or required by the host bridge.
Note: The OS only cares about the range to the extent that it ensures that this range is not used by itself or allocated to other devices. All other interactions using this IO range is contained within the methods defined by ACPI. So the OS does not need to be modified if we wish to extend this implementation.
The DSDT reports every bridge as a device with a particular _HID
and _CID
. This indicates to the OS that the device is the start of a PCI Host Bridge/Domain. It also indicates the PCI Segment/Domain number using _SEG
and its unique indentifier for that class of device. _OSC
method is used to return the capabilities supported by this segment (like PCIe hotplug).
DefinitionBlock ("", "DSDT", 2, "BOCHS ", "BXPCDSDT", 0x00000001)
{
External (_SB_.NVDR, UnknownObj)
Device (\_SB.PCI2)
{
Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */) // _HID: Hardware ID
Name (_CID, EisaId ("PNP0A03") /* PCI Bus */) // _CID: Compatible ID
Name (_ADR, Zero) // _ADR: Address
Name (_SEG, One) // _SEG: PCI Segment
Name (_UID, 0x02) // _UID: Unique ID
Name (SUPP, Zero)
Name (CTRL, Zero)
Method (_OSC, 4, NotSerialized) // _OSC: Operating System Capabilities
{
CreateDWordField (Arg3, Zero, CDW1)
If ((Arg0 == ToUUID ("33db4d5b-1ff7-401c-9657-7441c03dd766") /* PCI Host Bridge Device */))
{
CreateDWordField (Arg3, 0x04, CDW2)
CreateDWordField (Arg3, 0x08, CDW3)
SUPP = CDW2 /* \_SB_.PCI2._OSC.CDW2 */
CTRL = CDW3 /* \_SB_.PCI2._OSC.CDW3 */
CTRL = (CTRL & 0x1F)
If ((Arg1 != One))
{
CDW1 = (CDW1 | 0x08)
}
If ((CDW3 != CTRL))
{
CDW1 = (CDW1 | 0x10)
}
CDW3 = CTRL /* \_SB_.PCI2.CTRL */
Return (Arg3)
}
Else
{
CDW1 = (CDW1 | 0x04)
Return (Arg3)
}
}
}
_OSC
capabilities
The PCI Host Bridge capabilities are discovered based on
/* PCI Host Bridge _OSC: Capabilities DWORD 2: Support Field */
#define OSC_PCI_EXT_CONFIG_SUPPORT 0x00000001
#define OSC_PCI_ASPM_SUPPORT 0x00000002
#define OSC_PCI_CLOCK_PM_SUPPORT 0x00000004
#define OSC_PCI_SEGMENT_GROUPS_SUPPORT 0x00000008
#define OSC_PCI_MSI_SUPPORT 0x00000010
#define OSC_PCI_SUPPORT_MASKS 0x0000001f
/* PCI Host Bridge _OSC: Capabilities DWORD 3: Control Field */
#define OSC_PCI_EXPRESS_NATIVE_HP_CONTROL 0x00000001
#define OSC_PCI_SHPC_NATIVE_HP_CONTROL 0x00000002
#define OSC_PCI_EXPRESS_PME_CONTROL 0x00000004
#define OSC_PCI_EXPRESS_AER_CONTROL 0x00000008
#define OSC_PCI_EXPRESS_CAPABILITY_CONTROL 0x00000010
#define OSC_PCI_CONTROL_MASKS 0x0000001f
In the example above the host bridge reports we support MSI but not PCIe Native hotplug.
Note: This information is also reported by the Kernel at bootup
[ 0.176004] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig Segments MSI]
[ 0.176515] acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
acpi_pci_root_add() -> pci_acpi_scan_root() -> acpi_pci_root_create()
pci_root_handler() registers acpi_pci_root_add for
static const struct acpi_device_id root_device_ids[] = {
{"PNP0A03", 0},
{"", 0},
};
The OS scans the ACPI tables to look for specific methods (EJ0) implemented by Devices to detect that ACPI hotplug is implemented by a device.
Device (S00)
{
Name (_ADR, Zero) // _ADR: Address
}
Device (S08)
{
Name (_ADR, 0x00010000) // _ADR: Address
Name (_SUN, One) // _SUN: Slot User Number
Method (_EJ0, 1, NotSerialized) // _EJx: Eject Device
{
PCEJ (BSEL, _SUN)
}
}
For each PCI Host Bridge the OS needs to reserve memory required for the devices it holds. That is conveyed using _CRS
. This specifies the amount and range from which the OS should allocate IO and MMIO memory for each device.
Note: These ranges correspond to the PCI Holes
that are setup for each host bridge.
The number of buses supported as well and the bus range a given host bridge supports are also conveyed to the OS using _CRS
Note: Limiting the number of buses on a host bridge limits the amount of ECAM memory that needs to be reserved for a given host bridge.
Scope (\_SB.PCI2)
{
Name (_CRS, ResourceTemplate () // _CRS: Current Resource Settings
{
WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode,
0x0000, // Granularity
0x0000, // Range Minimum
0x0000, // Range Maximum
0x0000, // Translation Offset
0x0001, // Length
,, )
DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
0x00000000, // Granularity
0x70000000, // Range Minimum
0x70100000, // Range Maximum
0x00000000, // Translation Offset
0x00100001, // Length
,, , AddressRangeMemory, TypeStatic)
QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite,
0x0000000000000000, // Granularity
0x0000000900000000, // Range Minimum
0x000000093FFFFFFF, // Range Maximum
0x0000000000000000, // Translation Offset
0x0000000040000000, // Length
,, , AddressRangeMemory, TypeStatic)
})
}
PCI Enhanced Configuration Access Mechanism (ECAM) reservation for each PCI segment is communicated via the PCI MCFG table.
In the snippet below QEMU reports two segments and their ECAM ranges. The OS can then discover and interact with the PCI configuration space of all devices under that host bridge using MMIO.
/*
* Intel ACPI Component Architecture
* AML/ASL+ Disassembler version 20180810 (64-bit version)
* Copyright (c) 2000 - 2018 Intel Corporation
*
* Disassembly of mcfg.dat, Fri Sep 28 00:42:05 2018
*
* ACPI Data Table [MCFG]
*
* Format: [HexOffset DecimalOffset ByteLength] FieldName : FieldValue
*/
[000h 0000 4] Signature : "MCFG" [Memory Mapped Configuration table]
[004h 0004 4] Table Length : 0000004C
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : AE
[00Ah 0010 6] Oem ID : "BOCHS "
[010h 0016 8] Oem Table ID : "BXPCMCFG"
[018h 0024 4] Oem Revision : 00000001
[01Ch 0028 4] Asl Compiler ID : "BXPC"
[020h 0032 4] Asl Compiler Revision : 00000001
[024h 0036 8] Reserved : 0000000000000000
[02Ch 0044 8] Base Address : 0000000080000000
[034h 0052 2] Segment Group Number : 0000
[036h 0054 1] Start Bus Number : 00
[037h 0055 1] End Bus Number : FF
[038h 0056 4] Reserved : 00000000
[03Ch 0060 8] Base Address : 0000000060000000
[044h 0068 2] Segment Group Number : 0001
[046h 0070 1] Start Bus Number : 00
[047h 0071 1] End Bus Number : 00
[048h 0072 4] Reserved : 00000000
The locations of the IO Registers setup by QEMU are communicated as part of the resource definition of the DSDT
Scope (\_SB.PCI0)
{
OperationRegion (PCST, SystemIO, 0xAE00, 0x08)
Field (PCST, DWordAcc, NoLock, WriteAsZeros)
{
PCIU, 32,
PCID, 32
}
OperationRegion (SEJ, SystemIO, 0xAE08, 0x04)
Field (SEJ, DWordAcc, NoLock, WriteAsZeros)
{
B0EJ, 32
}
OperationRegion (BNMR, SystemIO, 0xAE10, 0x04)
Field (BNMR, DWordAcc, NoLock, WriteAsZeros)
{
BNUM, 32
}
Mutex (BLCK, 0x00)
Method (PCEJ, 2, NotSerialized)
{
Acquire (BLCK, 0xFFFF)
BNUM = Arg0
B0EJ = (One << Arg1)
Release (BLCK)
Return (Zero)
}
Name (BSEL, Zero)
...
Co-relating what we saw in acpi_pcihp_init
and the ACPI field definition we can observe that
QEMU DSDT
ACPI_PCIHP_ADDR 0xae00 -> OperationRegion (PCST, SystemIO, 0xAE00, 0x08)
PCI_UP_BASE -> PCIU, 32,
PCI_DOWN_BASE -> PCID, 32
PCI_EJ_BASE 0x0008 -> B0EJ
PCI_RMV_BASE 0x000c -> Unused
PCI_SEL_BASE 0x0010 -> BNUM
Where PCIU (PCI UP) and PCID (PCI DOWN) values are 32 bits masks describing the 32 slots on the host bridge. The DSDT table defines them under the PCST operation region.
They are also mapped in QEMU to the virt platform data structures as follows
typedef struct AcpiPciHpPciStatus {
uint32_t up;
uint32_t down;
uint32_t hotplug_enable;
} AcpiPciHpPciStatus;
typedef struct AcpiPciHpState {
AcpiPciHpPciStatus acpi_pcihp_pci_status[ACPI_PCIHP_MAX_HOTPLUG_BUS];
uint32_t hotplug_select;
PCIBus *root;
MemoryRegion io;
bool legacy_piix;
uint16_t io_base;
uint16_t io_len;
} AcpiPciHpState;
QEMU notifies the OS of hotplug/unplug events using the ACPI interrupts setup using the GED device.
The ACPI table corresponding to the GED device maps event to the interrupt which then maps to actions it needs to perform on receipt of these events
Device (\_SB.GED)
{
Name (_HID, "ACPI0013" /* Generic Event Device */) // _HID: Hardware ID
Name (_UID, Zero) // _UID: Unique ID
Name (_CRS, ResourceTemplate () // _CRS: Current Resource Settings
{
...
Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
{
0x00000012,
}
})
Method (_EVT, 1, Serialized) // _EVT: Event
{
Local0 = One
While ((Local0 == One))
{
Local0 = Zero
If ((Arg0 == 0x10))
{
\_SB.CPUS.CSCN ()
}
....
ElseIf ((Arg0 == 0x12))
{
Acquire (\_SB.PCI0.BLCK, 0xFFFF)
\_SB.PCI0.PCNT ()
Release (\_SB.PCI0.BLCK)
}
}
}
}
As a result of this table, when QEMU injects an interrupt that is mapped to the GED device based on the interrupt number (0x12 in the case above), there is a method defined in the ACPI table that the OS needs to invoke to process the same. That method is _SB.PCI0.PCNT in the case above.
PCNT method is defined in the DSDT for each bus (pci host bridge) to perform the actions required in the OS on receipt of a hotplug event.
Method (PCNT, 0, NotSerialized)
{
BNUM = Zero
DVNT (PCIU, One)
DVNT (PCID, 0x03)
}
…
Method (DVNT, 2, NotSerialized)
{
If ((Arg0 & 0x02))
{
Notify (S08, Arg1)
}
…
}
Here the OS using the ACPI method will
- Write 0 to BNUM
- Which will trigger a write event to QEMU triggering pci_write
- Which results in QEMU setting up the
s->hotplug_select = s->legacy_piix ? ACPI_PCIHP_BSEL_DEFAULT : data;
- OS will
Call DVNT (PCIU, One)
, which will then call Notify() for each and every slot that is marked as UP with argument One - OS will Call DVNT (PCID, 0x03), which will then call Notify() for each and every slot that is marked as DOWN with argument 0x03
Notify() is a built in ACPI method implemented by the OS details of which are found in a subsequent section.
When the OS calls these methods, QEMU detects that the guest OS reads from PCI_UP_BASE and PCI_DOWN_BASE and returns the the values as well as resets the slot state internally.
case PCI_UP_BASE:
val = s->acpi_pcihp_pci_status[bsel].up;
s->acpi_pcihp_pci_status[bsel].up = 0;
case PCI_DOWN_BASE:
val = s->acpi_pcihp_pci_status[bsel].down;
Both 0x01 and 0x03 correspond to the ACPI definition
ACPI_NOTIFY_BUS_CHECK (u8) 0x00 [Unused]
ACPI_NOTIFY_DEVICE_CHECK (u8) 0x01
ACPI_NOTIFY_EJECT_REQUEST (u8) 0x03
The ACPI code to support the PCI hotplug through ACPI comes from the logic in acpiphp_add_context()
if ((acpi_pci_check_ejectable(pbus, handle) || is_dock_device(adev))
&& !(pdev && pdev->is_hotplug_bridge && pciehp_is_native(pdev)))
So an ejectable device (i.e with _EJ0 method or _RMV field) which is not a pciehp native device as determined by OSC_PCI_EXPRESS_NATIVE_HP_CONTROL set on its host bridge will be handled via ACPI PCI Hotplug.
acpi_pci_check_ejectable()
if (!acpi_has_method(handle, "_ADR"))
return 0;
if (acpi_has_method(handle, "_EJ0"))
return 1;
status = acpi_evaluate_integer(handle, "_RMV", NULL, &removable);
if (ACPI_SUCCESS(status) && removable)
return 1;
Linux Notify handling
This Notify() method is an ACPI standard that will call into the guest OS with those parameters which is mapped to acpiphp_hotplug_notify()
in drivers/pci/hotplug/acpiphp_glue.c
.
The Notify method is registered as part of acpiphp_init_context by the following sequence
pci_acpi_scan_root()
pci_create_root_bus()
pci_register_host_bridge()
pcibios_add_bus()
acpi_pci_add_bus()
acpiphp_enumerate_slots()
acpiphp_init_context()
This results in the invocation of hotplug_event which will perform BUS_CHECK or EJECT logic based on the parameter used in the APCI Notify method.
static void hotplug_event(u32 type, struct acpiphp_context *context)
{
switch (type) {
case ACPI_NOTIFY_BUS_CHECK:
case ACPI_NOTIFY_DEVICE_CHECK:
case ACPI_NOTIFY_EJECT_REQUEST:
}
Which map to
DVNT (PCIU, One) -> ACPI_NOTIFY_DEVICE_CHECK
DVNT (PCID, 0x03) -> ACPI_NOTIFY_EJECT_REQUEST
/*
* Standard notify values
*/
#define ACPI_NOTIFY_BUS_CHECK (u8) 0x00
#define ACPI_NOTIFY_DEVICE_CHECK (u8) 0x01
#define ACPI_NOTIFY_DEVICE_WAKE (u8) 0x02
#define ACPI_NOTIFY_EJECT_REQUEST (u8) 0x03
#define ACPI_NOTIFY_DEVICE_CHECK_LIGHT (u8) 0x04
#define ACPI_NOTIFY_FREQUENCY_MISMATCH (u8) 0x05
#define ACPI_NOTIFY_BUS_MODE_MISMATCH (u8) 0x06
#define ACPI_NOTIFY_POWER_FAULT (u8) 0x07
#define ACPI_NOTIFY_CAPABILITIES_CHECK (u8) 0x08
#define ACPI_NOTIFY_DEVICE_PLD_CHECK (u8) 0x09
#define ACPI_NOTIFY_RESERVED (u8) 0x0A
#define ACPI_NOTIFY_LOCALITY_UPDATE (u8) 0x0B
#define ACPI_NOTIFY_SHUTDOWN_REQUEST (u8) 0x0C
#define ACPI_NOTIFY_AFFINITY_UPDATE (u8) 0x0D
#define ACPI_NOTIFY_MEMORY_UPDATE (u8) 0x0E
To summarize.
- APCI hotplug tables are created by QEMU populated with the IO Aperture, Field definitions and methods.
- The VM is started. The DSDT table is provisioned with the methods and the devices to describe how the PCI bus needs to be rescanned.
- The Operating system scans the DSDT table and registers the (notify) handlers for each bus.
- We also define the GED device with a set of interrupts mapped to it and the associated methods to invoke.
- For each GED interrupt we map a ACPI method. In the case of PCI, PCNT.
- When hotplugging a device through Qemu monitor, qdev_add_device() gets called, which invokes the hotplug handler that will trigger the interrupt associated with the event for PCI hotplug.
- Upon reception of the interrupt, the guest OS will invoke the ACPI method defined through the DSDT table that is associated with that particular bus.
- PCNT, which is the method associated with the interrupts, invokes DVNT on PCIU and PCID
- The method will trigger a scan of every slot marked as UP and eject every slot marked as DOWN from the PCI bus. The guest OS will probe new drivers for every PCI device discovered. DVNT calls APCI Notify()
Notify()
is implemented in the linux kernel which performs the discovery within the OS PCNT also results in read/writes to the IO Fields which are handled by synchronous QEMU callback functions, which lets QEMU know that the APCI event has been processed/handled within the kernel.