diff --git a/doc/source/specs/rocky/device-passthrough.rst b/doc/source/specs/rocky/device-passthrough.rst new file mode 100644 index 00000000..24c4dc6f --- /dev/null +++ b/doc/source/specs/rocky/device-passthrough.rst @@ -0,0 +1,414 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +================== +Device Passthrough +================== + +https://blueprints.launchpad.net/nova-powervm/+spec/device-passthrough + +Provide a generic way to identify hardware devices such as GPUs and attach them +to VMs. + +Problem description +=================== + +Deployers want to be able to attach accelerators and other adapters to their +VMs. Today in Nova this is possible only in very restricted circumstances. The +goal of this blueprint is to enable generic passthrough of devices for +consumers of the nova-powervm driver. + +While these efforts may enable more, and should be extensible going forward, +the primary goal for the current release is to pass through entire physical +GPUs. That is, we are not attempting to pass through: + +* Physical functions, virtual functions, regions, etc. I.e. granularity smaller + than "whole adapter". This requires device type-specific support at the + platform level to perform operations such as discovery/inventorying, + configuration, and attach/detach. +* Devices with "a wire out the back" - i.e. those which are physically + connected to anything (networks, storage, etc.) external to the host. These + will require the operator to understand and be able to specify/select + specific connection parameters for proper placement. + +Use Cases +--------- +As an admin, I wish to be able to configure my host and flavors to allow +passthrough of whole physical GPUs to VMs. + +As a user, I wish to make use of appropriate flavors to create VMs with GPUs +attached. + +Proposed change +=============== + +Device Identification and Whitelisting +-------------------------------------- +The administrator can identify and allow (explicitly) or deny (by omission) +passthrough of devices by way of a YAML file per compute host. + +.. note:: **Future:** We may someday figure out a way to support a config file + on the controller. This would allow e.g. cloud-wide whitelisting and + specification for particular device types by vendor/product ID, which + could then be overridden (or not) by the files on the compute nodes. + +The path to the config will be hardcoded as ``/etc/nova/inventory.yaml``. + +The file shall contain paragraphs, each of which will: + +* Identify zero or more devices based on information available on the + ``IOSlot`` NovaLink REST object. In pypowervm, given a ManagedSystem wrapper + ``sys_w``, a list of ``IOSlot`` wrappers is available via + ``sys_w.asio_config.io_slots``. See `identification`_. Any device not + identified by any paragraph in the file is denied for passthrough. But see + the `allow`_ section for future plans around supporting explicit denials. +* Name the resource class to associate with the resource provider inventory unit + by which the device will be exposed in the driver. If not specified, + ``CUSTOM_IOSLOT`` is used. See `resource_class`_. +* List traits to include on the resource provider in addition to those generated + automatically. See `traits`_. + +A `formal schema`_ is proposed for review. + +.. _formal schema: https://review.openstack.org/#/c/579289/3/nova_powervm/virt/powervm/passthrough_schema.yaml + +Here is a summary description of each section. + +Name +~~~~ +Each paragraph will be introduced by a key which is a human-readable name for +the paragraph. The name has no programmatic significance other than to separate +paragraphs. Each paragraph's name must be unique within the file. + +identification +~~~~~~~~~~~~~~ +Each paragraph will have an ``identification`` section, which is an object +containing one or more keys corresponding to ``IOSlot`` properties, as follows: + + ================ ==================== ===================================== + YAML key IOSlot property Description + ================ ==================== ===================================== + vendor_id pci_vendor_id \X{4} (four uppercase hex digits) + device_id pci_dev_id \X{4} " + subsys_vendor_id pci_subsys_vendor_id \X{4} " + subsys_device_id pci_subsys_dev_id \X{4} " + class pci_class \X{4} " + revision_id pci_rev_id \X{2} (two uppercase hex digits) + drc_index drc_index \X{8} (eight uppercase hex digits) + drc_name drc_name String (physical location code) + ================ ==================== ===================================== + +The values are expected to match those produced by ``pvmctl ioslot list -d +`` for a given property. + +The ``identification`` section is required, and must contain at least one of +the above keys. + +When multiple keys are provided in a paragraph, they are matched with ``AND`` +logic. + +.. note:: It is a stretch goal of this blueprint to allow wildcards in (some + of) the values. E.g. ``drc_name: U78CB.001.WZS0JZB-P1-*`` would + allow everything on the ``P1`` planar of the ``U78CB.001.WZS0JZB`` + enclosure. If we get that far, a spec amendment will be proposed with + the specifics (what syntax, which fields, etc.). + +allow +~~~~~ +.. note:: The ``allow`` section will not be supported initially, but is + documented here because we thought through what it should look like. + In the initial implementation, any device encompassed by a paragraph + is allowed for passthrough. + +Each paragraph will support a boolean ``allow`` keyword. + +If omitted, the default is ``true`` - i.e. devices identified by this +paragraph's ``identification`` section are permitted for passthrough. (Note, +however, that devices not encompassed by the union of all the +``identification`` paragraphs in the file are denied for passthrough.) + +If ``allow`` is ``false``, the only other section allowed is +``identification``, since the rest don't make sense. + +A given device can only be represented once across all ``allow=true`` +paragraphs (implicit or explicit); an "allowed" device found more than once +will result in an error. + +A given device can be represented zero or more times across all ``allow=false`` +paragraphs. + +We will first apply the ``allow=true`` paragraphs to construct a preliminary +list of devices; and then apply each ``allow=false`` paragraph and remove +explicitly denied devices from that list. + +.. note:: Again, we're not going to support the ``allow`` section at all + initially. It will be a stretch goal to add it as part of this + release, or it may be added in a subsequent release. + +resource_class +~~~~~~~~~~~~~~ +If ``allow`` is omitted or ``true``, an optional ``resource_class`` key is +supported. Its string value allows the author to designate the resource class +to be used for the inventory unit representing the device on the resource +provider. If omitted, ``CUSTOM_IOSLOT`` will be used as the default. + +.. note:: **Future:** We may be able to get smarter about dynamically + defaulting the resource class based on inspecting the device + metadata. For now, we have to rely on the author of the config file + to tell us what kind of device we're looking at. + +traits +~~~~~~ +If ``allow`` is omitted or ``true``, an optional ``traits`` subsection is +supported. Its value is an array of strings, each of which is the name of a +trait to be added to the resource providers of each device represented by this +paragraph. If the ``traits`` section is included, it must have at least one +value in the list. (If no additional traits are desired, omit the section.) + +The values must be valid trait names (either standard from ``os-traits`` or +custom, matching ``CUSTOM_[A-Z0-9_]*``). These will be in addition to the +traits automatically added by the driver - see `Generated Traits`_ below. +Traits which conflict with automatically-generated traits will result in an +error: the driver must be the single source of truth for the traits it +generates. + +Traits may be used to indicate any static attribute of a device - for example, +a capability (``CUSTOM_CAPABILITY_WHIZBANG``) not otherwise indicated by +`Generated Traits`_. + +Resource Providers +------------------ +The driver shall create nested resource providers, one per device (slot), as +children of the compute node provider generated by Nova. + +.. TODO: Figure out how NVLink devices appear and how to handle them - ideally + by hiding them and automatically attaching them with their corresponding + device. + +The provider name shall be generated as ``PowerVM IOSlot %(drc_index)08X`` e.g. +``PowerVM IOSlot 1C0FFEE1``. We shall let the placement service generate the +UUID. This naming scheme allows us to identify the full set of providers we +"own". This includes identifying providers we may have created on a previous +iteration (potentially in a different process) which now need to be purged +(e.g. because the slot no longer exists on the system). It also helps us +provide a clear migration path in the future, if, for example, Cyborg takes +over generating these providers. It also paves the way for providers +corresponding to things smaller than a slot; e.g. PFs might be namespaced +``PowerVM PF %(drc_index)08X``. + +Inventory +~~~~~~~~~ +Each device RP shall have an inventory of:: + + total: 1 + reserved: 0 + min_unit: 1 + max_unit: 1 + step_size: 1 + allocation_ratio: 1.0 + +of the `resource_class`_ specified in the config file for the paragraph +matching this device (``CUSTOM_IOSLOT`` by default). + +.. note:: **Future:** Some day we will provide SR-IOV VFs, vGPUs, FPGA + regions/functions, etc. At that point we will conceivably have + inventory of multiple units of multiple resource classes, etc. + +Generated Traits +~~~~~~~~~~~~~~~~ +The provider for a device shall be decorated with the following +automatically-generated traits: + +* ``CUSTOM_POWERVM_IOSLOT_VENDOR_ID_%(vendor_id)04X`` +* ``CUSTOM_POWERVM_IOSLOT_DEVICE_ID_%(device_id)04X`` +* ``CUSTOM_POWERVM_IOSLOT_SUBSYS_VENDOR_ID_%(subsys_vendor_id)04X`` +* ``CUSTOM_POWERVM_IOSLOT_SUBSYS_DEVICE_ID_%(subsys_device_id)04X`` +* ``CUSTOM_POWERVM_IOSLOT_CLASS_%(class)04X`` +* ``CUSTOM_POWERVM_IOSLOT_REVISION_ID_%(revision_id)02X`` +* ``CUSTOM_POWERVM_IOSLOT_DRC_INDEX_%(drc_index)08X`` +* ``CUSTOM_POWERVM_IOSLOT_DRC_NAME_%(drc_name)s`` where ``drc_name`` is + normalized via ``os_traits.normalize_name``. + +In addition, the driver shall decorate the provider with any `traits`_ +specified in the config file paragraph identifying this device. If that +paragraph specifies any of the above generated traits, an exception shall be +raised (we'll blow up the compute service). + +update_provider_tree +~~~~~~~~~~~~~~~~~~~~ +The above provider tree structure/data shall be provided to Nova by overriding +the ``ComputeDriver.update_provider_tree`` method. The algorithm shall be as +follows: + +* Parse the config file. +* Discover devices (``GET /ManagedSystem``, pull out + ``.asio_config.io_slots``). +* Merge the config data with the discovered devices to produce a list of + devices to pass through, along with inventory of the appropriate resource + class name, and traits (generated and specified). +* Ensure the tree contains entries according to this calculated passthrough + list, with appropriate inventory and traits. +* Set-subtract the names of the providers in the calculated passthrough list + from those in the provider tree whose names are prefixed with ``PowerVM + IOSlot`` and delete the resulting "orphans". + +This is in addition to the standard ``update_provider_tree`` contract of +ensuring appropriate ``VCPU``, ``MEMORY_MB``, and ``DISK_GB`` resources on the +compute node provider. + +.. note:: It is a stretch goal of this blueprint to implement caching and/or + other enhancements to the above algorithm to optimize performance by + minimizing the need to call PowerVM REST and/or process whitelist + files every time. + +Flavor Support +-------------- +Existing Nova support for generic resource specification via flavor extra specs +should "just work". For example, a flavor requesting two GPUs might look like:: + + resources:VCPU=1 + resources:MEMORY_MB=2048 + resources:DISK_GB=100 + resources1:CUSTOM_GPU=1 + traits1:CUSTOM_POWERVM_IOSLOT_VENDOR_ID_G00D=required + traits1:CUSTOM_POWERVM_IOSLOT_PRODUCT_ID_F00D=required + resources2:CUSTOM_GPU=1 + traits2:CUSTOM_POWERVM_IOSLOT_DRC_INDEX_1C0FFEE1=required + +PowerVMDriver +------------- + +spawn +~~~~~ +During ``spawn``, we will query placement to retrieve the resource provider +records listed in the ``allocations`` parameter. Any provider names which are +prefixed with ``PowerVM IOSlot`` will be parsed to extract the DRC index (the +last eight characters of the provider name). The corresponding slots will be +extracted from the ``ManagedSystem`` payload and added to the +``LogicalPartition`` payload for the instance as it is being created. + +destroy +~~~~~~~ +IOSlots are detached automatically when we ``DELETE`` the ``LogicalPartition``, +so no changes should be required here. + +Live Migration +~~~~~~~~~~~~~~ +Since we can't migrate the state of an active GPU, we will block live migration +of a VM with an attached IOSlot. + +.. _`Cold Migration`: + +Cold Migration, Rebuild, Remote Restart +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +We should get these for free, but need to make sure they're tested. + +Hot plug/unplug +~~~~~~~~~~~~~~~ +This is not in the scope of the current effort. For now, attaching/detaching +devices to/from existing VMs can only be accomplished via resize (`Cold +Migration`_). + +Alternatives +------------ +Use Nova's PCI passthrough subsystem. We've all agreed this sucks and is not +the way forward. + +Use oslo.config instead of a YAML file. Experience with the +``[pci]passthrough_whitelist`` has led us to conclude that config format is too +restrictive/awkward. The direction for Nova (as discussed in the Queens PTG in +Denver) will be toward some kind of YAML format; we're going to be the pioneers +on this front. + +Security impact +--------------- +It is the operator's responsibility to ensure that the passthrough YAML config +file has appropriate permissions, and lists only devices which do not +themselves pose a security risk if attached to a malicious VM. + +End user impact +--------------- +Users get acceleration for their workloads \o/ + +Performance Impact +------------------ + +Discovery +~~~~~~~~~ +For the `update_provider_tree`_ flow, we're adding the step of loading and +parsing the passthrough YAML config file. This should be negligible compared to +e.g. retrieving the ``ManagedSystem`` object (which we're already doing, so no +impact there). + +spawn/destroy +~~~~~~~~~~~~~ +There's no impact from the community side. It may take longer to create or +destroy a LogicalPartition with attached IOSlots. + +Deployer impact +--------------- +None. + +Developer impact +---------------- +None. + +Upgrade impact +-------------- +None. + +Implementation +============== + +Assignee(s) +----------- +Primary assignee: + efried + +Other contributors: + edmondsw, mdrabe + +Work Items +---------- +See `Proposed change`_. + + +Dependencies +============ +os-traits 0.9.0 to pick up the ``normalize_name`` method. + +Testing +======= +Testing this in the CI will be challenging, given that we are not likely to +score GPUs for all of our nodes. + +We will likely need to rely on manual testing and PowerVC to cover the code +paths described under `PowerVMDriver`_ with a handful of various device +configurations. + + +Documentation Impact +==================== +* Add a section to our support matrix for generic device passthrough. +* User documentation for: + * How to build the passthrough YAML file. + * How to construct flavors accordingly. + +References +========== +None. + + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Rocky + - Introduced diff --git a/doc/source/specs/rocky/index.rst b/doc/source/specs/rocky/index.rst new file mode 100644 index 00000000..ce48b5d9 --- /dev/null +++ b/doc/source/specs/rocky/index.rst @@ -0,0 +1,7 @@ +Rocky Specifications +==================== + +.. toctree:: + :glob: + + *