Merge "Specification for Cyborg/Nova/Glance interaction in the compute node, including os-acc library API."

This commit is contained in:
Zuul 2018-06-08 02:55:25 +00:00 committed by Gerrit Code Review
commit c962039a11
1 changed files with 413 additions and 0 deletions

View File

@ -0,0 +1,413 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==============================================
Cyborg-Nova-Glance Interaction in Compute Node
==============================================
Cyborg is a service for managing accelerators, such as FPGAs, GPUs, etc. For
scheduling an instance that needs accelerators, Cyborg needs to work with Nova
at three levels:
* Representation and Discovery: Cyborg shall represent accelerators
as resources in Placement. When a device is discovered, Cyborg
updates resource inventories in Placement.
* Instance placement/scheduling: Cyborg may provide a weigher
that prioritizes hosts based on available accelerator resources.
* Attaching accelerators to instances. In the compute node, Cyborg
shall define a workflow based on interacting with Nova through a
new os-acc library (like os-vif and os-brick).
The first two aspects are addressed in [#CyborgNovaSched]_. This spec
addresses the attachment of accelerators to instances, via os-acc. For
FPGAs, Cyborg also needs to interact with Glance for fetching bitstreams.
Some aspects of that are covered in [#BitstreamSpec]_. This spec will
address the interaction of Cyborg and Glance in the compute node.
This spec is common to all accelerators, including GPUs, High Precision
Time Synchronization (HPTS) cards, etc. Since FPGAs have more aspects
to be considered than other devices, some sections may focus on
FPGA-specific factors. The spec calls out the FPGA-specific aspects.
Smart NICs based on FPGAs fall into two categories: those which
expose the FPGA explicitly to the host, and those that do not. Cyborg's
current scope includes the former. This spec includes such devices,
though the Cyborg-Neutron interaction is out of scope.
The scope of this spec is Rocky release.
Terminology
===========
* Accelerator: The unit that can be assigned to an instance for
offloading specific functionality. For non-FPGA devices, it is either the
device itself or a virtualized version of it (e.g. vGPUs). For FPGAs, an
accelerator is either the entire device, a region within the device or a
function.
* Bitstream: An FPGA image, usually a binary file, possibly with
vendor-specific metadata. A bitstream may implement one or more functions.
* Function: A specific functionality, such as matrix multiplication or video
transcoding, usually represented as a string or UUID. This term may be used
with multi-function devices, including FPGAs and other fixed function
hardware like Intel QuickAssist.
* Region: A part of the FPGA which can be programmed without disrupting
other parts of that FPGA. If an FPGA does not support Partial
Reconfiguration, the entire device constitutes one region. A region
may implement one or more functions.
Here is an example diagram for an FPGA with multiple regions, and multiple
functions in a region::
PCI A PCI B
| |
+-------|--------|-------------------+
| | | |
| +----|--------|---+ +--------+ |
| | +--|--+ +---|-+ | | | |
| | | Fn A| | Fn B| | | | |
| | +-----+ +-----+ | | | |
| +-----------------+ +--------+ |
| Region 1 Region 2 |
| |
+------------------------------------+
Problem description
===================
Once Nova has picked a compute node for placement of an instance that needs
accelerators, the following steps needs to happen:
* Nova compute on that node has to invoke Cyborg Agent for handling the needed
accelerators. This needs to happen through a library, named os-acc, patterned
after os-vif (Neutron) and os-brick (Cinder).
* Cyborg Agent may call Glance to fetch a bitstream, either by id or based on
tags.
* Cyborg Agent may need to call into a Cyborg driver to program said bitstream.
* Cyborg Agent needs to call into a Cyborg driver to prepare a device and/or
obtain an attach handle (e.g. PCI BDF) that can be attached to the instance.
* Cyborg Agent returns enough information to Nova compute via os-acc for the
instance to be launched.
The behavior of each of these steps needs to be specified.
In addition, the OpenStack Compute API [#ServerConcepts]_ specifies the
operations that can be done on an instance. The behavior with respect to
accelerators must be defined for each of these operations. That in turn is
related to when Nova compute calls os-acc.
Use Cases
---------
Please see [#CyborgNovaSched]_. We intend to support FPGAaaS with
request time programming, and AFaaS (both pre-programmed and
orchestrator-programmed scenarios).
Cyborg will discover accelerator resources whenever the Cyborg agent starts up.
PCI hot plug can be supported past Rocky release.
Cyborg must support all instance operations mentioned in OpenStack Compute API
[#ServerConcepts]_ in Rocky, except booting off a snapshot and live migration.
Proposed change
===============
OpenStack Server API Behavior
-----------------------------
The OpenStack Compute API [#ServerConcepts]_ mentions the list of operations
that can be performed on an instance. Of these, some will not be supported by
Cyborg in Rocky. The list of supported operations (with
the intended behaviors) are as follows:
* When an instance is started, the accelerators requested by that instances
flavor must be attached to the instance. On termination, those resources are
released.
* When an instance is paused, suspended or locked, the accelerator resources
are left intact, and not detached from the instance. So, when the instance is
unpaused, resumed or unlocked, there is nothing to do.
* When an instance is shelved, the accelerator resources are detached. On an
unshelve, it is expected that the build operation will go through the
scheduler again, so it is equivalent to an instance start.
* When an instance is deleted, the accelerator resources are detached. On a
restore, it is expected that the build operation will go through the
scheduler again, so it is equivalent to an instance start.
* Reboot: The accelerator resources are left intact. It is up the instance
software to rediscover attached resources.
* Rebuild: Prior to the instance image replacement, all device access must be
quiesced, i.e., accesses to devices from that instance must be completed and
further accesses must be prohibited. The mechanics of such quiescing are
outside the scope of this document. With that precondition, accelerator
resources are left attached to the instance during the rebuild.
* Resize (with change of flavor): It is equivalent to a termination followed by
re-scheduling and restart. The accelerator resources are detached on
termination, and re-attached on when the instance is scheduled again.
* Cold migration: It is equivalent to a termination followed by re-scheduling
and restart. The accelerator resources are detached on termination, and
re-attached on when the instance is scheduled again.
* Evacuate: This is a forcible rebuild by the administrator. As the semantics
of evacuation are left open even without accelerators, Cyborgs behavior is
also left undefined.
* Set administrator password, trigger crash dump: These are supported and not
no-ops for accelerators.
The following instance operations are not supported in this release:
* Booting off a snapshot: The snapshot may have been taken when the attached
accelerators were in a particular state. When booting off a previous
snapshot, the current configuration and state of accelerators may not match
the snapshot. So, this is unsupported.
* Live migration: Until a mechanism is defined to migrate accelerator state
along with the instance, this is unsupported.
os_acc Structure
----------------
Cyborg will develop a new library named os-acc. That library will offer the
APIs listed later in this section. Nova Compute calls these APIs if it sees
that the requested flavor refers to CUSTOM_ACCELERATOR resource class, except
for the initialize() call, which is called unconditionally. Nova Compute calls
these APIs asynchronously, as suggested below::
with ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(os_acc.<api>, *args)
# do other stuff
try:
data = future.result()
except:
# handle exceptions
The APIs of os-acc are as below:
* initialize()
* Called once at start of day. Waits for Cyborg Agent to be ready to accept
requests, i.e., all devices enumerated and traits published.
* Returns None on success.
* Throws ``CyborgAgentUnavailable`` exception if Cyborg Agent cannot be
contacted.
* plug(instance_info, selected_rp, flavor_extra_specs)
* Parameters are all read-only. Here are their descriptions:
* instance_info: dictionary containing instance UUID, instance name,
project/tenant ID and VM image UUID. The instance name is needed for
better logging, the project/tenant ID may be passed to some accelerator
policy engine in the future and the VM image UUID may be used to query
Glance for metadata about accelerator requirements that may be stored
with the VM image.
* selected_rp: Information about the selected resource provider is
passed as a dictionary.
* flavor_extra_specs: the extra_specs field in the flavor, including
resource classes, traits and other fields interpreted by Cyborg.
* Called by Nova compute when an instance is started, unshelved, or
restored and after a resize or cold migration.
* Called before an instance is built, i.e., before the specification of
the instance is created. For libvirt-based hypervisors, this means
the call happens before the instances domain XML is created.
* As part of this call, Cyborg Agent may fetch bitstreams from Glance and
initiate programming. It may fetch the bitstream specified in the
requests flavor extra specs, if any. If the request refers to a
function ID/name, Cyborg Agent would query Glance to find bitstreams
that provide the flavor and match the chosen device, and would then
fetch the needed bitstream.
* As part of this call, Cyborg Agent will locate the Deployable corresponding
to the chosen RP, locate the attach handles (e.g. PCI BDF) needed, update
its internal data structures in a persistent way, and return the needed
information back to Nova.
* Returns an array, with one entry per requested accelerator, each entry
being a dictionary. The dictionary is structured as below for Rocky:
| { “pci_id”: <pci bdf> }
* unplug(instance_info)
* Parameters are all read-only. Here are their descriptions:
* instance_info: dictionary containing instance UUID and instance
name. The instance name is needed for better logging.
* Called when an instance is stopped, shelved, or deleted and before
a resize or cold migration.
* As part of this call, Cyborg Agent will clean up internal resources, call
the appropriate Cyborg driver to clean up the device resources and update
its data structures persistently.
* Returns the number of accelerators that were released. Errors may cause
exceptions to be thrown.
Workflows
---------
The pseudocode for each os-acc API can be expressed as below::
def initialize():
# checks that all devices are discovered and their traits published
# waits if any discovery operation is ongoing
return None
def plug(instance_info, rp, extra_specs):
validate_params(....)
glance = glanceclient.Client(...)
driver = # select Cyborg driver for chosen rp
rp_deployable = # get deployable for RP
if extra_specs refers to ``CUSTOM_FPGA_<vendor>_REGION_<uuid>`` and
extra_specs refers to ``bitstream:<uuid>``:
bitstream = glance.images.data(image_uuid)
driver.program(bitstream, rp_deployable, …)
if extra_specs refers to ``CUSTOM_FPGA_<vendor>_FUNCTION_<uuid>`` and
extra_specs refers to function UUID/name:
region_type_uuid = # fetch from selected RP
bitstreams = glance.images.list(...)
# queries Glance by function UUID/name property and region type
# UUID to get matching bitstreams
if len(bitstreams) > 1:
error(...) # bitstream choice policy is outside Cyborg
driver.program(bitstream, rp_deployable, …)
pci_bdf = driver.allocate_handle(...)
# update Cyborg DB with instance_info and BDF usage
return { “pci_id”: pci bdf }
def unplug(instance_info):
bdf_list = # fetch BDF usage from Cyborg DB for instance
# update Cyborg DB to mark those BDFs as free
return len(bdf_list)
Alternatives
------------
N/A
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Implementation
==============
Assignee(s)
-----------
None
Work Items
----------
* Decide how to associate multiple functions/bitstreams in extra specs
with multiple devices in the flavor.
* Decide specific changes needed in Cyborg conductor, db, agent and drivers.
* Others: TBD
Dependencies
============
* Nested Resource Provider support in Nova
* `Nova Granular Requests
<https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/granular-resource-requests.html>`_
Testing
=======
For each vendor driver supported in this release, we need to integrate the
corresponding FPGA type(s) in the CI infrastructure.
Documentation Impact
====================
The behavior with respect to accelerators during various instance operations
(reboot, pause, etc.) must be documented. The procedure to upload a bitstream,
including applying Glance properties, must also be documented.
References
==========
.. [#CyborgNovaSched] `Cyborg Nova Scheduling Specification
<https://review.openstack.org/#/c/554717/>`_
.. [#Bitstreamspec] `Cyborg bitstream metadata standardization spec
<https://review.openstack.org/#/c/558265/>`_
.. [#ServerConcepts] `OpenStack Server API Concepts
<https://developer.openstack.org/api-guide/compute/server_concepts.html>`_
History
=======
Optional section intended to be used each time the spec is updated to describe
new design, API or any database schema updated. Useful to let reader understand
what's happened along the time.
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Rocky
- Introduced