Merge "Specification for Cyborg/Nova interaction for scheduling."

2018-06-11 03:45:04 +00:00 · 2018-06-11 03:45:04 +00:00 · 4ec459afb2
parent 524703c347 ebb94796bd
commit 4ec459afb2
1 changed files with 486 additions and 0 deletions
--- a/doc/specs/rocky/cyborg-nova-sched.rst
+++ b/doc/specs/rocky/cyborg-nova-sched.rst
@ -0,0 +1,486 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================================
+Cyborg-Nova Interaction for Scheduling
+==========================================
+
+https://blueprints.launchpad.net/cyborg/+spec/cyborg-nova-interaction
+
+Cyborg provides a general management framework for accelerators, such
+as FPGAs, GPUs, etc. For scheduling an instance that needs accelerators,
+Cyborg needs to work with Nova on three levels:
+
+* Representation and Discovery: Cyborg shall represent accelerators as
+  resources in Placement. When a device is discovered, Cyborg updates
+  resource providers, inventories, traits, etc. in Placement.
+
+* Instance placement/scheduling: Cyborg may provide a filter and/or weigher
+  that limit or prioritize hosts based on available accelerator resources,
+  but it is expected that Placement itself can handle most requirements.
+
+* Attaching accelerators to instances. In the compute node, Cyborg shall
+  define a workflow based on interacting with Nova through a new os-acc
+  library (similar to os-vif and os-brick).
+
+This spec addresses the first two aspects. There is another spec to
+address the attachment of accelerators to instances [#os-acc]_.
+Cyborg also needs to handle some aspects for FPGAs without involving
+Nova, specifically FPGA programming and bitstream management. They
+will be covered in other specs. This spec is independent of those specs.
+
+This spec is common to all accelerators, including GPUs, High Precision
+Time Synchronization (HPTS) cards, etc. Since FPGAs have more aspects to
+be considered than other devices, some sections may focus on FPGA-specific
+factors. The spec calls out the FPGA-specific aspects.
+
+Smart NICs based on FPGAs fall into two categories: those which expose
+the FPGA explicitly to the host, and those that do not. Cyborg's scope
+includes the former. This spec includes such devices, though the
+Cyborg-Neutron interaction is out of scope.
+
+The scope of this spec is Rocky release.
+
+Terminology
+===========
+* Accelerator: The unit that can be assigned to an instance for
+  offloading specific functionality. For non-FPGA devices, it is either the
+  device itself or a virtualized version of it (e.g. vGPUs). For FPGAs, an
+  accelerator is either the entire device, a region within the device or a
+  function.
+
+* Bitstream: An FPGA image, usually a binary file, possibly with
+  vendor-specific metadata. A bitstream may implement one or more functions.
+
+* Function: A specific functionality, such as matrix multiplication or video
+  transcoding, usually represented as a string or UUID. This term may be used
+  with multi-function devices, including FPGAs and other fixed function
+  hardware like Intel QuickAssist.
+
+* Region: A part of the FPGA which can be programmed without disrupting
+  other parts of that FPGA. If an FPGA does not support Partial
+  Reconfiguration, the entire device constitutes one region. A region
+  may implement one or more functions.
+
+Here is an example diagram for an FPGA with multiple regions, and multiple
+functions in a region::
+
+         PCI A     PCI B
+          |        |
+  +-------|--------|-------------------+
+  |       |        |                   |
+  |  +----|--------|---+   +--------+  |
+  |  | +--|--+ +---|-+ |   |        |  |
+  |  | | Fn A| | Fn B| |   |        |  |
+  |  | +-----+ +-----+ |   |        |  |
+  |  +-----------------+   +--------+  |
+  |  Region 1              Region 2    |
+  |                                    |
+  +------------------------------------+
+
+Problem description
+===================
+Cyborg's representation and handling of accelerators needs to be consistent
+with Nova's Placement API. Specifically, they must be modeled in terms of
+Resource Providers (RPs), Resource Classes (RCs) and Traits.
+
+Though PCI Express is entrenched in the data center, some accelerators
+may be exposed to the host via some other protocol. Even with PCI, the
+connections between accelerator components and PCI functions
+may vary across devices. Accordingly, Cyborg should not represent
+accelerators as PCI functions.
+
+For instances that need accelerators, we need to define a way for Cyborg
+to be included seamlessly in the Nova scheduling workflow.
+
+Use Cases
+---------
+We need to satisfy the following use cases for the tenant role:
+
+* Device as a Service (DaaS): The flavor asks for a device.
+
+  * FPGA variation: The flavor asks for a device to which specific
+    bitstream(s) can be applied. There are three variations, the first
+    two of which delegate bitstream programming to Cyborg for secure
+    programming:
+
+    * Request-time Programming: The flavor specifies a bitstream. (Cyborg
+      applies the bitstream before instance bringup. This is similar to
+      AWS flow.)
+
+    * Run-time Programming: The instance may request one or more
+      bitstreams dynamically. (Cyborg receives the request and does
+      the programming.)
+
+    * Direct Programming: The instance directly programs the FPGA
+      region assigned to it, without delegating it to Cyborg. The
+      security questions that this raises need to be addressed in
+      the future. (This is listed only for completeness; this is not
+      going to be addressed in Rocky, or even future releases till
+      the security concerns are fully addressed.)
+
+* Accelerated Function as a Service (AFaaS): The flavor asks for a
+  function (e.g. ipsec) attached to the instance. The operator may
+  satisfy this use case in two ways:
+
+  * Pre-programmed: Do not allow orchestration to modify any function,
+    for any of these reasons:
+
+    * Only fixed function hardware is available. (E.g. ASICs.)
+
+    * Operational simplicity.
+
+    * Assure tenants of programming security, by doing all programming offline
+      through some audited process.
+
+  * For FPGAs, allow orchestration to program as needed, to maximize
+    flexibility and availability of resources.
+
+An operator must be able to provide both Device as a Service and Accelerated
+Function as a Service in the same cluster, to serve all
+kinds of users: those who are device-agnostic, those using 3rd party
+bitstreams, and those using their own bitstreams (incl. developers).
+
+The goal for Cyborg is to provide the mechanisms to enable all these use
+cases.
+
+In this spec, we do not consider bitstream developer or device developer
+roles. Also, we assume that each accelerator device is dedicated to a
+compute node, rather than shared among several nodes.
+
+Proposed change
+===============
+
+Representation
+--------------
+
+  * Cyborg will represent a generic accelerator for a device type as a
+    custom Resource Class (RC) for that type, of the form
+    CUSTOM_ACCELERATOR_<device-type>. E.g. CUSTOM_ACCELERATOR_GPU,
+    CUSTOM_ACCELERATOR_FPGA, etc. This helps in defining separate quotas
+    for different device types.
+
+  * Device-local memory is the memory available to the device alone,
+    usually in the form of DDR, QDR or High Bandwidth Memory in the
+    PCIe board along with the device. It can also be represented as an
+    RC of the form CUSTOM_ACCELERATOR_MEMORY_<memory-type>. E.g.
+    CUSTOM_ACCELERATOR_MEMORY_DDR. A single PCIe board may have more
+    than one type of memory.
+
+  * In addition, each device/region is represented as a Resource Provider
+    (RP). This enables traits to be applied to it and other RPs/RCs to
+    be contained within it. So, a device RP provides one or more instances
+    of that device type's RC. This depends on nested RP support in
+    Nova [#nRP]_.
+
+       * For FPGAs, both the device and the regions within it will be
+         represented as RPs. This allows the hierarchy within an FPGA
+         to be naturally modelled as an RP hierarchy.
+
+       * Using Nested RPs is the preferred way. But, until Nova
+         supports nested RPs, Cyborg shall associate the
+         RCs and traits (described below) with the compute node RPs. This
+         requires that all devices on a single host must share the same
+         traits. If nested RP support becomes usable after Rocky release,
+         the operator needs to handle the upgrade as below:
+
+         * Terminate all instances using accelerators.
+
+         * Remove all Cyborg traits and inventory on all compute node RPs,
+           perhaps by running a script.
+
+         * Perform the Cyborg upgrade. Post-upgrade, the new agent/driver(s)
+           will create RPs for the devices and publish the traits
+           and inventory.
+
+  * Cyborg will associate a Device Type trait with each device, of the
+    form CUSTOM_<device-type>-<vendor>. E.g. CUSTOM_GPU_AMD or
+    CUSTOM_FPGA_XILINX. This trait is intended to help match the
+    software drivers/libraries in the instance image. This is meant to
+    be used in a flavor when a single driver/library in the instance
+    image can handle most or all of device types from a vendor.
+
+       * For FPGAs, this trait and others will be applied to the region
+         RPs which are children of the device RPs as well.
+
+  * Cyborg will associate a Device Family trait with each device as
+    needed, of the form CUSTOM_<device-type>_<vendor>_<family>.
+    E.g. CUSTOM_FPGA_INTEL_ARRIA10.
+    This is not a product name, but the name of a device family, used to
+    match software in the instance image with the device family. This is
+    a refinement of the Device Type Trait. It is meant to be used in
+    a flavor when there are different drivers/libraries for different
+    device families. Since it may be tough to forecast whether a new
+    device family will need a new driver/library, it may make sense to
+    associate both these traits with the same device RP.
+
+  * For FPGAs, Cyborg will associate a region type trait with each region
+    (or with the FPGA itself if there is no Partial Reconfiguration
+    support), of the form CUSTOM_FPGA_REGION_<vendor>__<uuid>.
+    E.g.  CUSTOM_FPGA_REGION_INTEL_<uuid>. This is needed for Device as a
+    Service with FPGAs.
+
+  * For FPGAs, Cyborg may associate a function type trait with a region
+    when the region gets programmed, of the form
+    CUSTOM_FPGA_FUNCTION_<vendor>_<uuid>. E.g.
+    CUSTOM_FPGA_FUNCTION_INTEL_<gzip-uuid>.
+    This is needed for AFaaS use case. This is updated when Cyborg
+    reprograms a region as part of AFaaS request.
+
+  * For FPGAs, Cyborg should associate a CUSTOM_PROGRAMMABLE trait with
+    every region. This is needed to lay the groundwork for
+    multi-function accelerators in the future. Flavors should ask for
+    this trait, except in the pre-programmed case.
+
+  * For FPGAs, since they may implement a wide variety of functionality,
+    we may also attach a Functionality Trait.
+    E.g. CUSTOM_FPGA_COMPUTE, CUSTOM_FPGA_NETWORK, CUSTOM_FPGA_STORAGE.
+
+  * The Cyborg agent needs to get enough information from the Cyborg driver
+    to create the RPs, RCs and traits. In particular, it needs to get the
+    device type string, region IDs and function IDs from the driver. This
+    requires the driver/agent interface to be enhanced [#drv-api]_.
+
+  * The modeling in Placement represents generic virtual accelerators as
+    resource classes, and devices/regions as RPs. This is PCI-agnostic.
+    However, many FPGA implementations use PCI Express in general, and
+    SR-IOV in particular. In those cases, it is expected that Cyborg will
+    pass PCI VFs to instances via PCI Passthrough, and retain the PCI PF
+    in the host for management.
+
+Flavors
+-------
+  For the sake of illustrating how the device representation in Nova
+  can be used, and for completeness, we now show how to define flavors
+  for various use cases. Please see [#flavor]_ for more details.
+
+  * A flavor that needs device access always asks for one or more instances
+    of 'resource:CUSTOM_ACCELERATOR_<device-type>'. In addition, it
+    needs to specify the right traits.
+
+  * Example flavor for DaaS:
+
+    | ``resources:CUSTOM_ACCELERATOR_HPTS=1``
+    | ``trait:CUSTOM_HPTS_ZTE=required``
+
+    NOTE: For FPGAs, the flavor should also include CUSTOM_PROGRAMMABLE trait.
+
+  * Example flavor for AFaaS Pre-programed:
+
+    | ``resources:CUSTOM_ACCELERATOR_FPGA=1``
+    | ``trait:CUSTOM_FPGA_INTEL_ARRIA10=required``
+    | ``trait:CUSTOM_FPGA_FUNCTION_INTEL_<gzip-uuid>=required``
+
+  * Example flavor for AFaaS Orchestration-Programmed:
+
+    | ``resources:CUSTOM_ACCELERATOR_FPGA=1``
+    | ``trait:CUSTOM_FPGA_INTEL_ARRIA10=required``
+    | ``trait:CUSTOM_PROGRAMMABLE=required``
+    | ``function:CUSTOM_FPGA_FUNCTION_INTEL_<gzip-uuid>=required``
+      (Not interpreted by Nova.)
+
+    * NOTE: When Nova supports preferred traits, we can use that instead
+      of 'function' keyword in extra specs.
+
+    * NOTE: For Cyborg to fetch the bitstream for this function, it
+      is assumed that the operator has configured the function UUID
+      as a property of the bitstream image in Glance.
+
+  * Another example flavor for AFaaS Orchestration-Programmed which
+    refers to a function by name instead of UUID for ease of use:
+
+    | ``resources:CUSTOM_ACCELERATOR_FPGA=1``
+    | ``trait:CUSTOM_FPGA_INTEL_ARRIA10=required``
+    | ``trait:CUSTOM_PROGRAMMABLE=required``
+    | ``function_name:<string>=required``
+      (Not interpreted by Nova.)
+
+    * NOTE: This assumes the operator has configured the function name
+      as a property of the bitstream image in Glance. The FPGA
+      hardware is not expected to expose function names, and so
+      Cyborg will not represent function names as traits.
+
+  * A flavor may ask for other RCs, such as local memory.
+
+  * A flavor may ask for multiple accelerators, using the granular resource
+    request syntax. Cyborg can tie function and bitstream fields in
+    the extra_specs to resources/traits using an extension of the granular
+    resource request syntax (see References) which is not interpreted by Nova.
+
+    | ``resourcesN: CUSTOM_ACCELERATOR_FPGA=1``
+    | ``traitsN: CUSTOM_FPGA_INTEL_ARRIA10=required``
+    | ``othersN: function:CUSTOM_FPGA_FUNCTION_INTEL_<gzip-uuid>=required``
+
+Scheduling workflow
+--------------------
+We now look at the scheduling flow when each device implements only
+one function. Devices with multiple functions are outside the scope for now.
+
+  * A request spec with a flavor comes to Nova conductor/scheduler.
+
+  * Placement API returns the list of RPs which contain the requested
+    resources with matching traits. (With nested RP support, the returned
+    RPs are device/region RPs. Without it, they are compute node RPs.)
+
+  * FPGA-specific: For AFaaS orchestration-programmed use case, Placement
+    will return matching devices but they may not have the requested
+    function. So, Cyborg may provide a weigher which checks the
+    allocation candidates to see which ones have the required function trait,
+    and ranks them higher. This requires no change to Cyborg DB.
+
+  * The request_spec goes to compute node (ignoring Cells for now).
+
+    NOTE: When one device/region implements multiple functions and
+    orchestration-driven programming is desired, the inventory of that
+    device needs to be adjusted.
+    This can be addressed later and is not a priority for Rocky release.
+    See References.
+
+  * Nova compute calls os-acc/Cyborg [#os-acc]_.
+
+  * FPGA-specific: If the request spec asks for a function X in extra specs,
+    but X is not present in the selected region RP, Cyborg should program
+    that region.
+
+  * Cyborg should associate RPs/RCs and PFs/VFs with Deployables in its
+    internal DB. It can use such mappings associating the requested resource
+    (device/function) with some attach handle that can be used to
+    attach the resource to an instance (such as a PCI function).
+
+NOTE : This flow is PCI-agnostic: no PCI whitelists involved.
+
+Handling Multiple Functions Per Device
+--------------------------------------
+
+Alternatives
+------------
+
+N/A
+
+Data model impact
+-----------------
+
+Following changes are needed in Cyborg.
+
+* Do not publish PCI functions as resources in Nova. Instead, publish
+  RC/RP info to Nova, and keep RP-PCI mapping internally.
+
+* Cyborg should associate RPs/RCs and PFs/VFs with Deployables in its
+  internal DB.
+
+* Driver/agent interface needs to report device/region types so that
+  RCs can be created.
+
+* Deployables table should track which RP corresponds to each Deployable.
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+This change allows tenants to initiate FPGA bitstream programming. To mitigate
+the security impact, it is proposed that only 2 methods are offered for
+programming (flavor asks for a bitstream, or the running instance asks for
+specific bitstreams) and both are handled through Cyborg. There is no direct
+access from an instance to an FPGA.
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+None
+
+Work Items
+----------
+
+* Decide specific changes needed in Cyborg conductor, db, agent and drivers.
+
+Dependencies
+============
+
+* `Nested Resource Providers
+  <http://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/nested-resource-providers-allocation-candidates.html>`_
+
+* `Nova Granular Requests
+  <https://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/granular-resource-requests.html>`_
+
+NOTE: the granular requests feature is needed to define a flavor that requests
+non-identical accelerators, but is not needed for Cyborg development in Rocky.
+
+Testing
+=======
+
+For each vendor driver supported in this release, we need to integrate the
+corresponding FPGA type(s) in the CI infrastructure.
+
+Documentation Impact
+====================
+
+None
+
+References
+==========
+
+.. [#os-acc] `Specification for Compute Node <https://review.openstack.org/#/c/566798/>`_
+
+.. [#nRP] `Nested RPs in Rocky <http://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/nested-resource-providers-allocation-candidates.html>`_
+
+.. [#drv-api] `Specification for Cyborg Agent-Driver API <https://review.openstack.org/#/c/561849/>`_
+
+.. [#flavor] `Custom Resource Classes in Flavors <https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/custom-resource-classes-in-flavors.html>`_
+
+.. [#qspec] `Cyborg Nova Queens Spec <https://github.com/openstack/cyborg/blob/master/doc/specs/queens/approved/cyborg-nova-interaction.rst>`_
+
+.. [#ptg] `Rocky PTG Etherpad for Cyborg Nova Interaction <https://etherpad.openstack.org/p/cyborg-ptg-rocky-nova-cyborg-interaction>`_
+
+.. [#multifn] `Detailed Cyborg/Nova scheduling <https://etherpad.openstack.org/p/Cyborg-Nova-Multifunction>`_
+
+.. [#mails] `Openstack-dev email discussion <http://lists.openstack.org/pipermail/openstack-dev/2018-April/128951.html>`_
+
+
+
+History
+=======
+
+Optional section intended to be used each time the spec is updated to describe
+new design, API or any database schema updated. Useful to let reader know
+what happened over time.
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Rocky
+     - Introduced