Specification for Cyborg Agent-Driver API.

Enhanced the device description model (which is structured
as a Python dictionary) to align it with RCs and traits
that are published by Cyborg in Nova.

Change-Id: I27a4bcab6379652e6893aee7fcc684036db93d61
Proposes: spec cyborg-agent-driver-api
This commit is contained in:
Sundar Nadathur 2018-04-17 02:25:26 -07:00
parent ce0b6e5a95
commit cda3ca7b35
1 changed files with 222 additions and 0 deletions

View File

@ -0,0 +1,222 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==========================================
Cyborg Agent-Driver API
==========================================
Cyborg agent interacts with each Cyborg driver in the compute node to
discover available devices. This spec defines how the agent-driver API
is structured.
No change is proposed to the way the agent discovers the drivers on
start or restart.
This spec is common to all accelerators, including GPUs, High Precision
Time Synhronization (HPTS) cards, etc. Since FPGAs have more aspects to
be considered than other devices, some sections may focus on FPGA-specific
factors. The spec calls out the FPGA-specific aspects.
The scope of this spec is Rocky release, but the API has been designed
to be extensible for future releases. Accordingly, the spec calls out
the Rocky-specific aspects.
Problem description
===================
The [#Cyborg_Nova_scheduling_spec]_ specifies that devices are
represented using Resource Providers (RPs), Resource Classes (RCs)
and traits. The information needed to create them has to come from
the Cyborg driver to the Cyborg agent, which in turn needs to
push it to the Cyborg Conductor.
The main challenge is discovering the device topology for FPGAs.
An FPGA may have one or more Partial Reconfiguration regions,
and those regions may have one or more accelerators nested inside them.
Further, it may have local memory that is either partitioned or
shared among the regions.
Use Cases
---------
* Devices of different types (GPUs, FPGAs, HPTS cards, Quick Assist) are
present in the same host.
* FPGAs of different types, possibly from different vendors, are present
in the same host.
* An FPGA may have one or more regions. Each region may have one
or more accelerators.
* In Rocky, we may support only one region per FPGA, and only one
accelerator per region.
* For Rocky, it is proposed that local memory need not be exposed as
a resource to orchestration. That is because, since there is only
one region per FPGA, an instance attached to that region will be
able to access all the memory, no matter how much there is. For
non-FPGA devices like GPUs, there does not seem to be a requirement
to expose video RAM.
Cyborg will assume and handle the following component relationships:
* One product (e.g. Intel PAC Arria 10) may correspond to multiple
PCI vendor/device IDs.
* One PCI vendor/device ID may correspond to different region type IDs.
This could be either because there are multiple regions in the same device
or because there are different versions/revisions of the same device.
* But the same region type ID will never show up in products with
different PCI IDs.
Proposed change
===============
Today, the Cyborg agent invokes the discover() API for each driver
that it finds. The discover() API returns a dictionary indexed by
the PCI BDF of a device. The value element in the key-value pair of
the dictionary contains the components and characteristics
of the device with that BDF.
We propose to retain the same model, but enhance the dictionary to
include enough information to create the resource providers and traits
needed to populate Placement. Here are the additional proposed keys
in the device dictionary for each PF:
| ``"type": <enum-string>`` # One of GPU, FPGA, etc.
| ``"vendor": <string>``
| ``"product": <string>``
Also, in the ``regions`` entry for each PF, it is proposed to add
the following keys:
| ``"region-type-uuid": <uuid>`` # Optional, default: NULL
| ``"bitstream-id": <uuid>`` # Glance/other UUID, optional, default: NULL
| ``"function-uuid": <uuid>`` # Optional, default: NULL
When the agent receives this dictionary for a device, it will do
the following:
* If there is nested RP support, create an RP for the device and each
region within.
* Create a device type trait: ``CUSTOM_<type>_<vendor>_<product>``.
Apply it to the device RP (if nRP support exists) or the compute node RP.
* E.g. CUSTOM_FPGA_INTEL_PAC_ARRIA10.
* NOTE: The agent will convert all characters to upper case, replace
spaces with underscores, and check for conformance to custom trait
syntax (see [#Custom_traits]_)
* Create region type traits for each region, of the form:
``CUSTOM_<type>_<vendor>_REGION_<type-uuid>``. Apply them to the
corresponding region RP (if nRP support exists) or the compute node RP.
* E.g. CUSTOM_FPGA_INTEL_REGION_<type-uuid>
* NOTE: For UUIDs, the agent will convert all hexadecimal digits to upper
case, replace hyphens with underscores and validate all characters.
* Create function type traits for each function in each region, of the form:
``CUSTOM_<type>_<vendor>_FUNCTION_<function-uuid>``. Apply them to the
corresponding region RP (if nRP support exists) or the compute node RP.
* E.g. CUSTOM_FPGA_INTEL_FUNCTION_<function-uuid>
Alternatives
------------
N/A
Data model impact
-----------------
Add the new fields to the database under Deployables and Attributes.
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Implementation
==============
Assignee(s)
-----------
None
Work Items
----------
Dependencies
============
None
Testing
=======
Need to update unit tests to check for the newly added fields.
Documentation Impact
====================
None
References
==========
.. [#Cyborg_Nova_scheduling_spec] `Cyborg/Nova Scheduling spec <https://review.openstack.org/#/c/554717>`_
.. [#Custom_traits] `Custom Traits <http://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/resource-provider-traits.html#rest-api-impact>`_
History
=======
Optional section intended to be used each time the spec is updated to describe
new design, API or any database schema updated. Useful to let reader
understand what's happened along the time.
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Rocky
- Introduced