Merge "Cyborg NVIDIA GPU Driver support vGPU management"

This commit is contained in:
Zuul 2020-11-27 09:07:43 +00:00 committed by Gerrit Code Review
commit 635eb882ee
1 changed files with 337 additions and 0 deletions

View File

@ -0,0 +1,337 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
================================================
Cyborg NVIDIA GPU Driver support vGPU management
================================================
The Cyborg NVIDIA GPU Driver has implemented pGPU management in the Train
release, this spec proposes the specification of supporting vGPU management
in the same driver.
Problem description
===================
GPU devices can provide supercomputing capabilities, and can replace the CPU
to provide users with more efficient computing power at a lower cost. GPU cloud
servers have great value in the following application scenarios, including:
video encoding and decoding, scientific research and artificial intelligence
(deep learning, machine learning).
In the OpenStack ecosystem, users can now use Nova to pass gpu resources to
guest by two methods:
* Pass the GPU hardware to the guest (PCI pass-through).
* Pass the Mediated Device(vGPU) to the guest.
With the long-term goal that Cyborg will manage heterogeneous accelerators
including GPUs, Cyborg needs to support GPU management and integrate with Nova
to provide users with gpu resources allocation in the aforementioned methods.
The existing Cyborg GPU driver, NVIDIA GPU Driver, has supported the first
method (PCI pass-through), while the second method is not yet supported.
Please see ref [1]_ for Nova-Cyborg vGPU integration spec.
Use Cases
---------
* When the user is using Cyborg to manage GPU devices, he/she wants to boot
up a VM with Nvidia GPU (pGPU or vGPU) attached in order to accelerate the
video coding and decoding, Cyborg should be able to manage this kind of
acceleration resources and to assign it to the VM(binding).
Proposed changes
================
To be clear, in the following, we will describe the whole process of how does
the NVIDIA GPU Driver discover, generate Cyborg specific driver objects of the
vGPU devices(comply with Cyborg Database Model), and report it to cyborg-db
and Placement by cyborg-conductor. Features that are aleady supported in
current branch is marked as DONE, new changes are marked as NEW CHANGES.
1. Collect raw info of GPU devices from compute node by "lspci" and grep
nvidia related keyword.(DONE)
2. Parsing details from each record including ``vendor_id``, ``product_id``
and ``pci_address``.(DONE)
3. Generate Cyborg specific driver objects and resource provider modeling
for the GPU device as well as its mdiated devices. Below is the objects to
describe a vGPU devices which complies with the Cyborg database mode [4]_
and placement data model [5]_.(NEW CHANGE)
::
Hardware Driver objects Placement data model
| | |
1 GPU 1 device |
| | |
| 1 deployable ---> resource_provider
| | ---> parent resource_provider: compute node
| | |
4 vGPUs 4 attach_handles ---> inventories(total:4)
4. Supporting set the vGPU type for a specific GPU device in cyborg.conf. The
implementation is similar to that in Nova [9]_.(NEW CHANGE)
* Firstly, we propose [gpu]/enabled_vgpu_types to define which vgpu type Cyborg
driver can use:
::
[gpu]
enabled_vgpu_types = [str_vgpu_type_1, str_vgpu_type_2, ...]
* And also, we propose that Cyborg driver will accept configuration sections
that are related to the [gpu]/enabled_vgpu_types and specifies which
exact pGPUs are related to the enabled vGPU types and will have a
device_addresses option defined like this:
::
cfg.ListOpt('device_addresses',
default=[],
help="""
List of physical PCI addresses to associate with a specific GPU type.
The particular physical GPU device address needs to be mapped to the vendor
vGPU type which that physical GPU is configured to accept. In order to
provide this mapping, there will be a CONF section with a name
corresponding to the following template: "vgpu_type_%(vgpu_type_name)s
The vGPU type to associate with the PCI devices has to be the section name
prefixed by ``vgpu_``. For example, for 'nvidia-11', you would declare
``[vgpu_nvidia-11]/device_addresses``.
Each vGPU type also has to be declared in ``[devices]/enabled_vgpu_types``.
Related options:
* ``[gpu]/enabled_vgpu_types``
"""),
For example, it would be set in cyborg.conf
::
[gpu]
enabled_vgpu_types = nvidia-223,nvidia-224
[vgpu_nvidia-223]
device_addresses = 0000:af:00.0,0000:86:00.0
[vgpu_nvidia-224]
device_addresses = 0000:87:00.0
5. Generate resource_class and traits for device, which later will also be
reported to Placement, and used by nova-scheduler to filter appropriate
accelerators.(NEW CHANGE)
* ``resource class`` follows standard resources classes used by OpenStack [6]_.
Pass-through GPU device will report 'PGPU' as its resource class,
Virtualized GPU device will report 'VGPU' as its resource class.
* ``traits`` follows the placement custom trait format [7]_. In the Cyborg
driver, it will report two traits for vGPU accelerator using the format
below:
trait1: **OWNER_CYBORG**.
trait2: **CUSTOM_<VENDOR_NAME>_<PRODUCT_ID>_<Virtual_GPU_Type>**.
Meaning of each parameter is listed below.
* OWNER_CYBORG: a new namespace in os-traits to remark that a device is
reported by Cyborg when the inventory is reported to placement. It is used
to distinguish GPU devices reported by Nova.
* VENDOR_NAME: vendor name of the GPU device.
* PRODUCT_ID: product ID of the GPU device.
* Virtual_GPU_Type: this parameter is actually another format of the
enabled_vgpu_types for a specific device set by admin in cyborg.conf.
In order to generate this param, driver will first retrieve
``enabled_vgpu_type`` and then map it to Virtual_GPU_Type by the way
showed below. The name is exactly the Virtual_GPU_Type that will be
reported in traits. For more details about the valid Virtual GPU Types
for supported GPUs, please refer to [8]_.
::
# find mapping relation between Virtual_GPU_Type and enabled_vgpu_type.
# The value in "name" file contains its corresponding Virtual_GPU_Type.
cat /sys/class/mdev_bus/{device_address}/mdev_supported_types/{enabled_vgpu_type}/name
* Here is a example to show the traits of a GPU device in the real world.
* A Nvidia Tesla T4 device has been successfully installed on host,
device address is 0000:af:00.0. In addition, the vendors vGPU driver
software must be installed and configured on the host at the same time.
::
[vtu@ubuntudbs ~]# lspci -nnn -D|grep 1eb8
0000:af:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1)
* Enable GPU types (Accelerator)
1. Specify which specific GPU type(s) the instances would get from this
specific device.
Edit devices.enabled_vgpu_types and device_address in cyborg.conf:
::
[gpu]
enabled_vgpu_types=nvidia-223
[vgpu_nvidia-223]
device_addresses = 0000:af:00.0
2. Restart the cyborg-agent service.
* Finally, traits reported for this device(RP) will be:
**OWNER_CYBORG** and **CUSTOM_NVIDIA_1EB8_T4_2B**
.. NOTE::
For the last parameter "T4_2B" (<Virtual_GPU_type>), we can validate the
mapping relation between "nvidia-223" and "T4_2B" by check from the mdev
sys path:
::
[vtu@ubuntudbs mdev_supported_types]$ pwd
/sys/class/mdev_bus/0000:af:00.0/mdev_supported_types
[vtu@ubuntudbs mdev_supported_types]$ ls
nvidia-222 nvidia-225 nvidia-228 nvidia-231 nvidia-234 nvidia-320
nvidia-223 nvidia-226 nvidia-229 nvidia-232 nvidia-252 nvidia-321
nvidia-224 nvidia-227 nvidia-230 nvidia-233 nvidia-319
[vtu@ubuntudbs mdev_supported_types]$ cat nvidia-223/name
GRID T4-2B
6. Generate ``controlpath_id``, ``deployable``, ``attach_handle``,
``attribute`` for vGPU.(NEW CHANGE)
7. Create a mdev device in the sys by echo its UUID (actually is the
attach_handle UUID) to the create file when vgpu is bind to a VM.(NEW CHANGE)
create_file_path=
/sys/class/mdev_bus/{pci_address}/mdev_supported_types/{type-id}/create
8. Delete a mdev device from sys by echo "1" to the remove file when vgpu is
unbind from a VM.(NEW CHANGE)
remove_file_path=
/sys/class/mdev_bus/{pci_address}/mdev_supported_types/{type-id}/UUID/remove
Alternatives
------------
Using Nova to manage vGPU device [10]_.
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
This feature is highly dependent on the version of libvirt and the physical
devices present on the host.
For vGPU management, deployers need to make sure that the GPU device has been
successfully virtualized. Otherwise, Cyborg will report it as a pGPU device.
Please see ref [2]_ and [3]_ for how to install the Virtual GPU Manager package
to virtualize your GPU devices.
Developer impact
----------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
<yumeng-bao>
Work Items
----------
* Implement NVIDIA GPU Driver enhancement in Cyborg
* Add related test cases.
* Add test report to wiki and update the supported driver doc page
Dependencies
============
None
Testing
========
* Unit tests will be added to test this driver.
Documentation Impact
====================
Document Nvidia GPU driver in Cyborg project.
References
==========
.. [1] https://review.opendev.org/#/c/750116/
.. [2] https://docs.nvidia.com/grid/6.0/grid-vgpu-user-guide/index.html
.. [3] https://docs.nvidia.com/grid/6.0/grid-vgpu-user-guide/index.html#install-vgpu-package-generic-linux-kvm
.. [4] https://specs.openstack.org/openstack/cyborg-specs/specs/stein/implemented/cyborg-database-model-proposal.html
.. [5] https://docs.openstack.org/nova/rocky/user/placement.html#references
.. [6] https://github.com/openstack/os-resource-classes/blob/master/os_resource_classes/__init__.py#L41
.. [7] https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/resource-provider-traits.html
.. [8] https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#virtual-gpu-types-grid-reference
.. [9] https://specs.openstack.org/openstack/nova-specs/specs/ussuri/implemented/vgpu-multiple-types.html
.. [10] https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release
- Description
* - Wallaby
- Introduced