Add NVIDIA MIG management spec

implement blueprint nvidia-a100-vgpu
Change-Id: I43f35e6781b84917b835834dc8d530ccbf1f1a9a
This commit is contained in:
songwenping 2023-06-12 15:35:04 +08:00
parent 3cd247ece4
commit 7b1bbd34c3
1 changed files with 125 additions and 0 deletions

View File

@ -0,0 +1,125 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
===============================================
Cyborg NVIDIA GPU Driver support MIG management
===============================================
GPU MIG is the new feature of A serial card, MIG can partition the GPU
into as many as seven instances, each fully isolated with its own
high-bandwidth memory, cache, and compute cores[1]_. Its virtualization
need enable SRIOV like QAT and SmartNIC, which is different from V100
and other GPU cards. This spec proposes the specification of supporting
MIG management in the nvidia mig driver.
Problem description
===================
GPU MIG is the new feature of A serial card and has different virtualization
type from V serial and other GPU cards. We need a new driver to manage
MIG resources by Cyborg like other accelerators.
Use Cases
---------
An admin or operator wants to use the MIG resource like other vGPU resources,
and boots up VM with MIG device attached.
Proposed change
===============
Add new driver to manage MIG resources.
1.collect raw info of GPU devices from compute node by "lspci" and grep
nvidia related keyword.
2.partition the default GPU instances for every A serial cards.
3.find the MIG resources by filter the VFs of GPU
4.report available VFs to cyborg-conductor and placement
Change MIG partition by cyborg-api like change vgpu_type. Cyborg-api will
call cyborg-agent to partition GPU and exec the period task to report new
MIG info.
Alternatives
------------
None
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
No need to configure the vpgu_type for A serial GPU cards.
Add new driver named nvidia_mig_driver for enabled_drivers.
Developer impact
----------------
If the user want to use these feature, they should upgrade their Cyborg
project to latest to support these changes.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
songwenping
Work Items
----------
* Add new driver to manage MIG resources.
* Change MIG partition by cyborg-api.
* Change cyborgclient to support change MIG partition action.
* Add related tests.
Dependencies
============
None
Testing
=======
Appropriate unit and functional tests should be added.
Documentation Impact
====================
* Need a documentaiton to explain nvidia_mig_driver usage.
References
==========
.. [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Bobcat
- Introduced