Add NVIDIA MIG management spec
implement blueprint nvidia-a100-vgpu Change-Id: I43f35e6781b84917b835834dc8d530ccbf1f1a9a
This commit is contained in:
parent
3cd247ece4
commit
7b1bbd34c3
|
@ -0,0 +1,125 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===============================================
|
||||
Cyborg NVIDIA GPU Driver support MIG management
|
||||
===============================================
|
||||
|
||||
GPU MIG is the new feature of A serial card, MIG can partition the GPU
|
||||
into as many as seven instances, each fully isolated with its own
|
||||
high-bandwidth memory, cache, and compute cores[1]_. Its virtualization
|
||||
need enable SRIOV like QAT and SmartNIC, which is different from V100
|
||||
and other GPU cards. This spec proposes the specification of supporting
|
||||
MIG management in the nvidia mig driver.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
GPU MIG is the new feature of A serial card and has different virtualization
|
||||
type from V serial and other GPU cards. We need a new driver to manage
|
||||
MIG resources by Cyborg like other accelerators.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
An admin or operator wants to use the MIG resource like other vGPU resources,
|
||||
and boots up VM with MIG device attached.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Add new driver to manage MIG resources.
|
||||
1.collect raw info of GPU devices from compute node by "lspci" and grep
|
||||
nvidia related keyword.
|
||||
2.partition the default GPU instances for every A serial cards.
|
||||
3.find the MIG resources by filter the VFs of GPU
|
||||
4.report available VFs to cyborg-conductor and placement
|
||||
Change MIG partition by cyborg-api like change vgpu_type. Cyborg-api will
|
||||
call cyborg-agent to partition GPU and exec the period task to report new
|
||||
MIG info.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
No need to configure the vpgu_type for A serial GPU cards.
|
||||
Add new driver named nvidia_mig_driver for enabled_drivers.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
If the user want to use these feature, they should upgrade their Cyborg
|
||||
project to latest to support these changes.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignee:
|
||||
songwenping
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add new driver to manage MIG resources.
|
||||
* Change MIG partition by cyborg-api.
|
||||
* Change cyborgclient to support change MIG partition action.
|
||||
* Add related tests.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
Appropriate unit and functional tests should be added.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
* Need a documentaiton to explain nvidia_mig_driver usage.
|
||||
|
||||
References
|
||||
==========
|
||||
.. [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/
|
||||
|
||||
History
|
||||
=======
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Bobcat
|
||||
- Introduced
|
Loading…
Reference in New Issue