Merge "Virt driver large page allocation for guest RAM"
This commit is contained in:
commit
f706bc603d
|
@ -0,0 +1,340 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===============================================
|
||||
Virt driver large page allocation for guest RAM
|
||||
===============================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/virt-driver-large-pages
|
||||
|
||||
This feature aims to improve the libvirt driver so that it can use large pages
|
||||
for backing the guest RAM allocation. This will improve the performance of
|
||||
guest workloads by increasing TLB cache efficiency. It will ensure that the
|
||||
guest has 100% dedicated RAM that will never be swapped out.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Most modern virtualization hosts support a variety of memory page sizes. On
|
||||
x86 the smallest, used by the kernel by default, is 4kb, while large sizes
|
||||
include 2MB and 1GB. The CPU TLB cache has a limited size, so when there is a
|
||||
very large amount of RAM present and utilized, the cache efficiency can be
|
||||
fairly low which in turn increases memory access latency. By using larger page
|
||||
sizes, there are fewer entries needed in the TLB and thus its efficiency goes
|
||||
up.
|
||||
|
||||
The use of huge pages for backing guests implies that the guest is running with
|
||||
a dedicated resource allocation. ie the concept of memory overcommit is no
|
||||
longer possible to provide. This is a tradeoff that cloud administrators may
|
||||
be willing to make to support workloads that require predictable memory access
|
||||
times, such as NFV.
|
||||
|
||||
While large pages are better than small pages, it can't be assumed that the
|
||||
benefit increases as the page size increases. In some workloads, a 2 MB page
|
||||
size can be better overall than 1 GB page sizes. Also the choice of page size
|
||||
affects the granularity of guest RAM size. ie a 1.5 GB guest would not be able
|
||||
to use 1 GB pages since RAM is not a multiple of the page size.
|
||||
|
||||
Although it is theoretically possible to reserve large pages on the fly, after
|
||||
a host has been booted for a period of time, physical memory will have become
|
||||
very fragmented. This means that even if the host has lots of free memory, it
|
||||
may be unable to find contiguous chunks required to provide large pages. This
|
||||
is a particular problem for 1 GB sized pages. To deal with this problem, it is
|
||||
usual practice to reserve all required large pages upfront at host boot time,
|
||||
by specifying a reservation count on the kernel command line of the host. This
|
||||
would be a one-time setup task done when deploying new compute node hosts.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
The flavour extra specs will be enhanced to support a new parameter
|
||||
|
||||
* hw:mem_page_size=large|small|any|2MB|1GB
|
||||
|
||||
In absence of any page size setting in the flavour, the current behaviour of
|
||||
using the small, default, page size will continue. A setting of 'large' says
|
||||
to only use larger page sizes for guest RAM, eg either 2MB or 1GB on x86;
|
||||
'small' says to only use the small page sizes, eg 4k on x86, and is the
|
||||
default; 'any' means to leave policy upto the compute driver implementation to
|
||||
decide. When seeing 'any' the libvirt driver might try to find large pages,
|
||||
but fallback to small pages, but other drivers may choose alternate policies
|
||||
for 'any'. Finally an explicit page size can be set if the workload has very
|
||||
precise requirements for a specific large page size. It is expected that the
|
||||
common case would be to use page_size=large or page_size=any. The
|
||||
specification of explicit page sizes would be something that NFV workloads
|
||||
would require.
|
||||
|
||||
The property defined for the flavour can also be set against the image, but
|
||||
the use of large pages would only be honoured if the flavour already had a
|
||||
policy or 'large' or 'any'. ie if the flavour said 'small', or a specific
|
||||
numeric page size, the image would not be permitted to override this to access
|
||||
other large page sizes. Such invalid override in the image would result in
|
||||
an exception being raised and the attempt to boot the instance resulting in
|
||||
an error. While ultimate validation is done in the virt driver, this can also
|
||||
be caught and reported at the at the API layer.
|
||||
|
||||
If the flavor memory size is not a multiple of the specified huge page size
|
||||
this would be considered an error which would cause the instance to fail to
|
||||
boot. If the page size is 'large' or 'any', then the compute driver would
|
||||
obviously attempt to pick a page size which was a multiple of the RAM size
|
||||
rather than erroring. This is only likely to be a significant problem when
|
||||
when using 1 GB page sizes, which imply that ram size must be in 1 GB
|
||||
increments.
|
||||
|
||||
The libvirt driver will be enhanced to honour this parameter when configuring
|
||||
the guest RAM allocation policy. This will effectively introduce the concept
|
||||
of a "dedicated memory" guest, since large pages must be 1-to-1 associated with
|
||||
guests - there's not facility to over commit by allowing one large page to be
|
||||
used with multiple guests or to swap large pages.
|
||||
|
||||
The libvirt driver will be enhanced to report on large page availability per
|
||||
NUMA node, building on previously added NUMA topology reporting.
|
||||
|
||||
The scheduler will be enhanced to take account of the page size setting on the
|
||||
flavour and pick hosts which have sufficient large pages available when
|
||||
scheduling the instance. Conversely if large pages are not requested, then the
|
||||
scheduler needs to avoid placing the instance on a host which has pre-reserved
|
||||
large pages. The enhancements for the scheduler will be done as part of the
|
||||
new filter that is implemented as part of the NUMA topology blueprint. This
|
||||
involves altering the logic done in that blueprint, so that instead of just
|
||||
looking at free memory in each NUMA node, it instead looks at the free page
|
||||
count for the desired page size.
|
||||
|
||||
As illustrated later in this document each host will be reporting on all
|
||||
page sizes available and this information will be available to the scheduler.
|
||||
So when it interprets 'small', it will consider the smallest page size
|
||||
reported by the compute node. Conversely when intepreting 'large' it will
|
||||
consider any page size except the smallest one. This obviously implies that
|
||||
there is potential for 'large' and 'small' to have different meanings
|
||||
depending on the host being considered. For the use cases where this would
|
||||
be a problem, an explicit page size would be requested instead of using
|
||||
these symbolic named sizes. It will also have to consider whether the page
|
||||
size is a multiple of the flavor memory size. If the instance is using
|
||||
multiple NUMA nodes, it will have to consider whether the RAM in each
|
||||
guest node is a multiple of the page size, rather than the total memory
|
||||
size.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Recent Linux hosts have a concept of "transparent huge pages" where the kernel
|
||||
will opportunistically allocate large pages for guest VMs. The problem with
|
||||
this is that over time, the kernel's memory allocations get very fragmented
|
||||
making it increasingly hard to find contiguous blocks of RAM to use for large
|
||||
pages. This makes transparent large pages impractical for use with 1 GB page
|
||||
sizes. The opportunistic approach also means that users do not have any hard
|
||||
guarantee that their instance will be backed by large pages. This makes it an
|
||||
unusable approach for NFV workloads which require hard guarantees.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
The previously added data in the host state structure for reporting NUMA
|
||||
topology would be enhanced to further include information on page size
|
||||
availability per node. So it would then look like
|
||||
|
||||
::
|
||||
|
||||
hw_numa = {
|
||||
nodes = [
|
||||
{
|
||||
id = 0
|
||||
cpus = 0, 2, 4, 6
|
||||
mem = {
|
||||
total = 10737418240
|
||||
free = 3221225472
|
||||
},
|
||||
mempages = {
|
||||
4096 = {
|
||||
total = 262144
|
||||
free = 262144
|
||||
}
|
||||
2097152 = {
|
||||
total = 1024
|
||||
free = 1024
|
||||
}
|
||||
1073741824 = {
|
||||
total = 7
|
||||
free = 0
|
||||
}
|
||||
}
|
||||
distances = [ 10, 20],
|
||||
},
|
||||
{
|
||||
id = 1
|
||||
cpus = 1, 3, 5, 7
|
||||
mem = {
|
||||
total = 10737418240
|
||||
free = 5368709120
|
||||
},
|
||||
mempages = {
|
||||
4096 = {
|
||||
total = 262144
|
||||
free = 262144
|
||||
}
|
||||
2097152 = {
|
||||
total = 1024
|
||||
free = 1024
|
||||
}
|
||||
1073741824 = {
|
||||
total = 7
|
||||
free = 2
|
||||
}
|
||||
}
|
||||
distances = [ 20, 10],
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
The data provided to the extensible resource tracker would be similarly
|
||||
enhanced to include this page info in a flattened format, which can be
|
||||
efficiently queried based on the key name:
|
||||
|
||||
* hw_numa_nodes=2
|
||||
* hw_numa_node0_cpus=4
|
||||
* hw_numa_node0_mem_total=10737418240
|
||||
* hw_numa_node0_mem_avail=3221225472
|
||||
* hw_numa_node0_mem_page_total_4=262144
|
||||
* hw_numa_node0_mem_page_avail_4=262144
|
||||
* hw_numa_node0_mem_page_total_2048=1024
|
||||
* hw_numa_node0_mem_page_avail_2048=1024
|
||||
* hw_numa_node0_mem_page_total_1048576=7
|
||||
* hw_numa_node0_mem_page_avail_1048576=0
|
||||
* hw_numa_node0_distance_node0=10
|
||||
* hw_numa_node0_distance_node1=20
|
||||
* hw_numa_node1_cpus=4
|
||||
* hw_numa_node1_mem_total=10737418240
|
||||
* hw_numa_node1_mem_avail=5368709120
|
||||
* hw_numa_node1_mem_page_total_4=262144
|
||||
* hw_numa_node1_mem_page_avail_4=262144
|
||||
* hw_numa_node1_mem_page_total_2048=1024
|
||||
* hw_numa_node1_mem_page_avail_2048=1024
|
||||
* hw_numa_node1_mem_page_total_1048576=7
|
||||
* hw_numa_node1_mem_page_avail_1048576=2
|
||||
* hw_numa_node1_distance_node0=20
|
||||
* hw_numa_node1_distance_node1=10
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
No impact.
|
||||
|
||||
The existing APIs already support arbitrary data in the flavour extra specs.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
No impact.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
No impact.
|
||||
|
||||
The notifications system is not used by this change.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
There are no changes that directly impact the end user, other than the fact
|
||||
that their guest should have more predictable memory access latency.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The scheduler will have more logic added to take into account large page
|
||||
availability per NUMA node when placing guests. Most of this impact will have
|
||||
already been incurred when initial NUMA support was added to the scheduler.
|
||||
This change is merely altering the NUMA support such that it considers the
|
||||
free large pages instead of overall RAM size.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
The cloud administrator will gain the ability to set large page policy on the
|
||||
flavours they configured. The administrator will also have to configure their
|
||||
compute hosts to reserve large pages at boot time, and place those hosts into a
|
||||
group using aggregates.
|
||||
|
||||
It is possible that there might be a need to expose information on the page
|
||||
counts to host administrators via the Nova API. Such a need can be considered
|
||||
for followup work once the work refernced in this basic spec is completed
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
If other hypervisors allow the control over large page usage, they could be
|
||||
enhanced to support the same flavour extra specs settings. If the hypervisor
|
||||
has self-determined control over large page usage, then it is valid to simply
|
||||
ignore this new flavour setting. ie do nothing.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
berrange
|
||||
|
||||
Other contributors:
|
||||
ndipanov
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Enhance libvirt driver to report available large pages per NUMA node in the
|
||||
host state data
|
||||
* Enhance libvirt driver to configure guests based on the flavour parameter
|
||||
for page sizes
|
||||
* Add support to scheduler to place instances on hosts according to the
|
||||
availability of required large pages
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Virt driver guest NUMA node placement & topology. This blueprint is going
|
||||
to be an extension of the work done in the compute driver and scheduler
|
||||
for NUMA placement, since large pages must be allocated from matching
|
||||
guest & host NUMA node to avoid cross-node memory access
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement
|
||||
|
||||
* Libvirt / KVM need to be enhanced to allow Nova to indicate that large
|
||||
pages should be allocated from specific NUMA nodes on the host. This is not
|
||||
a blocker to supporting large pages in Nova, since it can use the more
|
||||
general large page support in libvirt, however, the performance benefits
|
||||
won't be fully realized until per-NUMA node large page allocation can be
|
||||
done.
|
||||
|
||||
* Extensible resource tracker
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/extensible-resource-tracking
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Testing this in the gate would be difficult since the hosts which run the
|
||||
gate tests would have to be pre-configured with large pages allocated at
|
||||
initial OS boot time. This in turn would preclude running gate tests with
|
||||
guests that do not want to use large pages.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The new flavour parameter available to the cloud administrator needs to be
|
||||
documented along with recommendations about effective usage. The docs will
|
||||
also need to mention the compute host deployment pre-requisites such as the
|
||||
need to pre-allocate large pages at boot time and setup aggregates.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Current "big picture" research and design for the topic of CPU and memory
|
||||
resource utilization and placement. vCPU topology is a subset of this
|
||||
work
|
||||
|
||||
* https://wiki.openstack.org/wiki/VirtDriverGuestCPUMemoryPlacement
|
Loading…
Reference in New Issue