Merge "Virt driver large page allocation for guest RAM"

2014-07-04 15:05:15 +00:00 · 2014-07-04 15:05:15 +00:00 · f706bc603d
parent 813c126395 ba80e46332
commit f706bc603d
1 changed files with 340 additions and 0 deletions
--- a/specs/juno/virt-driver-large-pages.rst
+++ b/specs/juno/virt-driver-large-pages.rst
@ -0,0 +1,340 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+===============================================
+Virt driver large page allocation for guest RAM
+===============================================
+
+https://blueprints.launchpad.net/nova/+spec/virt-driver-large-pages
+
+This feature aims to improve the libvirt driver so that it can use large pages
+for backing the guest RAM allocation. This will improve the performance of
+guest workloads by increasing TLB cache efficiency. It will ensure that the
+guest has 100% dedicated RAM that will never be swapped out.
+
+Problem description
+===================
+
+Most modern virtualization hosts support a variety of memory page sizes. On
+x86 the smallest, used by the kernel by default, is 4kb, while large sizes
+include 2MB and 1GB. The CPU TLB cache has a limited size, so when there is a
+very large amount of RAM present and utilized, the cache efficiency can be
+fairly low which in turn increases memory access latency. By using larger page
+sizes, there are fewer entries needed in the TLB and thus its efficiency goes
+up.
+
+The use of huge pages for backing guests implies that the guest is running with
+a dedicated resource allocation. ie the concept of memory overcommit is no
+longer possible to provide. This is a tradeoff that cloud administrators may
+be willing to make to support workloads that require predictable memory access
+times, such as NFV.
+
+While large pages are better than small pages, it can't be assumed that the
+benefit increases as the page size increases. In some workloads, a 2 MB page
+size can be better overall than 1 GB page sizes. Also the choice of page size
+affects the granularity of guest RAM size. ie a 1.5 GB guest would not be able
+to use 1 GB pages since RAM is not a multiple of the page size.
+
+Although it is theoretically possible to reserve large pages on the fly, after
+a host has been booted for a period of time, physical memory will have become
+very fragmented. This means that even if the host has lots of free memory, it
+may be unable to find contiguous chunks required to provide large pages. This
+is a particular problem for 1 GB sized pages. To deal with this problem, it is
+usual practice to reserve all required large pages upfront at host boot time,
+by specifying a reservation count on the kernel command line of the host. This
+would be a one-time setup task done when deploying new compute node hosts.
+
+Proposed change
+===============
+
+The flavour extra specs will be enhanced to support a new parameter
+
+* hw:mem_page_size=large|small|any|2MB|1GB
+
+In absence of any page size setting in the flavour, the current behaviour of
+using the small, default, page size will continue. A setting of 'large' says
+to only use larger page sizes for guest RAM, eg either 2MB or 1GB on x86;
+'small' says to only use the small page sizes, eg 4k on x86, and is the
+default; 'any' means to leave policy upto the compute driver implementation to
+decide. When seeing 'any' the libvirt driver might try to find large pages,
+but fallback to small pages, but other drivers may choose alternate policies
+for 'any'. Finally an explicit page size can be set if the workload has very
+precise requirements for a specific large page size. It is expected that the
+common case would be to use page_size=large or page_size=any. The
+specification of explicit page sizes would be something that NFV workloads
+would require.
+
+The property defined for the flavour can also be set against the image, but
+the use of large pages would only be honoured if the flavour already had a
+policy or 'large' or 'any'. ie if the flavour said 'small', or a specific
+numeric page size, the image would not be permitted to override this to access
+other large page sizes. Such invalid override in the image would result in
+an exception being raised and the attempt to boot the instance resulting in
+an error. While ultimate validation is done in the virt driver, this can also
+be caught and reported at the at the API layer.
+
+If the flavor memory size is not a multiple of the specified huge page size
+this would be considered an error which would cause the instance to fail to
+boot. If the page size is 'large' or 'any', then the compute driver would
+obviously attempt to pick a page size which was a multiple of the RAM size
+rather than erroring. This is only likely to be a significant problem when
+when using 1 GB page sizes, which imply that ram size must be in 1 GB
+increments.
+
+The libvirt driver will be enhanced to honour this parameter when configuring
+the guest RAM allocation policy. This will effectively introduce the concept
+of a "dedicated memory" guest, since large pages must be 1-to-1 associated with
+guests - there's not facility to over commit by allowing one large page to be
+used with multiple guests or to swap large pages.
+
+The libvirt driver will be enhanced to report on large page availability per
+NUMA node, building on previously added NUMA topology reporting.
+
+The scheduler will be enhanced to take account of the page size setting on the
+flavour and pick hosts which have sufficient large pages available when
+scheduling the instance. Conversely if large pages are not requested, then the
+scheduler needs to avoid placing the instance on a host which has pre-reserved
+large pages. The enhancements for the scheduler will be done as part of the
+new filter that is implemented as part of the NUMA topology blueprint. This
+involves altering the logic done in that blueprint, so that instead of just
+looking at free memory in each NUMA node, it instead looks at the free page
+count for the desired page size.
+
+As illustrated later in this document each host will be reporting on all
+page sizes available and this information will be available to the scheduler.
+So when it interprets 'small', it will consider the smallest page size
+reported by the compute node. Conversely when intepreting 'large' it will
+consider any page size except the smallest one. This obviously implies that
+there is potential for 'large' and 'small' to have different meanings
+depending on the host being considered. For the use cases where this would
+be a problem, an explicit page size would be requested instead of using
+these symbolic named sizes. It will also have to consider whether the page
+size is a multiple of the flavor memory size. If the instance is using
+multiple NUMA nodes, it will have to consider whether the RAM in each
+guest node is a multiple of the page size, rather than the total memory
+size.
+
+Alternatives
+------------
+
+Recent Linux hosts have a concept of "transparent huge pages" where the kernel
+will opportunistically allocate large pages for guest VMs. The problem with
+this is that over time, the kernel's memory allocations get very fragmented
+making it increasingly hard to find contiguous blocks of RAM to use for large
+pages. This makes transparent large pages impractical for use with 1 GB page
+sizes. The opportunistic approach also means that users do not have any hard
+guarantee that their instance will be backed by large pages. This makes it an
+unusable approach for NFV workloads which require hard guarantees.
+
+Data model impact
+-----------------
+
+The previously added data in the host state structure for reporting NUMA
+topology would be enhanced to further include information on page size
+availability per node. So it would then look like
+
+::
+
+  hw_numa = {
+     nodes = [
+         {
+            id = 0
+            cpus = 0, 2, 4, 6
+            mem = {
+               total = 10737418240
+               free = 3221225472
+            },
+            mempages = {
+               4096 = {
+                  total = 262144
+                  free = 262144
+               }
+               2097152 = {
+                  total = 1024
+                  free = 1024
+               }
+               1073741824 = {
+                  total = 7
+                  free = 0
+               }
+            }
+            distances = [ 10, 20],
+         },
+         {
+            id = 1
+            cpus = 1, 3, 5, 7
+            mem = {
+               total = 10737418240
+               free = 5368709120
+            },
+            mempages = {
+               4096 = {
+                  total = 262144
+                  free = 262144
+               }
+               2097152 = {
+                  total = 1024
+                  free = 1024
+               }
+               1073741824 = {
+                  total = 7
+                  free = 2
+               }
+            }
+            distances = [ 20, 10],
+         }
+     ],
+  }
+
+The data provided to the extensible resource tracker would be similarly
+enhanced to include this page info in a flattened format, which can be
+efficiently queried based on the key name:
+
+* hw_numa_nodes=2
+* hw_numa_node0_cpus=4
+* hw_numa_node0_mem_total=10737418240
+* hw_numa_node0_mem_avail=3221225472
+* hw_numa_node0_mem_page_total_4=262144
+* hw_numa_node0_mem_page_avail_4=262144
+* hw_numa_node0_mem_page_total_2048=1024
+* hw_numa_node0_mem_page_avail_2048=1024
+* hw_numa_node0_mem_page_total_1048576=7
+* hw_numa_node0_mem_page_avail_1048576=0
+* hw_numa_node0_distance_node0=10
+* hw_numa_node0_distance_node1=20
+* hw_numa_node1_cpus=4
+* hw_numa_node1_mem_total=10737418240
+* hw_numa_node1_mem_avail=5368709120
+* hw_numa_node1_mem_page_total_4=262144
+* hw_numa_node1_mem_page_avail_4=262144
+* hw_numa_node1_mem_page_total_2048=1024
+* hw_numa_node1_mem_page_avail_2048=1024
+* hw_numa_node1_mem_page_total_1048576=7
+* hw_numa_node1_mem_page_avail_1048576=2
+* hw_numa_node1_distance_node0=20
+* hw_numa_node1_distance_node1=10
+
+REST API impact
+---------------
+
+No impact.
+
+The existing APIs already support arbitrary data in the flavour extra specs.
+
+Security impact
+---------------
+
+No impact.
+
+Notifications impact
+--------------------
+
+No impact.
+
+The notifications system is not used by this change.
+
+Other end user impact
+---------------------
+
+There are no changes that directly impact the end user, other than the fact
+that their guest should have more predictable memory access latency.
+
+Performance Impact
+------------------
+
+The scheduler will have more logic added to take into account large page
+availability per NUMA node when placing guests. Most of this impact will have
+already been incurred when initial NUMA support was added to the scheduler.
+This change is merely altering the NUMA support such that it considers the
+free large pages instead of overall RAM size.
+
+Other deployer impact
+---------------------
+
+The cloud administrator will gain the ability to set large page policy on the
+flavours they configured. The administrator will also have to configure their
+compute hosts to reserve large pages at boot time, and place those hosts into a
+group using aggregates.
+
+It is possible that there might be a need to expose information on the page
+counts to host administrators via the Nova API. Such a need can be considered
+for followup work once the work refernced in this basic spec is completed
+
+Developer impact
+----------------
+
+If other hypervisors allow the control over large page usage, they could be
+enhanced to support the same flavour extra specs settings. If the hypervisor
+has self-determined control over large page usage, then it is valid to simply
+ignore this new flavour setting. ie do nothing.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  berrange
+
+Other contributors:
+  ndipanov
+
+Work Items
+----------
+
+* Enhance libvirt driver to report available large pages per NUMA node in the
+  host state data
+* Enhance libvirt driver to configure guests based on the flavour parameter
+  for page sizes
+* Add support to scheduler to place instances on hosts according to the
+  availability of required large pages
+
+Dependencies
+============
+
+* Virt driver guest NUMA node placement & topology. This blueprint is going
+  to be an extension of the work done in the compute driver and scheduler
+  for NUMA placement, since large pages must be allocated from matching
+  guest & host NUMA node to avoid cross-node memory access
+
+   https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement
+
+* Libvirt / KVM need to be enhanced to allow Nova to indicate that large
+  pages should be allocated from specific NUMA nodes on the host. This is not
+  a blocker to supporting large pages in Nova, since it can use the more
+  general large page support in libvirt, however, the performance benefits
+  won't be fully realized until per-NUMA node large page allocation can be
+  done.
+
+* Extensible resource tracker
+
+  https://blueprints.launchpad.net/nova/+spec/extensible-resource-tracking
+
+Testing
+=======
+
+Testing this in the gate would be difficult since the hosts which run the
+gate tests would have to be pre-configured with large pages allocated at
+initial OS boot time. This in turn would preclude running gate tests with
+guests that do not want to use large pages.
+
+Documentation Impact
+====================
+
+The new flavour parameter available to the cloud administrator needs to be
+documented along with recommendations about effective usage. The docs will
+also need to mention the compute host deployment pre-requisites such as the
+need to pre-allocate large pages at boot time and setup aggregates.
+
+References
+==========
+
+Current "big picture" research and design for the topic of CPU and memory
+resource utilization and placement. vCPU topology is a subset of this
+work
+
+* https://wiki.openstack.org/wiki/VirtDriverGuestCPUMemoryPlacement