Merge "Support memory fragmentation tuning"
This commit is contained in:
commit
5b0d19cbcf
|
@ -0,0 +1,142 @@
|
||||||
|
..
|
||||||
|
Copyright 2021 Canonical Ltd.
|
||||||
|
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0
|
||||||
|
Unported License.
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
..
|
||||||
|
This template should be in ReSTructured text. Please do not delete
|
||||||
|
any of the sections in this template. If you have nothing to say
|
||||||
|
for a whole section, just write: "None". For help with syntax, see
|
||||||
|
http://sphinx-doc.org/rest.html To test out your formatting, see
|
||||||
|
http://www.tele3.cz/jbar/rest/rest.html
|
||||||
|
|
||||||
|
===========================
|
||||||
|
Memory Fragmentation Tuning
|
||||||
|
===========================
|
||||||
|
|
||||||
|
In some high memory pressure scenarios, the memory shortage would make the high
|
||||||
|
order pages hard to be allocated and also the page allocation would go to the
|
||||||
|
synchronous frequently reclaim thanks to the default gap between
|
||||||
|
min<->low<->high is too small to wake up the kswapd (asynchronous reclaim)
|
||||||
|
earlier.
|
||||||
|
|
||||||
|
Problem Description
|
||||||
|
===================
|
||||||
|
|
||||||
|
In the OpenStack compute node, especially the hyperconverged machine with the
|
||||||
|
Ceph OSDs using a lot of page cache. It is easy to have the memory allocation
|
||||||
|
stall issue. The issue would lead to several issues: The new instance cannot be
|
||||||
|
brought up (KVM needs to allocate order sixth pages) or VM stuck, etc. The
|
||||||
|
reasons are:
|
||||||
|
|
||||||
|
1). Compaction for big order page
|
||||||
|
If the THP (Transparent Huge Page) is used with the VM, it will be more severe
|
||||||
|
than the persistent huge pages reserved for the VM's dedicated usage. The THP
|
||||||
|
needs to allocate the 2MB (x86) huge pages at run time. Moreover, this is the
|
||||||
|
order 9 (2^9 * 4K = 2MB). In running system, it will be hard to get the
|
||||||
|
continuous 512 (2^9) 4K pages according to /proc/pagetypeinfo.
|
||||||
|
|
||||||
|
2). Synchronous reclaim.
|
||||||
|
There are three levels of watermark inside the system: 1). min 2). low 3).
|
||||||
|
high. When the number of free pages lowers down to the low watermark. The kswapd
|
||||||
|
will be wakened up to do the asynchronous reclaim. Furthermore, it will not be
|
||||||
|
stopped until the number of free pages reaches the high watermark. However, when
|
||||||
|
the memory allocation is strong enough, the free pages will continue to lower
|
||||||
|
down to the min watermark. At this point, the number of min pages is reserved
|
||||||
|
for emergency usage, and the allocation will go into the
|
||||||
|
direct-reclaim (synchronous) mode. This will stall the process.
|
||||||
|
|
||||||
|
Proposed Change
|
||||||
|
===============
|
||||||
|
|
||||||
|
In the past experience, the 1GB gap between min<->low<->high watermark is a good
|
||||||
|
practice in the server environment. The bigger gap can wake up the kswapd
|
||||||
|
earlier and avoid the synchronous reclaim. Moreover, this can alleviate the
|
||||||
|
latency. The sysctl parameters related to the watermark gap calculation:
|
||||||
|
|
||||||
|
vm.min_free_kbytes
|
||||||
|
vm.watermark_scale_factor
|
||||||
|
|
||||||
|
For the Ubuntu kernel before 4.15 (Bionic), the only way to tune the watermark is
|
||||||
|
to modify the vm.min_free_kbytes. The gap would be 1/4 of the
|
||||||
|
vm.min_free_kbytes. However, increasing the min_free_kbytes is the minimum
|
||||||
|
watermark reservation increase, which will decrease the actual memory that the
|
||||||
|
runtime system can use.
|
||||||
|
|
||||||
|
For Ubuntu kernel after 4.15, vm.watermark_scale_factor can be used to increase
|
||||||
|
the gap without increasing the min watermark reservation. The gap is calculated
|
||||||
|
by "watermark_scale_factor/10000 * managed_pages".
|
||||||
|
|
||||||
|
The proposed solution is to set the 1GB watermark gap by using the above two
|
||||||
|
parameters when the compute node is rebooted.
|
||||||
|
|
||||||
|
The feature will be designed in flexible ways:
|
||||||
|
1). There will be a switch to turn on/off the feature. By default, it is turned
|
||||||
|
off. For some small memory compute nodes (<32GB), the 1GB low memory is too
|
||||||
|
many.
|
||||||
|
|
||||||
|
2). The manual config has a higher priority to overwrite the default calculation.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
The config can be set up in the run time with the following command:
|
||||||
|
juju deploy cs:sysconfig-2
|
||||||
|
juju add-relation sysconfig nova-compute
|
||||||
|
juju config sysconfig sysctl="{vm.extfrag_threshold: 200,
|
||||||
|
vm.watermark_scale_factor: 50}"
|
||||||
|
|
||||||
|
However, each system might have different memory capacities. The
|
||||||
|
watermark_scale_factor needs to be calculated manually.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
- Gavin Guo <gavin.guo@canonical.com>
|
||||||
|
|
||||||
|
|
||||||
|
Gerrit Topic
|
||||||
|
------------
|
||||||
|
|
||||||
|
Use Gerrit topic "memory-fragmentation-tuning" for all patches related to this spec.
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
git-review -t memory-fragmentation-tuning
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
Implement the watermark_scale_factor value calculation to set up the gap to 1GB.
|
||||||
|
|
||||||
|
Repositories
|
||||||
|
------------
|
||||||
|
|
||||||
|
No new git Repository is required.
|
||||||
|
|
||||||
|
Documentation
|
||||||
|
-------------
|
||||||
|
|
||||||
|
The documentation is needed to include the switch to turn on/off the feature.
|
||||||
|
|
||||||
|
Security
|
||||||
|
--------
|
||||||
|
|
||||||
|
The use of this feature exposes no other security attack surface.
|
||||||
|
|
||||||
|
Testing
|
||||||
|
-------
|
||||||
|
|
||||||
|
To verify if the calculated watermark value is correct. Also, in different
|
||||||
|
kernel versions, different parameters should be used (min_free_kbytes v.s.
|
||||||
|
watermark_scale_factor).
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
None
|
Loading…
Reference in New Issue