.. Copyright 2021 Canonical Ltd. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode .. This template should be in ReSTructured text. Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: "None". For help with syntax, see http://sphinx-doc.org/rest.html To test out your formatting, see http://www.tele3.cz/jbar/rest/rest.html =========================== Memory Fragmentation Tuning =========================== In some high memory pressure scenarios, the memory shortage would make the high order pages hard to be allocated and also the page allocation would go to the synchronous frequently reclaim thanks to the default gap between min<->low<->high is too small to wake up the kswapd (asynchronous reclaim) earlier. Problem Description =================== In the OpenStack compute node, especially the hyperconverged machine with the Ceph OSDs using a lot of page cache. It is easy to have the memory allocation stall issue. The issue would lead to several issues: The new instance cannot be brought up (KVM needs to allocate order sixth pages) or VM stuck, etc. The reasons are: 1). Compaction for big order page If the THP (Transparent Huge Page) is used with the VM, it will be more severe than the persistent huge pages reserved for the VM's dedicated usage. The THP needs to allocate the 2MB (x86) huge pages at run time. Moreover, this is the order 9 (2^9 * 4K = 2MB). In running system, it will be hard to get the continuous 512 (2^9) 4K pages according to /proc/pagetypeinfo. 2). Synchronous reclaim. There are three levels of watermark inside the system: 1). min 2). low 3). high. When the number of free pages lowers down to the low watermark. The kswapd will be wakened up to do the asynchronous reclaim. Furthermore, it will not be stopped until the number of free pages reaches the high watermark. However, when the memory allocation is strong enough, the free pages will continue to lower down to the min watermark. At this point, the number of min pages is reserved for emergency usage, and the allocation will go into the direct-reclaim (synchronous) mode. This will stall the process. Proposed Change =============== In the past experience, the 1GB gap between min<->low<->high watermark is a good practice in the server environment. The bigger gap can wake up the kswapd earlier and avoid the synchronous reclaim. Moreover, this can alleviate the latency. The sysctl parameters related to the watermark gap calculation: vm.min_free_kbytes vm.watermark_scale_factor For the Ubuntu kernel before 4.15 (Bionic), the only way to tune the watermark is to modify the vm.min_free_kbytes. The gap would be 1/4 of the vm.min_free_kbytes. However, increasing the min_free_kbytes is the minimum watermark reservation increase, which will decrease the actual memory that the runtime system can use. For Ubuntu kernel after 4.15, vm.watermark_scale_factor can be used to increase the gap without increasing the min watermark reservation. The gap is calculated by "watermark_scale_factor/10000 * managed_pages". The proposed solution is to set the 1GB watermark gap by using the above two parameters when the compute node is rebooted. The feature will be designed in flexible ways: 1). There will be a switch to turn on/off the feature. By default, it is turned off. For some small memory compute nodes (<32GB), the 1GB low memory is too many. 2). The manual config has a higher priority to overwrite the default calculation. Alternatives ------------ The config can be set up in the run time with the following command: juju deploy cs:sysconfig-2 juju add-relation sysconfig nova-compute juju config sysconfig sysctl="{vm.extfrag_threshold: 200, vm.watermark_scale_factor: 50}" However, each system might have different memory capacities. The watermark_scale_factor needs to be calculated manually. Implementation ============== Assignee(s) ----------- Primary assignee: - Gavin Guo Gerrit Topic ------------ Use Gerrit topic "memory-fragmentation-tuning" for all patches related to this spec. .. code-block:: bash git-review -t memory-fragmentation-tuning Work Items ---------- Implement the watermark_scale_factor value calculation to set up the gap to 1GB. Repositories ------------ No new git Repository is required. Documentation ------------- The documentation is needed to include the switch to turn on/off the feature. Security -------- The use of this feature exposes no other security attack surface. Testing ------- To verify if the calculated watermark value is correct. Also, in different kernel versions, different parameters should be used (min_free_kbytes v.s. watermark_scale_factor). Dependencies ============ None