From 76de3fd931ca2e1e162fc2f5f14fcdd76cbbc851 Mon Sep 17 00:00:00 2001 From: Michele Baldessari Date: Thu, 12 Mar 2020 15:01:25 +0100 Subject: [PATCH] Use exec when spawning dnsmasq inside sidecar container We see some deployment failures where the overcloud is unable to PXE/DHCP boot during the initial bits of the deployments. The following errors are seen in neutron dhcp logs: 2020-03-11 17:58:33.737 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Resync event has been scheduled _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:277 2020-03-11 17:58:33.737 54481 DEBUG neutron.common.utils [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Calling throttled function clear wrapper /usr/lib/python3.6/site-packages/neutron/common/utils.py:110 2020-03-11 17:58:33.738 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] resync (a187b137-b68c-476e-bd37-39253158e762): [ProcessExecutionError("Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec\n+ trap 'exec 2>&4 1>&3' 0 1 2 3\n+ exec\n",)] _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:294 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent [-] Unable to reload_allocations dhcp for a187b137-b68c-476e-bd37-39253158e762.: neutron_lib.exceptions.ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec + trap 'exec 2>&4 1>&3' 0 1 2 3 + exec 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent Traceback (most recent call last): 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 160, in call_driver 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs) 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 528, in reload_allocations 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=True) 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 470, in _spawn_or_reload_process 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent pm.enable(reload_cfg=reload_with_HUP, ensure_active=True) 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 92, in enable 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.reload_cfg() 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 100, in reload_cfg 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.disable('HUP') 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 113, in disable 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent utils.execute(cmd, run_as_root=self.run_as_root) 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py", line 147, in execute 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent returncode=returncode) 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent neutron_lib.exceptions.ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent + trap 'exec 2>&4 1>&3' 0 1 2 3 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent + exec 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent 2020-03-11 17:58:33.740 54481 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/a187b137-b68c-476e-bd37-39253158e762.pid.haproxy get_value_from_file /usr/lib/pyt The issue is that the dhcp side containers are spawned with the following processes: |-conmon(375906)-+-dumb-init(375918)---bash(375935)---dnsmasq(375938) | `-{conmon}(375908) Now when neutron wants to send a SIGHUP to the dnsmasq it actually invokes the following command: nsenter --net=/run/netns/qdhcp-e11ac152-745f-4292-b423-d282fbf97f13 --preserve-credentials -m -t 1 podman kill --signal HUP 0b7371f6de52cfe377858... The problem is that podman kill will send the signal to "dumb-init --single-child" (pid1 for this container) which will then forward it only to bash, which will cause dnsmasq to be terminated and will eventually be later respawned with a different pid (stored in /var/lib/neutron/dhcp/86884abc-f7d7-4118-923f-38b247fee8e9/pid). So if multiple ports are created concurrently this is racy and one of them will fail to reload dnsmasq with the error above, because one process might use a pid file that is no longer valid. TLDR: this all works if SIGHUP to the dnsmasq process does not change pids under the hood all of a sudden. Otherwise a SIGHUP used to reload dnsmasq will be sent to the bash process father of dnsmasq which will then terminate dnsmasq and break all the things Tested this on three runs and did not experience the issue any longer Co-Authored-By: Bernard Cafarelli Co-Authored-By: Slawomir Kaplonski Closes-Bug: #1867192 Change-Id: I1af2ecd9e3996de4f43224f66a8bdb81eab07022 (cherry picked from commit 3ca7e8f03faab0ee599a18bce4217430583e2050) --- deployment/neutron/neutron-dhcp-container-puppet.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/deployment/neutron/neutron-dhcp-container-puppet.yaml b/deployment/neutron/neutron-dhcp-container-puppet.yaml index 3e5728d7e1..114d7cfeb5 100644 --- a/deployment/neutron/neutron-dhcp-container-puppet.yaml +++ b/deployment/neutron/neutron-dhcp-container-puppet.yaml @@ -333,7 +333,7 @@ outputs: loop_var: dhcp_wrapper_item loop: - name: dhcp_dnsmasq - cmd: /usr/sbin/dnsmasq -k + cmd: exec /usr/sbin/dnsmasq -k kill_script: dnsmasq-kill - name: dhcp_haproxy cmd: >-