Set SSH server keep alive options

When os-net-config configures the network configuration on the overcloud nodes
ssh connections can be dropped.

Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task
since it was failed by a ssh connection error.

However, the first task was actually still running and it eventually succeeds.

The second task that was kicked off by ansible as a retry, sees that the
deployment is already applied, but the notification file (*.notify.json) does
not yet exist since the first task is still in progress. This causes the second
task to fail with the error reported in the bug and the whole ansible-playbook
run to then fail.

Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix
the issue as ssh doesn't drop the first connection when these are configured.

Change-Id: I08781fe2aa6472d3fae5c5f5d0babd1f7a3b9b2d
Closes-Bug: #1792343
(cherry picked from commit c0f41cae9f)
This commit is contained in:
James Slagle 2018-09-20 13:36:03 -04:00 committed by Radoslaw Smigielski
parent 9c856b0101
commit 56bf1d6db5
2 changed files with 8 additions and 1 deletions

View File

@ -0,0 +1,5 @@
---
fixes:
- The ServerAliveInterval and ServerAliveCountMax SSH options are now set in
the mistral ansible action so that when networking configuration is
performed on the overcloud nodes SSH will not drop the connection.

View File

@ -48,7 +48,9 @@ def write_default_ansible_cfg(work_dir,
'-o UserKnownHostsFile=/dev/null '
'-o StrictHostKeyChecking=no '
'-o ControlMaster=auto '
'-o ControlPersist=30m')
'-o ControlPersist=30m '
'-o ServerAliveInterval=5 '
'-o ServerAliveCountMax=5')
config.set('ssh_connection', 'control_path_dir',
os.path.join(work_dir, 'ansible-ssh'))
config.set('ssh_connection', 'retries', '8')