Use long_rpc_timeout in conductor migrate_server RPC API call

The conductor migrate_server RPC method is a blocking RPC call used by both the API during a resize / cold migrate request and by the compute service if rescheduling from a failed prep_resize operation on the selected dest host (or alternate). Currently the RPC call is using the global rpc_response_timeout which defaults to 60 seconds. When coming from the API request, we're going from API to conductor to scheduler and don't return the response to the API caller until conductor casts to the first selected destination host's prep_resize method. In a large deployment, or under heavy load on the control plane, this could take long enough to trip the rpc_response_timeout and result in a MessagingTimeout 500 error response to the user. Reschedules from the compute should be faster since they don't involve a roundtrip call to the scheduler (since we have alternate selections since Queens). This makes the migrate_server method use the long_rpc_timeout config for the overall timeout which defaults to 1800 seconds. The rpc_response_timeout becomes the heartbeat value to make sure the call is still alive. This was noticed during at least one particularly slow resize call that timed out in the gate [1]. Related-Bug: #1763070 [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010494.html Change-Id: I9115ef6df59844cd6e702f19ba38ffbf9f8b35d3
2019-11-01 10:14:34 -04:00 · 2019-11-01 10:14:34 -04:00 · cd0021157b
parent 46a02d5eb5
commit cd0021157b
3 changed files with 35 additions and 1 deletions
--- a/nova/conductor/rpcapi.py
+++ b/nova/conductor/rpcapi.py
@ -338,7 +338,10 @@ class ComputeTaskAPI(object):
            kw['instance'] = jsonutils.to_primitive(
                    objects_base.obj_to_primitive(instance))
            version = '1.4'
-        cctxt = self.client.prepare(version=version)
+        cctxt = self.client.prepare(
+            version=version,
+            call_monitor_timeout=CONF.rpc_response_timeout,
+            timeout=CONF.long_rpc_timeout)
        return cctxt.call(context, 'migrate_server', **kw)

    def build_instances(self, context, instances, image, filter_properties,
--- a/nova/conf/rpc.py
+++ b/nova/conf/rpc.py
@ -31,6 +31,7 @@ Operations with RPC calls that utilize this value:
 * enabling/disabling a compute service
 * image pre-caching
 * snapshot-based / cross-cell resize
+* resize / cold migration

 Related options:

--- a/nova/tests/unit/conductor/test_conductor.py
+++ b/nova/tests/unit/conductor/test_conductor.py
@ -3584,6 +3584,36 @@ class ConductorTaskRPCAPITestCase(_BaseTaskTestCase,
                              self.context, mock.sentinel.aggregate,
                              [mock.sentinel.image])

+    def test_migrate_server(self):
+        self.flags(rpc_response_timeout=10, long_rpc_timeout=120)
+        instance = objects.Instance()
+        scheduler_hint = {}
+        live = rebuild = False
+        flavor = objects.Flavor()
+        block_migration = disk_over_commit = None
+
+        @mock.patch.object(self.conductor.client, 'can_send_version',
+                           return_value=True)
+        @mock.patch.object(self.conductor.client, 'prepare')
+        def _test(prepare_mock, can_send_mock):
+            self.conductor.migrate_server(
+                self.context, instance, scheduler_hint, live, rebuild,
+                flavor, block_migration, disk_over_commit)
+            kw = {'instance': instance, 'scheduler_hint': scheduler_hint,
+                  'live': live, 'rebuild': rebuild, 'flavor': flavor,
+                  'block_migration': block_migration,
+                  'disk_over_commit': disk_over_commit,
+                  'reservations': None, 'clean_shutdown': True,
+                  'request_spec': None, 'host_list': None}
+            prepare_mock.assert_called_once_with(
+                version=test.MatchType(str),  # version
+                call_monitor_timeout=10,
+                timeout=120)
+            prepare_mock.return_value.call.assert_called_once_with(
+                self.context, 'migrate_server', **kw)
+
+        _test()
+

 class ConductorTaskAPITestCase(_BaseTaskTestCase, test_compute.BaseTestCase):
    """Compute task API Tests."""