Use long_rpc_timeout in conductor migrate_server RPC API call

The conductor migrate_server RPC method is a blocking RPC call
used by both the API during a resize / cold migrate request and
by the compute service if rescheduling from a failed prep_resize
operation on the selected dest host (or alternate).

Currently the RPC call is using the global rpc_response_timeout
which defaults to 60 seconds. When coming from the API request,
we're going from API to conductor to scheduler and don't return
the response to the API caller until conductor casts to the
first selected destination host's prep_resize method. In a large
deployment, or under heavy load on the control plane, this could
take long enough to trip the rpc_response_timeout and result in
a MessagingTimeout 500 error response to the user.

Reschedules from the compute should be faster since they don't
involve a roundtrip call to the scheduler (since we have alternate
selections since Queens).

This makes the migrate_server method use the long_rpc_timeout
config for the overall timeout which defaults to 1800 seconds.
The rpc_response_timeout becomes the heartbeat value to make sure
the call is still alive.

This was noticed during at least one particularly slow resize
call that timed out in the gate [1].

Related-Bug: #1763070

[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010494.html

Change-Id: I9115ef6df59844cd6e702f19ba38ffbf9f8b35d3
This commit is contained in:
Matt Riedemann 2019-11-01 10:14:34 -04:00
parent 46a02d5eb5
commit cd0021157b
3 changed files with 35 additions and 1 deletions

View File

@ -338,7 +338,10 @@ class ComputeTaskAPI(object):
kw['instance'] = jsonutils.to_primitive(
objects_base.obj_to_primitive(instance))
version = '1.4'
cctxt = self.client.prepare(version=version)
cctxt = self.client.prepare(
version=version,
call_monitor_timeout=CONF.rpc_response_timeout,
timeout=CONF.long_rpc_timeout)
return cctxt.call(context, 'migrate_server', **kw)
def build_instances(self, context, instances, image, filter_properties,

View File

@ -31,6 +31,7 @@ Operations with RPC calls that utilize this value:
* enabling/disabling a compute service
* image pre-caching
* snapshot-based / cross-cell resize
* resize / cold migration
Related options:

View File

@ -3584,6 +3584,36 @@ class ConductorTaskRPCAPITestCase(_BaseTaskTestCase,
self.context, mock.sentinel.aggregate,
[mock.sentinel.image])
def test_migrate_server(self):
self.flags(rpc_response_timeout=10, long_rpc_timeout=120)
instance = objects.Instance()
scheduler_hint = {}
live = rebuild = False
flavor = objects.Flavor()
block_migration = disk_over_commit = None
@mock.patch.object(self.conductor.client, 'can_send_version',
return_value=True)
@mock.patch.object(self.conductor.client, 'prepare')
def _test(prepare_mock, can_send_mock):
self.conductor.migrate_server(
self.context, instance, scheduler_hint, live, rebuild,
flavor, block_migration, disk_over_commit)
kw = {'instance': instance, 'scheduler_hint': scheduler_hint,
'live': live, 'rebuild': rebuild, 'flavor': flavor,
'block_migration': block_migration,
'disk_over_commit': disk_over_commit,
'reservations': None, 'clean_shutdown': True,
'request_spec': None, 'host_list': None}
prepare_mock.assert_called_once_with(
version=test.MatchType(str), # version
call_monitor_timeout=10,
timeout=120)
prepare_mock.return_value.call.assert_called_once_with(
self.context, 'migrate_server', **kw)
_test()
class ConductorTaskAPITestCase(_BaseTaskTestCase, test_compute.BaseTestCase):
"""Compute task API Tests."""