Add workaround to disable group policy check upcall

Right now, the compute node is responsible for validating that the scheduler honored the affinity policy of a server group. It does this by listing all the instances in a server group and validating that affinity or anti-affinity was kept true. With cellsv2, this check requires an upcall to the API database as server groups are stored there. This violates our goal of not allowing any upcalls to be congruent with known restrictions that actual cells(v1) deployers require. This adds a workaround flag, disabled by default, which defeats this check in the compute node. For smaller deployments, isolation of the cell and api services is not as big of a deal, so by default, this check will continue to work as it does today. Larger deployments which care more about the isolation than they do the absolute affinity guarantees can enable this workaround to avoid the upcall check. A user can still detect a violation of their affinity request by examining the obscured host identifier. Longer-term, this problem goes away when we're doing claims in the scheduler and placement has a notion of affinity that it can enforce early in the boot process, eliminating the race and the need for this check entirely. Related to blueprint cells-aware-api Change-Id: I22e1a0736a269d89b55b71c2413fa763d5f1fd0c
2017-03-07 11:24:31 -08:00 · 2017-03-07 11:24:31 -08:00 · ef1c539ad1
parent 1226c57884
commit ef1c539ad1
5 changed files with 55 additions and 1 deletions
--- a/nova/compute/manager.py
+++ b/nova/compute/manager.py
@ -1304,7 +1304,8 @@ class ComputeManager(manager.Manager):
                            instance_uuid=instance.uuid,
                            reason=msg)

-        _do_validation(context, instance, group_hint)
+        if not CONF.workarounds.disable_group_policy_check_upcall:
+            _do_validation(context, instance, group_hint)

    def _log_original_error(self, exc_info, instance_uuid):
        LOG.error(_LE('Error: %s'), exc_info[1], instance_uuid=instance_uuid,
--- a/nova/conf/scheduler.py
+++ b/nova/conf/scheduler.py
@ -230,6 +230,11 @@ usage data to query the database on each request instead.

 This option is only used by the FilterScheduler and its subclasses; if you use
 a different scheduler, this option has no effect.
+
+NOTE: In a multi-cell (v2) setup where the cell MQ is separated from the
+top-level, computes cannot directly communicate with the scheduler. Thus,
+this option cannot be enabled in that scenario. See also the
+[workarounds]/disable_group_policy_check_upcall option.
 """),
    cfg.MultiStrOpt("available_filters",
        default=["nova.scheduler.filters.all_filters"],
--- a/nova/conf/workarounds.py
+++ b/nova/conf/workarounds.py
@ -123,6 +123,24 @@ Interdependencies to other options:
 * If ``sync_power_state_interval`` is negative and this feature is disabled,
  then instances that get out of sync between the hypervisor and the Nova
  database will have to be synchronized manually.
+"""),
+
+    cfg.BoolOpt(
+        'disable_group_policy_check_upcall',
+        default=False,
+        help="""
+Disable the server group policy check upcall in compute.
+
+In order to detect races with server group affinity policy, the compute
+service attempts to validate that the policy was not violated by the
+scheduler. It does this by making an upcall to the API database to list
+the instances in the server group for one that it is booting, which violates
+our api/cell isolation goals. Eventually this will be solved by proper affinity
+guarantees in the scheduler and placement service, but until then, this late
+check is needed to ensure proper affinity policy.
+
+Operators that desire api/cell isolation over this check should
+enable this flag, which will avoid making that upcall from compute.
 """),
 ]

--- a/nova/tests/unit/compute/test_compute_mgr.py
+++ b/nova/tests/unit/compute/test_compute_mgr.py
@ -4317,6 +4317,26 @@ class ComputeManagerBuildInstanceTestCase(test.NoDBTestCase):
        mock_prep.assert_called_once_with(self.context, self.instance,
                self.block_device_mapping)

+    @mock.patch('nova.objects.InstanceGroup.get_by_hint')
+    def test_validate_policy_honors_workaround_disabled(self, mock_get):
+        instance = objects.Instance(uuid=uuids.instance)
+        filter_props = {'scheduler_hints': {'group': 'foo'}}
+        mock_get.return_value = objects.InstanceGroup(policies=[])
+        self.compute._validate_instance_group_policy(self.context,
+                                                     instance,
+                                                     filter_props)
+        mock_get.assert_called_once_with(self.context, 'foo')
+
+    @mock.patch('nova.objects.InstanceGroup.get_by_hint')
+    def test_validate_policy_honors_workaround_enabled(self, mock_get):
+        self.flags(disable_group_policy_check_upcall=True, group='workarounds')
+        instance = objects.Instance(uuid=uuids.instance)
+        filter_props = {'scheduler_hints': {'group': 'foo'}}
+        self.compute._validate_instance_group_policy(self.context,
+                                                     instance,
+                                                     filter_props)
+        self.assertFalse(mock_get.called)
+
    def test_failed_bdm_prep_from_delete_raises_unexpected(self):
        with test.nested(
                mock.patch.object(self.compute,
--- a/releasenotes/notes/scheduler-upcalls-with-isolated-cells-0100eb5d1f212210.yaml
+++ b/releasenotes/notes/scheduler-upcalls-with-isolated-cells-0100eb5d1f212210.yaml
@ -0,0 +1,10 @@
+---
+issues:
+  - |
+    In deployments with multiple (v2) cells, upcalls from the computes to the scheduler
+    (or other control services) cannot occur. This prevents certain things from happening,
+    such as the track_instance_changes updates, as well as the late affinity checks for
+    server groups. See the related documentation on the `scheduler.track_instance_changes`
+    and `workarounds.disable_group_policy_check_upcall` configuration options for more
+    details. Single-cell deployments without any MQ isolation will continue to operate as
+    they have for the time being.