nova/compute at 11cb42f396fdbc1d973e1a1b592c00896f646015 - nova

History

Matt Riedemann 11cb42f396 Restore RT.old_resources if ComputeNode.save() fails When starting nova-compute for the first time with a new node, the ResourceTracker will create a new ComputeNode record in _init_compute_node but without all of the fields set on the ComputeNode, for example "free_disk_gb". Later _update_usage_from_instances will set some fields on the ComputeNode record (even if there are no instances on the node, why - I don't know) like free_disk_gb. This will make the eventual call from _update() to _resource_change() update the value in the old_resouces dict and return True, and then _update() will try to update those ComputeNode changes to the database. If that update fails, for example due to a DBConnectionError, the value in old_resources will still be for the current version of the node in memory but not what is actually in the database. Note that this failure does not result in the compute service failing to start because ComputeManager._update_available_resource_for_node traps the Exception and just logs it. A subsequent trip through the RT._update() method - because of the update_available_resource periodic task - will call _resource_change but because old_resource matches the current state of the node, it returns False and the RT does not attempt to persist the changes to the DB. _update() will then go on to call _update_to_placement which will create the resource provider in placement along with its inventory, making it potentially a candidate for scheduling. This can be a problem later in the scheduler because the HostState._update_from_compute_node method may skip setting fields on the HostState object if free_disk_gb is not set in the ComputeNode record - which can then break filters and weighers later in the scheduling process (see bug 1834691 and bug 1834694). The fix proposed here is simple: if the ComputeNode.save() in RT._update() fails, restore the previous value in old_resources so that the subsequent run through _resource_change will compare the correct state of the object and retry the update. An alternative to this would be killing the compute service on startup if there is a DB error but that could have unintended side effects, especially if the DB error is transient and can be fixed on the next try. Obviously the scheduler code needs to be more robust also, but those improvements are left for separate changes related to the other bugs mentioned above. Also, ComputeNode.update_from_virt_driver could be updated to set free_disk_gb if possible to workaround the tight coupling in the HostState._update_from_compute_node code, but that's also sort of a whack-a-mole type change best made separately. Change-Id: Id3c847be32d8a1037722d08bf52e4b88dc5adc97 Closes-Bug: #1834712		2019-07-17 10:29:10 +01:00
..
monitors	hacking: Resolve W503 (line break occurred before a binary operator)	2019-06-24 14:24:06 -05:00
__init__.py	Remove nova.compute.*API() shims	2019-06-12 16:09:46 +01:00
api.py	Merge "Remove Rocky-era min compute trusted certs compat check"	2019-07-16 14:00:26 +00:00
build_results.py	Compute Add build_instance hook in compute manager	2014-12-04 10:12:00 -05:00
claims.py	Make Claim._claim_test handle SchedulerLimits object	2019-02-12 11:59:51 -05:00
flavors.py	Remove deprecated 'default_flavor' config option	2019-04-30 13:01:40 +00:00
instance_actions.py	Add instance action record for snapshot instances	2017-12-11 17:46:38 +08:00
instance_list.py	Plumbing for ignoring list_records_by_skipping_down_cells	2019-02-08 16:28:28 -05:00
manager.py	Revert resize: wait for events according to hybrid plug	2019-07-10 19:56:31 -04:00
migration_list.py	Refactor scatter-gather utility to return exception objects	2018-10-31 15:18:07 -04:00
multi_cell_list.py	Bump to hacking 1.1.0	2019-04-12 16:23:49 +01:00
power_state.py	Removed enum duplication from nova.compute	2016-09-02 07:30:44 +00:00
provider_tree.py	Perf: Use dicts for ProviderTree roots	2019-07-09 16:24:39 -05:00
resource_tracker.py	Restore RT.old_resources if ComputeNode.save() fails	2019-07-17 10:29:10 +01:00
rpcapi.py	Sync COMPUTE_STATUS_DISABLED from API	2019-07-02 18:57:38 -04:00
stats.py	Change consecutive build failure limit to a weigher	2018-06-06 15:18:50 -07:00
task_states.py	Fix resource tracker updates during instance evacuation	2018-09-12 13:05:29 +03:00
utils.py	Merge "Share snapshot image membership with instance owner"	2019-03-12 18:43:12 +00:00
vm_states.py	Removed enum duplication from nova.compute	2016-09-02 07:30:44 +00:00