From c33767c97b0c3cffd4d982bad3aef309d48471d0 Mon Sep 17 00:00:00 2001 From: lixiaoy1 Date: Thu, 13 Dec 2018 20:11:00 +0800 Subject: [PATCH] Driver reinitialization after failure This is an update to the spec of driver reinitialization after the discussion that we had on the Cinder meeting. Change-Id: Ic8b58efb240efa535cc5cb039139b497b48f1570 Implements: blueprint driver-initialization-after-fail --- .../driver-reinitialization-after-fail.rst | 63 +++++++------------ 1 file changed, 22 insertions(+), 41 deletions(-) diff --git a/specs/stein/driver-reinitialization-after-fail.rst b/specs/stein/driver-reinitialization-after-fail.rst index a5400d01..37982ad1 100644 --- a/specs/stein/driver-reinitialization-after-fail.rst +++ b/specs/stein/driver-reinitialization-after-fail.rst @@ -16,27 +16,13 @@ during startup. Problem description =================== -When a service starts, at first it does initialization. If something wrong -happens, the service ends. -After initialization completes, it cleans up resources, initializes its driver -and RPC servers. Later the volume service reports its capabilities to scheduler -in fixed interval. +During Cinder initialization, for many reasons, the storage backend might not +be ready and responding. In this case, the driver will not be loaded even if +the array becomes available right after. -During above progress, some errors lead to the service process exiting, -leaving the volume driver not functioning. - -1) Driver not supported. -2) Driver initialization fails. -3) Volume cleanups are processed by parent class CleanableManager. -4) Something wrong with 'publish_service_capabilities' [2]. - -The driver will not be initialized in above case 1), 2) and 3). As a result, -although the volume service process exists, and tries to publish its service -capabilities to scheduler, but fails every time. This means users have to -restart cinder volume to re-initialize drivers. - -When case 4 happens the volume service moves on and can become available if -it succeeds to publish its service to scheduler next time. +As there is no retry in Cinder volume service, even later the backend storage +is ready, Cinder volume service can't recover by itself. It needs users +to restart the volume service manually. Use Cases ========= @@ -44,7 +30,7 @@ Use Cases When a Cinder volume service starts, sometimes its corresponding storage services are not ready. But later the storage services become ready. As a result the volume service can't work properly and can't recover by itself. -But the administrator probably prefer Cinder to automatically recover from +But the administrators probably prefer Cinder to automatically recover from the temporary failures without manual intervention of restarting the service. Proposed change @@ -52,31 +38,30 @@ Proposed change The proposal is to -- Reintialize the volume driver when it failed to initialize. This reinitialization - happens except when the error is unrecoverable. The following lists - unrecoverable errors: +- Allow reinitialization of a volume driver when it failed to initialize. - 1) config error - 2) Driver not supported - 3) lack of python driver library +- Provide a configuration to set the maximum retry numbers. -- Every time when a volume service publishes its capability to scheduler, - it checks whether the driver is initialized. If it is not initialized - and can be recovered, it calls init_host[1] to do reinitialization. +- The interval of retry will exponentially backoff. Every interval is the + exponentiation of retry count. The first interval is 1s, second interval + is 2s, third interval is 4s, and so on. + +- Retry will be handled in init_host. For this, the following additional config option would be needed: - 'reinit_driver_count' (default: 3) Set the maximum times to reintialize the driver if volume initialization fails. - Default number is 3. The value 0 means no limitation. + Default number is 3. Alternatives ------------ -Compared with listing unrecoverable errors when checking whether retrying, another -way is to keep a list of all recoverable errors and reinitialize on the errors in -the list. The problem is that it tightly depends on exceptions raised by drivers which -may change on different versions. +- We also can differentiate whether it should retry with an exception. Like + import error, config error, it may not retry. But the benefit is not + very impressive, and implementing the differentiation needs work in every + driver. As drivers don't differentiate such errors from backend storage + errors. Data model impact ----------------- @@ -137,10 +122,7 @@ Work Items ---------- * Add the option 'reinit_driver_count'. -* Need to differentiate config and library error in drivers, and update these - exceptions to inherit from same base exceptions. So that we can skip these - errors when reinitializing. -* Reinitialize volume drivers in 'publish_service_capabilities' [2]. +* Retry to initialize volume drivers when it fails. * Add related unit test cases. Dependencies @@ -162,5 +144,4 @@ Documentation Impact References ========== -_`[1]`: https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L408 -_`[2]`: https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L2539 +* None