Merge "Add worker retry and future updates support"

2014-12-22 12:13:06 +00:00 · 2014-12-22 12:13:06 +00:00 · 19e877e79c
parent cb495d294e eb432093e7
commit 19e877e79c
1 changed files with 349 additions and 0 deletions
--- a/specs/kilo/add-worker-retry-update-support.rst
+++ b/specs/kilo/add-worker-retry-update-support.rst
@ -0,0 +1,349 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+===========================================
+Add worker retry and future updates support
+===========================================
+
+Launchpad blueprint:
+https://blueprints.launchpad.net/barbican/+spec/add-worker-retry-update-support
+
+The Barbican worker processes need a means to support retrying failed yet
+recoverable tasks (such as when remote systems are unavailable) and for
+handling updates for long-running order processes such as certificate
+generation. This blueprint defines the requirements for this retry and update
+processing, and proposes an implementation to add this feature.
+
+Problem Description
+===================
+Barbican manages asynchronous tasks, such as generating secrets, via datastore
+tracking entities such as orders (currently the only tracking entity in
+Barbican). These entities have a status field that tracks their state, starting
+with PENDING for new entities, and moving to either ACTIVE or ERROR states for
+successful or unsuccessful termination of the asynchronous task respectively.
+
+Barbican worker processes implement these asynchronous tasks, as depicted on
+this wiki page: https://github.com/cloudkeep/barbican/wiki/Architecture
+
+As shown in the diagram, a typical deployment can include multiple worker
+processes operating in parallel off a tasking queue. The queue invokes task
+methods on the worker processes via RPC. In some cases, these invoked tasks
+require the entity (eg. order) to stay PENDING, either to allow for follow on
+processing in the future or else to retry processing due to a temporary
+blocking condition (eg. remote service is not available at this time).
+
+The following are requirements for retrying tasks in the future and thus
+keeping the tracking entity in the PENDING state::
+
+    R-1) Barbican needs to support extended workflow processes whereby an entity
+         might be PENDING for a long time, requiring periodic status checks to
+         see if the workflow is completed
+
+    R-2) Barbican needs to support re-attempting an RPC task at some point in
+         the future if dependent services are temporarily unavailable
+
+Note that this blueprint does not handle concurrent updates made to the
+same entity, say to perform a periodic status check on an order and also apply
+client updates to that same order. This will be addressed in a future
+blueprint.
+
+Note also that this blueprint does not handle entities that are 'stuck' in the
+PENDING state because of lost messages in the queue or workers that crash while
+processing an entity. This will also be addressed in a future blueprint.
+
+In addition, the following non-functional requirements are needed in the final
+implementation::
+
+    NF-1) To keep entity state consistent, only one worker can work on an
+          entity or manage retrying tasks at a time.
+
+    NF-2) For resilience of the worker cluster:
+
+        a) Any worker process (of a cluster of workers) should be able to
+           handle retrying entities independently of other worker processes,
+           even if these worker processes are intermittently available.
+
+        b) If a worker comes back online after going down, it should be able to
+           start processing retry tasks again, without need to synchronize with
+           other workers.
+
+    NF-3) In the default standalone Barbican implementation, it should be
+          possible to demonstrate the periodic status check feature via the
+          SimpleCertificatePlugin class in
+          barbican.plugin.simple_certificate-Manager.py.
+
+The following assumptions are made::
+
+    A-1) Accurate retry times are not required:
+
+        a) For example, if a task is to be retried in 5 minutes, it would be
+           acceptable if the task was actually retried after more than 5
+           minutes. For SSL certificate workflows, where some certificate types
+           can take days to process, such retry delays would not be
+           significant.
+
+        b) Relaxed retry schedules allow for more granular retry checking
+           intervals, and to allow for delays due to excessive tasks in queues
+           during busy times.
+
+        c) Excessive delays in retry times from expected could indicate that
+           worker nodes are overloaded. This blueprint does not address
+           this issue, deferring to deployment monitoring and scaling
+           processes.
+
+Proposed Change
+===============
+This blueprint proposes that for requirements R-1 and R-2, the plugins used by
+worker tasks (such as the certificate plugin) determine if tasks should be
+retried and at what time in the future. If plugins determine that a task
+should be retried, then these tasks will be scheduled for a future retry
+attempt.
+
+To implement this scheduling process, this blueprint proposes using the Oslo
+periodic task feature, described here:
+
+http://docs.openstack.org/developer/oslo-incubator/api/openstack.common.periodic_task.html
+
+A working example implementation with an older code base is shown here:
+
+https://github.com/cloudkeep/barbican/blob/verify-resource/barbican/queue/server.py#L174
+
+Each worker node could then execute a periodic task service, that invokes a
+method on a scheduled basis (configurable, say every 15 seconds). This method
+would then query which tasks need to be retried (say if current time >=
+retry time), and for each one issue a retry task message to the queue. Once
+tasks are enqueued, this method would remove the retry records from the retry
+list. Eventually the queue would invoke workers to implement these retry tasks.
+
+To provide a means to evaluate the retry feature in standalone Barbican per
+NF-3, the SimpleCertificatePlugin class in
+barbican.plugin.simple_certificate_manager.py would be modified to have the
+issue_certificate_request() method return a retry time of 5 seconds
+(configurable). The check_certificate_status() method would then return a
+successful execution to terminate the order in the ACTIVE state.
+
+This blueprint proposes adding two entities to the data model: OrderRetryTask
+and EntityLock.
+
+The OrderRetryTask entity would manage which tasks need to be retried on which
+entities, and would have the following attributes::
+
+    1) id: Primary key for this record
+
+    2) order_id: FK to the order record the retry task is intended for
+
+    3) retry_task: The RPC method to invoke for the retry. This method could be
+                   a different method than the current one, such as to support
+                   a SSL certificate plugin checking for certificate updates
+                   after initiating the certificate process
+
+    4) retry_at: The timestamp at or after which to retry the task
+
+    5) retry_args: A list of args to send to the retry_task. This list includes
+                   the entity ID, so no need for an entity FK in this entity
+
+    6) retry_kwargs: A JSON-ified dict of the kwargs to send to retry_task
+
+    7) retry_count: A count of how many times this task has been retried
+
+New retry records would be added for tasks that need to be retried in the
+future, as determined by the plugin as part of workflow processing. The next
+periodic task method invocation would then send this task to the queue for
+another worker to implement later.
+
+The EntityLock entity would manage which worker is allowed to delete from the
+OrderRetryTask table, since per NF-1 above only one worker should be able to
+delete from this table. This entity would have the following attributes::
+
+    1) entity_to_lock: The name of the entity to lock ('OrderRetryTask' here).
+                       This would be a primary key.
+
+    2) worker_host_name: The host name of the worker that has the
+                         OrderRetryTask entity 'locked'.
+
+    3) created_at: When this table was locked.
+
+This entity would only have zero or one records. So the periodic method above
+would execute the following pseudo code::
+
+    Start SQLAlchemy session/transaction
+    try:
+        Attempt to insert a new record into the EntityLock table
+        session.commit()
+    except:
+        session.rollback()
+        Handle 'stuck' locks (see paragraph below)
+        return
+
+    try:
+        Query for retry tasks
+        Send retry tasks to the queue
+        Remove enqueued retry tasks from OrderRetryTask table
+        session.commit()
+    except:
+        session.rollback()
+    finally:
+        Remove record from EntityLock table
+        Clear SQLAlchemy session/transaction
+
+Lock tables can be problematic if the locking process crashes without removing
+the locks. The overall time a worker holds on to a lock should be brief
+however, so the lock attempt rollback process above should check for and remove
+a stale lock based on the 'created_at' time on the lock.
+
+To separate coding concerns, it makes sense to implement this process in a
+separate Oslo 'service' server process, similar to the `Keystone listener
+approach <https://github.com/openstack/barbican/blob/master/barbican/queue/keystone_listener.py#L130>`_
+This service would only run the Oslo periodic task method, to perform the retry
+updating process. If the method failed to operate, say due to another worker
+locking resource, it could just return/exit. The next periodic call would then
+start the process again.
+
+Alternatives
+------------
+Rather than having each worker process manage retrying tasks, a separate node
+could be designated to manage these retries. This would eliminate the need for
+the EntityLock entity. However, this approach would require configuring yet
+another node in the Barbican network, adding to deployment complexity. This
+manager node would also be a single point of failure for managing retry tasks.
+
+Data model impact
+-----------------
+As mentioned above, two new entities would be required. No migrations would be
+needed.
+
+REST API impact
+---------------
+None
+
+Security impact
+---------------
+None
+
+Notifications & Audit Impact
+----------------------------
+None
+
+Other end user impact
+---------------------
+None
+
+Performance Impact
+------------------
+The addition of a periodic task to identify task to be retried presents an
+extra load on the worker nodes (assuming they are co-located processes to the
+normal worker processing, as expected). However, this process does not perform
+the retry work, but rather issues tasks into the queue to then evenly
+distribute back to the worker processes. Hence the additional load on a given
+worker should be minimal.
+
+This proposal includes utilizing locks to deal with concurrency concerns
+across the multiple worker nodes that could be handling retry tasks. This can
+result in two performance impacts: (1) multiple workers might fight to grab
+the lock simultaneously leading to degraded performance for the workers that
+fail to grab the lock, and (2) a lock could become 'stuck' if a worker holding
+the lock crashes.
+
+Regarding (1), locks are only utilized on the worker nodes involved in
+processing asynchronous tasks which are not time sensitive. Also, the time the
+lock is utilized will be very brief, just long enough to perform a query for
+retry tasks and to send those tasks to queue for follow on processing. In
+addition the periodic process of each worker node handles these retry tasks,
+so if the deployment of worker nodes is staggered the retry processes should
+not conflict. Another option is to randomly dither the periodic interval (eg.
+30 seconds +- 5 seconds) so that worker nodes are less likely to conflict with
+each other.
+
+Regarding concern (2) about 'stuck' locks, since the conditions which involve
+locks are either long-running orders that can suffer delays until locks are
+restored, or else are (hopefully) rare conditions when resources aren't
+available, this condition should not be critical to resolve. The proposal does
+however suggest a means to remove stuck locks utilizing their created-at
+times.
+
+Other deployer impact
+---------------------
+The Barbican configuration file will need a configuration parameter to
+periodically run the retry-query process, called 'schedule_period_seconds',
+with a default value of 15 seconds. This parameter would be placed in a new
+'[scheduler]' group.
+
+A configuration parameter called 'retry_lock_timeout_seconds' would be used to
+release 'stuck' locks on the retry tasks table, as described in the 'Proposed
+Change' section above. This parameter would also be added to the '[scheduler]'
+group.
+
+A configuration parameter called 'delay_before_update_seconds' would be used to
+configure the amount of time the SimpleCertificatePlugin delays from
+initiating a demo certificate order to the time the update certificate method
+is invoked. This parameter would be placed in a new '[simple_certificate]'
+group.
+
+These configurations would be applied and utilized once the revised code base
+is deployed.
+
+Developer impact
+----------------
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  john-wood-w
+
+Other contributors:
+  Chelsea Winfree
+
+Work Items
+----------
+
+1) Add data model entities and unit tests, for OrderRetryTask and EntityLock
+
+2) Add logic to SimpleCertificatePlugin per the Approach section, to allow demonstration of retry feature
+
+3) Modify barbican.tasks.certificate_resources.py's _schedule_retry_task to add retry records into OrderRetryTask table
+
+4) Add Oslo periodic task support
+
+5) Implement periodic method, that performs the query for tasks that need to be retried
+
+6) Implement workers sending retry RPC messages back to the queue...see note below
+
+7) Add new scripts to launch the Oslo periodic task called bin/barbican-task-scheduler.py and .sh, similar to bin/barbican-keystone-listener.py and .sh
+
+8) Add to the Barbican Devstack gate functional tests a test of the new retry feature via the SimpleCertificatePlugin logic added above
+
+9) Add logic to handle expired locks on the OrderRetryTask table
+
+Note that for #6, the 'queue' and 'tasks' packages have to be modified somewhat
+to allow the server logic to send messages to the queue via the client logic,
+mainly to break circular dependencies. Again, see the example `here <https://github.com/cloudkeep/barbican/tree/verify-resource/barbican>`_
+for a working example of this server/client/retry processing.
+
+Dependencies
+============
+None
+
+Testing
+=======
+In addition to planned unit testing, the functional Tempest-based tests in the
+Barbican repository would be augmented to add a test of the new retry feature
+for the default certificate plugin.
+
+Documentation Impact
+====================
+Developer guides will need to updated, to include the additional periodic retry
+process detailed above. Deployment guides will need to be updated to specify
+that a new process needs to executed (for the bin/barbican-task-scheduler.sh
+process).
+
+References
+==========
+None