barbican-specs/specs/kilo/add-worker-retry-update-sup...

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

===========================================
Add worker retry and future updates support
===========================================

Launchpad blueprint:
https://blueprints.launchpad.net/barbican/+spec/add-worker-retry-update-support

The Barbican worker processes need a means to support retrying failed yet
recoverable tasks (such as when remote systems are unavailable) and for
handling updates for long-running order processes such as certificate
generation. This blueprint defines the requirements for this retry and update
processing, and proposes an implementation to add this feature.

Problem Description
===================
Barbican manages asynchronous tasks, such as generating secrets, via datastore
tracking entities such as orders (currently the only tracking entity in
Barbican). These entities have a status field that tracks their state, starting
with PENDING for new entities, and moving to either ACTIVE or ERROR states for
successful or unsuccessful termination of the asynchronous task respectively.

Barbican worker processes implement these asynchronous tasks, as depicted on
this wiki page: https://github.com/cloudkeep/barbican/wiki/Architecture

As shown in the diagram, a typical deployment can include multiple worker
processes operating in parallel off a tasking queue. The queue invokes task
methods on the worker processes via RPC. In some cases, these invoked tasks
require the entity (eg. order) to stay PENDING, either to allow for follow on
processing in the future or else to retry processing due to a temporary
blocking condition (eg. remote service is not available at this time).

The following are requirements for retrying tasks in the future and thus
keeping the tracking entity in the PENDING state::

    R-1) Barbican needs to support extended workflow processes whereby an entity
         might be PENDING for a long time, requiring periodic status checks to
         see if the workflow is completed

    R-2) Barbican needs to support re-attempting an RPC task at some point in
         the future if dependent services are temporarily unavailable

Note that this blueprint does not handle concurrent updates made to the
same entity, say to perform a periodic status check on an order and also apply
client updates to that same order. This will be addressed in a future
blueprint.

Note also that this blueprint does not handle entities that are 'stuck' in the
PENDING state because of lost messages in the queue or workers that crash while
processing an entity. This will also be addressed in a future blueprint.

In addition, the following non-functional requirements are needed in the final
implementation::

    NF-1) To keep entity state consistent, only one worker can work on an
          entity or manage retrying tasks at a time.

    NF-2) For resilience of the worker cluster:

        a) Any worker process (of a cluster of workers) should be able to
           handle retrying entities independently of other worker processes,
           even if these worker processes are intermittently available.

        b) If a worker comes back online after going down, it should be able to
           start processing retry tasks again, without need to synchronize with
           other workers.

    NF-3) In the default standalone Barbican implementation, it should be
          possible to demonstrate the periodic status check feature via the
          SimpleCertificatePlugin class in
          barbican.plugin.simple_certificate-Manager.py.

The following assumptions are made::

    A-1) Accurate retry times are not required:

        a) For example, if a task is to be retried in 5 minutes, it would be
           acceptable if the task was actually retried after more than 5
           minutes. For SSL certificate workflows, where some certificate types
           can take days to process, such retry delays would not be
           significant.

        b) Relaxed retry schedules allow for more granular retry checking
           intervals, and to allow for delays due to excessive tasks in queues
           during busy times.

        c) Excessive delays in retry times from expected could indicate that
           worker nodes are overloaded. This blueprint does not address
           this issue, deferring to deployment monitoring and scaling
           processes.

Proposed Change
===============
This blueprint proposes that for requirements R-1 and R-2, the plugins used by
worker tasks (such as the certificate plugin) determine if tasks should be
retried and at what time in the future. If plugins determine that a task
should be retried, then these tasks will be scheduled for a future retry
attempt.

To implement this scheduling process, this blueprint proposes using the Oslo
periodic task feature, described here:

https://docs.openstack.org/developer/oslo-incubator/api/openstack.common.periodic_task.html

A working example implementation with an older code base is shown here:

https://github.com/cloudkeep/barbican/blob/verify-resource/barbican/queue/server.py#L174

Each worker node could then execute a periodic task service, that invokes a
method on a scheduled basis (configurable, say every 15 seconds). This method
would then query which tasks need to be retried (say if current time >=
retry time), and for each one issue a retry task message to the queue. Once
tasks are enqueued, this method would remove the retry records from the retry
list. Eventually the queue would invoke workers to implement these retry tasks.

To provide a means to evaluate the retry feature in standalone Barbican per
NF-3, the SimpleCertificatePlugin class in
barbican.plugin.simple_certificate_manager.py would be modified to have the
issue_certificate_request() method return a retry time of 5 seconds
(configurable). The check_certificate_status() method would then return a
successful execution to terminate the order in the ACTIVE state.

This blueprint proposes adding two entities to the data model: OrderRetryTask
and EntityLock.

The OrderRetryTask entity would manage which tasks need to be retried on which
entities, and would have the following attributes::

    1) id: Primary key for this record

    2) order_id: FK to the order record the retry task is intended for

    3) retry_task: The RPC method to invoke for the retry. This method could be
                   a different method than the current one, such as to support
                   a SSL certificate plugin checking for certificate updates
                   after initiating the certificate process

    4) retry_at: The timestamp at or after which to retry the task

    5) retry_args: A list of args to send to the retry_task. This list includes
                   the entity ID, so no need for an entity FK in this entity

    6) retry_kwargs: A JSON-ified dict of the kwargs to send to retry_task

    7) retry_count: A count of how many times this task has been retried

New retry records would be added for tasks that need to be retried in the
future, as determined by the plugin as part of workflow processing. The next
periodic task method invocation would then send this task to the queue for
another worker to implement later.

The EntityLock entity would manage which worker is allowed to delete from the
OrderRetryTask table, since per NF-1 above only one worker should be able to
delete from this table. This entity would have the following attributes::

    1) entity_to_lock: The name of the entity to lock ('OrderRetryTask' here).
                       This would be a primary key.

    2) worker_host_name: The host name of the worker that has the
                         OrderRetryTask entity 'locked'.

    3) created_at: When this table was locked.

This entity would only have zero or one records. So the periodic method above
would execute the following pseudo code::

    Start SQLAlchemy session/transaction
    try:
        Attempt to insert a new record into the EntityLock table
        session.commit()
    except:
        session.rollback()
        Handle 'stuck' locks (see paragraph below)
        return

    try:
        Query for retry tasks
        Send retry tasks to the queue
        Remove enqueued retry tasks from OrderRetryTask table
        session.commit()
    except:
        session.rollback()
    finally:
        Remove record from EntityLock table
        Clear SQLAlchemy session/transaction

Lock tables can be problematic if the locking process crashes without removing
the locks. The overall time a worker holds on to a lock should be brief
however, so the lock attempt rollback process above should check for and remove
a stale lock based on the 'created_at' time on the lock.

To separate coding concerns, it makes sense to implement this process in a
separate Oslo 'service' server process, similar to the `Keystone listener
approach <https://github.com/openstack/barbican/blob/master/barbican/queue/keystone_listener.py#L130>`_
This service would only run the Oslo periodic task method, to perform the retry
updating process. If the method failed to operate, say due to another worker
locking resource, it could just return/exit. The next periodic call would then
start the process again.

Alternatives
------------
Rather than having each worker process manage retrying tasks, a separate node
could be designated to manage these retries. This would eliminate the need for
the EntityLock entity. However, this approach would require configuring yet
another node in the Barbican network, adding to deployment complexity. This
manager node would also be a single point of failure for managing retry tasks.

Data model impact
-----------------
As mentioned above, two new entities would be required. No migrations would be
needed.

REST API impact
---------------
None

Security impact
---------------
None

Notifications & Audit Impact
----------------------------
None

Other end user impact
---------------------
None

Performance Impact
------------------
The addition of a periodic task to identify task to be retried presents an
extra load on the worker nodes (assuming they are co-located processes to the
normal worker processing, as expected). However, this process does not perform
the retry work, but rather issues tasks into the queue to then evenly
distribute back to the worker processes. Hence the additional load on a given
worker should be minimal.

This proposal includes utilizing locks to deal with concurrency concerns
across the multiple worker nodes that could be handling retry tasks. This can
result in two performance impacts: (1) multiple workers might fight to grab
the lock simultaneously leading to degraded performance for the workers that
fail to grab the lock, and (2) a lock could become 'stuck' if a worker holding
the lock crashes.

Regarding (1), locks are only utilized on the worker nodes involved in
processing asynchronous tasks which are not time sensitive. Also, the time the
lock is utilized will be very brief, just long enough to perform a query for
retry tasks and to send those tasks to queue for follow on processing. In
addition the periodic process of each worker node handles these retry tasks,
so if the deployment of worker nodes is staggered the retry processes should
not conflict. Another option is to randomly dither the periodic interval (eg.
30 seconds +- 5 seconds) so that worker nodes are less likely to conflict with
each other.

Regarding concern (2) about 'stuck' locks, since the conditions which involve
locks are either long-running orders that can suffer delays until locks are
restored, or else are (hopefully) rare conditions when resources aren't
available, this condition should not be critical to resolve. The proposal does
however suggest a means to remove stuck locks utilizing their created-at
times.

Other deployer impact
---------------------
The Barbican configuration file will need a configuration parameter to
periodically run the retry-query process, called 'schedule_period_seconds',
with a default value of 15 seconds. This parameter would be placed in a new
'[scheduler]' group.

A configuration parameter called 'retry_lock_timeout_seconds' would be used to
release 'stuck' locks on the retry tasks table, as described in the 'Proposed
Change' section above. This parameter would also be added to the '[scheduler]'
group.

A configuration parameter called 'delay_before_update_seconds' would be used to
configure the amount of time the SimpleCertificatePlugin delays from
initiating a demo certificate order to the time the update certificate method
is invoked. This parameter would be placed in a new '[simple_certificate]'
group.

These configurations would be applied and utilized once the revised code base
is deployed.

Developer impact
----------------
None

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  john-wood-w

Other contributors:
  Chelsea Winfree

Work Items
----------

1) Add data model entities and unit tests, for OrderRetryTask and EntityLock

2) Add logic to SimpleCertificatePlugin per the Approach section, to allow demonstration of retry feature

3) Modify barbican.tasks.certificate_resources.py's _schedule_retry_task to add retry records into OrderRetryTask table

4) Add Oslo periodic task support

5) Implement periodic method, that performs the query for tasks that need to be retried

6) Implement workers sending retry RPC messages back to the queue...see note below

7) Add new scripts to launch the Oslo periodic task called bin/barbican-task-scheduler.py and .sh, similar to bin/barbican-keystone-listener.py and .sh

8) Add to the Barbican Devstack gate functional tests a test of the new retry feature via the SimpleCertificatePlugin logic added above

9) Add logic to handle expired locks on the OrderRetryTask table

Note that for #6, the 'queue' and 'tasks' packages have to be modified somewhat
to allow the server logic to send messages to the queue via the client logic,
mainly to break circular dependencies. Again, see the example `here <https://github.com/cloudkeep/barbican/tree/verify-resource/barbican>`_
for a working example of this server/client/retry processing.

Dependencies
============
None

Testing
=======
In addition to planned unit testing, the functional Tempest-based tests in the
Barbican repository would be augmented to add a test of the new retry feature
for the default certificate plugin.

Documentation Impact
====================
Developer guides will need to updated, to include the additional periodic retry
process detailed above. Deployment guides will need to be updated to specify
that a new process needs to executed (for the bin/barbican-task-scheduler.sh
process).

References
==========
None