350 lines
14 KiB
ReStructuredText
350 lines
14 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
===========================================
|
|
Add worker retry and future updates support
|
|
===========================================
|
|
|
|
Launchpad blueprint:
|
|
https://blueprints.launchpad.net/barbican/+spec/add-worker-retry-update-support
|
|
|
|
The Barbican worker processes need a means to support retrying failed yet
|
|
recoverable tasks (such as when remote systems are unavailable) and for
|
|
handling updates for long-running order processes such as certificate
|
|
generation. This blueprint defines the requirements for this retry and update
|
|
processing, and proposes an implementation to add this feature.
|
|
|
|
Problem Description
|
|
===================
|
|
Barbican manages asynchronous tasks, such as generating secrets, via datastore
|
|
tracking entities such as orders (currently the only tracking entity in
|
|
Barbican). These entities have a status field that tracks their state, starting
|
|
with PENDING for new entities, and moving to either ACTIVE or ERROR states for
|
|
successful or unsuccessful termination of the asynchronous task respectively.
|
|
|
|
Barbican worker processes implement these asynchronous tasks, as depicted on
|
|
this wiki page: https://github.com/cloudkeep/barbican/wiki/Architecture
|
|
|
|
As shown in the diagram, a typical deployment can include multiple worker
|
|
processes operating in parallel off a tasking queue. The queue invokes task
|
|
methods on the worker processes via RPC. In some cases, these invoked tasks
|
|
require the entity (eg. order) to stay PENDING, either to allow for follow on
|
|
processing in the future or else to retry processing due to a temporary
|
|
blocking condition (eg. remote service is not available at this time).
|
|
|
|
The following are requirements for retrying tasks in the future and thus
|
|
keeping the tracking entity in the PENDING state::
|
|
|
|
R-1) Barbican needs to support extended workflow processes whereby an entity
|
|
might be PENDING for a long time, requiring periodic status checks to
|
|
see if the workflow is completed
|
|
|
|
R-2) Barbican needs to support re-attempting an RPC task at some point in
|
|
the future if dependent services are temporarily unavailable
|
|
|
|
Note that this blueprint does not handle concurrent updates made to the
|
|
same entity, say to perform a periodic status check on an order and also apply
|
|
client updates to that same order. This will be addressed in a future
|
|
blueprint.
|
|
|
|
Note also that this blueprint does not handle entities that are 'stuck' in the
|
|
PENDING state because of lost messages in the queue or workers that crash while
|
|
processing an entity. This will also be addressed in a future blueprint.
|
|
|
|
In addition, the following non-functional requirements are needed in the final
|
|
implementation::
|
|
|
|
NF-1) To keep entity state consistent, only one worker can work on an
|
|
entity or manage retrying tasks at a time.
|
|
|
|
NF-2) For resilience of the worker cluster:
|
|
|
|
a) Any worker process (of a cluster of workers) should be able to
|
|
handle retrying entities independently of other worker processes,
|
|
even if these worker processes are intermittently available.
|
|
|
|
b) If a worker comes back online after going down, it should be able to
|
|
start processing retry tasks again, without need to synchronize with
|
|
other workers.
|
|
|
|
NF-3) In the default standalone Barbican implementation, it should be
|
|
possible to demonstrate the periodic status check feature via the
|
|
SimpleCertificatePlugin class in
|
|
barbican.plugin.simple_certificate-Manager.py.
|
|
|
|
The following assumptions are made::
|
|
|
|
A-1) Accurate retry times are not required:
|
|
|
|
a) For example, if a task is to be retried in 5 minutes, it would be
|
|
acceptable if the task was actually retried after more than 5
|
|
minutes. For SSL certificate workflows, where some certificate types
|
|
can take days to process, such retry delays would not be
|
|
significant.
|
|
|
|
b) Relaxed retry schedules allow for more granular retry checking
|
|
intervals, and to allow for delays due to excessive tasks in queues
|
|
during busy times.
|
|
|
|
c) Excessive delays in retry times from expected could indicate that
|
|
worker nodes are overloaded. This blueprint does not address
|
|
this issue, deferring to deployment monitoring and scaling
|
|
processes.
|
|
|
|
Proposed Change
|
|
===============
|
|
This blueprint proposes that for requirements R-1 and R-2, the plugins used by
|
|
worker tasks (such as the certificate plugin) determine if tasks should be
|
|
retried and at what time in the future. If plugins determine that a task
|
|
should be retried, then these tasks will be scheduled for a future retry
|
|
attempt.
|
|
|
|
To implement this scheduling process, this blueprint proposes using the Oslo
|
|
periodic task feature, described here:
|
|
|
|
https://docs.openstack.org/developer/oslo-incubator/api/openstack.common.periodic_task.html
|
|
|
|
A working example implementation with an older code base is shown here:
|
|
|
|
https://github.com/cloudkeep/barbican/blob/verify-resource/barbican/queue/server.py#L174
|
|
|
|
Each worker node could then execute a periodic task service, that invokes a
|
|
method on a scheduled basis (configurable, say every 15 seconds). This method
|
|
would then query which tasks need to be retried (say if current time >=
|
|
retry time), and for each one issue a retry task message to the queue. Once
|
|
tasks are enqueued, this method would remove the retry records from the retry
|
|
list. Eventually the queue would invoke workers to implement these retry tasks.
|
|
|
|
To provide a means to evaluate the retry feature in standalone Barbican per
|
|
NF-3, the SimpleCertificatePlugin class in
|
|
barbican.plugin.simple_certificate_manager.py would be modified to have the
|
|
issue_certificate_request() method return a retry time of 5 seconds
|
|
(configurable). The check_certificate_status() method would then return a
|
|
successful execution to terminate the order in the ACTIVE state.
|
|
|
|
This blueprint proposes adding two entities to the data model: OrderRetryTask
|
|
and EntityLock.
|
|
|
|
The OrderRetryTask entity would manage which tasks need to be retried on which
|
|
entities, and would have the following attributes::
|
|
|
|
1) id: Primary key for this record
|
|
|
|
2) order_id: FK to the order record the retry task is intended for
|
|
|
|
3) retry_task: The RPC method to invoke for the retry. This method could be
|
|
a different method than the current one, such as to support
|
|
a SSL certificate plugin checking for certificate updates
|
|
after initiating the certificate process
|
|
|
|
4) retry_at: The timestamp at or after which to retry the task
|
|
|
|
5) retry_args: A list of args to send to the retry_task. This list includes
|
|
the entity ID, so no need for an entity FK in this entity
|
|
|
|
6) retry_kwargs: A JSON-ified dict of the kwargs to send to retry_task
|
|
|
|
7) retry_count: A count of how many times this task has been retried
|
|
|
|
New retry records would be added for tasks that need to be retried in the
|
|
future, as determined by the plugin as part of workflow processing. The next
|
|
periodic task method invocation would then send this task to the queue for
|
|
another worker to implement later.
|
|
|
|
The EntityLock entity would manage which worker is allowed to delete from the
|
|
OrderRetryTask table, since per NF-1 above only one worker should be able to
|
|
delete from this table. This entity would have the following attributes::
|
|
|
|
1) entity_to_lock: The name of the entity to lock ('OrderRetryTask' here).
|
|
This would be a primary key.
|
|
|
|
2) worker_host_name: The host name of the worker that has the
|
|
OrderRetryTask entity 'locked'.
|
|
|
|
3) created_at: When this table was locked.
|
|
|
|
This entity would only have zero or one records. So the periodic method above
|
|
would execute the following pseudo code::
|
|
|
|
Start SQLAlchemy session/transaction
|
|
try:
|
|
Attempt to insert a new record into the EntityLock table
|
|
session.commit()
|
|
except:
|
|
session.rollback()
|
|
Handle 'stuck' locks (see paragraph below)
|
|
return
|
|
|
|
try:
|
|
Query for retry tasks
|
|
Send retry tasks to the queue
|
|
Remove enqueued retry tasks from OrderRetryTask table
|
|
session.commit()
|
|
except:
|
|
session.rollback()
|
|
finally:
|
|
Remove record from EntityLock table
|
|
Clear SQLAlchemy session/transaction
|
|
|
|
Lock tables can be problematic if the locking process crashes without removing
|
|
the locks. The overall time a worker holds on to a lock should be brief
|
|
however, so the lock attempt rollback process above should check for and remove
|
|
a stale lock based on the 'created_at' time on the lock.
|
|
|
|
To separate coding concerns, it makes sense to implement this process in a
|
|
separate Oslo 'service' server process, similar to the `Keystone listener
|
|
approach <https://github.com/openstack/barbican/blob/master/barbican/queue/keystone_listener.py#L130>`_
|
|
This service would only run the Oslo periodic task method, to perform the retry
|
|
updating process. If the method failed to operate, say due to another worker
|
|
locking resource, it could just return/exit. The next periodic call would then
|
|
start the process again.
|
|
|
|
Alternatives
|
|
------------
|
|
Rather than having each worker process manage retrying tasks, a separate node
|
|
could be designated to manage these retries. This would eliminate the need for
|
|
the EntityLock entity. However, this approach would require configuring yet
|
|
another node in the Barbican network, adding to deployment complexity. This
|
|
manager node would also be a single point of failure for managing retry tasks.
|
|
|
|
Data model impact
|
|
-----------------
|
|
As mentioned above, two new entities would be required. No migrations would be
|
|
needed.
|
|
|
|
REST API impact
|
|
---------------
|
|
None
|
|
|
|
Security impact
|
|
---------------
|
|
None
|
|
|
|
Notifications & Audit Impact
|
|
----------------------------
|
|
None
|
|
|
|
Other end user impact
|
|
---------------------
|
|
None
|
|
|
|
Performance Impact
|
|
------------------
|
|
The addition of a periodic task to identify task to be retried presents an
|
|
extra load on the worker nodes (assuming they are co-located processes to the
|
|
normal worker processing, as expected). However, this process does not perform
|
|
the retry work, but rather issues tasks into the queue to then evenly
|
|
distribute back to the worker processes. Hence the additional load on a given
|
|
worker should be minimal.
|
|
|
|
This proposal includes utilizing locks to deal with concurrency concerns
|
|
across the multiple worker nodes that could be handling retry tasks. This can
|
|
result in two performance impacts: (1) multiple workers might fight to grab
|
|
the lock simultaneously leading to degraded performance for the workers that
|
|
fail to grab the lock, and (2) a lock could become 'stuck' if a worker holding
|
|
the lock crashes.
|
|
|
|
Regarding (1), locks are only utilized on the worker nodes involved in
|
|
processing asynchronous tasks which are not time sensitive. Also, the time the
|
|
lock is utilized will be very brief, just long enough to perform a query for
|
|
retry tasks and to send those tasks to queue for follow on processing. In
|
|
addition the periodic process of each worker node handles these retry tasks,
|
|
so if the deployment of worker nodes is staggered the retry processes should
|
|
not conflict. Another option is to randomly dither the periodic interval (eg.
|
|
30 seconds +- 5 seconds) so that worker nodes are less likely to conflict with
|
|
each other.
|
|
|
|
Regarding concern (2) about 'stuck' locks, since the conditions which involve
|
|
locks are either long-running orders that can suffer delays until locks are
|
|
restored, or else are (hopefully) rare conditions when resources aren't
|
|
available, this condition should not be critical to resolve. The proposal does
|
|
however suggest a means to remove stuck locks utilizing their created-at
|
|
times.
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
The Barbican configuration file will need a configuration parameter to
|
|
periodically run the retry-query process, called 'schedule_period_seconds',
|
|
with a default value of 15 seconds. This parameter would be placed in a new
|
|
'[scheduler]' group.
|
|
|
|
A configuration parameter called 'retry_lock_timeout_seconds' would be used to
|
|
release 'stuck' locks on the retry tasks table, as described in the 'Proposed
|
|
Change' section above. This parameter would also be added to the '[scheduler]'
|
|
group.
|
|
|
|
A configuration parameter called 'delay_before_update_seconds' would be used to
|
|
configure the amount of time the SimpleCertificatePlugin delays from
|
|
initiating a demo certificate order to the time the update certificate method
|
|
is invoked. This parameter would be placed in a new '[simple_certificate]'
|
|
group.
|
|
|
|
These configurations would be applied and utilized once the revised code base
|
|
is deployed.
|
|
|
|
Developer impact
|
|
----------------
|
|
None
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
john-wood-w
|
|
|
|
Other contributors:
|
|
Chelsea Winfree
|
|
|
|
Work Items
|
|
----------
|
|
|
|
1) Add data model entities and unit tests, for OrderRetryTask and EntityLock
|
|
|
|
2) Add logic to SimpleCertificatePlugin per the Approach section, to allow demonstration of retry feature
|
|
|
|
3) Modify barbican.tasks.certificate_resources.py's _schedule_retry_task to add retry records into OrderRetryTask table
|
|
|
|
4) Add Oslo periodic task support
|
|
|
|
5) Implement periodic method, that performs the query for tasks that need to be retried
|
|
|
|
6) Implement workers sending retry RPC messages back to the queue...see note below
|
|
|
|
7) Add new scripts to launch the Oslo periodic task called bin/barbican-task-scheduler.py and .sh, similar to bin/barbican-keystone-listener.py and .sh
|
|
|
|
8) Add to the Barbican Devstack gate functional tests a test of the new retry feature via the SimpleCertificatePlugin logic added above
|
|
|
|
9) Add logic to handle expired locks on the OrderRetryTask table
|
|
|
|
Note that for #6, the 'queue' and 'tasks' packages have to be modified somewhat
|
|
to allow the server logic to send messages to the queue via the client logic,
|
|
mainly to break circular dependencies. Again, see the example `here <https://github.com/cloudkeep/barbican/tree/verify-resource/barbican>`_
|
|
for a working example of this server/client/retry processing.
|
|
|
|
Dependencies
|
|
============
|
|
None
|
|
|
|
Testing
|
|
=======
|
|
In addition to planned unit testing, the functional Tempest-based tests in the
|
|
Barbican repository would be augmented to add a test of the new retry feature
|
|
for the default certificate plugin.
|
|
|
|
Documentation Impact
|
|
====================
|
|
Developer guides will need to updated, to include the additional periodic retry
|
|
process detailed above. Deployment guides will need to be updated to specify
|
|
that a new process needs to executed (for the bin/barbican-task-scheduler.sh
|
|
process).
|
|
|
|
References
|
|
==========
|
|
None
|