oslo-specs/specs/juno/taskflow-redis-jobs.rst

12 KiB

Redis backed jobs and boards

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/taskflow/+spec/redis-jobboard

To avoid having just one implementation of a jobboard in taskflow (currently based on zookeeper) it would be advantageous from a design perspective (having more than one implementation usually ensures that the design is correct) and from a user perspective (not everyone wants to or can run zookeeper) to provide a job posting and consumption mechanism that is based on at least one other capable system (redis is one of the next best potential implementations, while it does have issues and is not perfect/ideal it will likely be acceptable).

Problem description

The job (and associated jobboard) mechanism that taskflow provides as a way to submit tasks and flows to be worked on as jobs (aka the posting and work creation process) and ensure those jobs are robustly executed in a manner that is reliable, atomic and scalable (aka the consumption process) provides a novel mechanism to transfer work from producers to capable consumers (typically these are conductors) in a reliable and inherently fault-tolerant manner.

Currently the implementation that exists requires zookeeper to provide the primitives that are used to implement the features that form the basis of the job and jobboard API.

At a high-level this is done via the following zookeeper primitives:

  • Workflow/job postings; publishing of a non-ephemeral nodes to a given zookeeper directory by some set of producers.
  • Atomic ownership; implemented by acquisition of zookeeper ephemeral nodes (these nodes act as distributed locks) by some set of workers (those workers can be selective in what work they attempt to take ownership of).
  • Automatic ownership release; lose of previously gained ephemeral nodes which happens automatically when the client heartbeat is lost (zookeeper will also notify others, via watches that this ephemeral node has been destroyed, which makes it very easy for other workers to then attempt to acquire & finish that newly lost/abandoned work).

The issue is that there is currently only one implementation and that implementation has the following (supposed) drawbacks:

  • Requires zookeeper and the supporting infrastructure and brainpower to maintain and run java and zookeeper and that surrounding ecosystem. This makes certain people sad (java -Xmx -Xms ... doesn't apparently make them feel so happy about life).
  • Can be complicated to setup (due to previously stated expertise) and maintain which can be painful for new developers (those without a zookeeper setup) and new operators (or those that just don't want to run java).

To make it possible to gain most of the above features using redis we need to flush out how to make that possible while avoiding some of the landmines that are possible with the implementation of those primitives in redis (for ex, atomic ownership & release will be more problematic in redis since it lacks the built-in primitives and support that releases owned items when a client disconnects or stops sending the appropriate heartbeat).

Proposed change

Add a redis backed jobboard mechanism using tooz to implement the base primitives required (using existing or new tooz concepts/abstractions). Use these primitives to mirror as much of the previously described functionality that currently exists for jobs and jobboards using tooz + redis as the backing implementation.

This will knowingly have the following problems:

  • Redis does not support client heartbeats; this will be required to be done using timeouts and selected client keys (?) and associated job recovery that is done when a jobs owner dies or is lost. This is added complexity that zookeeper or raft come with built-in, and is a problem/debt that will be incurred with this solution.
  • Redis lacks a multi-master strategy (this is getting better with redis clustering in 3.0 but that feature still does not exist as a recommended production ready solution). It is also an unknown how this clustering strategy will really work under partitions and high-load.
    • Without this feature as production ready it will imply that the redis server that clients (job workers and the job posters) connect to will be of limited use at large scale without involving client side static partitioning (dynamic partitioning means the key-space will be split across many servers, which can lead to inconsistencies during network partitions or server loss...). Until that feature is proven out and known to be production ready; I would personally rather not recommend it.
  • Redis does not support a concept of watches which the current zookeeper backed implementation uses to allow clients (aka workers or conductors) to asynchronously (without polling) become aware of new jobs appearing (and disappearing). This will need to be worked around by using polling or using the pub/sub capabilities of redis to trigger workers to react to new job/s appearing or disappearing.

Alternatives

A few alternatives are possible and from reviewing the current state of the python world they appear to be.

  • RQ; this project does nearly the same thing that is described above, even using redis internally. It could be a potential alternative, from looking at the source code it has a few inherent flaws (using pickle to serialize the workers function to be executed). It is also not integrated into taskflow but it should be a possible reference or alternative that could be worked into taskflow (assuming the above pickle issue usage is fixed/removed...) and just used. We should likely consider this pretty strongly as a way to make this work (and just help improve the RQ library?).
  • Raft; provide a comparable implementation to the existing zookeeper based one but instead back that implementation by a a raft client and require individuals and operators that want to take advantage of this feature to setup a raft based cluster and associated quorum. This is a desirable and likely one of the better/best alternatives (in terms of feature parity and capabilities, since the primitives that exist in raft based implementations and the API exposed is nearly identical to what zookeeper provides). Sadly though the implementations that exist for raft seem to be not yet mature enough for this to be a realistic alternative; I strongly believe we can revisit this in the near future as those implementations mature and are more extensively proven out (and adopted).
  • Don't change anything; doesn't seem so elegant (and it also restricts the API of the current design to be specific to just one system, which means the API can/could be unknowingly and unnecessarily fragile).

Impact on Existing APIs

No new API changes, this should be a API compatible new backend that can be selected using stevedore. If the existing API is to specific to the current implementation then we will need to consider how to adjust the existing API in a backward-compatible manner.

Security impact

Redis has limited support for authentication features as it is designed to be ran inside trusted environments. This should be taken into account when selecting redis as a deployed implementation. The existing zookeeper one is better off since it supports SASL (since 3.4.0) and it has concepts of restricted ACLs natively built-in.

Performance Impact

None expected.

Configuration Impact

A new set of configuration will be required when selecting the new backend. It will likely involve at least the following:

  • The redis server IP and port.
  • The key prefix that should be used (used for name-spacing servers and clients).
  • A pub/sub channel/s (used so that workers become aware of new work being posted).
  • Likely a few others.

Developer Impact

This should make it easier for developers (and deployers) to start using the job and jobboard functionality that taskflow offers and makes it easier for them to test locally (using redis) and deploy to small and medium sized environments (also using redis) and for larger environments they can use the alternative (but feature compatible implementation using zookeeper, or later raft when that is ready).

Implementation

Assignee(s)

Primary assignee:

  • <TBD>

Other contributors:

  • <TBD>

Milestones

K (or at least end of J).

Work Items

  • Investigate a prototype with the RQ library (and report back on failure or successes). If this seems like a feasible implementation consider just using it instead.
  • If RQ is not a feasible implementation then create a implementation using tooz primitives (the tooz library likely requires redis additions to make this possible). If the tooz change becomes not feasible, then just use the redis python library and work on making the other solutions more feasible (and eventually depreciating/replacng the created implementation when those other solutions become feasible).
  • Test like crazy.
  • Provide/update documentation so that people know how to use it.

Incubation

N/A

Adoption

N/A

Library

N/A

Anticipated API Stabilization

Hopefully the existing API that already exist just works and no tweaks are required to make the redis implementation operate correctly. If stabilization is required I would expect it to not take more than one release cycle to flush out/adjust.

Documentation Impact

New documentation describing the feature, how to use it and the features (and any described drawbacks, see above) that come along with using it. It is expected that the documentation will be updated accordingly with this new addition so that users can easily reference how to take advantage of it (extra brownie points for adding working and understandable examples as well).

Dependencies

  • Redis client in requirements: already exists as the redis python client is already part of the global requirements repository (the requirement was added at least before or during the havana cycle, so has been existing there for quite a while).
  • RQ in requirements (if it is feasible) or tooz in requirements (both are not currently in the requirements respository).

References

If tooz works out, then we can also/later consider moving the zookeeper based implementation also to complementary tooz primitives and remove/depreciate or augment some or all of that code in taskflow existing implementation.

Note

This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode