* The logic of calculating a task result in case of "with-items" was
overcomplicated and broke encapsulation of a "with-items" task.
This patch makes it simpler, so that the method doesn't need to
peek into the internals of a "with-items" task (e.g. runtime_context).
Change-Id: I036193cbae15d7f3c3414b123525ceafa91fdeb1
* The purpose of this patch is to improve encapsulation of task
execution state management. We already have the class Task
(engine.tasks.Task) that represents an engine task and it is
supposed to be responsible for everything related to managing
persistent state of the corresponding task execution object.
However, we break this encapsulation in many places and various
modules manipulate with task execution state directly. This fact
leads to what is called "spagetty code" because important
things are often spread out across the system and it's hard to
maintain. It also leads to lots of duplications. So this patch
refactors policies so that they manipulate with a task execution
through an instance of Task which hides low level aspects.
Change-Id: Ie728bf950c4244db3fec0f3dadd5e195ad42081d
Fail-on policy allows to fail success tasks by condition. It is useful
in cases we have to fail task if its result is unacceptable and it makes
workflow definition more readable.
Change-Id: I57b4f3d1533982d3b9b7063925f8d70f044aefea
Implements: blueprint fail-on-policy
Signed-off-by: Oleg Ovcharuk <vgvoleg@gmail.com>
* After this patch we can switch scheduler implementations in the
configuration. All functionality related to scheduling jobs is
now expressed vi the internal API classes Scheduler and
SchedulerJob. Patch also adds another entry point into setup.cfg
where we can register a new scheduler implementation.
* The new scheduler (which is now called DefaultScheduler) still
should be considered experimental and requires a lot of testing
and optimisations.
* Fixed and refactored "with-items" tests. Before the patch they
were breaking the "black box" testing principle and relied on
on some either purely implementation or volatile data (e.g.
checks of the internal 'capacity' property)
* Fixed all other relevant tests.
Change-Id: I340f886615d416a1db08e4516f825d200f76860d
Delayed calls for nonexistent entities should not fail; they should do
nothing and be deleted in normal way.
Change-Id: I1b818d671468b95ce8ae06416b57fd4a22cc6eb2
Signed-off-by: Oleg Ovcharuk <vgvoleg@gmail.com>
* action_queue module is replaced with the more generic
post_tx_queue module that allows to register operations that must
run after the main DB transaction associated with processing a
workflow event such as completing action.
* Instead of calling workflow completion check from all places
where task may possibly complete, Mistral now registers a post
transactional operation that runs after the main DB transaction
(to make sure at least one needed consistent DB read) right
inside the task completion logic. It reduces clutter significantly.
* Workflow completion check is now registered only if the just
completed task may lead to workflow completion, i.e. if it's the
last one in a workflow branch.
* Join now checks delayed calls to reduce a number of join
completion checks created with scheduler and also uses post
transactional queue for that.
Closes-Bug: #1801872
Change-Id: I90741d4121c48c42606dfa850cfe824557b095d0
* Workflow completion algorithm use periodic scheduled jobs to
poll DB and determine when a workflow is finished. The problem
with this approach is that if Mistral runs another iteration
of such job too soon then running such jobs will create a big
load on the system. If too late, then a workflow may be in
RUNNING state for too long after all its tasks are completed.
The current implementation tries to predict a delay with which
the next job should run, based on a number of incompleted tasks.
This approach was initially taken because we switched to a
non-blocking transactional model (previously we locked the entire
workflow execution graph in order to change a state of anything)
and in this architecture, when we have parallel branches, i.e.
parallel DB transactions, we can't make a consistent read from
DB from neither of these transactions to make a reliable decision
about whether the workflow is completed or not. Using periodic
jobs was a solution. However, this approach has been proven to
work unreliably because such a prediction about delay before the
next job iteration doesn't work well on all variety of use cases
that we have.
This patch removes using periodic jobs in favor of using the
"two transactions" approach when in the first transaction we
handle action completion event (and task completion if it causes
it) and in the second transaction, if a task is completed, we
check if the workflow is completed. This approach guarantees
that at least one of the "second" transactions in parallel
branches will make needed consistent read from DB (i.e. will
see the actuall state of all needed objects) to make the right
decision.
Closes-Bug: #1799382
Change-Id: I2333507503b3b8226c184beb0bd783e1dcfa397f
* Previously we stored the data structure describing the current
task execution (id and name) in the inbound task execution context
directly so that it'd be saved to DB. This was needed to evaluate
YAQL/Jinja function task() without parameters properly. However,
it's not needed, we can just build a context view on the fly
just before evaluating an expression.
Change-Id: If523039446ab3e2ccc9542617de2a170168f6e20
Closes-Bug: #1764704
* Commands going after 'pause' in 'on-XXX' clauses
were never processed after workflow resume. The
solution is to introduce a notion of a workflow
execution backlog where we can save these commands
in a serialized form so that the engine dispatcher
could see and process them after resume.
* Other minor changes
Change-Id: I963b5660daf528d1caf6a785311de4fb272cafd0
Closes-Bug: #1714054
It shall be possible to specify timeout for Mistral actions in order
to cancel some long-performed action so that to provide predictable
execution time for client service.
Currently Mistral allows configure timeout on task and automatically
changes task status to error. However mistral don't interrupt action
execution.
We need Mistral to terminate timed out action execution, because there
might be the following issues:
* several the same action executions can run at the same time breaking
data consistency
* stale action executions may lead to the massive resources
consumption (memory, cpu..)
Change-Id: I2a960110663627a54b8150917fd01eec68e8933d
Signed-off-by: Vitalii Solodilov <mcdkr@yandex.ru>
RetryPolicy: prevent break_on from evaluation before task execution.
Sometimes expressions in break_on require existence of task execution
(see example in updated test). But if break_on is evaluated before
first execution of task, it may end up with exception.
Change-Id: Ia836c0330dbed62954d79059df1bef3758f7c5e5
Signed-off-by: Anton Kazakov <ton.kazakov@gmail.com>
Signed-off-by: Vitalii Solodilov <mcdkr@yandex.ru>
When the DB is disconnected, the Mistral API should retry the
operation for a predefined amount of time at least for GET
type requests as this error is highly probable to be caused
by temporary failures. The handlind of Operational errors
was already implemented.
Change-Id: I3adb94dd695aeaa40d37956beae088d5618422c3
* Deletion of delayed calls is incorrect. A list of delayed calls
gets deleted within one DB transaction and if at least one object
is not deleted because of a DBDeadlock exception (on MySQL) then
the entire transaction fails and, what's more important, the
exception is swallowed by the try-finally block without reraising
it so that it could be handled by the "retry_on_deadlock" decorator.
This patch fixes this problem by reraising the initial exception.
* Added "retry_on_deadlock" decorator to all methods methods that
open DB transactions and where we have a risk of hitting a deadlock.
Change-Id: I816c8c2a940e38cf1698d76e1019671249238598
When a workflow is paused by pause-before, the state will cascade down
to other subworkflows and up to parent workflow.
Change-Id: Ied178fe08f8308455bf05b3168635a3b69799cec
Closes-Bug: #1700196
If the task is specified with number of retries as 1, then it is
not retried on error. So, this patch changes the statement of
retries_remain to consider 1 as a value for retry.
Change-Id: Ib0ede7a119bb57108141e50722928d53dd904d5f
Closes-Bug: #1631140
Allow action executions to be cancelled, specifically for async actions, and
handle the cancellation for task and with-items task appropriately. For
with-items tasks, if one of the action executions is cancelled, then the
task is cancelled. Previously, if there is a mix of error and cancels, the
task is marked with error. But this leads to on-complete being processed
which shouldn't since the with-items task is incomplete due to partially
cancelled.
Change-Id: Iafc2263735f75fe06ae5f03a885cda8f965a7cc4
Implements: blueprint mistral-cancel-state
* 'in_context' field of task executions changed its semantics to
not store workflow input and other data stored in the initial
workflow context such as openstack security context and workflow
variables, therefore task executions occupy less space in DB
* Introduced ContextView class to avoid having to merge
dictionaries every time we need to evaluate YAQL functions
against some context. This class is a composite structure
built on top of regular dictionaries that provides priority
based lookup algorithm over these dictionaries. For example,
if we need to evaluate an expression against a task inbound
context we just need to build a context view including
task 'in_context', workflow initial context (wf_ex.context)
and workflow input dictionary (wf_ex.input). Using this
class is a significant performance boost
* Fixed unit tests
* Other minor changes
Change-Id: I7fe90533e260e7d78818b69a087fb5175b9d5199
* Having different types of execution objects in different
tables will give less contention on DB tables and hence better
performance so DB schema was changed accordingly
* Fixed all unit tests and places in the code where we assumed
polymorphic access to execution objects
* Other minor fixes
TODO(in upcoming patches):
* DB migration script
Change-Id: Ibc8408e12dd85e143302d7fdddace32954551ac5
* In case if task needs to be continued, e.g. in case of 'wait-before'
policy which inserts a delay into normal task execution flow (between
creation of task policy and scheduling actions), possible exceptions
also need to be handled properly (move task and worklfow into ERROR).
This patch adds error handling and the test to check this.
* Other minor changes related to addressing a few TODO's across engine
code.
Change-Id: I525f193a149e3b0341aa8d0ffa0858ded96ba94f
* Introduced class hierarchies Task and Action used by Mistral engine.
Note: Action here is a different than executor Action and represents
rather actions of different types: regular python action, ad-hoc
action and workflow action (since for task action and workflow are
polymorphic)
* Refactored task_handler.py and action_handler.py with Task and Action
hierarchies
* Rebuilt a chain call so that the entire action processing would look
like a chain of calls Action -> Task -> Workflow where each level
knows only about the next level and can influence it (e.g. if adhoc
action has failed due to YAQL error in 'output' transformer action
itself fails its task)
* Refactored policies according to new object model
* Fixed some of the tests to match the idea of having two types of
exceptions, MistralException and MistralError, where the latter
is considered either a harsh environmental problem or a logical
issue in the system itself so that it must not be handled anywhere
in the code
TODO(in subsequent patches):
* Refactor WithItemsTask w/o using with_items.py
* Remove DB transaction in Scheduler when making a delayed call,
helper policy methods like 'continue_workflow'
* Refactor policies test so that workflow definitions live right
in test methods
* Refactor workflow_handler with Workflow abstraction
* Get rid of RunExistingTask workflow command, it should be just
one command with various properties
* Refactor resume and rerun with Task abstraction (same way as
other methods, e.g. on_action_complete())
* Add error handling to all required places such as
task_handler.continue_task()
* More tests for error handling
P.S. This patch is very big but it was nearly impossible to split
it into multiple smaller patches just because how entangled everything
was in Mistral Engine.
Partially implements: blueprint mistral-engine-error-handling
Implements: blueprint mistral-action-result-processing-pipeline
Implements: blueprint mistral-refactor-task-handler
Closes-Bug: #1568909
Change-Id: I0668e695c60dde31efc690563fc891387d44d6ba
* Adding state_info to fail_task_if_incomplete solve it
* Unskip test TaskDefaultsReverseWorkflowEngineTest#test_task_defaults_timeout_policy
Closes-Bug: #1527976
Change-Id: I1f44f648ea71d2dcf8bdca77e6bcca0023963be0
While creating policy '>' operator is used,
due to which in py34 exception is occurred when
variable is provided from input parameter.
Exception was
TypeError: unorderable types: str() > int()
TODO: Add more unit test to catch such scenarios.
Partially-Implements: blueprint mistral-py3
Change-Id: I2c652812ae4a04cd7610f2a6684da76c582a4e32
Since the task execution API get_all method is in a transaction block,
if there is a lot of read against the task execution API GET method , it
will lead to unnecessary DB locks that can result in deadlocks and
consequently WF execution failures.
Change-Id: I5a6b7829176178bb6e06768e9d52e94202cf4347
Closes-Bug: #1501433
* As discussed in the mailing list it's better to rename DELAYED
to RUNNING_DELAYED so that we semantically express it as a substate
of RUNNING whereas WAITING is not.
Closes-Bug: #1470369
Change-Id: I3b7033d894d29fe755d4d0262c1029c4576421cd
When retry policy is used without continue-on clause, the retry iteration
will still be scheduled even if the task succeeds.
This patch fixes the problem by return when task succeeds with retry policy.
Change-Id: I9f07ed3565fe7169f2831a435e4e76a49af34f6c
Closes-Bug: #1469330