Commit Graph

80 Commits

Author SHA1 Message Date
Stephen Finucane 43a5f3984e db: Remove layer of indirection
We don't have another ORM to content with here. Simplify
'heat.db.sqlalchemy' to 'heat.db'.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
Change-Id: Id1db6c0ff126859f436c6c9b1187c250f38ebb62
2023-03-25 12:02:27 +09:00
Zane Bitter 5c326c22df Simplify logic in retrigger_check_resource()
The node to retrigger (cleanup or update) depends only on whether the
update node appears in the new traversal's graph, not on what type of
node in the old traversal was blocking the new one. Simplify the logic
and remove the unused parameter.

Also use the ConvergenceNode named tuple instead of raw tuples
everywhere.

Change-Id: I00aecb2b4b52d3d759446f22c69891fb85c4c735
2020-04-30 10:51:45 -04:00
Zane Bitter 38614a78c1 Add unit test for nested stack cancel
Test that when cancelling a nested stack, its children also get
cancelled.

Change-Id: Icfd4ef1654dd141d17541bed48fee412001efdec
2019-10-29 23:18:13 -04:00
Zane Bitter e63778efc9 Eliminate client race condition in convergence delete
Previously when doing a delete in convergence, we spawned a new thread to
start the delete. This was to ensure the request returned without waiting
for potentially slow operations like deleting snapshots and stopping
existing workers (which could have caused RPC timeouts).

The result, however, was that the stack was not guaranteed to be
DELETE_IN_PROGRESS by the time the request returned. In the case where a
previous delete had failed, a client request to show the stack issued soon
after the delete had returned would likely show the stack status as
DELETE_FAILED still. Only a careful examination of the updated_at timestamp
would reveal that this corresponded to the previous delete and not the one
just issued. In the case of a nested stack, this could leave the parent
stack effectively undeletable. (Since the updated_at time is not modified
on delete in the legacy path, we never checked it when deleting a nested
stack.)

To prevent this, change the order of operations so that the stack is first
put into the DELETE_IN_PROGRESS state before the delete_stack call returns.
Only after the state is stored, spawn a thread to complete the operation.

Since there is no stack lock in convergence, this gives us the flexibility
to cancel other in-progress workers after we've already written to the
Stack itself to start a new traversal.

The previous patch in the series means that snapshots are now also deleted
after the stack is marked as DELETE_IN_PROGRESS. This is consistent with
the legacy path.

Change-Id: Ib767ce8b39293c2279bf570d8399c49799cbaa70
Story: #1669608
Task: 23174
2018-07-30 20:48:28 -04:00
Zane Bitter 6a176a270c Use a namedtuple for convergence graph nodes
The node key in the convergence graph is a (resource id, update/!cleanup)
tuple. Sometimes it would be convenient to access the members by name, so
convert to a namedtuple.

Change-Id: Id8c159b0137df091e96f1f8d2312395d4a5664ee
2017-09-26 16:46:17 -04:00
Jenkins b4a1ad2bd5 Merge "Avoid creating two Stacks when loading Resource" 2017-08-14 12:23:17 +00:00
ricolin 552f94b928 Add converge flag in stack update for observing on reality
Add converge parameter for stack update API and RPC call,
that allow triggering observe on reality. This will be
triggered by API call with converge argument (with True
or False value) within. This flag also works for resources
within nested stack.
Implements bp get-reality-for-resources

Change-Id: I151b575b714dcc9a5971a1573c126152ecd7ea93
2017-08-07 05:39:29 +00:00
Zane Bitter 960f626c24 Avoid creating two Stacks when loading Resource
When load()ing a Resource in order to check it, we must load its definition
from whatever version of the template it was created or last updated with.
Previously we created a second Stack object with that template in order to
obtain the resource definition. Since all we really need in order to obtain
this is the StackDefinition, create just that instead.

Change-Id: Ia05983c3d1b838d2e28bb5eca38d13e83ccaf368
Implements: blueprint stack-definition
2017-07-21 10:44:51 -04:00
Jenkins 224a83821a Merge "Fix _retrigger_replaced in convergence worker" 2017-07-21 10:53:03 +00:00
Zane Bitter 33a16aa7a8 Log unhandled exceptions in worker
RPC calls to the worker use 'cast', so nothing is listening to find out the
result. If an exception occurs we will never hear about it. This change
logs such unhandled exceptions as errors.

Change-Id: I51365a9dee8fd4eff85e77d3e42bf33be814a22c
Partial-Bug: #1703043
2017-07-10 16:43:38 -04:00
ricolin 6d7506c690 Fix _retrigger_replaced in convergence worker
Fix missing argument in _retrigger_replaced when calling
CheckResource.
Closes-Bug: #1702487

Change-Id: Idc81b50fcc7036aa90f1489a348572ef03aa3381
2017-07-06 10:25:23 +08:00
Zane Bitter 5681e237c5 Avoid creating new resource with old template
If a traversal is interrupted by a fresh update before a particular
resource is created, then the resource is left stored in the DB with the
old template ID. While an update always uses the new template, a create
assumes that the template ID in the DB is correct. Since the resource has
never been created, the new traversal will create it using the old
template.

To resolve this, detect the case where the resource has not been created
yet and we are about to create it and the traversal ID is still current,
and always use the new resource definition in that case.

Change-Id: Ifa0ce9e1e08f86b30df00d92488301ea05b45b14
Closes-Bug: #1663745
2017-06-05 23:14:19 -04:00
liyi 8f10215ffd Remove log translations
Log messages are no longer being translated. This removes all use of
the _LE, _LI, and _LW translation markers to simplify logging and to
avoid confusion with new contributions.

See:
http://lists.openstack.org/pipermail/openstack-i18n/2016-November/002574.html
http://lists.openstack.org/pipermail/openstack-dev/2017-March/113365.html

Change-Id: Ieec8028305099422e1b0f8fc84bc90c9ca6c694f
2017-03-25 17:11:50 +08:00
Zane Bitter bc4fde4dce Add a NodeData class to roll up resource data
Formalise the format for the output data from a node in the convergence
graph (i.e. resource reference ID, attributes, &c.) by creating an object
with an API rather than ad-hoc dicts.

Change-Id: I7a705b41046bfbf81777e233e56aba24f3166510
Partially-Implements: blueprint stack-definition
2017-02-24 10:10:26 -05:00
Thomas Herve 84067dba88 Remove db.api wrapper
The db.api module provides a useless indirection to the only
implementation we ever had, sqlalchemy. Let's use that directly instead
of the wrapper.

Change-Id: I80353cfed801b95571523515fd3228eae45c96ae
2016-12-13 09:40:29 +01:00
Jenkins bcf3889774 Merge "Cleanup service usage" 2016-11-22 13:14:09 +00:00
Crag Wolfe 892a4eac36 Do not load templates in stop_traversal
When iterating through nested stacks in stop_traversal, there is no
need to load or process templates.

Change-Id: If2795cff4a9e7052e2186c811cdcd3e9451f9ff6
2016-11-07 11:27:21 -08:00
Thomas Herve 34f6ff920e Cleanup service usage
oslo_service Service usage in the engine was slightly wrong: we
inherited from the base class without using its threadgroup, and we also
inherited from it in utility classes that were not real services. This
cleans up those.

Change-Id: I0f902afb2b4fb03c579d071f9b502e3108aa460a
2016-11-03 07:59:10 +01:00
zhufl 5c74723f5e Add missing %s in print message
This is to add missing %s in print message

Change-Id: Ibfc88c579442c38b5c58babae358d113c85c4172
2016-09-21 10:58:35 +08:00
Jenkins 07808e280a Merge "Re-trigger on update-replace" 2016-09-20 23:26:40 +00:00
Anant Patil 99b055b423 Re-trigger on update-replace
It is found that the inter-leaving of lock when a update-replace of a
resource is needed is the reason for new traversal not being triggered.

Consider the order of events below:
1. A server is being updated. The worker locks the server resource.
2. A rollback is triggered because some one cancelled the stack.
3. As part of rollback, new update using old template is started.
4. The new update tries to take the lock but it has been already
acquired in (1). The new update now expects that the when the old
resource is done, it will re-trigger the new traversal.
5. The old update decides to create a new resource for replacement. The
replacement resource is initiated for creation, a check_resource RPC
call is made for new resource.
6. A worker, possibly in another engine, receives the call and then it
bails out when it finds that there is a new traversal initiated (from
2). Now, there is no progress from here because it is expected (from 4)
that there will be a re-trigger when the old resource is done.

This change takes care of re-triggering the new traversal from worker
when it finds that there is a new traversal and an update-replace. Note
that this issue will not be seen when there is no update-replace
because the old resource will finish (either fail or complete) and in
the same thread it will find the new traversal and trigger it.

Closes-Bug: #1625073
Change-Id: Icea5ba498ef8ca45cd85a9721937da2f4ac304e0
2016-09-20 11:58:24 +00:00
Anant Patil bc2e136fe3 Cancel traversal of nested stack
The stack cancel update would halt the parent stack from propagating but
the nested stacks kept on going till they either failed or completed.
This is not desired, the cancel update should stop all the nested stacks
from moving further, albeit, it shouldn't abruptly stop the currently
running workers.

Change-Id: I3e1c58bbe4f92e2d2bfea539f3d0e861a3a7cef1
Co-Authored-By: Zane Bitter <zbitter@redhat.com>
Closes-Bug: #1623201
2016-09-15 10:30:58 -04:00
Anant Patil 2e281df428 Fix sync point delete
When a resource failed, the stack state was set to FAILED and current
traversal was set to emoty string. The actual traversal was lost and
there was no way to delete the sync points belonging to the actual
traversal.

This change keeps the current traversal when you do a state set, so that
later you can delete the sync points belonging to it. Also, the current
traversal is set to empty when the stack has failed and there is no need
to rollback.

Closes-Bug: #1618155

Change-Id: Iec3922af92b70b0628fb94b7b2d597247e6d42c4
2016-09-14 17:04:22 +05:30
Anant Patil 873a40851d Convergence: basic framework for cancelling workers
Implements mechanism to cancel existing workers (in_progress resources).
The stack-cancel-update request lands in one of the engines, and if
there are any workers in that engine which are working for the stack,
they are cancelled first and then other engines are requested to cancel
the workers.

Change-Id: I464c4fdb760247d436473af49448f7797dc0130d
2016-09-10 09:22:36 +02:00
Zane Bitter 9c79ee4d69 Add interrupt points for convergence check-resource operations
This allows a convergence operation to be cancelled at an appropriate point
(i.e. between steps in a task) by sending a message to a queue.

Note that there's no code yet to actually cancel any operations
(specifically, sending a cancel message to the stack will _not_ cause the
check_resource operations to be cancelled under convergence).

Change-Id: I9469c31de5e40334083ef1dd20243f2f6779549e
Related-Bug: #1545063
Co-Authored-By: Anant Patil <anant.patil@hpe.com>
2016-08-26 11:02:45 +00:00
Anant Patil 084d0eb20f Convergence cancel update implementation
Implements:
(1) stack-cancel-update <stack_id> will start another update using the
previous template/environment. We'll start rolling back; in-progress
resources will be allowed to complete normally.
(2) stack-cancel-update <stack_id> --no-rollback will set the
traversal_id to None so no further resources will be updated;
in-progress resources will be allowed to complete normally.

Change-Id: I46ebdebb130be7410abe3e0b62f85da9856287b6
2016-08-23 17:01:57 +05:30
Anant Patil 459086f984 Convergence: Cancel message
Implements a cancel message sending mechanism.

A cancel message is sent to heat engines working on the stack.

Change-Id: I3b529addbd02a79364f7f2a041fc87d5019dd5d9
Patial-Bug: #1533176
2016-07-05 07:52:03 +00:00
Jenkins 98b5f3b79c Merge "Convergence: Refactor worker" 2016-05-12 07:13:23 +00:00
Rabi Mishra 51d913a30d Check for worker_service initialization
When stopping the engine check if the worker_service is
intialized or not before stopping it.

Change-Id: I876c2cef4bf6589b9bc45f58b5cd52ed0323c9e9
Closes-Bug: #1572851
2016-04-25 08:33:30 -05:00
Anant Patil 829e80d06e Convergence: Refactor worker
Refactor the worker service; move the check resource code to its own
class in another file and keep the convergence worker RPC API clean.

This refactor will help us contain the convergence logic in a separate
class file instead of in RCP API. The RPC service class should only have
the APIs it implements.

Change-Id: Ie9cf4daba7e6bf61f4cac3388494e8c9efefa4d7
2016-04-22 12:52:16 +00:00
Jenkins 9d03183ab5 Merge "Use EntityNotFound instead of SyncPointNotFound" 2016-03-30 07:35:50 +00:00
Anant Patil afd08e07b5 Convergence: Avoid cache when resolving input data
While constructing input-data for building the cache, the resource
attributes must resolve without hitting the cache again. It is
unnecessary to look into cache for resolving attributes of a freshly
baked resource.

Change-Id: I0893c17d87c687ca5cf370c4443f471160bd2f3c
2016-03-08 06:54:06 +00:00
Thomas Herve c4f8db9681 Add function tests for event sinks
Add a new functional test using Zaqar as a target for event sinks. This
fixes the behavior when convergence is on.

Change-Id: I4bbdec55b98d0a261168229540a411d423e9406d
2016-02-22 09:41:13 +00:00
ricolin 0c8d9145da Use EntityNotFound instead of SyncPointNotFound
Unify NotFound exception with Entitynotfound.

Change-Id: I0c69596eb332b768a606c7b11ef768c4a1404d2e
Depends-On: I782c372723f188bab38656e5b7cc401d23808ffb
2016-01-17 06:19:52 +00:00
Anant Patil b84417b6ce Convergence: Pick resource from dead engine worker
When a engine worker crashes or is restarted, the resources being
provisioned in it remain in IN_PROGRESS state. Next stack update should
pick these resources and work on them. The implementation is to set the
status of resource as FAILED and re-trigger check_resource.

Change-Id: Ib7fd73eadd0127f8fae47881b59388b31131daf4
Closes-Bug: #1501161
2016-01-06 16:01:08 +05:30
Rakesh H S 24d265327e Convergence: Re-trigger failed resource for latest traversal
Presently, when a resource of previous traversal completes its action
successfully we re-trigger this resource for latest traversal.(since
the latest traversal will be waiting for its completion)

However, if a resource of previous traversal fails we do not
re-trigger which leads to latest traversal waiting endlessly.

This patch re-triggers the resource for latest traversal even when
the resource fails.

Change-Id: I9f70878ad7f1ff7c2facb950e496681425b54fc4
Partial-Bug: #1512343
2015-11-26 09:46:08 +00:00
Anant Patil 634c24ecfe Convergence: Concurrency subtle issues
To avoid certain concurrency related issues, the DB update API needs to
be given the traversal ID of the stack intended to be updated. By making
this change, we can void having following at all the places:

    if current_traversal != stack.current_traversal:
        return

The check for current traversal should be implicit, as a part of stack's
store and state_set methods, where self.current_traversal should be used
as expected traversal to be updated. All the state changes or updates in
DB to the stack object go through this implicit check (using
update...where).

When stack updates are triggered, the current traversal should be backed
up as previous traversal, a new traversal should be generated and the
stack should be stored in DB with expected traversal as the previous
traversal. This will ensure that no two updates can simultaneously
succeed on same stack with same traversal ID. This was one of our
primary goal.

Following example cases describe the issues we encounter:

1. When 2 updates, U1 and U2 try to update a stack concurrently:

    1. Current traversal(CT) is X
    2. U1 loads stack with CT=X
    3. U2 loads stack with CT=X
    4. U2 stores the stack and updates CT=Y
    5. U1 stores the stack and updates the CT=Z

    Both the updates have succeeded, and both would be running until
    one of the workers does stack.current_traversal == current_traversal
    and bail out.

    Ideally, U1 should have failed: only one should be allowed in case
    of concurrent update. When both U1 and U2 pass X as the expected
    traversal ID of the stack, then this problem is solved.

2. A resource R is being provisioned for stack with current traversal
   CT=X:

    1. An new update U is issued, it loads the stack with CT=X.
    2. Resource R fails and loads the stack with CT=X to mark it as FAILED.
    3. Update U updates the stack with CT=Y and goes ahead with sync_point
       etc., marks stack as UPDATE_IN_PROGRESS
    4. Resource marks the stack as UPDATE_FAILED, which to user means that
       update U has failed, but it actually is going on.

    With this patch, when Resource R fails, it will supply CT=X as
    expected traversal to be updated and will eventually fail because
    update U with CT=Y has taken over.

Partial-Bug: #1512343
Change-Id: I6ca11bed1f353786bb05fec62c89708d98159050
2015-11-26 09:45:49 +00:00
Rakesh H S 77c11d037c Convergence: Load resource stack with correct template
When loading a resource, load the stack with template of the resource.
Appropriate stack needs to be assigned to resource(resource.stack), else
resource actions will fail.

Co-Authored-By: Anant Patil <anant.patil@hp.com>
Partial-Bug: #1512343

Change-Id: Ic4526152c8fd027049514b71554036321a61efd2
2015-11-26 14:05:21 +05:30
Peter Razumovsky 2da170c435 Fix [H405] pep rule in heat/engine
Fix [H405] rule in heat/engine python
files.

Implements bp docstring-improvements

Change-Id: Iaa1541eb03c4db837ef3a0e4eb22393ba32e270f
2015-09-21 14:51:46 +03:00
Rakesh H S 1956ddd2a6 Convergence: Store resource status in cache data
Fix failing convergence gate functional tests
- store resource uuid, action, status in cache data. Most of the code
requires the resource to have proper status and uuid to work.
- initialize rsrc._data to None so that the resource data is fetched from
db first time.

Change-Id: I7309c7da8fe1ce3e1c7e3d3027dea2e400111015
Co-Authored-By: Anant Patil <anant.patil@hp.com>
Partial-Bug: #1492116
Closes-Bug: #1495094
2015-09-14 17:29:18 +05:30
Oleksii Chuprykov f1b2d9add5 Move Resource exceptions to common module (4)
It is convenient to have all exceptions in exception module.
Also it is reduces namespace cluttering of resource module and decreases
the number of dependencies in other modules (we do not need to import resource
in some cases for now).
UpdateInProgress exception is moved in this patch.

Change-Id: If694c264639bbce5334e1e6e7403b225ce1d3aee
2015-09-04 11:24:47 +00:00
Oleksii Chuprykov 4e2cfb991a Move Resource exceptions to common module (1)
It is convenient to have all exceptions in exception module.
Also it is reduces namespace cluttering of resource module and decreases
the number of dependencies in other modules (we do not need to import resource
in some cases for now).
UpdateReplace exception is moved in this patch.

Change-Id: Ief441ca2022a0d50e88d709d1a062631479715b7
2015-09-04 14:23:53 +03:00
Angus Salkeld dd0859a080 Convergence: add support for the path_component
store the attr name and path so attributes don't get shadowed
e.g. get_attr: [res1, attr_x, show]
     get_attr: [res1, attr_x, something]

Change-Id: I724e91b32776aa5813d2b821c2062424e0635a69
2015-09-01 12:53:05 +05:30
Angus Salkeld 881e4d051a Convergence: input_data physical_resource_id -> reference_id
1. we are caching the result of FnGetRefId which can be the name
2. cache_data_resource_attribute() was trying to access "attributes"
   instead of "attrs".

Change-Id: I59d55dcee2af521924fdb5da14e012dcc7b4dd3f
2015-08-18 12:06:36 +10:00
Anant Patil b5968ef068 Convergence: Implementation of timeout
The resource provisioning work is distributed among heat engines, so the
timeout also has to be distributed and brought to the resource level
granularity.

Thus,
1. Before invoking check_resource on a resource, ensure that the stack
has not timed out.
2. Pass the remaining amount of time to the resource converge method so
that it can raise timeout exception if it cannot finish in the remaining
time.

Once timeout exception is raised by a resource converge method, the
corresponding stack is marked as FAILED with "Timed out" as failure
reason. Then, if rollback is enabled on the stack, it is triggered.

Change-Id: Id1806d546c67505137f57f72d5b463dc229a666d
2015-08-07 10:05:30 +05:30
Jenkins ae7cb9bfcb Merge "Convergence: Refactor convergence dependency" 2015-08-04 11:26:22 +00:00
Sirushti Murugesan 5d1027a135 Convergence: Do create operation only if action is INIT
All resources that are new will have an INIT state. Instead
of having a complex strategy to decide whether the resource
should be created or updated, just check for the action
to see if it is in the INIT state or not. If it is not, then
always trigger the update workflow.

Also, this fixes a bug where we triggered a create for a
resource without a resource id that originally should've been
updated because it was in UPDATE_FAILED which was the unhandled
case.

Change-Id: I3f2318fecfe76592e8b54e9c09fdf1614197e83f
2015-08-03 19:12:27 +05:30
Anant Patil cd3931c635 Convergence: Refactor convergence dependency
A new property is added to fetch convergence dependencies from the
stack.

Change-Id: If2eb29f9222f21390513fad5702dc4718d5c4165
2015-08-01 04:20:56 +00:00
Angus Salkeld d23ebb6065 Convergence: clarify what "data" is
Mostly in worker we have arguments called "data", it is not clear
if these are serialized or not (and if they have adopt data in them).

1. split adopt data out (add RPC support for the new argument)
2. name arguments "resource_data" for deserialized data
3. name arguments "rpc_data" for serialized data
4. make sure all data into client.check_resource() is serialized

Change-Id: Ie6bd0e45d2857d3a23235776c2b96cce02cb711a
2015-08-01 04:19:33 +00:00
Angus Salkeld 14897230fb Clean up the worker service logging
1. remove the duplication between service.py and worker.py
2. use the topic, version & engine_id when logging

Change-Id: I2b7dfbbe1d5a68a9f1739ab53ba5c08691b495e1
2015-08-01 04:19:15 +00:00