This would have been causing the entire stack to remain in memory until
garbage-collected. We only need the identifier, so store that instead.
Change-Id: If965b4415d7640b93edd153f2893a7e0c04bc8d6
Partial-Bug: #1626675
(cherry picked from commit 82b8fd8c17)
Previously, the stop_stack message accidentally used the
engine_life_check_timeout (by default, 2s). But unlike other messages sent
using that timeout, stop_stack needs to synchronously kill all running
threads operating on the stack. For a very large stack, this can easily
take much longer than a couple of seconds. This patch increases the timeout
to give a better chance of being able to start the delete.
Change-Id: I4b36ed7f1025b6439aeab63d71041bb2000363a0
Closes-Bug: #1499669
(cherry picked from commit e56fc689e1)
A stack may be in transient state where it is DELETE_COMPLETE, but has
has not actually been soft-deleted yet. For the purposes of
delete_stack in service.py, consider a DELETE_COMPLETE stack as
equivalent to a soft-deleted one (it soon will be), thereby avoiding a
race where we would have attempted to update the stack, running into a
foreign-key constraint issue for a non-existing user_cred.
Change-Id: Iec021e6a0df262d447fdf9ee1789603c7a1c55f8
Closes-Bug: #1626173
Closes-Bug: #1626107
(cherry picked from commit e1f161a19a)
If an exception was raised in delete_stack when deleting a nested stack,
the parent stack would never hear about it because we were accidentally
using cast() instead of call() to do the stack delete. This meant the
parent resource would remain DELETE_IN_PROGRESS until timeout when the
nested stack had already failed and raised an exception.
In the case of bug 1499669, the exception being missed was
StopActionFailed.
Change-Id: I039eb8f6c6a262653c1e9edc8173e5680d81e31b
Partial-Bug: #1499669
(cherry picked from commit e5cec71e52)
This particular patch fixes a behaviour of cancel update for
nova server with defined port, so there are no ports manageable
by nova. We have these issues while restoring ports after rollback:
1) We doesn't detach any ports from current server, because we
doesn't save them to resoruce data. (we store this data after
succesfull create of the server)
2) Detaching an interface from current server will fail, if the server
will be in building state, so we need to wait until server will be
in active or in error state.
Refresh ports list to solve problem (1).
Wait until nova moves to active/error state to solve (2).
A functional test to prove the fix was added. Note, that this test is
skipped for convergence engine tests until cancel update will work
properly in convergence mode (see bug 1533176).
Partial-Bug: #1570908
Change-Id: If6fd916068a425eea6dc795192f286cb5ffcb794
(cherry picked from commit 584efe3329)
It is found that the inter-leaving of lock when a update-replace of a
resource is needed is the reason for new traversal not being triggered.
Consider the order of events below:
1. A server is being updated. The worker locks the server resource.
2. A rollback is triggered because some one cancelled the stack.
3. As part of rollback, new update using old template is started.
4. The new update tries to take the lock but it has been already
acquired in (1). The new update now expects that the when the old
resource is done, it will re-trigger the new traversal.
5. The old update decides to create a new resource for replacement. The
replacement resource is initiated for creation, a check_resource RPC
call is made for new resource.
6. A worker, possibly in another engine, receives the call and then it
bails out when it finds that there is a new traversal initiated (from
2). Now, there is no progress from here because it is expected (from 4)
that there will be a re-trigger when the old resource is done.
This change takes care of re-triggering the new traversal from worker
when it finds that there is a new traversal and an update-replace. Note
that this issue will not be seen when there is no update-replace
because the old resource will finish (either fail or complete) and in
the same thread it will find the new traversal and trigger it.
Closes-Bug: #1625073
Change-Id: Icea5ba498ef8ca45cd85a9721937da2f4ac304e0
(cherry picked from commit 99b055b423)
The error messages 'Command Out of Sync' are due to the threads being
stopped in the middle of the database operations. This happens in the
legacy action when delete is requested during a stack create.
We have the thread cancel message but that was not being used in this
case. Thread cancel should provide a more graceful way of ensuring the
stack is in a FAILED state before the delete is attempted.
This changes does the following in the delete_stack service method for
legace engine:
- if the stack is still locked, send thread cancel message
- in a subthread wait for the lock to be released, or until a
timeout based on the 4 minute cancel grace period
- if the stack is still locked, do a thread stop as before
Closes-Bug: #1499669
Closes-Bug: #1546431
Closes-Bug: #1536451
Change-Id: I4cd613681f07d295955c4d8a06505d72d83728a0
(cherry picked from commit 3000f90408)
The stack cancel update would halt the parent stack from propagating but
the nested stacks kept on going till they either failed or completed.
This is not desired, the cancel update should stop all the nested stacks
from moving further, albeit, it shouldn't abruptly stop the currently
running workers.
Change-Id: I3e1c58bbe4f92e2d2bfea539f3d0e861a3a7cef1
Co-Authored-By: Zane Bitter <zbitter@redhat.com>
Closes-Bug: #1623201
(cherry picked from commit bc2e136fe3)
When a resource failed, the stack state was set to FAILED and current
traversal was set to emoty string. The actual traversal was lost and
there was no way to delete the sync points belonging to the actual
traversal.
This change keeps the current traversal when you do a state set, so that
later you can delete the sync points belonging to it. Also, the current
traversal is set to empty when the stack has failed and there is no need
to rollback.
Closes-Bug: #1618155
Change-Id: Iec3922af92b70b0628fb94b7b2d597247e6d42c4
(cherry picked from commit 2e281df428)
We used to try to acquire the stack lock in order to find out which engine
to cancel a running update on, in the misguided belief that it could never
succeed. Accordingly, we never released the lock.
Since it is entirely possible to encounter a race where the lock has
already been released, use the get_engine_id() method instead to look up
the ID of the engine holding the lock without attempting to acquire it.
Change-Id: I1d026f8c67dddcf840ccbc2f3f1537693dc266fb
Closes-Bug: #1624538
(cherry picked from commit e2ba3390cd)
This patch allows to do update inplace for allowed_address_pairs
properties. Scenario mentioned in bug works correct now.
Also add couple fixes to related test:
- Add explicit translation name to string, otherwise it returns objects,
that raise error during resolving Property name, which should be a
string.
- Add check, that update of any of mentioned properties does not cause replace.
Change-Id: I913fd36012179f2fdd602f2cca06a89e3fa896f3
Closes-Bug: #1623821
(cherry picked from commit 353e7319db)
It's expected that during a convergence traversal, we may encounter a
resource that is still locked by a previous traversal. Don't log an
ERROR-level message about what is a normal condition. Instead, log at
INFO level describing what is happening, with more details at DEBUG
level.
Change-Id: I645c2a173b828d4a983ba874037d059ee645955f
Related-Bug: #1607814
(cherry picked from commit 7f5bd76f7a)