Commit Graph

27 Commits

Author SHA1 Message Date
marios 521a897322 Cleanup no longer used upgrade files
Removes some of the no longer used scripts and templates used by
the upgrades workflow in previous versions.

Change-Id: I7831d20eae6ab9668a919b451301fe669e2b1346
2017-03-15 10:28:17 +02:00
Jenkins 46aeff7515 Merge "Fix a typo" 2017-03-06 21:11:53 +00:00
Luca Lorenzetto 864cb733a3 Adding definition of backup_flags
During the upgrade from M to N i encountered an error in a step
requiring the upgrade of mysql version. The variable backup_flags
is undefined at that point.

Change-Id: Ic6681c40934b27a03d00a75007d7f12d6d540de3
Closes-Bug: #1667731
2017-02-24 17:10:30 +01:00
marios afcb6e01f3 Make the openvswitch 2.4->2.5 upgrade more robust
In I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae and
I11fcf688982ceda5eef7afc8904afae44300c2d9 we added a manual step
for upgrading openvswitch in order to specify the --nopostun
as discussed in the bug below.

This change adds a minor update to make this workaround more
robust. It removes any existing rpms that may be around from
an earlier run, and also checks that the rpms installed are
at least newer than the version we are on.

This also refactors the code into a common definition in the
pacemaker_common_functions.sh which is included even for the
heredocs generating upgrade scripts during init. Thanks
Sofer Athlan-Guyot and Jirka Stransky for help with that.

Change-Id: Idc863de7b5a8c116c990ee8c1472cfe377836d37
Related-Bug: 1635205
2016-12-14 19:15:11 +02:00
gengchc2 1066241957 Fix a typo
TrivialFix

Change-Id: Ibc072af7bbcb39c4469d4e4a6b0ed202c98221c2
2016-12-07 13:59:23 +00:00
Michele Baldessari dde12b075f Fix race during major-upgrade-pacemaker step
Currently when we call the major-upgrade step we do the following:
"""
...
if [[ -n $(is_bootstrap_node) ]]; then
    check_clean_cluster
fi
...
if [[ -n $(is_bootstrap_node) ]]; then
    migrate_full_to_ng_ha
fi
...
for service in $(services_to_migrate); do
    manage_systemd_service stop "${service%%-clone}"
    ...
done
"""

The problem with the above code is that it is open to the following race
condition:
1. Code gets run first on a non-bootstrap controller node so we start
stopping a bunch of services
2. Pacemaker notices will notice that services are down and will mark
the service as stopped
3. Code gets run on the bootstrap node (controller-0) and the
check_clean_cluster function will fail and exit
4. Eventually also the script on the non-bootstrap controller node will
timeout and exit because the cluster never shut down (it never actually
started the shutdown because we failed at 3)

Let's make sure we first only call the HA NG migration step as a
separate heat step. Only afterwards we start shutting down the systemd
services on all nodes.

We also need to move the STONITH_STATE variable into a file because it
is being used across two different scripts (1 and 2) and we need to
store that state.

Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com>

Closes-Bug: #1640407
Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
2016-11-09 14:51:51 +01:00
Pradeep Kilambi a8e119094f Rework gnocchi-upgrade to run in a separate upgrade step
gnocchi when configured with swift will require keystone
to be available to authenticate to migrate to v3. At this
step keystone is not available and gnocchi upgrade fails
with auth error. Instead start apache in step 3, start
apache first and then run gnocchi upgrade in a separate
step and let upgrade happen here.

Closes-Bug: #1634897

Change-Id: I22d02528420e4456f84b80905a7b3a80653fa7b0
2016-11-01 08:33:23 -04:00
marios 2e6cc07c1a Adds Environment File for Removing Sahara during M/N upgrade
The default path if the operator does nothing is to keep the
sahara services on mitaka to newton upgrades.

If the operator wishes to remove sahara services then they
need to specify the provided major-upgrade-remove-sahara.yaml
environment file in the stack upgrade commands.

The existing migration to ha arch already removes the constraints
and pcs resource for sahara api/engine so we just need to stop
it from starting again if we want to remove it.

This adds a  KeepSaharaServiceOnUpgrade parameter to determine if
Sahara is disabled from starting up after the controllers are
upgraded (defaults true).

Finally it is worth noting that we default the sahara services
as 'on' during converge here in the resource_registry of the
converge environment file; any subsequent stack updates where
the deployment contains sahara services will need to
include the -e /environments/services/sahara.yaml environment
file.

Related-Bug: 1630247
Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407
2016-10-05 16:32:31 +03:00
Jenkins f5f41504e5 Merge "Fix typo in fixing gnocchi upgrade." 2016-09-29 23:57:43 +00:00
Sofer Athlan-Guyot 371698a203 Fix typo in fixing gnocchi upgrade.
Change-Id: I44451a280dd928cd694dd6845d5d83040ad1f482
Related-Bug: #1626592
2016-09-29 15:22:16 +02:00
Michele Baldessari ad07a29f94 Fix races in major-upgrade-pacemaker Step2
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
has the following code:
...
check_resource mongod started 600

if [[ -n $(is_bootstrap_node) ]]; then
...
    tstart=$(date +%s)
    while ! clustercheck; do
        sleep 5
        tnow=$(date +%s)
        if (( tnow-tstart > galera_sync_timeout )) ; then
            echo_error "ERROR galera sync timed out"
            exit 1
        fi
    done

    # Run all the db syncs
    cinder-manage db sync
...
fi

start_or_enable_service rabbitmq
check_resource rabbitmq started 600
start_or_enable_service redis
check_resource redis started 600
start_or_enable_service openstack-cinder-volume
check_resource openstack-cinder-volume started 600

systemctl_swift start

for service in $(services_to_migrate); do
    manage_systemd_service start "${service%%-clone}"
    check_resource_systemd "${service%%-clone}" started 600
done
"""

The problem with the above code is that it is open to the following race
condition:
1) Bootstrap node is busy checking the galera status via cluster check
2) Non-bootstrap node has already reached: start_or_enable_service
   rabbitmq and later lines. These lines will be skipped because
   start_or_enable_service is a noop on non-bootstrap nodes and
   check_resource rabbitmq only checks that pcs status |grep rabbitmq
   returns true.
3) Non-bootstrap node can then reach the manage_systemd_service start
   and it will fail with stuff like:
  "Job for openstack-nova-scheduler.service failed because the control
  process exited with error code. See \"systemctl status
  openstack-nova-scheduler.service\" and \"journalctl -xe\" for
  details.\n" (because the db tables are not migrated yet)

This happens because 3) was started on non-bootstrap nodes before the
db-sync statements are complete on the bootstrap node. I did not feel
like changing the semantics of check_resource and remove the noop on
non-bootstrap nodes as other parts of the tree might rely on this
behaviour.

Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed
Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9
Closes-Bug: #1627965
2016-09-29 07:41:28 +02:00
Sofer Athlan-Guyot 89efa79599 Update gnocchi database during M/N upgrade.
We call gnocchi-upgrade to make sure we update all the needed schemas
during the major-upgrade-pacemaker step.

We also make sure that redis is started before we call gnocchi-upgrade
otherwise the command will be stuck in a loop trying to contact redis.

Closes-Bug: #1626592
Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed
2016-09-28 22:46:01 +02:00
Michele Baldessari da53e9c00b Fix "Not all flavors have been migrated to the API database"
After a successful upgrade to Newton, I ran the tripleo.sh
--overcloud-pingtest and it failed with the following:

resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409)

The issue is the fact that some tables have migrated to the
nova_api db and we need to migrate the data as well.

Currently we do:
    nova-manage db sync
    nova-manage api_db sync

We want to add:
    nova-manage db online_data_migrations

After launching this command the overcloud-pingtest works correctly:
tripleo.sh -- Overcloud pingtest SUCCEEDED

Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096
Closes-Bug: #1628450
2016-09-28 12:20:33 +02:00
Jenkins 9e1d7f0495 Merge "Disable openstack-cinder-volume in step1 and reenable it in step2" 2016-09-27 06:50:12 +00:00
Jenkins 7565e03a82 Merge "A few major-upgrade issues" 2016-09-27 01:11:46 +00:00
Jenkins 9023746e1f Merge "Start mongod before calling ceilometer-dbsync" 2016-09-27 01:11:39 +00:00
Michele Baldessari f9e6a26f32 A few major-upgrade issues
This commit does the following:
1. We now explicitly disable/stop and then remove the resources that are
   moving to systemd. We do this because we want to make sure they are all
   stopped before doing a yum upgrade, which otherwise would take ages due
   to rabbitmq and galera being down. It is best if we do this via pcs
   while we do the HA Full -> HA NG migration because it is simpler to make
   sure all the services are stopped at that stage. For extra safety we can
   still do a check by hand. By doing it via pacemaker we have the
   guarantee that all the migrated services are down already when we stop
   the cluster (which happens to be a syncronization point between all
   controller nodes). That way we can be certain that they are all down on
   all nodes before starting the yum upgrade process.

2. We actually need to start the systemd services in
   major_upgrade_controller_pacemaker_2.sh and not stop them.

3. We need to use the proper bash variable name

4. Use is_bootstrap_node everywhere to make the code more consistent

Change-Id: Ic565c781b80357bed9483df45a4a94ec0423487c
Closes-Bug: #1627490
2016-09-25 14:10:31 +02:00
Michele Baldessari b70d6e6f34 Disable openstack-cinder-volume in step1 and reenable it in step2
Currently we do not disable openstack-cinder-volume during our
major-upgrade-pacemaker step. This leads to the following scenario. In
major_upgrade_controller_pacemaker_2.sh we do:

  start_or_enable_service galera
  check_resource galera started 600
  ....
  if [[ -n $(is_bootstrap_node) ]]; then
  ...
      cinder-manage db sync
  ...

What happens here is that since openstack-cinder-volume was never
disabled it will already be started by pacemaker before we call
cinder-manage and this will give us the following errors during the
start:
06:05:21.861 19482 ERROR cinder.cmd.volume DBError:
                   (pymysql.err.InternalError) (1054, u"Unknown column 'services.cluster_name' in 'field list'")

Change-Id: I01b2daf956c30b9a4985ea62cbf4c941ec66dcdf
Closes-Bug: #1627470
2016-09-25 11:52:04 +02:00
Michele Baldessari 9593981149 Start mongod before calling ceilometer-dbsync
Currently we in major_upgrade_controller_pacemaker_2.sh we are calling
ceilometer-dbsync before mongod is actually started (only galera is
started at this point). This will make the dbsync hang indefinitely
until the heat stack times out.

Now this approach should be okay, but do note that when we start mongod
via systemctl we are not guaranteed that it will be up on all nodes
before we call ceilometer-dbsync. This *should* be okay because
ceilometer-dbsync keeps retrying and eventually one of the nodes will
be available. A completely clean fix here would be to add another
step in heat to have the guarantee that all mongo servers are up and
running before the dbsync call.

Change-Id: I10c960b1e0efdeb1e55d77c25aebf1e3e67f17ca
Closes-Bug: #1627453
2016-09-25 10:49:15 +02:00
Michele Baldessari 24a73efdd0 Reinstantiate parts of code that were accidentally removed
With commit fb25385d34
"Rework the pacemaker_common_functions for M..N upgrades" we
accidentally removed some lines that fixed M/N upgrade issues.
Namely:
extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh

  -# https://bugzilla.redhat.com/show_bug.cgi?id=1284047
  -# Change-Id: Ib3f6c12ff5471e1f017f28b16b1e6496a4a4b435
  -crudini --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend rabbit
  -# https://bugzilla.redhat.com/show_bug.cgi?id=1284058
  -# Ifd1861e3df46fad0e44ff9b5cbd58711bbc87c97 Swift Ceilometer middleware no longer exists
  -crudini --set /etc/swift/proxy-server.conf pipeline:main pipeline "catch_errors healthcheck cache ratelimit tempurl formpost authtoken keystone staticweb proxy-logging proxy-server"
  -# LP: 1615035, required only for M/N upgrade.
  -crudini --set /etc/nova/nova.conf DEFAULT scheduler_host_manager host_manager

extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
  nova-manage db sync
- nova-manage api_db sync

This patch simply puts that code back without reverting the
whole commit that broke things, because that is needed.

Closes-Bug: #1627448

Change-Id: I89124ead8928ff33e6b6907a7c2178169e91f4e6
2016-09-25 10:18:57 +02:00
Michele Baldessari 63421ca73d Add a function to upgrade from full HA to NG HA
This is the initial work to have a function that migrates a full HA
architecture as deployed in Mitaka to the HA architecture as deployed in
Newton where only a few resources are managed by pacemaker.

The sequence is the following:
1) We remove the desired services from pacemaker's control. The services
   at this point are still running normally via the systemd service as
   invoked by pacemaker
2) We do a "systemctl stop <service>" on all controllers for all the
   services that were removed from pacemaker's control. We do this to make
   sure that during the yum upgrade, the %post sections that call
   "systemctl try-restart" do not take ages, because at this point during
   the upgrade rabbit is down. The only exceptions are "openstack-core"
   and "delay" which are dummy pacemaker resources that do not exist on
   the system
3) We do a "systemctl start <service>" on all nodes for all the services
   mentioned above.

We should probably merge this patch only when newton has branched as it
is very specific to the M/N upgrade.

Closes-Bug: 1617520
Change-Id: I4c409ce58c1a57b6e0decc3cf168b62698b32e39
2016-09-19 12:48:00 +02:00
marios fb25385d34 Rework the pacemaker_common_functions for M..N upgrades
For N we cannot assume services are managed by pacemaker.
This adds functions to check if a service is systemd or
pcmk managed and start/stops it accordingly. For pcmk,
only stop/disable on bootstrap node for example, whereas
systemd should stop/start on all controllers.

There is also an equivalent change to the check_resource
which has been reworked to allow both pcmk and systemd.

Implements: blueprint overcloud-upgrades-workflow-mitaka-to-newton
Change-Id: Ic8252736781dc906b3aef8fc756eb8b2f3bb1f02
2016-09-17 04:46:24 +00:00
Sofer Athlan-Guyot cb894b4509 M/N upgrade fail to restart nova-scheduler.
The nova api db need to be synchronized as well.

Change-Id: I2628b24ff1153c84cbf388455666ae42570cb10f
Closes-Bug: 1615042
2016-08-24 15:13:11 +02:00
Ian Pilcher 6e65c8fc0a Disable VIPs before stopping cluster during version upgrade
If "pcs cluster stop --all" is executed on a controller that
happens to have a VIP on the internal network, pcs may use the
VIP as the source address for communication with another cluster
node.  When pacemaker is stopped this VIP goes away, and pcs never
receives a response from the other node.  This causes pcs to hang
indefinitely; eventually the upgrade times out and fails.

Disabling the VIPs before stopping the cluster avoids this
situation.

Change-Id: I6bc59120211af28456018640033ce3763c373bbb
Closes-Bug: 1577570
2016-05-02 16:26:49 -05:00
Giulio Fidente 2d92911838 Update .sh references from openstack-keystone to openstack-core
The update and upgrade shell scripts were still referencing the
old openstack-keystone service which got removed with
Ie26908ac9bfc0b84b6b65ae3bda711236b03d9d4

Also removes kilo and liberty specific workarounds and config changes.

Change-Id: Icc80904908ee3558930d4639a21812f14b2fd12e
2016-04-11 14:27:55 +02:00
marios 31b884a2a8 Moves the swift start/stop into the common_functions.sh file
Since swift isn't managed by pacemaker we need to manually (systemctl)
stop and start the swift services. This moves the duplicate blocks for
start/stop into a common function (we already include that
pacemaker_common_functions.sh here so may as well)

Change-Id: Ic4f23212594c1bf9edc39143bf60c7f6d648fd1d
2016-03-02 18:31:51 +02:00
Jiri Stransky 0dd10ffe4f Introduce update/upgrade workflow
Change-Id: I7226070aa87416e79f25625647f8e3076c9e2c9a
2016-02-23 16:28:43 +01:00