tripleo-heat-templates

Commit Graph

Author	SHA1	Message	Date
marios	521a897322	Cleanup no longer used upgrade files Removes some of the no longer used scripts and templates used by the upgrades workflow in previous versions. Change-Id: I7831d20eae6ab9668a919b451301fe669e2b1346	2017-03-15 10:28:17 +02:00
Jenkins	46aeff7515	Merge "Fix a typo"	2017-03-06 21:11:53 +00:00
Luca Lorenzetto	864cb733a3	Adding definition of backup_flags During the upgrade from M to N i encountered an error in a step requiring the upgrade of mysql version. The variable backup_flags is undefined at that point. Change-Id: Ic6681c40934b27a03d00a75007d7f12d6d540de3 Closes-Bug: #1667731	2017-02-24 17:10:30 +01:00
marios	afcb6e01f3	Make the openvswitch 2.4->2.5 upgrade more robust In I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae and I11fcf688982ceda5eef7afc8904afae44300c2d9 we added a manual step for upgrading openvswitch in order to specify the --nopostun as discussed in the bug below. This change adds a minor update to make this workaround more robust. It removes any existing rpms that may be around from an earlier run, and also checks that the rpms installed are at least newer than the version we are on. This also refactors the code into a common definition in the pacemaker_common_functions.sh which is included even for the heredocs generating upgrade scripts during init. Thanks Sofer Athlan-Guyot and Jirka Stransky for help with that. Change-Id: Idc863de7b5a8c116c990ee8c1472cfe377836d37 Related-Bug: 1635205	2016-12-14 19:15:11 +02:00
gengchc2	1066241957	Fix a typo TrivialFix Change-Id: Ibc072af7bbcb39c4469d4e4a6b0ed202c98221c2	2016-12-07 13:59:23 +00:00
Michele Baldessari	dde12b075f	Fix race during major-upgrade-pacemaker step Currently when we call the major-upgrade step we do the following: """ ... if [[ -n $(is_bootstrap_node) ]]; then check_clean_cluster fi ... if [[ -n $(is_bootstrap_node) ]]; then migrate_full_to_ng_ha fi ... for service in $(services_to_migrate); do manage_systemd_service stop "${service%%-clone}" ... done """ The problem with the above code is that it is open to the following race condition: 1. Code gets run first on a non-bootstrap controller node so we start stopping a bunch of services 2. Pacemaker notices will notice that services are down and will mark the service as stopped 3. Code gets run on the bootstrap node (controller-0) and the check_clean_cluster function will fail and exit 4. Eventually also the script on the non-bootstrap controller node will timeout and exit because the cluster never shut down (it never actually started the shutdown because we failed at 3) Let's make sure we first only call the HA NG migration step as a separate heat step. Only afterwards we start shutting down the systemd services on all nodes. We also need to move the STONITH_STATE variable into a file because it is being used across two different scripts (1 and 2) and we need to store that state. Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com> Closes-Bug: #1640407 Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60	2016-11-09 14:51:51 +01:00
Pradeep Kilambi	a8e119094f	Rework gnocchi-upgrade to run in a separate upgrade step gnocchi when configured with swift will require keystone to be available to authenticate to migrate to v3. At this step keystone is not available and gnocchi upgrade fails with auth error. Instead start apache in step 3, start apache first and then run gnocchi upgrade in a separate step and let upgrade happen here. Closes-Bug: #1634897 Change-Id: I22d02528420e4456f84b80905a7b3a80653fa7b0	2016-11-01 08:33:23 -04:00
marios	2e6cc07c1a	Adds Environment File for Removing Sahara during M/N upgrade The default path if the operator does nothing is to keep the sahara services on mitaka to newton upgrades. If the operator wishes to remove sahara services then they need to specify the provided major-upgrade-remove-sahara.yaml environment file in the stack upgrade commands. The existing migration to ha arch already removes the constraints and pcs resource for sahara api/engine so we just need to stop it from starting again if we want to remove it. This adds a KeepSaharaServiceOnUpgrade parameter to determine if Sahara is disabled from starting up after the controllers are upgraded (defaults true). Finally it is worth noting that we default the sahara services as 'on' during converge here in the resource_registry of the converge environment file; any subsequent stack updates where the deployment contains sahara services will need to include the -e /environments/services/sahara.yaml environment file. Related-Bug: 1630247 Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407	2016-10-05 16:32:31 +03:00
Jenkins	f5f41504e5	Merge "Fix typo in fixing gnocchi upgrade."	2016-09-29 23:57:43 +00:00
Sofer Athlan-Guyot	371698a203	Fix typo in fixing gnocchi upgrade. Change-Id: I44451a280dd928cd694dd6845d5d83040ad1f482 Related-Bug: #1626592	2016-09-29 15:22:16 +02:00
Michele Baldessari	ad07a29f94	Fix races in major-upgrade-pacemaker Step2 tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status \|grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965	2016-09-29 07:41:28 +02:00
Sofer Athlan-Guyot	89efa79599	Update gnocchi database during M/N upgrade. We call gnocchi-upgrade to make sure we update all the needed schemas during the major-upgrade-pacemaker step. We also make sure that redis is started before we call gnocchi-upgrade otherwise the command will be stuck in a loop trying to contact redis. Closes-Bug: #1626592 Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed	2016-09-28 22:46:01 +02:00
Michele Baldessari	da53e9c00b	Fix "Not all flavors have been migrated to the API database" After a successful upgrade to Newton, I ran the tripleo.sh --overcloud-pingtest and it failed with the following: resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409) The issue is the fact that some tables have migrated to the nova_api db and we need to migrate the data as well. Currently we do: nova-manage db sync nova-manage api_db sync We want to add: nova-manage db online_data_migrations After launching this command the overcloud-pingtest works correctly: tripleo.sh -- Overcloud pingtest SUCCEEDED Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096 Closes-Bug: #1628450	2016-09-28 12:20:33 +02:00
Jenkins	9e1d7f0495	Merge "Disable openstack-cinder-volume in step1 and reenable it in step2"	2016-09-27 06:50:12 +00:00
Jenkins	7565e03a82	Merge "A few major-upgrade issues"	2016-09-27 01:11:46 +00:00
Jenkins	9023746e1f	Merge "Start mongod before calling ceilometer-dbsync"	2016-09-27 01:11:39 +00:00
Michele Baldessari	f9e6a26f32	A few major-upgrade issues This commit does the following: 1. We now explicitly disable/stop and then remove the resources that are moving to systemd. We do this because we want to make sure they are all stopped before doing a yum upgrade, which otherwise would take ages due to rabbitmq and galera being down. It is best if we do this via pcs while we do the HA Full -> HA NG migration because it is simpler to make sure all the services are stopped at that stage. For extra safety we can still do a check by hand. By doing it via pacemaker we have the guarantee that all the migrated services are down already when we stop the cluster (which happens to be a syncronization point between all controller nodes). That way we can be certain that they are all down on all nodes before starting the yum upgrade process. 2. We actually need to start the systemd services in major_upgrade_controller_pacemaker_2.sh and not stop them. 3. We need to use the proper bash variable name 4. Use is_bootstrap_node everywhere to make the code more consistent Change-Id: Ic565c781b80357bed9483df45a4a94ec0423487c Closes-Bug: #1627490	2016-09-25 14:10:31 +02:00
Michele Baldessari	b70d6e6f34	Disable openstack-cinder-volume in step1 and reenable it in step2 Currently we do not disable openstack-cinder-volume during our major-upgrade-pacemaker step. This leads to the following scenario. In major_upgrade_controller_pacemaker_2.sh we do: start_or_enable_service galera check_resource galera started 600 .... if [[ -n $(is_bootstrap_node) ]]; then ... cinder-manage db sync ... What happens here is that since openstack-cinder-volume was never disabled it will already be started by pacemaker before we call cinder-manage and this will give us the following errors during the start: 06:05:21.861 19482 ERROR cinder.cmd.volume DBError: (pymysql.err.InternalError) (1054, u"Unknown column 'services.cluster_name' in 'field list'") Change-Id: I01b2daf956c30b9a4985ea62cbf4c941ec66dcdf Closes-Bug: #1627470	2016-09-25 11:52:04 +02:00
Michele Baldessari	9593981149	Start mongod before calling ceilometer-dbsync Currently we in major_upgrade_controller_pacemaker_2.sh we are calling ceilometer-dbsync before mongod is actually started (only galera is started at this point). This will make the dbsync hang indefinitely until the heat stack times out. Now this approach should be okay, but do note that when we start mongod via systemctl we are not guaranteed that it will be up on all nodes before we call ceilometer-dbsync. This should be okay because ceilometer-dbsync keeps retrying and eventually one of the nodes will be available. A completely clean fix here would be to add another step in heat to have the guarantee that all mongo servers are up and running before the dbsync call. Change-Id: I10c960b1e0efdeb1e55d77c25aebf1e3e67f17ca Closes-Bug: #1627453	2016-09-25 10:49:15 +02:00
Michele Baldessari	24a73efdd0	Reinstantiate parts of code that were accidentally removed With commit `fb25385d34` "Rework the pacemaker_common_functions for M..N upgrades" we accidentally removed some lines that fixed M/N upgrade issues. Namely: extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh -# https://bugzilla.redhat.com/show_bug.cgi?id=1284047 -# Change-Id: Ib3f6c12ff5471e1f017f28b16b1e6496a4a4b435 -crudini --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend rabbit -# https://bugzilla.redhat.com/show_bug.cgi?id=1284058 -# Ifd1861e3df46fad0e44ff9b5cbd58711bbc87c97 Swift Ceilometer middleware no longer exists -crudini --set /etc/swift/proxy-server.conf pipeline:main pipeline "catch_errors healthcheck cache ratelimit tempurl formpost authtoken keystone staticweb proxy-logging proxy-server" -# LP: 1615035, required only for M/N upgrade. -crudini --set /etc/nova/nova.conf DEFAULT scheduler_host_manager host_manager extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh nova-manage db sync - nova-manage api_db sync This patch simply puts that code back without reverting the whole commit that broke things, because that is needed. Closes-Bug: #1627448 Change-Id: I89124ead8928ff33e6b6907a7c2178169e91f4e6	2016-09-25 10:18:57 +02:00
Michele Baldessari	63421ca73d	Add a function to upgrade from full HA to NG HA This is the initial work to have a function that migrates a full HA architecture as deployed in Mitaka to the HA architecture as deployed in Newton where only a few resources are managed by pacemaker. The sequence is the following: 1) We remove the desired services from pacemaker's control. The services at this point are still running normally via the systemd service as invoked by pacemaker 2) We do a "systemctl stop <service>" on all controllers for all the services that were removed from pacemaker's control. We do this to make sure that during the yum upgrade, the %post sections that call "systemctl try-restart" do not take ages, because at this point during the upgrade rabbit is down. The only exceptions are "openstack-core" and "delay" which are dummy pacemaker resources that do not exist on the system 3) We do a "systemctl start <service>" on all nodes for all the services mentioned above. We should probably merge this patch only when newton has branched as it is very specific to the M/N upgrade. Closes-Bug: 1617520 Change-Id: I4c409ce58c1a57b6e0decc3cf168b62698b32e39	2016-09-19 12:48:00 +02:00
marios	fb25385d34	Rework the pacemaker_common_functions for M..N upgrades For N we cannot assume services are managed by pacemaker. This adds functions to check if a service is systemd or pcmk managed and start/stops it accordingly. For pcmk, only stop/disable on bootstrap node for example, whereas systemd should stop/start on all controllers. There is also an equivalent change to the check_resource which has been reworked to allow both pcmk and systemd. Implements: blueprint overcloud-upgrades-workflow-mitaka-to-newton Change-Id: Ic8252736781dc906b3aef8fc756eb8b2f3bb1f02	2016-09-17 04:46:24 +00:00
Sofer Athlan-Guyot	cb894b4509	M/N upgrade fail to restart nova-scheduler. The nova api db need to be synchronized as well. Change-Id: I2628b24ff1153c84cbf388455666ae42570cb10f Closes-Bug: 1615042	2016-08-24 15:13:11 +02:00
Ian Pilcher	6e65c8fc0a	Disable VIPs before stopping cluster during version upgrade If "pcs cluster stop --all" is executed on a controller that happens to have a VIP on the internal network, pcs may use the VIP as the source address for communication with another cluster node. When pacemaker is stopped this VIP goes away, and pcs never receives a response from the other node. This causes pcs to hang indefinitely; eventually the upgrade times out and fails. Disabling the VIPs before stopping the cluster avoids this situation. Change-Id: I6bc59120211af28456018640033ce3763c373bbb Closes-Bug: 1577570	2016-05-02 16:26:49 -05:00
Giulio Fidente	2d92911838	Update .sh references from openstack-keystone to openstack-core The update and upgrade shell scripts were still referencing the old openstack-keystone service which got removed with Ie26908ac9bfc0b84b6b65ae3bda711236b03d9d4 Also removes kilo and liberty specific workarounds and config changes. Change-Id: Icc80904908ee3558930d4639a21812f14b2fd12e	2016-04-11 14:27:55 +02:00
marios	31b884a2a8	Moves the swift start/stop into the common_functions.sh file Since swift isn't managed by pacemaker we need to manually (systemctl) stop and start the swift services. This moves the duplicate blocks for start/stop into a common function (we already include that pacemaker_common_functions.sh here so may as well) Change-Id: Ic4f23212594c1bf9edc39143bf60c7f6d648fd1d	2016-03-02 18:31:51 +02:00
Jiri Stransky	0dd10ffe4f	Introduce update/upgrade workflow Change-Id: I7226070aa87416e79f25625647f8e3076c9e2c9a	2016-02-23 16:28:43 +01:00

27 Commits