This fixes the wrong edit made by [1] and ensures log driver is
configured.
[1] Iffe9c9a1d7ca736f273d2da43928d7da4a99d1d6
Change-Id: I1e679ba4f30cb7f9eb827b91e36cb8feed9afd8f
mysql_bundle.pp has been modified to be able to configure --pids-limit option
using a template. By default the parameter remains 'undef' when not specified.
When "tripleo::profile::pacemaker::database::mysql_bundle::pids_limit:" is used
as an ExtraConfig it will automatically set the value in the Galera cluster
resource.
Closes-Bug: #1982751
Change-Id: Iffe9c9a1d7ca736f273d2da43928d7da4a99d1d6
This commit allows to inject arbitrary arguments into the
wsrep_provider_options string.
Operators should be extremely careful in doing so as there is no
validation or syntax checking whatsoever.
Example:
ExtraConfig:
tripleo::profile::pacemaker::database::mysql_bundle::provider_options: 'evs.suspect_timeout=PT30S'
results in:
wsrep_provider_options = evs.suspect_timeout=PT30S;gcache.recover=no;gmcast.listen_addr=tcp://172.17.0.151:4567;socket.ssl_key=/etc/pki/tls/private/mysql.key;socket.ssl_cert=/etc/pki/tls/certs/mysql.crt;socket.ssl_cipher=AES128-SHA256;socket.ssl_ca=/etc/ipa/ca.crt;
Change-Id: Ie4711ace66846b10252bccdddae84e045af3f604
The hiera function is deprecated and does not work with the latest
hieradata version 5. It should be replaced by the new lookup
function[1].
[1] https://puppet.com/docs/puppet/7/hiera_automatic.html
With the lookup function, we can define value type and merge behavior,
but these are kept default at this moment to limit scope of this change
to just simple replacement. Adding value type might be useful to make
sure the value is in expected type (especially when a boolean value is
expected), but we will revisit that later.
example:
lookup(<NAME>, [<VALUE TYPE>], [<MERGE BEHAVIOR>], [<DEFAULT VALUE>])
This covers the remaining manifests to set up pacemaker resource.
Change-Id: I749b979a7333f68a646f36afa912603b1af0a943
On startup, mariadb 10.5 now reloads previous writesets cached on
disk to serve IST to other joiner nodes instead of SST.
This cache on disk is unreliable, so make this feature configurable
and disable it by default (as it was in mariadb 10.3).
Change-Id: I176ac88f9d91080926556690319941bd18bd34f6
Closes-Bug: #1984264
In addition to rsync and rsync tunnelled over socat,
expose the ability to use mariabackup in galera config.
Related-Bug: #1973872
Change-Id: I4d5983ada52a58cdd853bd7b701bb11eeb88be8d
... because Docker support has been removed from tht and these are no
longer used.
Depends-on: https://review.opendev.org/843755
Change-Id: I5719d06464ba2c1d37898b44f70ac5521ceaaf7e
We started seeing some lint failures which were not caught properly
before. This change fixes all these failures to unblock the lint job.
Change-Id: I8efbf29e0d153d48f114d8799ffb67e3c7a8185f
Depending on the host history, it may happen some directory content
don't have the correct SELinux type. This has been seen with OVN
service, during a Queens -> Train FFU:
while the /var/lib/openvswitch/ovn directory had the correct
container_file_t type, some files in this location were typed with
openvswitch_var_lib_t, leading to errors during the deploy part of the
upgrade (after the OS upgrade, when the deploy is running on the cleaned
host).
The specific issue depends on the actual files with the wrong label, but
usually it involves a container crash/error, leading to a deploy error,
and a manual intervention in order to correct the SELinux type in the
location.
This situation may happen when first deployed on Queens, since it was
using Docker. For the records, back then Docker Daemon was configured in
order to disable the SELinux support, so it didn't really care about
labels; but the situation is different with Podman, and we have a full
SELinux support at all levels on the OS, leading to the issue.
For the records, tripleo-heat-templates as well as tripleo-ansible are
setting the "setype: container_file_t" on the directories, but we don't
use the "recurse: true" in order to avoid performance issues - some
locations might be huge, and it would take too much time to relabel
everything via ansible.
This patch aims to converge all the mounts to the same options, and
ensure no SELinux denial can prevent the actual container startup and
function.
Change-Id: Ic3e427156fc82c524c763d1896937fcc3c49fabb
Closes-Bug: #1943459
Appropriate gcache.size values allow a cluster node to perform IST
instead of SST if the writes occurred during its downtime do not exceed
the size of the cache itself.
This is especially beneficial during maintenance windows.
Change-Id: I483484c58ab703f3a4dcede636d733e23b051f63
This change introduces several timeout parameters so that users can
tune operation timeouts about mysql resource in pacemaker.
Change-Id: Ib1e0e687e2f8dd5dba53a37f6d4e5f149b933fd4
This change fixes the lint errors detected since we removed pins of
lint packages.
Note that this change also replaces absolute name used to call
the tripleo::stunnel::service_proxy resource type, which is not yet
detected by the latest lint rules.
Closes-Bug: #1928079
Change-Id: I12ba801db92cb3df1d05f14f4c150ac765f0b874
This MySQL / MariaDB server value was changed
from the value 1 to 2 between MariaDB 10.1 and 10.2 [1]. The
result of this change is that any database connection which is
not gracefully closed results in a log message
"Got an error reading communication packets" in the MySQL server
log, which is misleading as it does not usually refer to any
actionable issue; real connectivity issues are always seen in
application logs and most of these messages in the server
logs are likely to be false positives due to the behavior of HAProxy.
While applications can reduce the occurence of this error by
ensuring that database connections are gracefully closed, this
is already the behavior of oslo.db and SQLAlchemy which maintains
a connection pool that closes out stale connections explicitly
when requests are made.
The majority of these warnings are likely the result of normal HAProxy
operation, where the settings "timeout client" and "timeout server"
are set to 90 minutes, such that any connection older than this
time will be non-gracefully closed by the proxy, generating
the warning. An idle application server process will not have attended
to connections that are older than the timeout period,
leading to these connections being left for HAProxy to handle;
HAProxy's timeout behavior leading to this message in the logs has been
confirmed in local experimentation.
The application server itself is never exposed to this as upon
the start of work will always recycle any connection that is older
than its own timeout, which defaults to 60 minutes for applications
using oslo.config + oslo.db. Without HAProxy having the capability
to close out these connections using MySQL's protocol, the messages
are unavoidable.
The message will also occur anytime an Openstack process is stopped
or killed for all connections that are pooled in that process.
The correct way to diagnose if an application is having connectivity
issues is to look in the application server log itself for error
messages and stack traces that have much more detail as to the context
that produced a particular error message. This warning is also
known to occur when an application server is not able to respond
to packets quickly enough as has been observed with services
such as Cinder where eventlet monkeypatching causes the PyMySQL
client to be blocked; however when this occurs, there is an
informative stack trace and error message in the application logs
that shows what's going on.
As this particular warning message is not useful in that most
occurences will refer to normal behavior as designed, the
log level should be forced to "1" to prevent these messages
as they are causing confusion in downstream environments.
[1] https://mariadb.com/kb/en/upgrading-from-mariadb-101-to-mariadb-102/#incompatible-changes-between-101-and-102
Change-Id: I0efb4f77aaceda635c8983d6b7a240171a7accdc
When deploying a 2-node HA overcloud, the galera resource
agent can be configured to enable a "2-node mode" heuristic,
that allows it to restart a galera node in the event of a
network split.
Make this resource agent's option available in puppet via
the new parameter "two_node_mode".
Closes-Bug: #1903051
Change-Id: I543ee77ec38b6429989435122ae0c257d279e507
During scale up, two galera resources are being updated in the
pacemaker cluster. Force a specific ordering in puppet to make
sure the galera resource agent always picks up the up-to-date
config when it starts new replicas.
Closes-Bug: #1892530
Change-Id: Id40ac8c10fd0348ce4fd99ce319dab933312acfa
Allow override of galera promote timeout
This commit removes the hard coded value of pacemaker promote time out
(currently 300s), and allows operators to override it via:
tripleo::profile::pacemaker::database::mysql::promote_timeout
tripleo::profile::pacemaker::database::mysql_bundle::promote_timeout
Closes-Bug: #1883896
Change-Id: I96f5d349b94f05f4f66db6b85ba481deba0015d9
Function mysql_password is deprecated and has been removed
in recent puppetlabs-mysql [1]. It has been replaced with
the equivalent, namespaced function mysql::password. Use it
instead.
[1] 5a70627674
Change-Id: I405a986f78f865d89b54dffea17e84d75c068ed7
Closes-Bug: #1878153
While moving to running pcs commands on the host and off short-lived
containers, we are confronted with the issue that pcs usually checks
for the resource agent's existence on the host before creating it.
Since we'd rather avoid installing the needed resource agents on the
host (as it is inside a container), we allow a new 'force_ocf' parameter
to be passed to those situations where we might need it.
Depends-On: I20eb78a061a334b20f6b2274591c5d313a0af532
Related-Bug: #1863442
Change-Id: If9048196b5c03e3cfaba72f043b7f7275568bdc4
When podman dropped the journald log-driver we rushed to move to the supported
k8s-file driver. This had the side effect of us losing the stdout logs of the
HA containers.
In fact previously we were easily able to troubleshoot haproxy startup failures
just by looking in the journal. These days instead if haproxy fails to start we
have no traces whatsoever in the logs, because when a container fails it gets
stopped by pacemaker (and consequently removed) and no logs on the system are
available any longer.
Tested as follows:
1) Redeploy a previously deployed overcloud that did not have the patch
and observe that we now log the startup of HA bundles in /var/log/containers/stdouts/*bundle.log
[root@controller-0 stdouts]# ls -l *bundle.log |grep -v -e init -e restart
-rw-------. 1 root root 16032 Apr 14 14:13 openstack-cinder-volume.log
-rw-------. 1 root root 19515 Apr 14 14:00 haproxy-bundle.log
-rw-------. 1 root root 10509 Apr 14 14:03 ovn-dbs-bundle.log
-rw-------. 1 root root 6451 Apr 14 14:00 redis-bundle.log
2) Deploy a composable HA overcloud from scratch with the patch above
and observe that we obtain the stdout on disk.
Note that most HA containers log to their usual on-host files just
fine, we are mainly missing haproxy logs and/or the kolla startup only
of the HA containers.
Closes-Bug: #1872734
Change-Id: I4270b398366e90206adffe32f812632b50df615b
Downcase in puppet 6.14 throws an error if the input to it is Undef. We
can avoid this by checking for a value before trying to downcase.
See context https://review.rdoproject.org/r/#/c/26297/
Change-Id: Ib2e97060523a4198a14949a15c9171b56928699c
This commit removes the hard coded value of open-file-limit (16384)
and allows operators to override it via:
tripleo::profile::pacemaker::database::mysql::open_files_limit
tripleo::profile::pacemaker::database::mysql_bundle::open_files_limit
Change-Id: I4927eb65a2dc1b5a86fc2141c7b2713f36ad49de
Resolves: rhbz#1812969
During a major upgrade, upgrade tasks can rebuild a new pacemaker
cluster by adding nodes one at a time. This is implemented by
using two special hiera variables mysql_node_names_override and
mysql_short_node_names_override.
Make sure the mysql_bundle puppet module uses both variables
when such cluster rebuild is in progress.
Change-Id: I6a06269f55a38071c34d2a95109d213fe7e2452c
Closes-Bug: #1859961
Co-Authored-By: Jose Luis Franco Arza <jfrancoa@redhat.com>
Allow all bundles --user option to be overridden as some of them might
prefer switching to a non-root user when possible.
The ovn-dbs bundle is a bit special because it never specified any user.
Hence we default that user to undef and do not set anything.
Tested as follows:
1. deployed an overcloud
2. patched it with this change
3. redeployed and and then observed that no HA container has restarted at all
4. verified cinder-volume runs with root by default:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 4204 716 ? Ss 09:01 0:00 dumb-init --single-child -- /bin/bash /usr/local/bin/kolla_start
root 7 0.7 0.7 912976 145760 ? S 09:01 1:04 /usr/bin/python3 /usr/bin/cinder-volume --config-file /usr/share/cinder/cinder-dist.conf --config-file /etc/cinder/cinder.conf
root 71 0.1 0.6 925800 124640 ? S 09:01 0:14 /usr/bin/python3 /usr/bin/cinder-volume --config-file /usr/share/cinder/cinder-dist.conf --config-file /etc/cinder/cinder.conf
5. added 'tripleo::profile::pacemaker::cinder::volume_bundle::bundle_user: cinder' to
the templates and redeployed
6. Observed that cinder-volume got restarted and now runs with cinder
user:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
cinder 1 0.0 0.0 4204 804 ? Ss 12:23 0:00 dumb-init --single-child -- /bin/bash /usr/local/bin/kolla_start
cinder 7 2.1 0.7 912976 145432 ? S 12:23 0:04 /usr/bin/python3 /usr/bin/cinder-volume --config-file /usr/share/cinder/cinder-dist.conf --config-file /etc/cinder/cinder.conf
cinder 64 0.3 0.5 919908 118452 ? S 12:23 0:00 /usr/bin/python3 /usr/bin/cinder-volume --config-file /usr/share/cinder/cinder-dist.conf --config-file /etc/cinder/cinder.conf
Change-Id: I985d0d192ef3accf7fdd31503348de80713fded4
Currently in puppet-tripleo for the HA container we hardcode the following:
options => "--user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS${tls_priorities_real}",
Since at least podman had some changes in terms of supported driver
backends (and bugs) it's best if we make this configurable. While we're
at it we should also switch to k8s-file as a driver when podman is being
used which is what all other containers are using. When docker is the
default container_cli we will stick to journald as usual.
Tested this on a Train environment and successfully verified that
we still see the correct logs in /var/log/containers/.../...
Change-Id: I5b1483826f816d11a064a937d59f9a8f468315a5
Closes-Bug: #1853517
Currently when adding some tuning options via hiera, galera won't start because
overriding even a single mysql option will reset the whole key in the hash. So
for example, when adding:
tripleo::profile::base::database::mysql::mysql_server_options:
mysqld:
# MySQL InnoDB equally divided in 1GB instances
innodb_buffer_pool_instances: 2
# Query network write timeout raised to 120 seconds
net_write_timeout: 120
# Query network read timeout raised to 120 seconds
net_read_timeout: 120
# MySQL connection timeout set to 8 hours
connect_timeout: 28800
Things will break because all the wsrep options that are set normally will be
overridden and galera will refuse to start
Tested by passing the above hiera keys and observing the deploy complete
successfully and the settings correctly applied to galera/mysql on the overcloud.
Change-Id: I30f03bc8eb81db0243c137d4af08924adeebc951
Closes-Bug: #1848060
We add initial support for being able to specify tls priorities in
pacemaker. For bundles this will happen via an env variable because
pacemaker_remote is started normally as a process and there is no
sourcing of /etc/sysconfig/pacemaker.
Tested on both queens and stein. Via a deploy and a redeploy against
existing cloud. Observed that:
A) We got PCMK_tls_priorities inside /etc/sysconfig/pacemaker with the
value that was passed in THT
B) Containers had the following env variable set:
"PCMK_tls_priorities=normal",
The '-e' addition is a noop in case the PCMK_tls_priorities is unset
so that we do not change the signature of the resources and hence do
not needlessly restart the HA resource.
Depends-On: I1971810f6a90f244ed5ced972a5fe7fde29dde86
Change-Id: I703b5a429f48063474aace85bc45d948f5c91435
For the upgrade we have to re-provision the controller cluster, one
node at a time.
Using extra override variable set in hiera we are able to specify to
pacemaker which nodes should be added to the cluster.
Change-Id: I2f6ef4679265718fbbe8726ee6c81832bc468f3e
Implements: blueprint upgrades-with-os
- move nova dbsync from nova-api to nova-conductor
- nova db is more tightly coupled to conductor/computes
- we don't have a nova-api services on a CellController
- super-conductor on Controller will sync cell0 db
- when additional cell
- duplicate service node name hiera for transport_urls on cell stack
- nova -> oslo_messaging_rpc_cell_node_names
- neutron agent -> oslo_messaging_rpc_node_names
- rabbit -> rabbit nodes are cell controllers
bp tripleo-multicell-basic
Co-Authored-By: Martin Schuppert <mschuppert@redhat.com>
Change-Id: I79c1080605611c5c7748a28d2afcc9c7275a2e5d
The retry is needed in a composable HA environment because a two nodes
might be modifying the CIB at the same time and so we need to retry more
than once to get the freshest CIB, modify it and push it back. Currently
all HA resources have it but we did not add it in the bundles. While it
is a rare race, we should still plug it.
Change-Id: Ib9d9c76c83f103e329a9c575ae5c110d5ad3c048
Closes-Bug: #1809223
It seems that pre 10.3 mariadb did not except square brackets when
setting the wsrep gcomm address, whereas 10.3 requires them.
Also 10.3 seems to have isssues binding to a specified ipv6 address.
For the time being, until we investigate more in detail, let's just
not bind an ip address for the the gcomm address when using ipv6.
This will unblock all promotions while we get to the bottom of the
issue.
Change-Id: I0b49019065c71edc4497c777f33af7926e8b1238
Closes-Bug: #1808536
Introduce a generic mechanism to override mysql options for the HA
bundle. We reuse the profile::base hiera key so we leverage unity
between nonHA and HA hiera settings
Change-Id: I6dc048882e7e9be44710829e98c90d2e9663b372
Co-Authored-By: <dciabrin@redhat.com>
Since the mysql service has been containerized, we lost the ability
to update the root password during a stack update.
When the mysql root password in hiera differs from the one currently
set in the mysql DB, connect to the DB with password from .my.cnf and
update credentials of the root user before the puppet mysql module
tries to access the database. Also update other root DB users.
Change-Id: I8fe9a640ba36288a1f9cb18563b363159d4731c0
Depends-On: I5bdbc89897a6dcd5bd57f2132e2acf99702b28ea
Closes-Bug: #1792416
We added a container backend in puppet-pacemaker via
Ia4a7b58d14d80e85d51e98acec1aad2ba90b69de. Let's now
let tripleo override it when needed.
Tested this via some hiera keys overrides and it works correctly.
Change-Id: I610923327462b901840131316a4984c8fe98faaa
Since the introduction of I62870c055097569ceab2ff67cf0fe63122277c5b
"Introduce restart_bundle containers to detect config changes and
restart pacemaker resources" we actually use paunch to detect any
config changes (by verifying an md5 hash over the generated config
files of the service).
With this new way of detecting changes there is no need to use the
old 'tripleo::pacemaker::resource_restart_flag' method to restart
pcmk services.
Let's just remove this unused code.
Change-Id: Ib12dbe66575e3d54a8ec7d2c72c2b4619bc39b03