HAProxy exposes a Prometheus metrics endpoint, it just needs to be
enabled. Enable this and remove configuration for
prometheus-haproxy-exporter. Remaining prometheus-haproxy-exporter
containers will automatically be removed.
Change-Id: If6e75691d2a996b06a9b95cb0aae772db54389fb
Co-Authored-By: Matt Anson <matta@stackhpc.com>
This allows us to continue execution until a certain proportion of hosts
to fail. This can be useful at scale, where failures are common, and
restarting a deployment is time-consuming.
The default max failure percentage is 100, keeping the default
behaviour. A global max failure percentage may be set via
kolla_max_fail_percentage, and individual services may define a max
failure percentage via <service>_max_fail_percentage.
Note that all hosts in the inventory must be reachable for fact
gathering, even those not included in a --limit.
Closes-Bug: #1833737
Change-Id: I808474a75c0f0e8b539dc0421374b06cea44be4f
This commit addresses a few shortcomings in the etcd service:
* Adding or removing etcd nodes required manual intervention.
* The etcd service would have brief outages during upgrades or
reconfigures because restarts weren't always serialised.
This makes the etcd service follow a similar pattern to mariadb:
* There is now a distiction between bootstrapping the cluster
and adding / removing another member.
* This more closely follows etcd's upstream bootstrapping
guidelines.
* The etcd role now serialises restarts internally so the
kolla_serial pattern is no longer appropriate (or necessary).
This does not remove the need for manual intervention in all
failure modes: the documentation has been updated to address the
most common issues.
Note that there's repetition in the container specifications: this
is somewhat deliberate. In a future cleanup, it's intended to reduce
the duplication.
Change-Id: I39829ba0c5894f8e549f9b83b416e6db4fafd96f
Ansible 2.14.3 introduced a change that broke the method used for
restarting MariaDB and RabbitMQ serially [1][2]. In
I57425680a4cdbf0daeb9b2cc35920f1b933aa4a8 we limited to 2.14.2 to work
around this. Ansible upstream claim this behaviour was unintentional,
and will not fix it.
This change moves to a different approach where we use separate plays
with a 'serial' keyword to execute the restart.
This change also removes the restriction on the maximum supported
version of 2.14.2 on ansible-core - any 2.14 release is now supported.
[1] 65366f663d
[2] https://github.com/ansible/ansible/issues/80848
Depends-On: https://review.opendev.org/c/openstack/kolla/+/884208
Change-Id: I5a12670d07077d24047aaff57ce8d33ccf7156ff
etcd-compatible tooz drivers do not support multiple endpoints via
backend_url. We can put a loadbalancer in front of etcd and configure
backend_url to use the VIP instead. The issue with hard coding the first
host is that we break coordination if we take this host offline. In the
case of cinder, we would not be able to perform any volume related
operations.
Co-Authored-By: Mark Goddard <mark@stackhpc.com>
Change-Id: Ib684501ba03c386dc5ac71e5cbea05c99f191665
ovn-controller should be deployed first according to OVN upgrade guide.
Since we are getting newer OVN/OVS versions from RDO/Ubuntu in a cycle,
let's apply that to deployment.
Closes-Bug: #1979329
Change-Id: I017aec611a057db1634cfc2634164b21cb210193
This change replaces ElasticSearch with OpenSearch, and Kibana
with OpenSearch Dashboards. It migrates the data from ElasticSearch
to OpenSearch upon upgrade.
No TLS support is in this patch (will be a followup).
A replacement for ElasticSearch Curator will be added as a followup.
Depends-On: https://review.opendev.org/c/openstack/kolla/+/830373
Co-authored-by: Doug Szumski <doug@stackhpc.com>
Co-authored-by: Kyle Dean <kyle@stackhpc.com>
Change-Id: Iab10ce7ea5d5f21a40b1f99b28e3290b7e9ce895
Instead of handling everything in one role - let's have small
fit-for-purpose roles, because in reality these are two hosts
roles and performance should be better with this approach.
[1]: https://docs.ovn.org/en/latest/intro/install/ovn-upgrades.html
Change-Id: I8f9dbe9d950323f16375ad5e1dbaedfb1be6585f
Kolla Ansible is switching to OpenSearch and is dropping support for
deploying ElasticSearch. This is because the final OSS release of
ElasticSearch has exceeded its end of life.
Monasca is affected because it uses both Logstash and ElasticSearch.
Whilst it may continue to work with OpenSearch, Logstash remains an
issue.
In the absence of any renewed interest in the project, we remove
support for deploying it. This helps to reduce the complexity
of log processing configuration in Kolla Ansible, freeing up
development time.
Change-Id: I6fc7842bcda18e417a3fd21c11e28979a470f1cf
Facts define the group key to judge in incloud roles,
remove when statement does not execute to speed up execution
Partially-Implements: blueprint performance-improvements
Change-Id: If22255f1adc07ab16b46f8ad1280efdf7d713d28
This project [1] can provide a one-stop solution to log collection,
cleaning, indexing, analysis, alarm, visualization, report generation
and other needs, which involves helping operator or maintainer to
quickly solve retrieve problems, grasp the operational health of the
platform, and improve the level of platform management.
[1] https://wiki.openstack.org/wiki/Venus
Change-Id: If3562bbed6181002b76831bab54f863041c5a885
rabbitmq starting from 3.8.0, built-in Prometheus support,
prometheus plugins are enabled by default, when the environment is
"enable_prometheus is no", rabbitmq role will disable prometheus plugins
Closes-Bug: #1885106
Change-Id: I4d694d6224c813285d228d6bc7eece5731db1078
Add support for deploying the Kolla Prometheus libvirt exporter image to
facilitate gathering metrics from the Nova libvirt service.
Co-Authored-by: Dr. Jens Harbott <harbott@osism.tech>
Change-Id: Ib27e60c39297b86ae674297370f9543ab08cda05
Partially-Implements: blueprint libvirt-exporter
chrony is not supported in Xena cycle, remove it from kolla
Moved tasks from chrony role to chrony-cleanup.yml playbook to avoid a
vestigial chrony role.
Co-Authored-By: Mark Goddard <mark@stackhpc.com>
Change-Id: I5a730d55afb49d517c85aeb9208188c81e2c84cf
* Register Swift-compatible endpoints in Keystone
* Load balance across RadosGW API servers using HAProxy
The support is exercised in the cephadm CI jobs, but since RGW is
not currently enabled via cephadm, it is not yet tested.
https://docs.ceph.com/en/latest/radosgw/keystone/
Implements: blueprint ceph-rgw
Change-Id: I891c3ed4ed93512607afe65a42dd99596fd4dbf9
For now role haproxy is maintaining haproxy
and keepalived. In follow-up changes there is also
proxysql added.
This patch is *only* renaming/moving stuff to more
prominent role loadbalancer, and moving also specific
templates to subdirectory.
This was done only to better diff in follow-up
changes.
Change-Id: I1d39d5bcaefc4016983bf267a2736b742cc3a555
Adds HAcluster Ansible role. This role contains High Availability
clustering solution composed of Corosync, Pacemaker and Pacemaker Remote.
HAcluster is added as a helper role for Masakari which requires it for
its host monitoring, allowing to provide HA to instances on a failed
compute host.
Kolla hacluster images merged in [1].
[1] https://review.opendev.org/#/c/668765/
Change-Id: I91e5c1840ace8f567daf462c4eb3ec1f0c503823
Implements: blueprint ansible-pacemaker-support
Co-Authored-By: Radosław Piliszek <radoslaw.piliszek@gmail.com>
Co-Authored-By: Mark Goddard <mark@stackhpc.com>
This trivial patch is just turning off ansible
changed report for group_by tasks as it could
be confusing for user.
Change-Id: I7512af573782359a6f01290a55291ac7eb0de867
This makes it possible for services to fetch the Elasticsearch endpoint
from Keystone. It is useful for both operators and Monasca Tempest.
Change-Id: Id60298582496a8959e82b970676669ca17e2e9d4
Some plays were not applied to all groups referenced by the services
they deploy. In most cases this works fine, but if the default inventory
is modified this may cause problems where containers are not deployed to
hosts in the missing groups, if they are not a member of other groups
that the play is targeted to.
This change syncs up the play hosts for all services.
Closes-Bug: #1889387
Change-Id: I6b92d8e53a29b06a065e0611840140d09c8a6695
The common role was previously added as a dependency to all other roles.
It would set a fact after running on a host to avoid running twice. This
had the nice effect that deploying any service would automatically pull
in the common services for that host. When using tags, any services with
matching tags would also run the common role. This could be both
surprising and sometimes useful.
When using Ansible at large scale, there is a penalty associated with
executing a task against a large number of hosts, even if it is skipped.
The common role introduces some overhead, just in determining that it
has already run.
This change extracts the common role into a separate play, and removes
the dependency on it from all other roles. New groups have been added
for cron, fluentd, and kolla-toolbox, similar to other services. This
changes the behaviour in the following ways:
* The common role is now run for all hosts at the beginning, rather than
prior to their first enabled service
* Hosts must be in the necessary group for each of the common services
in order to have that service deployed. This is mostly to avoid
deploying on localhost or the deployment host
* If tags are specified for another service e.g. nova, the common role
will *not* automatically run for matching hosts. The common tag must
be specified explicitly
The last of these is probably the largest behaviour change. While it
would be possible to determine which hosts should automatically run the
common role, it would be quite complex, and would introduce some
overhead that would probably negate the benefit of splitting out the
common role.
Partially-Implements: blueprint performance-improvements
Change-Id: I6a4676bf6efeebc61383ec7a406db07c7a868b2a