Once the opendev launchers are handling these duties and these servers
have all been removed from the system-config inventory we can go ahead
and land this change to clean up the unused config files.
Change-Id: I9792620eea81a07b6cbbfee37c08807114d2b390
Once these new servers are up and running in a happy idle state we are
clear to flip the configs around so the new focal servers take over node
provisioning duties. This change makes that happen.
Change-Id: I6ad57218805e28b555e1e3a0dc959ee4f00428cc
This serves as a sanity check that we don't have any fedora-30 usage
hiding somewhere. If this goes in safely then we can remove the image
from the builders.
Change-Id: I09b21e812081f5855a069ca8ab1eedadf090c1b8
Ubuntu Focal has a newer libvirt version than Bionic (4.0.0 vs 6.0.0).
By adding a Focal-flavored nested-virt label, features made possible
by a more recent libvirt version can be tested in the gate.
Specifically, whitebox-tempest-plugin tests Nova's hw_video_type image
property. Support for the 'none' value was added in libvirt 4.6.0.
Change-Id: Id48fff64d13c258d9f22908debfad86c5f089bf5
Needed-by: https://review.opendev.org/#/c/742014/
Change [1] added nested virtualization labels for Ubuntu Bionic and
CentOS 7. This patch extends that to CentOS 8.
Additionally, we extend nl04 to include these labels too as OVH is a
nested-virt enabled nodepool provider.
[1] https://review.opendev.org/#/c/683431/
Change-Id: Ibf5ac5fa0371cc70dbe58806d147568278afcfea
This should only be landed once we've landed the dependency and
confirmed all clouds have the new key value.
This does our semi regular key rotation.
Depends-On: https://review.opendev.org/727865
Change-Id: Ic55c96ad5dd867b70fa52c396e792d5a2e2e0470
New mirror servers have been built, so turn our utilization of these
regions back on again.
This reverts commit 37d292ee74.
Change-Id: Id86b578cec163e264c93fbbbda32a2cc4603492a
Focal images were built with [0] and the result looks successful, so
let's start launching them.
[0] https://review.opendev.org/720719
Change-Id: I2b825178df230d13d75e782c60dd247e6d65ac8b
All jobs using Fedora 29 have been removed, we can remove it from
nodepool and thus OpenDev now.
Depends-On: https://review.opendev.org/711969
Change-Id: I75c0713d164c29a47db9a0cdfc43fadb370e81f8
This removes trusty from the repo and thus from OpenDev.
Afterwards the AFS volume mirror.wheel.trustyx64 can be deleted.
Depends-On: https://review.opendev.org/702771
Depends-On: https://review.opendev.org/702818
Change-Id: I3fa4c26b0c8aeacf1af76f9046ea98edb2fcdbd0
The opensuse-150 image is being removed as the 15.0 release is EOL.
Similar to centos the expectation is that users keep up to date with
minor releases. For this we have the opensuse-15 image which should be
used instead.
Depends-On: https://review.opendev.org/#/c/682844/
Change-Id: I8db99f8f2fd4b1b7b9a5e06148ca2dc185ed682b
OVH is having issues on some hypervisors on OSF aggregate.
As a result, the others hypervisors are under heavy load and some
instances are not able to boot correctly.
To avoid this, I propose to reduce the number of instance to 120 for a
while.
Change-Id: Ic5f4b279e7222e9ec242aeb80e69612d2e6ef70f
Signed-off-by: Arnaud Morin <arnaud.morin@gmail.com>
The main reason why we can do this cleanup now is because there has been a policy
change with openSUSE Leap 15: the minor releases like 15.1 and 15.2 are similarly
backwards compatible like e.g. minor releases in centos 7.x are, as such
we can build an opensuse 15 image in the ci and update all jobs to that
one to have less continuous effort in maintaining the opensuse builds.
Depends-On: https://review.opendev.org/#/c/660137/
Change-Id: I2b1f21fb6e01558c8cee27de116dfc857a1a1c91
OVH updated its infra in order to have correct network_data.json given
by metadata API and / or config drive.
So glean override for this is not needed anymore.
Change-Id: Id97aceb78019b7b71bc231778d7ea7e0f3964e0d
Signed-off-by: Arnaud Morin <arnaud.morin@corp.ovh.com>
This reverts commit 9467a1e51b.
With using a different nameserver, everything should be fine again.
Change-Id: Icd388dd5b96526c10bd4452a2c1d9f83f656edc6
This reverts commit 1911815832.
Seems the DNS lookup failures are continuing, based on recent job
logs.
Change-Id: I55690b005eb1a393041f93f2512c783f59bec6d2
It seems to be seeing network issues, such as Ansible timing out
trying to connect to the machine and the VMs failing to hit 1.1.1.1
to resolve hosts such as git.openstack.org on node startup.
Change-Id: Id4af1ec98899afd1f2e55ad7b7bd397ceca43a62
This reverts commit 32e63aa0c8 (and
small follow-on fix 0eeb4395d1)
The base CentOS node is switched to NetworkManager support
Change-Id: Ic254273afdf0637194b608b781ea9e3ff4bd73a3
We updated the kernel on the aggregate, so we dont have the issue with
memory leak anymore.
We can safely re-enable GRA1 nodepool.
Change-Id: Ie1d4e188c352d427e2e2113daedc38c1eea2e92a
Signed-off-by: Arnaud Morin <arnaud.morin@corp.ovh.com>
This enables NetworkManager control of interfaces on a new centos7-nm
node type. This is intended only for short-term initial testing.
Change-Id: I43318f33d206c28e1f06ac7a8f07c3fb8c8f0626
I recently applied a new kernel on BHS1, if everything is fine with
that, I propose to apply the same one GRA1 so it will help fixing some
timeout errors.
Change-Id: I489f8b84871c18f2dad079cae5b53fb1a504f1bd
Signed-off-by: Arnaud Morin <arnaud.morin@corp.ovh.com>
Set ovh-bhs1 max-servers to 150. OVH (thank you amorin) have debugged
and corrected a memory leak there that we believe to be the cause of the
test node slowness.
Frickler and I have run fio tests on VMs running on each hypervisor in
the region and they look happy. We've also run spot tests of devstack
and tempest which also appear happy.
Change-Id: If6fd5a6194a9996e8b031f74918f373dc7bbe758
We are seeing excessive job timeouts in this region[0], disable it
until we can get a more stable turnout again.
[0] https://ethercalc.openstack.org/jg8f4p7jow5o
Change-Id: I7969cca2cdd99526294a4bf7a0f44f059823dae7
We are debugging slow nodes in bhs1. Looking at dstat data we clearly
have some jobs that end up spending a lot of cpu time in sys and wai
columns while other similar jobs do not.
One thought was that this is due to an unhappy hypervisor or two, but
amorin has dug in and found that these slow jobs run on multiple unique
hypervisors implying that isn't likely.
My next thought is that we are our own noisy neighbors. Reducing the
max-servers should improve things if we are indeed our own noisy
neighbors.
Change-Id: Idd7804778a141d38da38b739294c6c6a62016053
I'd like to isolate one host from the aggregate, but to perform that in
a good way, it's better to reduce the number of instances the nodepool
is trying to boot, this will avoid useless no valid host found errors.
Change-Id: Iddbfba1c3093e9f128c41db91d6b5b3e1d467ce8
Signed-off-by: Arnaud Morin <arnaud.morin@corp.ovh.com>
This reverts commit 3f40af4296.
Can be approved once the slow disk performance in this region is
resolved.
Change-Id: Idda585116ae9dc09b55f6794ab5ee7bda47f455a
We've gotten reports of frequent slow job runs in the BHS1 region
leading to job timeouts. Further investigation indicates these
instances top out around ~10-15MB/sec for contiguous writes to their
rootfs while instances booted from the same image and flavor in GRA1
see 250MB/sec or better with the same write patterns. Disable BHS1
in nodepool for now while we work with OVH staff to see if they can
determine the root cause.
Change-Id: I8b9a79b64dd7da6d3a33f24797ca597bd2426c86
We've gotten reports of frequent slow job runs in the BHS1 region
leading to job timeouts and OVH staff have confirmed we're running a
CPU oversubscription ratio of 2:1 there, so try dropping our
utilization by half to confirm whether this could be due to CPU
contention during peak load.
Change-Id: If7e5f3c0dec71813f5bcb974a0217dc031801115
This should happen at the same time as we switch the zuul scheduler over
to the new zk cluster and after the nodepool builders have populated
image data on the new zk cluster.
This gets us off the old nodepool.o.o server and onto newer HA cluster.
Change-Id: I9cea03f726d4acb21ad5584f8db7a4d15bc556db
This partially reverts commit
bfdd3e6a42.
After fruitful discussions with amorin in IRC, we have nodes working
again in this region. This puts a small load on for us to monitor for
a while. A follow-on will do a full revert so we don't forget.
Story: #2004090
Task: #27492
Change-Id: Id01f85fcee150f9360f508b09003a8d0043155bd
This reverts commit 19e7cf09d9.
The issues in OVH BHS1 around networking configuration have been worked
around with updates to glean and configuration to the labels in zuul.
New images are in place for each supported image in BHS1. We can go
ahead and start using this region again.
I have manually tested this by booting an ubuntu-xenial node with
glean_ignore_interfaces='True' set in metadata and the networking comes
up with expected using DHCP. The mirror in that region is reachable from
this test node.
Change-Id: I29746686217a62709c4afc6656d95829ace6fb3b
Instruct glean via metadata properties to ignore the config drive
network_data.json interface data on OVH and instead fall back to DHCP.
This is necessary because post upgrade OVH config drive
network_data.json provides inaccurate network configuration details and
DHCP is actually what is needed there for working l2 networking.
Change-Id: I51f16d34a96ee8d964e8b540ce5113a662a56f6d
This reverts commit 756a8f43f7, which
was where we re-enabled OVH BHS1 after maintenance. I strongly
suspect that this has something to do with the issues ...
It appears that VM's in BHS1 can not communicate with the mirror
From a sample host 158.69.64.62 to mirror01.bhs1.ovh.openstack.org
---
root@ubuntu-bionic-ovh-bhs1-0002154210:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether fa:16:3e:1b:4b:32 brd ff:ff:ff:ff:ff:ff
inet 158.69.64.62/19 brd 158.69.95.255 scope global ens3
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fe1b:4b32/64 scope link
valid_lft forever preferred_lft forever
root@ubuntu-bionic-ovh-bhs1-0002154210:~# traceroute -n mirror01.bhs1.ovh.openstack.org
traceroute to mirror01.bhs1.ovh.openstack.org (158.69.80.87), 30 hops max, 60 byte packets
1 158.69.64.62 2140.650 ms !H 2140.627 ms !H 2140.615 ms !H
root@ubuntu-bionic-ovh-bhs1-0002154210:~# ping mirror01.bhs1.ovh.openstack.org
PING mirror01.bhs1.ovh.openstack.org (158.69.80.87) 56(84) bytes of data.
From ubuntu-bionic-ovh-bhs1-0002154210 (158.69.64.62) icmp_seq=1 Destination Host Unreachable
From ubuntu-bionic-ovh-bhs1-0002154210 (158.69.64.62) icmp_seq=2 Destination Host Unreachable
From ubuntu-bionic-ovh-bhs1-0002154210 (158.69.64.62) icmp_seq=3 Destination Host Unreachable
--- mirror01.bhs1.ovh.openstack.org ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3049ms
---
However, *external* access to the mirror host and all other hosts
seems fine. It appears to be an internal OVH BHS1 networking issue.
I have raised ticket #9721374795 with OVH about this issue. It needs
to be escalated so is currently pending (further details should come
to infra-root@openstack.org).
In the mean time, all jobs are failing in the region. Disable it
until we have a solution.
Change-Id: I748ca1c10d98cc2d7acf2e1821d4d0f886db86eb