Commit Graph

1924 Commits

Author SHA1 Message Date
Clark Boylan 8c088d17f1 Enable nodepool delete after upload option
This enables the nodepool delete-after-upload option with keep-formats
set to qcow2 on x86 image builders. This should clear out vhd and raw
files after uploads for those formats are completed keeping only qcow2
longer term. This should reduce disk space overhead while still enabling
us to convert from qcow2 to the other formats if that becomes necessary.

Note that we do not enable this for arm64 before arm64 builders
currently build raw images only and we still want at least one copy of
the image to be kept even if it is raw (and not qcow2).

Change-Id: I6cf481e0f9a5eaff35b5d961a084ae34a49ea6c6
2024-03-26 15:10:36 -07:00
Clark Boylan aabaf95b49 Remove centos-7 nodepool image builds
This is the last step in cleaning centos-7 out of nodepool. The previous
change will have cleaned up uploads and now we can stop building the
images entirely.

Change-Id: Ie81d6d516cd6cd42ae9797025a39521ceede7b71
2024-03-13 08:30:16 -07:00
Clark Boylan b8c53b9c03 Remove centos-7 image uploads from Nodepool
This removal of centos-7 image uploads should cause Nodepool to clean up
the existing images in the clouds. Once that is done we can completely
remove the image builds in a followup change.

We are performing this cleanup because CentOS 7 is near its EOL and
cleaning it up will create room on nodepool builders and our mirrors for
other more modern test platforms.

Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/912786
Change-Id: I48f6845bc7c97e0a8feb75fc0d540bdbe067e769
2024-03-13 08:21:46 -07:00
James E. Blair f5c200181a Revert "Try switching Rackspace DFW to an API key"
This reverts commit eca3bde9cb.

This was successful, but we want to make the change without altering
the cloud name.  So switch this back, and separately we will update
the config of the rax cloud.

Change-Id: I8cdbd7777a2da866e54ef9210aff2f913a7a0211
2024-03-07 08:46:25 -08:00
Jeremy Stanley eca3bde9cb Try switching Rackspace DFW to an API key
Switch the Rackspace region with the smallest quota to uploading
images and booting server instances with our account's API key
instead of its password, in preparation for their MFA transition. If
this works as expected, we'll make a similar switch for the
remaining two regions.

Change-Id: I97887063c735c96d200ce2cbd8950bbec0ef7240
Depends-On: https://review.opendev.org/911164
2024-03-06 15:06:34 +00:00
Clark Boylan 56c5fefcf6 CentOS 7 removal prep changes
This drop min-ready for centos-7 to 0 and removes use of some centos 7
jobs from puppet-midnoet. We will clean up those removed jobs in a
followup change to openstack-zuul-jobs.

We also remove x/collected-openstack-plugins from zuul. This repo uses
centos 7 nodesets that we want to clean up and it last merged a change
in 2019. That change was written by the infra team as part of global
cleanups. I think we can remove it from zuul for now and if interest
restarts it can be added and fixed up.

Change-Id: I06f8b0243d2083aacb44fe12c0c850991ce3ef63
2024-03-04 10:25:58 -08:00
Clark Boylan c41bc6e5c2 Remove debian-buster image builds from nodepool
This should be landed after the parent chagne has landed and nodepool
has successfully deleted all debian-buster image uploads from our cloud
providers. At this point it should be safe to remove the image builds
entirely.

Change-Id: I7fae65204ca825665c2e168f85d3630686d0cc75
2024-02-23 13:23:22 -08:00
Clark Boylan feff36e424 Drop debian-buster image uploads from nodepool
Debian buster has been replaced by bullseye and bookworm, both of which
are releases we have images for. It is time to remove the unused debian
buster images as a result.

This change follows the process in nodepool docs for removing a provider
[0] (which isn't quite what we are doing) to properly remove images so
that they can be deleted by nodepool before we remove nodepool's
knowledge of them. The followup change will remove the image builds from
nodepool.

[0] https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-a-provider

Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/910015
Change-Id: I37cb3779944ff9eb1b774ecaf6df3c6929596155
2024-02-23 13:19:49 -08:00
Clark Boylan 8eb9cb661e Set debian-buster min servers to 0
This is in preparation for the removal of this distro release from
Nodepool. Setting this value to will prevent nodepool from automatically
booting new nodes under this label if we cleanup any existing nodes.

Change-Id: I90b6c84a92a0ebc4f40ac3a632667c8338d477f1
2024-02-23 08:41:20 -08:00
Clark Boylan 211fe14946 Remove opensuse-15 image builds from nodepool
This should be landed after the parent chagne has landed and nodepool
has successfully deleted all opensuse-15 image uploads from our cloud
providers. At this point it should be safe to remove the image builds
entirely.

Change-Id: Icc870ce04b0f0b26df673f85dd6380234979906f
2024-02-22 10:27:37 -08:00
Clark Boylan 5635e67866 Drop opensuse image uploads from nodepool
These images are old opensuse 15.2 and there doesn't seem to be interest
in keeping these images running (very few jobs ever ran on them and
rarely successfully and no one is trying to update to 15.5 or 15.6).

This change follows the process in nodepool docs for removing a provider
[0] (which isn't quite what we are doing) to properly remove images so
that they can be deleted by nodepool before we remove nodepool's
knowledge of them. The followup change will remove the image builds from
nodepool.

[0] https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-a-provider

Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/909773
Change-Id: Id9373762ed5de5c7c5131811cec989c2e6e51910
2024-02-22 10:25:15 -08:00
Clark Boylan b8b984e5b6 Set opensuse-15 min-ready to 0
This is in preparation for the followup changes that will drop opensuse
nodes and images entirely. We set min-ready to 0 first so that we can
manually delete any running nodes before cleaning things up further.

Change-Id: I6cae355fd99dd90b5e48f804ca0d63b641c5da11
2024-02-21 09:32:56 -08:00
Jeremy Stanley cedfb950de Temporarily lower max-servers for linaro
The launcher is seeing "Error in creating the server. Compute
service reports fault: No valid host was found." from Linaro's
cloud, leading to NODE_FAILURE results for many jobs when our other
ARM-based node provider is running at quota. According to graphs,
we've been able to sustain 16 nodes in-use in this cloud, so
temporarily cap max-servers at that in order to avoid further
failures until someone has a chance to look into what has broken
there.

Change-Id: I3f79e9cc70e848b9ebc6728205f806693209dfd5
2023-12-14 16:33:20 +00:00
Michal Nasiadka 4ba928c675 Add nested-virt-debian-bookworm
Change-Id: I17a202cc82ff19a788fde7b34415542c1b354fae
2023-10-04 14:47:23 +02:00
Clark Boylan 4a3c87dbcd Set a six hour nodepool image upload timeout
This was the old timeout then some refactoring happened and we ended up
with the openstacksdk timeout of one hour. Since then Nodepool added the
ability to configure the timeout so we set it back to the original six
hour value.

Change-Id: I29d0fa9d0077bd8e95f68f74143b2d18dc62014b
2023-09-15 12:57:25 -07:00
Clark Boylan 3b9c5d2f07 Remove fedora image builds
This removes the fedora image builds from nodepool. At this point
Nodepool should no longer have any knowledge of fedora.

There is potential for other cleanups for things like dib elements, but
leaving those in place doesn't hurt much.

Change-Id: I3e6984bc060e9d21f7ad851f3a64db8bb555b38a
2023-09-06 09:16:34 -07:00
Clark Boylan d83736575e Remove fedora-35 and fedora-36 from nodepool providers
This will stop providing the node label entirely and should result in
nodepool cleaning up the existing images for these images in our cloud
providers. It does not remove the diskimages for fedora which will
happen next.

Change-Id: Ic1361ff4e159509103a6436c88c9f3b5ca447777
2023-09-06 09:12:33 -07:00
Clark Boylan 8d32d45da2 Set fedora labels min-ready to 0
In preparation for fedora node label removal we set min-ready to 0. This
is the first step to removing the images entirely.

Change-Id: I8c2a91cc43a0dbc633857a2733d66dc935ce32fa
2023-09-06 09:07:13 -07:00
Jeremy Stanley 16ddb49e48 Drop libvirt-python from suse in bindep fallback
The bindep fallback list includes a libvirt-python package for all
RPM-based distros, but it appears that OpenSuse Leap has recently
dropped this (likely as part of removing Python 2.7 related
packages). Exclude the package on that platform so that the
opensuse-15 job will stop failing.

Change-Id: I0bb7d9b7b34f4f6c392374182538b7e433617e13
2023-09-06 15:15:03 +00:00
Dr. Jens Harbott d0c0ddb977 Reduce frequency of image rebuilds
In order to reduce the load on our builder nodes and reduce the strain
on our providers' image stores, build most images only once per week.

Exceptions are ubuntu-jammy, our most often used distro image, which we
keep rebuilding daily, and some other more frequently used images built
every 2 days.

Change-Id: Ibba7f864b15e478fda59c998843c3b2ace0022d8
2023-09-02 13:18:19 +02:00
Dr. Jens Harbott 407f859232 Unpause image uploads for rax-iad part 2
Enable uploads for all images again for rax-iad. We have configured the
nodepool-builders to run with only 1 upload thread, so we will have at
most two parallel uploads (one per builder).

Change-Id: Ia2b737e197483f9080b719bab0ca23461850e157
2023-08-30 21:07:27 +02:00
Dr. Jens Harbott c8b1b1c3b6 Unpause image upload for rax-iad part 1
This is a partial revert of d50921e66b.

We want to slowly re-enable image uploads for rax-iad, start with a
single image, choosing the one that is getting used most often.

Change-Id: I0816f7da73e66085fe6c52372531477e140cfb76
Depends-On: https://review.opendev.org/892056
2023-08-19 20:12:40 +00:00
Dr. Jens Harbott d50921e66b Revert "Revert "Temporarily pause image uploads to rax-iad""
This reverts commit 27a3da2e53.

Reason for revert: Uploads are still not working properly

Change-Id: I2a75dd9ff0731a4113a362f9f17f510a9a236ebb
2023-08-10 07:24:07 +00:00
Jeremy Stanley 27a3da2e53 Revert "Temporarily pause image uploads to rax-iad"
Manual cleanup of approximately 1200 images in this region, some as
much as 4 years old, has completed. Start attempting uploads again
to see if they'll complete now.

This reverts commit 71d1f02164.

Change-Id: I850acb3926a3fdedad599767b99be466bf45daef
2023-08-09 11:43:54 +00:00
Jeremy Stanley 71d1f02164 Temporarily pause image uploads to rax-iad
We're getting Glance task timeout errors when trying to upload new
images into rax-iad, which seems to be resulting in rapidly leaking
images and may be creating an ever-worsening feedback loop. Let's
pause uploads for now since they're not working anyway, and
hopefully that will allow us to clean up the mess that's been
created more rapidly as well.

Change-Id: I0cc93a80e2cfa2ef761c6f538e134505bf4dc53c
2023-08-08 15:48:48 +00:00
Vladimir Kozhukalov 50c4046096 Add 32GB nodes Ubuntu Focal and Jammy nodes
Some deployment projects (e.g. Openstack-Helm)
test their code as "all-in-one" deployment.
I.e. a test job deploys all Openstack components
on a single node on top of a minimalistic
K8s cluster running on the same node.

This requires more memory than 8GB to make
jobs reliable. We add these new *-32GB labels
only in the Vexxhost ca-ymq-1 region because
v3-standard-8 flavor in this region has 32GB
nodes.

At the same time we should not stick to this
kind of nodes since the number of such nodes
is very limited. It is highly recommended to
redesign the test jobs so they use multinode
nodesets.

Change-Id: Icfd58a88a12d13f093c08f41ab4be85c26051149
2023-07-19 15:42:11 +03:00
Zuul 4eaf4e973a Merge "Fix unbound setup for debian-bookworm" 2023-07-04 09:46:55 +00:00
Dr. Jens Harbott 3df7459924 Fix unbound setup for debian-bookworm
dns-root-data has been demoted to a "Recommends" dependency of unbound,
which we don't install. Sadly the default unbound configuration is
broken without it.

Change-Id: I93e6928d30db8a90b45329ca00f066b4ec1b4ae7
2023-07-04 09:37:49 +02:00
Dr. Jens Harbott 5aa792f1ae Start booting bookworm nodes
Image builds have been successful

Change-Id: If286eb3e1a75c643f67f3d6d3d7e2d31c205ac1b
2023-07-03 18:47:46 +02:00
Dr. Jens Harbott 4c16313ad2 Build debian bookworm images
Release is done, mirror is in place, ready to go.

Adopt using systemd-timesyncd like we do for recent Ubuntu releases.

Change-Id: I3fbdc151177bf2dba81920a4a2e3966f271b50ad
2023-07-03 06:05:36 +00:00
Dr. Jens Harbott 6b1cfbe079 Cache new cirros images
The cirros project has released new images, add them to our cache prior
to actually using them in the CI. We can remove the old images once the
migration is completed and not too many stable branches using the old
images are still active, but comparing the size of these in relation to
the total size of our images, the impact of this shouldn't be too large
in comparison to the benefit in CI stability.

Signed-off-by: Dr. Jens Harbott <harbott@osism.tech>
Change-Id: I6d6bcc0e9cfef059de70bbb19e4254e8d29d415b
2023-06-01 16:26:54 +00:00
Zuul 7e255132bf Merge "Stop caching infrequently-used CirrOS images" 2023-06-01 09:29:26 +00:00
Rodolfo Alonso 2a0657cf70 Revert "Temporary disable nested-virt labels in vexxhost-ca-ymq-1"
This reverts commit 4df959c449.

Reason for revert: VEXXHOST provider has informed that they have
performed some optimizations and now we can enable again this pool.

Change-Id: Ifbd26a676c8c64c974e061e12d6d1a9d1ae47676
2023-05-11 15:44:42 +00:00
yatinkarel 4df959c449 Temporary disable nested-virt labels in vexxhost-ca-ymq-1
The jobs running with nested-virt labels on this
provider are impacted by mirror issues from last
couple of weeks.

Atleast jobs running on compute nodes[1]
are impacted.

Until the issue is clear let's disable the
provider.

[1] https://paste.opendev.org/show/bCbxIrXR1P01q4JYUh3i/

Change-Id: Id8b7d214789568565a07770cc3c8b095a4c0122d
2023-04-28 17:40:33 +05:30
Dr. Jens Harbott ac5c9ccc5b Add nested-virt-debian-bullseye label to nodepool
kolla wants to have testing parity between Ubuntu and Debian, so add a
nested-virt-debian-bullseye label to nodepool matching the existing
nested-virt-ubuntu-bionic label.

Change-Id: I27766140120fb55a2eab35552f0321b1da9c67ff
2023-03-31 18:15:25 +02:00
Jeremy Stanley 8f916dc736 Restore rax-ord quota but lower max-concurrency
Looking at our graphs, we're still spiking up into the 30-60
concurrent building range at times, which seems to result in some
launches exceeding the already lengthy timeout and wasting quota,
but when things do manage to boot we effectively utilize most of
max-servers nicely. The variability is because max-concurrency is
the maximum number of in-flight node requests the launcher will
accept for a provider, but the number of nodes in a request can be
quite large sometimes.

Raise max-servers back to its earlier value reflecting our available
quota in this provider, but halve the max-concurrency so we don't
try to boot so many at a time.

Change-Id: I683cdf92edeacd7ccf7b550c5bf906e75dfc90e8
2023-03-16 19:53:55 +00:00
Jeremy Stanley d0481326bf Limit rax-ord launch concurrency and don't retry
This region seems to take a very long time to launch nodes when we
have a burst of requests for them, like a thundering herd sort of
behavior causing launch times to increase substantially. We have a
lot of capacity in this region though, so want to boot as many
instances as we can here. Attempt to reduce the effect by limiting
the number of instances nodepool will launch at the same time.

Also, mitigate the higher timeout for this provider by not retrying
launch failures, so that we won't ever lock a request for multiples
of the timeout.

Change-Id: I179ab22df37b2f996288820074ec69b8e0a202a5
2023-03-10 18:09:33 +00:00
Jeremy Stanley bc7d946ca2 Wait longer for rax-ord nodes and ease up API rate
We're still seeing a lot of timeouts waiting for instances to become
active in this provider, and are observing fairly long delays
between API calls at times. Increase the launch wait from 10 to 15
minutes, and increase the minimum delay between API calls by an
order of magnitude from 0.001 to 0.01 seconds.

Change-Id: Ib13ff03629481009a838a581d98d50accbf81de2
2023-03-08 14:39:38 +00:00
Jeremy Stanley 6f5c773b6e Try halving max-servers for rax-ord region
Reduce the max-servers in rax-ord from 195 to 100, and revert the
boot-timeout from the 300 we tried back down to 120 like the others.
We're continuing to see server create calls taking longer to report
active than nodepool is willing to wait, but also may be witnessing
the results of API rate limiting or systemic slowness. Reducing the
number of instances we attempt to boot there may give us a clearer
picture of whether that's the case.

Change-Id: Ife7035ba64b457d964c8497da0d9872e41769123
2023-03-07 18:39:00 +00:00
Clark Boylan 53cdd2d990 Revert "Revert "Revert "Temporarily stop booting nodes in inmotion iad3"""
This reverts commit 4a2253aac3.

We've made some modifications to the nova installation in this cloud
which should prevent nodes other than the mirror from launching on its
hypervisor. This should protect it from OOMs.

Change-Id: I9e0f6dac3c13f21e676f44c44206861cea289b34
2023-03-06 16:06:59 -08:00
Jeremy Stanley a177d641f2 Increase boot-timeout for rax-ord
For a while we've been seeing a lot of "Timeout waiting for instance
creation" in Rackspace's ORD region, but checking behind the
launcher it appears these instances do eventually boot, so we're
wasting significant resources discarding quota we never use.
Increase the timeout for this from 2 minutes to 5, but only in this
region as 2 minutes appears to be sufficient in the others.

Change-Id: I1cf91a606eefc4aa65507f491a20182770b99f09
2023-03-06 16:56:45 +00:00
Jeremy Stanley 4a2253aac3 Revert "Revert "Temporarily stop booting nodes in inmotion iad3""
The mirror server spontaneously powered off again. It's been booted
back up, but let's take the region out of service until someone has
a chance to investigate the reason and hopefully fix it so that it
doesn't keep happening.

This reverts commit f45f51fdd7.

Change-Id: If24a375f3b0cbf7f9d60157ae7597bb0b1c4835a
2023-03-03 14:23:03 +00:00
Jeremy Stanley f45f51fdd7 Revert "Temporarily stop booting nodes in inmotion iad3"
Merge once the mirror for this provider has returned to service.

This reverts commit 17888a4a03.

Change-Id: I480d695a63f0a695631c97294740c8443dd6981c
2023-02-27 13:43:27 +00:00
Jeremy Stanley 17888a4a03 Temporarily stop booting nodes in inmotion iad3
The mirror server in the inmotion iad3 region is down. Don't boot
nodes there for now, since jobs run on them will almost certainly
fail. This can be reverted once the mirror is back in service.

Change-Id: I369b87f97446a3b927e98b59e2fd1ac1e772b8f8
2023-02-27 12:46:26 +00:00
Zuul b046d8837d Merge "nodepool: size linaro max-servers to subnet" 2023-02-15 23:05:45 +00:00
Zuul 3e52d32877 Merge "Cache Cirros 0.6.1 images" 2023-02-14 17:34:50 +00:00
Jeremy Stanley 5262094f9e Stop caching infrequently-used CirrOS images
According to Ic8b3e790fe332cf68bad7aaa3d5f85229600380b review
comments, OpenSearch indexing indicates jobs aren't often using
CirrOS 0.3.4, 0.3.5, 0.4.0 or 0.5.1 images any longer. If jobs
occasionally used them and have to retrieve them from the Internet
then that's fine, we really only need to cache images which are used
frequently. Remove the rest in order to shrink our node images
somewhat.

Change-Id: Ibada405e0c1183559f428c749d0e54d0a45a2223
2023-02-14 17:25:45 +00:00
Jeremy Stanley 92814f9b71
Remove empty limestone nodepool providers
Once the builders have a chance to clear out all uploaded images,
this will remove the remaining references in Nodepool. Then
system-config cleanup can proceed.

Change-Id: I69b96b690918a9145d2e7ccbc79968c5341480bb
2023-02-14 08:25:25 +11:00
Jeremy Stanley 7c81cf6eda
Farewell limestone
The mirror in our Limestone Networks donor environment is now
unreachable, but we ceased using this region years ago due to
persistent networking trouble and the admin hasn't been around for
roughly as long, so it's probably time to go ahead and say goodbye
to it.

In preparation for cleanup of credentials in system-config, first
remove configuration here except leave the nodepool provider with an
empty diskimages list so that it will have a chance to pick up after
itself.

Change-Id: I504682884a1439fac84d514880757c2cd041ada6
2023-02-14 08:25:10 +11:00
yatinkarel 10abfbe573 Cache Cirros 0.6.1 images
0.6.1 is the latest cirros release and with [1][2]
is being used in neutron jobs.

Add these to nodepool images to avoid pulling it
in jobs and hit external connectivity issues.

[1] https://review.opendev.org/c/openstack/neutron/+/869154
[2] https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/869152

Change-Id: Ic8b3e790fe332cf68bad7aaa3d5f85229600380b
2023-02-09 17:03:31 +05:30