We use the internal "azul" interface to azure, which only requires
requests. We can drop these dependencies.
Change-Id: I0383a365083a2060375d0d40d0ea24079fc717b1
This allows the metastatic driver to gracefully remove a backing
node from service after a certain amount of time. This forced
retirement can be used to periodically ensure that fresh backing
nodes are used even in busy systems (which can help ensure that,
over time, job behavior does not change based on the contents of
the backing node server).
Change-Id: I62a95411a5d0b75185739a3c2553c75124c78c25
This adds the ability to configure the minimum retention time for
the backing node of a metastatic resource. This is useful for
cloud resources with minimum billing intervals (e.g., you are billed
for 24 hours of an instance even if you use it less).
Change-Id: Ibb0588f244fc6697950f0401145e0ec5ad2482c3
Similarly to the recently demoted timeouts, these exceptions are
not useful since they are internally generated. Log them as
warnings without tracebacks.
Change-Id: I84c04b65c3006f9173e5880b38694acc368b8f44
If we hit the internal timeout while launching or deleting a server,
we raise an exception and then log the traceback. This is a not-
unexpected occurance, and the traceback is not useful since it's
just one stack frame within the same class, so instead, let's log
these timeouts at warning level without the traceback.
Change-Id: Id4806d8ea2d0a232504e5a75d69cec239bcac670
This is only a symlink that is added, we can just do this in the
dockerfile and avoid having to pull a whole new debootstrap from
unstable.
Change-Id: I92941816e90e029fcff8c8afd1be87c02ae1e374
The delete-after-upload option would sometimes delete before
upload if there are intermediate build files which match its
criteria. To ensure that we don't delete any build files until
the build is at least complete, only run that method if the ZK
build state is marked as ready.
Change-Id: Ia263b478b0a2b9d77833bc19d0967a65dcbf5b27
This fixes an issue where the aws driver errors when it encounters an
unknown instances family while listing instances, making it unable to
launch new instances.
Change-Id: I5c0a6eaeebe6038806149a9c5592899db7406574
Co-Authored-By: James E. Blair <jim@acmegating.com>
In the metric name, we use the builders fqdn as a key, but in the test
we used the hostname. So this test fails on systems where that's not the
same.
Change-Id: If286f19371d1fd70dc9bee4b7af814d13396357b
The cleanup routine for leaked image uploads based its detection
on upload ids, but they are not unique except in the context of
a provider and build. This meant that, for example, as long as
there was an upload with id 0000000001 for any image build for
the provider (very likely!) we would skip cleaning up any leaked
uploads with id 0000000001.
Correct this by using a key generated on build+upload (provider
is implied because we only consider uploads for our current
provider).
Update the tests relevant to this code to exercise this condition.
Change-Id: Ic68932b735d7439ca39e2fbfbe1f73c7942152d6
This allows operators to delete large diskimage files after uploads
are complete, in order to save space.
A setting is also provided to keep certain formats, so that if
operators would like to delete large formats such as "raw" while
retaining a qcow2 copy (which, in an emergency, could be used to
inspect the image, or manually converted and uploaded for use),
that is possible.
Change-Id: I97ca3422044174f956d6c5c3c35c2dbba9b4cadf
We have observed GCE returning bad machine type data which we
then cache. If that happens, clear the cache to avoid getting
stuck with the bad data.
Change-Id: I32fac2a92d4f9d400fe2db41fffd8d189d097542
Rackspace has announced that MFA will be required starting on March 26,
2024. When MFA is enabled on an account you will no longer be able to
log in to rackspace using a username and password with
openstacksdk/openstackclient/etc as the APIs apparently don't support
negotiating the MFA token. Instead we can either use a rackspace
specific api_key or a keystone bearer token.
We opt for the rackspace specific api_key because it doesn't expire
like the bearer tokens do. But using the rackspace api_key does require
a keystoneauth1 plugin called `rackspaceauth` to be installed which this
change adds to nodepool.
This new dep is Apache2 licensed according to the License file in the
sdist. The new dep has minimal deps of its own and they are all alread
shared by the existing dep tree. Seems reasonable to install this small
lib in hopes that we can keep rackspace working with nodepool
As a final note the OpenDev team plans to test use of the api_key with
this library against a single rackspace region. It is possible this
won't work out of the box and we may need to make additional updates.
Unfortunately, it isn't easy to test this without talking directly to
rax so we opt for the lib install and testing via OpenDev.
Change-Id: Ibff32bb44e05413391dd7a320ba356f521bb30e8
On startup, the launcher waits up to 5 seconds until it has seen
its own registry entry because it uses the registry to decide if
other components are able to handle a request, and if not, fail
the request.
In the case of a ZK disconnection, we will lose all information
about registered components as well as the tree caches. Upon
reconnection, we will repopulate the tree caches and re-register
our component.
If the tree cache repopulation happens first, our component
registration may be in line behind several thousand ZK events. It
may take more than 5 seconds to repopulate and it would be better
for the launcher to wait until the component registry is up to date
before it resumes processing.
To fix this, instead of only waiting on the initial registration,
we check each time through the launcher's main loop that the registry
is up-to-date before we start processing. This should include
disconnections because we expect the main loop to abort with an
error and restart in those cases.
This operates only on local cached data, so it doesn't generate any
extra ZK traffic.
Change-Id: I1949ec56610fe810d9e088b00666053f2cc37a9a
Like done with several other meta data, copy the `cloud` attribute from
the backing node to the metastatic node.
Change-Id: Id83b3e09147baaab8a85ace4d5beba77d1eb87bd
gp3 is better in almost every way (cheaper, faster, more configurable).
It seems difficult to find a situation where gp2 would be a better
choice, so update the default when creating images to use gp3.
There are two locations where we can specify volume-type: image creation
(where the volume type becomes the default type for the image) and
instance creation (where we can override what the image specifies).
This change updates only the first (image creation), but not the second,
which has no default (which means to use whatever the image specified).
https://aws.amazon.com/ebs/general-purpose/
Change-Id: Ibfc5dfd3958e5b7dbd73c26584d6a5b8d3a1b4eb
This adds some stats keys that may be useful when monitoring
the operation of individual nodepool builders.
Change-Id: Iffdeccd39b3a157a997cf37062064100c17b1cb3
If a long-running backing node used by the metastatic driver develops
problems, performing a host-key-check each time we allocate a new
metastatic node may detect these problems. If that happens, mark
the backing node as failed so that no more nodes are allocated to
it and it is eventually removed.
Change-Id: Ib1763cf8c6e694a4957cb158b3b6afa53d20e606
The dnf-plugins-core repo updates its download command to use a
dnf.utils method that is not present in the dnf version installed by
Debian packages. Update the fetch of dnf-plugins-core to use the last
version of the download plugin that is compatible with the dnf package
in Debian.
Note that we don't use the bookworm dnf-plugins-core package to address
this because dnf-plugins-core specifies that it breaks and replaces
zypper. There doesn't seem to be a good reason for this as there is no
file overlap between the packages according to `apt-file list`.
Change-Id: I6fbf7db87a8272dae2552f9075addec2d5c82e56
It appears that centos-9 stream image builds are broken. We don't
actually care what image we build in this job, so switch to jammy
which should be working.
Change-Id: If574a4b6d26230d7bb98cb2c9eab819a08f10eff
Some drivers were missing docs and/or validation for options that
they actually support. This change:
adds launch-timeout to:
metastatic docs and validation
aws validation
gce docs and validation
adds post-upload-hook to:
aws validation
adds boot-timeout to:
metastatic docs and validation
adds launch-retries to:
metastatic docs and validation
Change-Id: Id3f4bb687c1b2c39a1feb926a50c46b23ae9df9a
This change adds the ability to use the k8s (and friends) drivers
to create pods with custom specs. This will allow nodepool admins
to define labels that create pods with options not otherwise supported
by Nodepool, as well as pods with multiple containers.
This can be used to implement the versatile sidecar pattern, which,
in a system where it is difficult to background a system process (such
as a database server or container runtime) is useful to run jobs with
such requirements.
It is still the case that a single resource is returned to Zuul, so
a single pod will be added to the inventory. Therefore, the expectation
that it should be possible to shell into the first container in the
pod is documented.
Change-Id: I4a24a953a61239a8a52c9e7a2b68a7ec779f7a3d
In I93400cc156d09ea1add4fc753846df923242c0e6 we've refactore the
launcher config loading to use the last modified timestamps of the
config files to detect if a reload is necessary.
In the builder the situation is even worse as we reload and compare the
config much more often e.g. in the build worker when checking for manual
or scheduled image updates.
With a larger config (2-3MB range) this is a significant performance
problem that can lead to builders being busy with config loading instead
of building images.
Yappi profile (performed with the optimization proposed in
I786daa20ca428039a44d14b1e389d4d3fd62a735, which doesn't fully solve the
problem):
name ncall tsub ttot tavg
..py:880 AwsProviderDiskImage.__eq__ 812.. 17346.57 27435.41 0.000034
..odepool/config.py:281 Label.__eq__ 155.. 1.189220 27403.11 0.176285
..643 BuildWorker._checkConfigRecent 58 0.000000 27031.40 466.0586
..depool/config.py:118 Config.__eq__ 58 0.000000 26733.50 460.9225
Change-Id: I929bdb757eb9e077012b530f6f872bea96ec8bbc