After reviewing reports of multiple CCache cropping up in logs, we
found an issue in the way novajoin is initiating and updating
cache files containing keytabs. The result was numerous extra cache
files being created and overwritten.
With this change we ensure that the credentials cache is properly
shared across workers and that when new credentials are being
created, the cache files are locked to avoid potential conflicts.
Updates DEBUG level logging to include useful cache troubleshooting
breadcrumbs.
Change-Id: I07e0004f77e0d52ab2a2707c5fe50f48f718b717
Co-Authored-By: Ade Lee <alee@redhat.com>
Presently, when novajoin fails to make a connection with the IPA
server, for any reason, it will immediately re-attempt to make
the connection when the backoff is unset (it is off by default).
As a result, any timing related issues could be the source of
the connection issues will likely result in no connection at all.
This change adds a new configuration option, retry_delay, which
will halt subsequent connection attempts for N seconds where N
is the retry_delay. By default this is set to 5 seconds, mirroring
internal ipalib behavior[1].
[1] - https://github.com/freeipa/freeipa/blob/master/ipalib/install/kinit.py#L29-L30
Change-Id: Iec96e4bd6643c0a657c8db424cc72deb10f170bd
Presently novajoin has no way of differentiating between hosts and
hostnames. As a result, it is possible for a host to be inadvertantly
deleted in certain conditions.
This fix aims to resolve this and other join/delete edge cases by
passing the instance-id (server uuid) from nova along in the
description field that is passed to IdM. We can use this
description and id to ensure we delete only the hosts we meant to.
Overview of changes:
- Persist nova instance-id in IdM's Description field
- Update join logic to handle hosts with old Description field
- Update join logic to cause nova deploy failure when attempting to
add a host with a hostname that is already enrolled
- Add new DuplicateInstanceError exception type
- Add new DeleteInstanceIdMismatch exception type
- Add inline comments documenting code flow
- IPAClient add_host doc strings for clarity
Change-Id: I676bac162a6ec35366c506bdb660cf3913131afd
* Fix cloud-init error message when OTP is missing
* Add a log message in novajoin-server
Change-Id: Ib299269c564744af6a5fcded9195d27be1c14ce7
Related-Bug: 1836529
We are having a hard time keeping track of which operations
correspond to which request. This patch adds the ability to track
operations in the notifier with the message_id of the notification
being processed. This message_id (which is generated by oslo is
a uuid
For the server, we could also set the message_id to the request_id
of the python-requests object received, but this is already
logged as part of the server logs.
Change-Id: Ie8b885a2b5cba6684e92c49eed4a99d24621402e
Right now, the backoff mechanism is broken when the backoff is
set to something non-zero. Basically, you go into this state where
you retry ad infinitum, leading to inconsistent behavior.
This change fixes the mechanism so that you only get a fixed number
of retries. You can choose (through a new config parameter) to allow
backoff (or not).
To restore some of the old behavior, the default for the connect_retries
parameter has been increased from 2 to 4, and the max backoff time has
been decreased from 1024 to 512 seconds. Its unlikely that we'd ever
reach that backoff time without a large number of retries, but 1024
seems too long.
And there is a new exception that is thrown when the connection
fails. This will result in nice 500 errors in the novajoin-server,
and some log messages for the notifier.
Change-Id: I10547fbde8966c8694346ed8c054e627bee2ee51
In CI we get a random ResponseNotReady exception,
which is caused by the server closing the keepalive socket.
This will close and retry the connection.
This patch adds this reconnect in a second place that was missed.
Change-Id: I745aea8dcb51598ca7d7a371dce66c7dd6ae8005
This patch also moves the novajoin-install and novajoin-ipa-setup
scripts to the default python scripts directory. This is because
there is no other way to fixup the #! line for python3, apart from
modifying setup.py, which is managed by the global requirements repo.
Change-Id: I21ccb475905feebdb91aa158ce3845744b2f0a5f
This adds support for creating and removing DNS A records when
floating IPs are associated and disassociated in neutron.
novajoin-install and functional tests are enhanced to test it.
Change-Id: I82c83ad9e8c84ddfd4ecfc4d5c3b31a418af97a7
A basic test to check that a spawned instance
will be added to and than deleted from FreeIPA.
This also fixes the novajoin-install script to
work by default on devstack.
Change-Id: Id7e940360ade74d605fef9004c6a5454790c55a4
In CI we get a random ResponseNotReady exception,
which is caused by the server closing the keepalive socket.
This will close and retry the connection.
Change-Id: I28e51450cbfea8bf7a18e5783355b68f806eb999
Change: Id107000b3a667f5724331e281912560cff6f92f0 implemented
caching in the IPAClient. We need to store the OTP in the cache
and return the cached OTP, not the one generated on the join
request in case there is a cache hit, since we do not update
the OTP in FreeIPA when the host is in the cache.
Closes-Bug: #1796415
Change-Id: Ic19ee7c2228d275397bc4be04432126fd2f228ec
This adds two caches: one for hosts and another one for services. The
service cache also contains which hosts are managing the service.
This was done in order to reduce the calls to FreeIPA and to try to make
novajoin slightly more efficient.
Note that this was only added to the "add" functions, and the delete
functions merely update the cache. This is because checking for hosts
managing a group would require the cache to be consistent between all
the processes (and novajoin could be ran in several), and for this the
best thing would be to use a distributed cache. Being this the first
attempt, we leave this functionality out of the scope for this patch.
Change-Id: Id107000b3a667f5724331e281912560cff6f92f0
Centralize ipalib_imported so that all primary IPA imports are
done in one place.
The issue was that IPA imports were being done in two places
one of which was simpler and not aware of the import errors of
the other.
The symptom was that if one of the moved IPA v4.5.0 imports
failed then this was handled properly in ipa.py but in util.py
where only ipalib.api is imported the import would succeed.
This mismatch meant that api.finalize() wouddln't have been
called in ipa.py so any references to api.env.* in util.py
would fail.
Change-Id: I6016892630510b816721cea5b20c8d6e7f8d34fa
Notifications can be multi-threaded but each thread was sharing
the same IPAClient instance which was causing contention in the
retry code (and likely the ccache).
Move the IPA object closer to execution so each notification will
have its own instance. This will mean more kinit activity but
each one will be isolated and be able to handle expired tickets,
IPA being down, network issues, etc.
Add backoff code on failures so we don't spam the IPA server with
retries. It is a by-two backoff from 2 - 1024 seconds.
NOTE: novajoin-server will not use this backoff code. It will
continue to use the retries configuration setting. This is because
we know there is a limited window to respond so cannot infinitely
do a retry unlike notifications.
Change-Id: Ia18d3f97f7549c89dcf4e6f014f44c3fcebc919f
IPA < 4.4.0 will fail if updatedns is True and no DNS entries
exist when deleting a host, so we disabled updatedns for the
subhost case.
This can result in leftover DNS entries.
So instead, drop updatedns for all delete cases and make a call
to dnsrecord-del and delete all DNS entries for a hostname
that way instead.
Change-Id: I6650c99811001adcb4417c3369d5df077b06d765
I intended to use upstream code for splitting the principal but
the semantics were just too different so I'm using a version
similar to the IPA v4.0.0 which does some extra validation that
we probably don't need but will split a principal without requiring
a REALM.
Change-Id: Ib6c8c6bab7694380c85dda230e7ec018c90d44d1
Remove some duplicated code, handle KerberosError and make
get_host_and_realm() a top-level method.
Add integration test for IPA connection code. It must be run
manually against a current installation at the moment.
Make the default IPA retry two so we actually do a retr
Change-Id: I6c0b52ed964851eff43c2d3fb209a90f9b1539e4
IPA 4.4 added thin client capabilities. This is done by downloading
the call schema from the IPA server and is done during the
finalize() step. This requires a TGT.
So we need to ensure that a kinit is done before finalize() is
called both in the standalone installer and in the ipa code.
Change-Id: Id87b83cb945c946cf78c425aae19c311d900249a
These methods are used to decide whether a given host or service
can be removed or not. We want to avoid deleting hosts that
manage other services or services being managed by other hosts.
Change-Id: I45f3d49e53a3f1bdeed149ae0790d820c87a9a58
With all the managed services being added to the controllers,
the operations with IPA were taking too long and the nova metadata
read timeout was expiring. One solution is to add the services
operations in a single batch, reducing round trip and re-auth time.
Change-Id: I78cacb1deb876185857e80154cf9fbec5b7d65d1
This will use the IPA domain by default. If a domain is specified
then it will be used.
For testing, if there is no config option and IPA is not available
then the domain 'test' will be mocked.
Change-Id: I21b66c61830c49a79094927f42c22c76ec1b447c
flake8 is quite a bit more picky and discovered a lot of
issues.
Don't let missing configuration blow things up in order to be
able to run unit tests.
Hacky workaround for missing ipalib/ipapython in PyPy
It appears that the metadata can be retrieved multiple times
when standing up a new instance. This is problematic for OTP
because we only have the clear value during the add phase so
if the host already exists we blow up.
I had been using a cache to store the value based on instance_id
but it wouldn't scale or work with HA without using a real
database backend. So instead the last update wins.
The previous fix was incomplete. The exception handling was
wrong so it wasn't being caught and even if it had been the
old connection was still "open" which caused creating a new
one to throw its own exception.
Use ipalib to handle the RPC rather than python-requests. This
is more robust and is a whole lot less work. It also solves
the problem of re-using credentials and better handling stale
ccache.
Since I have credentials I don't need a cache to determine an
instance hostname so do away with that.
This will also allow enrollment if an image has the ipa_enroll
property set to True.
Finally, don't re-enforce ipa_enroll in the IPA code, that should
happen beforehand.
The nova team doesn't want to pass in system_metadata which
contained the image metadata because the format/naming may
change over time. They'd prefer the caller fetch the image
metadata.
Added that, using the current authenticated token.
At some point will need to add the ability to generate a
token using some special service user given that some/most/all
requests to metadata will be unauthenticated.
Set the IPA domain in join.conf so hostnames will get the
IPA domain, instance_name + domain.
Don't blow up if metadata or system_metadata comes in as None.
Add some missing variable definitions caught by pylint.
Read join.conf in the notify server as well.
Re-order the kinit in the installation script to not fail
if the user has no pre-existing ticket.
Don't copy join.conf and api-paste.ini from going into
/usr/share/novajoin.
This is based heavily on the WSGI code in cinder.
There are two services: a REST service and a notification
listener.
Currently both log only to stdout.
The configuration file join.conf controls the REST service.
nova configuration should look like this (assuming the REST
service is running on the nova compute host).
vendordata_providers = StaticJSON, DynamicJSON
vendordata_dynamic_targets = 'join@http://127.0.0.1:9999/v1/'
vendordata_driver = nova.api.metadata.vendordata_http.HTTPFileVendorData
vendordata_dynamic_connect_timeout = 5
vendordata_dynamic_read_timeout = 30
vendordata_jsonfile_path = /etc/nova/cloud-config.json
For the notification service like this:
notification_driver = messaging
notification_topic = notifications
notify_on_state_change = vm_state
Authentication is disabled in api-paste.ini for now.