Bump workers_pool_size to 300 and remove queueing of tasks

Especially in a single-conductor environment, the number of threads
should be larger than max_concurrent_deploy, otherwise the latter cannot
be reached in practice or will cause issues with heartbeats.

On the other hand, this change fixes an issue with how we use futurist.
Due to a misunderstanding, we ended up setting the workers pool size to
100 and then also allowing 100 more requests to be queued.

To be it shortly, this change moves from 100 threads + 100 queued to
300 threads and no queue.

Partial-Bug: #2038438
Change-Id: I1aeeda89a8925fbbc2dae752742f0be4bc23bee0
This commit is contained in:
Dmitry Tantsur 2023-10-02 18:42:07 +02:00
parent db549850e0
commit 224cdd726c
3 changed files with 32 additions and 4 deletions

View File

@ -125,9 +125,12 @@ class BaseConductorManager(object):
self._keepalive_evt = threading.Event()
"""Event for the keepalive thread."""
# TODO(dtantsur): make the threshold configurable?
rejection_func = rejection.reject_when_reached(
CONF.conductor.workers_pool_size)
# NOTE(dtantsur): do not allow queuing work. Given our model, it's
# better to reject an incoming request with HTTP 503 or reschedule
# a periodic task that end up with hidden backlog that is hard
# to track and debug. Using 1 instead of 0 because of how things are
# ordered in futurist (it checks for rejection first).
rejection_func = rejection.reject_when_reached(1)
self._executor = futurist.GreenThreadPoolExecutor(
max_workers=CONF.conductor.workers_pool_size,
check_and_reject=rejection_func)

View File

@ -22,7 +22,7 @@ from ironic.common.i18n import _
opts = [
cfg.IntOpt('workers_pool_size',
default=100, min=3,
default=300, min=3,
help=_('The size of the workers greenthread pool. '
'Note that 2 threads will be reserved by the conductor '
'itself for handling heart beats and periodic tasks. '

View File

@ -0,0 +1,25 @@
---
issues:
- |
When configuring a single-conductor environment, make sure the number
of worker pools (``[conductor]worker_pool_size``) is larger than the
maximum parallel deployments (``[conductor]max_concurrent_deploy``).
This was not the case by default previously (the options used to be set
to 100 and 250 accordingly).
upgrade:
- |
Because of a fix in the internal worker pool handling, you may now start
seeing requests rejected with HTTP 503 under a very high load earlier than
before. In this case, try increasing the ``[conductor]worker_pool_size``
option or consider adding more conductors.
- |
The default worker pool size (the ``[conductor]worker_pool_size`` option)
has been increased from 100 to 300. You may want to consider increasing
it even further if your environment allows that.
fixes:
- |
Fixes handling new requests when the maximum number of internal workers
is reached. Previously, after reaching the maximum number of workers
(100 by default), we would queue the same number of requests (100 again).
This was not intentional, and now Ironic no longer queues requests if
there are no free threads to run them.