Zuul test: tests.unit.test_scheduler.TestSchedulerSSL.test_jobs_executed
fails on ubuntu focal with the following exception:
Traceback (most recent call last):
File "/home/gchauvel/zuul/zuul/.tox/py38/lib/python3.8/site-packages/gear/__init__.py", line 2835, in _doConnectLoop
self.connectLoop()
File "/home/gchauvel/zuul/zuul/.tox/py38/lib/python3.8/site-packages/gear/__init__.py", line 2865, in connectLoop
c = context.wrap_socket(c, server_side=True)
File "/usr/lib/python3.8/ssl.py", line 500, in wrap_socket
return self.sslsocket_class._create(
File "/usr/lib/python3.8/ssl.py", line 1040, in _create
self.do_handshake()
File "/usr/lib/python3.8/ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
ssl.SSLError: [SSL] internal error (_ssl.c:1108)
This is due to libssl1.1 being compiled with
"-DOPENSSL_TLS_SECURITY_LEVEL=2" and gear forcing TLSv1.0
Extracted from ubuntu source package:
The default security level for TLS connections was increased from
level 1 to level 2. This moves from the 80 bit security level to the
112 bit security level and will require 2048 bit or larger RSA and
DHE keys, 224 bit or larger ECC keys, SHA-2, TLSv1.2 or DTLSv1.2.
Allowing to negotiate TLS to the highest available version between
server and client solves the issue, provided that TLSv1.2 is useable.
The option is supported by in the latest version of all pythons >=3.5
[1][2][3]. Unfortunately Xenial doesn't have latest 3.5 and lacks the
ssl.PROTOCOL_TLS definition. We provide a fallback to select the highest
version of TLS supported in that case.
There is some risk using the fallback beacuse both the client and server
need to agree on the version supported in this case. Xenial python 3.5
does support TLSv1_2 which means that for all practical purposes TLS
v1.2 should be available on all platforms that gear runs avoiding this
problem.
Disable TLSv1.3:
According to https://bugs.python.org/issue43622#msg389497, an event on
ssl socket can happen without data being available at application level.
As gear is using a polling loop with multiple file descriptors and ssl
socket used as a blocking one, a blocked state could happen.
This is highlighted by Zuul SSL test: TestSchedulerSSL, where such
blocked state appears consistently.
note: gear tests and zuul tests are ok when using TLSv1.2 but the
previous behavior could also happen
[1] https://docs.python.org/2.7/library/ssl.html?highlight=protocol_tls#ssl.PROTOCOL_TLS
[2] https://docs.python.org/3.5/library/ssl.html?highlight=protocol_tls#ssl.PROTOCOL_TLS
[3] https://docs.python.org/3/library/ssl.html?highlight=protocol_tls#ssl.PROTOCOL_TLS
Change-Id: I5efb6c0576987815c5b93f8bc4020cdee2898d04
gear/__init__.py:
Modified Server.wakeConnections() so that it randomizes the
list of active connections before sending out NOOP's to them.
This will hopefully spread workload across machines more evenly
when there are multiple workers per machine.
Reference: https://phabricator.wikimedia.org/T258630
Change-Id: I05dcb9fa383f3aefc8b5b1bb9dd8b3ff6ff7f37d
Depends-on: https://review.opendev.org/742165
On Fedora rawhide the gear package no longer build.
https://koschei.fedoraproject.org/package/python-gear?
This patch ensures that the ssl engine does not complains about:
- ssl.SSLError: [SSL: EE_KEY_TOO_SMALL] ee key too small (_ssl.c:2951)
- ssl.SSLError: [SSL: CA_MD_TOO_WEAK] ca md too weak (_ssl.c:2951)
To reproduce the issue:
podman run -it --root fedora:rawhide
dnf install git libffi-devel python-devel tox gcc
git clone https://opendev.org/opendev/gear.git && cd gear
tox -epy39
tox -epy38
Change-Id: I57cd9c4750f27b7b76e92a0eef03e7de70c13dd5
This floods the zuul test logs with poll messages which indicates that
something changed to a busy loop.
This reverts commit 103ad3e8ed.
Change-Id: Id3347136507e7e65ccde937f1c2fd303aa3dfbbe
Gear currently supports keepalive only on linux platforms. On mac the
socket must be configured differently. For now just ignore the
keepalive flag in this case and emit a warning.
Change-Id: I276967b720742fa64e5bc6eb769c75590141275c
Switch between Epoll and Poll depending on the OS capabilities.
Change-Id: Iaf1324d0c9d43c76e3f228f1a176e453a82a71a4
Co-Authored-By: Jan Kubovy <jan.kubovy@bmw.de>
Co-Authored-By: Tobias Henkel <tobias.henkel@bmw.de>
According to the python docs [1] it is recommended to use
SSLContext.wrap_socket to create an ssl connection since Python 3.2
and 2.7.9. This enables us to also leverage server name indication
(SNI).
One use case where SNI is beneficial is an easy and standard way to
route traffic into an Openshift cluster. The most common way to get
traffic into an Openshift cluster is using a routes. The routes in an
openshift cluster work with either HTTP, HTTPS with SNI or TLS with
SNI [2]. TLS with SNI in this case also works with non-http
connections like gearman is using.
[1] https://docs.python.org/3/library/ssl.html#socket-creation
[2] https://docs.okd.io/3.11/dev_guide/expose_service/expose_internal_ip_router.html#overview
Change-Id: I19c1edc4a14a303d2a91894e0065c8d31f89ce24
If the listen address allows for ipv4 or ipv6 values we want to prefer
ipv6 if the host is configured with working ivp6. We add the ai_flag
AF_ADDRCONFIG to filter out ipv6 (and ipv4) if the host isn't configure
for this AF_INET version. Then we sort based on the family to prefer
ipv6 over ipv4.
The reason for this is clients will prefer ipv6 before falling back to
ipv4 when attempting to connect to a hostname. If the server isn't
listening on ipv6 this makes new connections happen slowly.
Change-Id: I9f7a235b04068856c6cceeb2c230f3b56945572e
A gearman client only waiting for jobs will wait indefinitely if the
gearman server vanishes (e.g. due to a VM crash). In this case there
is no traffic on the connection and the client blocks forever if there
is nothing in between that forcefully terminates the connection.
Adding tcp keepalive can mitigate that and the connection will be
terminated by the kernel in this situation which then triggers a
reconnect.
Change-Id: I8589cd45450245a25539c051355b38d16ee9f4b9
Using Python3, Gearman unexpectedly closed the socket with a client when typing the 'workers' command.
gearman-debug.log:
File "/usr/local/lib/python3.5/dist-packages/gear/__init__.py", line 3250, in handleWorkers
(fd, ip, client_id.decode('utf8'),
AttributeError: 'str' object has no attribute 'decode'
Change-Id: I610bd44c76a0e52f8d4e8f24c82c636d4ebef0ae
Add the ability for an operator to control which interface to listen
on. By default we use None to maintain backwards compatibility.
Change-Id: I14c13ff500317d5a7b580e1b2a7f798a8db5de1d
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
After registering a function, if a connection is sleeping (ie,
waiting for a NOOP to wake it up), cause it to send another
grab_job in case the newly registered job is something it can
handle.
Change-Id: Ibea13726f4a451ebc67850b17e168bd4accfbc0b
When connection to gearman server is timeout, it's necessary to
release lock before raising a exception, otherwise the client may
be dead locked.
Change-Id: Idc9c9c4bd439b122dc7855ca05d962c4e6687829
Signed-off-by: shangxdy <shang.xiaodong@zte.com.cn>
Keep track of the three main queue stats we report, so that we don't
have to iterate over all the queued jobs to determine whether they
are running.
Also, drop the workers stat because it's not very useful and not
entirely trivial to calculate.
Change-Id: Id42a05e5626096d1100a9cb9e8166c8ec5103b41
The sendRaw method (and therefore sendPacket) was originally
thread safe by virtue of consisting of a single socket.send()
call, but when we added SSL, we added a loop within the method
to handle the increased likelihood that not all data would be
sent in one call. Of course, the method should have been written
this way to start with.
However, this means that we can end up with partial packets being
sent before a context switch to another thread which may also
want to send a packet. To handle that case, place the entire
method in a lock.
Note, this doesn't affect server connections as they use a
non-blocking connection which has a send queue, so only one thread
ever actuall transmits.
Change-Id: I3bda6fda5f762d18f28b56a43b7dc28f37dbc427
These classes will provide a smoother path from python2 -> python3 when
using gear, as it will relieve the need for encoding/decoding.
Change-Id: I93bfe33f898294f30a82c0a24a18a081f9752354
The method assertEquals has been deprecated since python 2.7.
http://docs.python.org/2/library/unittest.html#deprecated-aliases
Also in Python 3, a deprecated warning is raised when using assertEquals
therefore we should use assertEqual instead.
Change-Id: Ic99ea3f57bb6fd986ce2626e800e3d310a2e3df7
Closes-Bug: #1218185
This breaks the API in subtle ways, but only really for people who want
to use Python 3 and/or not utf-8 function names.
Change-Id: If6bfc35d916cfb84d630af59f4fde4ccae5187d4
The line in tox.ini "select=H231" did not add H231 to the list of flake8
checks. It made us only run H231.
Remove the H231 line and then fix the flake8 issues. The most
substantial of these was the shadowing of the queue module with local
variables named queue. Made that one import queue as queue_mod because
changing the variable names in the code was yuckier to look at.
Change-Id: I4a26a20889132f7f4525b17392088dda6bd5bbd2
Both returning the incomplete data from the isComplete method, and
scanning the entire incomplete string for the terminator each time
are O(n^2).
Also, returning the string is *very* slow, accounting for 99.6% of
the speedup; scanning the string each time accounts for 0.6%.
Both changes together find the terminus of a 481MiB response in
about 1.6 seconds compared to 76 minutes prior to this change.
Change-Id: I3614e8f3ff8ad7c3d9ca6da1b520b89cd9d5d603
If a connection is closed while sending a large amount of data,
the send() call may return EAGAIN, even though the socket is in
CLOSE_WAIT. In that case, just continue queuing the data, and
let the main loop handle the disconnect (which it will detect
via the poll call followed by a null read).
Change-Id: Ib15eae81077b58d2f95fb0989ed1139c6d542f49
Support configuration of TCP keepalives in gear server.
When connections are closed not gracefully they stays opened
for a long time, which leads to file descriptors leak.
Default values for interval, probe and count are such as in
most Linux distributions.
By default is disabled to keep compatibility with older versions.
Change-Id: I335f68a24dda409b1a48c6e4e520ec08dfd38078
Don't encode string type in the job name in handleRequest(). Without
the explicit .decode() before passing to the string formatter it
would end up looking like b"b'<jobname'\t\n...." which made for
working with gear on the other side near impossible because a
decode of the job data would not result in a consumable string
matching expected values.
Now there is an explicit decode prior to formatting the string
data.
Change-Id: Ib2996b84bce719a2f91c166aaa4278c18f89f88f
I noticed fedora doesn't yet ship python3-gear so this is the main
reason to add six support.
Change-Id: I3d8d37870a0ab865dc0a32e123bfd16e91fa5b72
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
In `gear.Server._lostConnection()` the jobs list pulls .values() from
`conn.related_jobs`, in `gear.server._removeJob()` it is possible to
change the conn.related_jobs dictionary which will cause a
`RuntimeError: dictionary changed size during iteration` exception to
be raised in python 3 since .values() is an iterator instead of a
rendered list() like in python 2. This RuntimeError is eliminated
by forcing the python 2 behavior by wrapping the `.values()` call
to be wrapped into an explicit list.
Change-Id: I730473f3200a427cce774df534f1ff4958866dac
The shebang is not needed in cmd/geard.py. The file
is not intended to be used directly as it is managed
by an entry_point.
Change-Id: I4e2a8715cca81a8626f6bcd86a368a537305bc0b
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
If we have already performed a shutdown we can prevent exploding when
writing to self.wake_write if we short circuit.
Change-Id: If5352a373d5fa61dd1ee661f4a37976b3447dd9d
This was an attempt to find a code path that could cause
stopWaitingForJobs to raise an exception. So far, that has failed,
but add this test anyway to exercise some of that code.
Change-Id: I39955fc7250af9b93f336c7c3e0315528bd8d4a9
When EAGAIN is returned on a socket write, do not raise RetryIOError.
This exception is used principally to indicate that we have encountered
a blocked read, and a method farther up the call stack may need to
perform some cleanup. However, within the sendQueuedData method,
all necessary cleanup is performed by the finally handler itself.
Therefore, it is safe to ignore a blocked write, and count on either
a subsequent write (which will append data to the queue and retry
sending the oldest data) or the poll edge trigger to indicate that
we should retry the write.
There is nothing in the call stack above this method that has an
exception handler for RetryIOError. Therefore, in the current state,
the raised exception was causing connections to be disconnected
due to the error.
Change-Id: I9211fb6365f8f3b6dd0310430cf97926ce1f5d07
The Python 2.7 threading.Condition.wait() function will block the
process from responding to any signals until the condition triggers.
This was especially bad for commandline applications. Before this
commit, if an application didn't manage to connect to the Gearman server
it would block forever and ignore CTRL+C.
Adding a 'timeout' works around this issue. Behaviour is unchanged
except that the 'Waiting for at least one active connection' debug
message will now be logged once per second.
This problem is fixed upstream in Python 3.2:
<https://bugs.python.org/issue8844>
Change-Id: Ib1043948b1b37a4a6732176314b8a243aad73397
In d5e03ac930, an optional argument
to wakeConnections was added so that only connections related to
a job would be awoken. However, the default value, None, meant
that no connection would be awoken. This doesn't make much sense,
and indeed, Zuul, in its unit tests, was relying on all connections
being awoken. Fix this by awakening all connections if the job
argument is None (which never happens within gear itself).
Change-Id: Ib7632ae0353c414cc1b68dbe430cb57292058012