Commit Graph

11003 Commits

Author SHA1 Message Date
Willy Tarreau
8081abe26a CLEANUP: connection: conn->xprt is never NULL
Let's remove this outdated test that's been there since 1.5. For quite
some time now xprt hasn't been NULL anymore on an initialized connection.
2019-12-27 14:04:33 +01:00
Willy Tarreau
70ccb2cddf BUG/MINOR: connection: only wake send/recv callbacks if the FD is active
Since commit c3df4507fa ("MEDIUM: connections: Wake the upper layer even
if sending/receiving is disabled.") the send/recv callbacks are called
on I/O if the FD is ready and not just if it's active. This means that
in some situations (e.g. send ready but nothing to send) we may
needlessly enter the if() block, notice we're not subscribed, set
io_available=1 and call the wake() callback even if we're just called
for read activity. Better make sure we only do this when the FD is
active in that direction..

This may be backported as far as 2.0 though it should remain under
observation for a few weeks first as the risk of harm by a mistake
is higher than the trouble it should cause.
2019-12-27 14:04:33 +01:00
Willy Tarreau
c8dc20a825 BUG/MINOR: checks: refine which errno values are really errors.
Two regtest regularly fail in a random fashion depending on the machine's
load (one could really wonder if it's really worth keeping such
unreproducible tests) :
  - tcp-check_multiple_ports.vtc
  - 4be_1srv_smtpchk_httpchk_layer47errors.vtc

It happens that one of the reason is the time it takes to connect to
the local socket (hence the load-dependent aspect): if connect() on the
loopback returns EINPROGRESS then this status is reported instead of a
real error. Normally such a test is expected to see the error cleaned
by tcp_connect_probe() but it really depends on the timing and instead
we may very well send() first and see this error. The problem is that
everything is collected based on errno, hoping it won't get molested
in the way from the last unsuccesful syscall to wake_srv_chk(), which
obviously is hard to guarantee.

This patch at least makes sure that a few non-errors are reported as
zero just like EAGAIN. It doesn't fix the root cause but makes it less
likely to report incorrect failures.

This fix could be backported as far as 1.9.
2019-12-27 14:04:33 +01:00
Ilya Shipitsin
7aed6ef8e3 BUILD: travis-ci: reenable address sanitizer for clang builds
address sanitizer was temporarily disabled. after getting rid of
LD_LIBRARY_PATH manipulation it works again, so let us enable it
2019-12-26 06:30:21 +01:00
Ilya Shipitsin
1afd2359eb BUILD: travis-ci: link with ssl libraries using rpath instead of LD_LIBRARY_PATH/DYLD_LIBRARY_PATH
modifying LD_LIBRARY_PATH/DYLD_LIBRARY_PATH also affects other utilities like curl
to avoid side effects let us use rpath for ssl library linking

Fixes #418
2019-12-21 12:14:21 +01:00
Lukas Tribus
a26d1e1324 BUILD: ssl: improve SSL_CTX_set_ecdh_auto compatibility
SSL_CTX_set_ecdh_auto() is not defined when OpenSSL 1.1.1 is compiled
with the no-deprecated option. Remove existing, incomplete guards and
add a compatibility macro in openssl-compat.h, just as OpenSSL does:

bf4006a6f9/include/openssl/ssl.h (L1486)

This should be backported as far as 2.0 and probably even 1.9.
2019-12-21 06:46:55 +01:00
Christopher Faulet
eec7f8ac01 BUG/MEDIUM: stream: Be sure to never assign a TCP backend to an HTX stream
With a TCP frontend, it is possible to upgrade a connection to HTTP when the
backend is in HTTP mode. Concretly the upgrade install a new mux. So, once it is
done, the downgrade to TCP is no longer possible. So we must take care to never
assign a TCP backend to a stream on this connection. Otherwise, HAProxy crashes
because raw data from the server are handled as structured data on the client
side.

This patch fixes the issue #420. It must be backported to all versions
supporting the HTX.
2019-12-20 18:09:49 +01:00
Christopher Faulet
6716cc2b93 BUG/MAJOR: mux-h1: Don't pretend the input channel's buffer is full if empty
A regression was introduced by the commit 76014fd1 ("MEDIUM: h1-htx: Add HTX EOM
block when the message is in H1_MSG_DONE state"). When nothing is copied in the
channel's buffer when the input message is parsed, we erroneously pretend it is
because there is not enough room by setting the CS_FL_WANT_ROOM flag on the
conn-stream. This happens when a partial request is parsed. Because of this
flag, we never try anymore to get input data from the mux because we first wait
for more room in the channel's buffer, which is empty. Because of this bug, it
is pretty easy to freeze a h1 connection.

To fix the bug, we must obsiously set the CS_FL_WANT_ROOM flag only when there
are still data to transfer while the channel's buffer is not empty.

This patch must be backported if the patch 76014fd1 is backported too. So for
now, no backport needed.
2019-12-20 18:09:19 +01:00
Willy Tarreau
ca7a5af664 BUG/MINOR: state-file: do not leak memory on parse errors
Issue #417 reports a possible memory leak in the state-file loading code.
There's one such place in the loop which corresponds to parsing errors
where the curreently allocated line is not freed when dropped. In any
case this is very minor in that no more than the file's length may be
lost in the worst case, considering that the whole file is kept anyway
in case of success. This fix addresses this.

It should be backported to 2.1.
2019-12-20 17:33:05 +01:00
Willy Tarreau
fd1aa01f72 BUG/MINOR: state-file: do not store duplicates in the global tree
The global state file tree isn't configured for unique keys, so if an
entry appears multiple times, e.g. due to a bogus script that concatenates
entries multiple times, this will needlessly eat memory. Let's just drop
duplicates.

This should be backported to 2.1.
2019-12-20 17:23:40 +01:00
Willy Tarreau
7d6a1fa311 BUG/MEDIUM: state-file: do not allocate a full buffer for each server entry
Starting haproxy with a state file of 700k servers eats 11.2 GB of RAM
due to a mistake in the function that loads the strings into a tree: it
allocates a full buffer for each backend+server name instead of allocating
just the required string. By just fixing this we're down to 80 MB.

This should be backported to 2.1.
2019-12-20 17:18:13 +01:00
Rosen Penev
b3814c2ca8 BUG/MINOR: ssl: openssl-compat: Fix getm_ defines
LIBRESSL_VERSION_NUMBER evaluates to 0 under OpenSSL, making the condition
always true. Check for the define before checking it.

Signed-off-by: Rosen Penev <rosenp@gmail.com>

[wt: to be backported as far as 1.9]
2019-12-20 16:01:31 +01:00
Willy Tarreau
8c4c1d4299 REGTEST: make the "set ssl cert" require version 2.1
This test fails on 2.0 and earlier since the feature was introduced in 2.1,
let's add the REQUIRE_VERSION tag.
2019-12-20 14:35:18 +01:00
Olivier Houchard
fc51f0f588 BUG/MEDIUM: fd/threads: fix a concurrency issue between add and rm on the same fd
There's a very hard-to-trigger bug in the FD list code where the
fd_add_to_fd_list() function assumes that if the FD it's trying to add
is already locked, it's in the process of being added. Unfortunately, it
can also be in the process of being removed. It is very hard to trigger
because it requires that one thread is removing the FD while another one
is adding it. First very few FDs run on multiple threads (listeners and
DNS), and second, it does not make sense to add and remove the FD at the
same time.

In practice the DNS code built on the older callback-only model does
perform bursts of fd_want_send() for all resolvers at once when it wants
to send a new query (dns_send_query()). And this is more likely to happen
when here are lots of resolutions in parallel and many resolvers, because
the dns_response_recv() callback can also trigger a series of queries on
all resolvers for each invalid response it receives. This means that it
really is perfectly possible to both stop and start in parallel during
short periods of time there.

This issue was not reported before 2.1, but 2.1 had the FD cache, built
on the exact same code base. It's very possible that the issue caused
exactly the opposite situation, where an event was occasionally lost,
causing a DNS retry that worked, and nobody noticing the problem in the
end. In 2.1 the lost entries are the updates asking for not polling for
writes anymore, and the effect is that the poller contiuously reports
writability on the socket when the issue happens.

This patch fixes bug #416 and must be backported as far as 1.8, and
absolutely requires that previous commit "MINOR: fd/threads: make
_GET_NEXT()/_GET_PREV() use the volatile attribute" is backported as
well otherwise it will make the issue worse.

Special thanks to Julien Pivotto for setting up a reliable reproducer
for this difficult issue.
2019-12-20 08:09:28 +01:00
Willy Tarreau
337fb719ee MINOR: fd/threads: make _GET_NEXT()/_GET_PREV() use the volatile attribute
These macros are either used between atomic ops which cause the volatile
to be implicit, or with an explicit volatile cast. However not having it
in the macro causes some traps in the code because certain loop paths
cannot safely be used without risking infinite loops if one isn't careful
enough.

Let's place the volatile attribute inside the macros and remove them from
the explicit places to avoid this. It was verified that the output executable
remains exactly the same byte-wise.
2019-12-20 08:09:28 +01:00
Olivier Houchard
54907bb848 BUG/MEDIUM: ssl: Revamp the way early data are handled.
Instead of attempting to read the early data only when the upper layer asks
for data, allocate a temporary buffer, stored in the ssl_sock_ctx, and put
all the early data in there. Requiring that the upper layer takes care of it
means that if for some reason the upper layer wants to emit data before it
has totally read the early data, we will be stuck forever.

This should be backported to 2.1 and 2.0.
This may fix github issue #411.
2019-12-19 15:22:04 +01:00
Willy Tarreau
dd0e89a084 BUG/MAJOR: task: add a new TASK_SHARED_WQ flag to fix foreing requeuing
Since 1.9 with commit b20aa9eef3 ("MAJOR: tasks: create per-thread wait
queues") a task bound to a single thread will not use locks when being
queued or dequeued because the wait queue is assumed to be the owner
thread's.

But there exists a rare situation where this is not true: the health
check tasks may be running on one thread waiting for a response, and
may in parallel be requeued by another thread calling health_adjust()
after a detecting a response error in traffic when "observe l7" is set,
and "fastinter" is lower than "inter", requiring to shorten the running
check's timeout. In this case, the task being requeued was present in
another thread's wait queue, thus opening a race during task_unlink_wq(),
and gets requeued into the calling thread's wait queue instead of the
running one's, opening a second race here.

This patch aims at protecting against the risk of calling task_unlink_wq()
from one thread while the task is queued on another thread, hence unlocked,
by introducing a new TASK_SHARED_WQ flag.

This new flag indicates that a task's position in the wait queue may be
adjusted by other threads than then one currently executing it. This means
that such WQ manipulations must be performed under a lock. There are two
types of such tasks:
  - the global ones, using the global wait queue (technically speaking,
    those whose thread_mask has at least 2 bits set).
  - some local ones, which for now will be placed into the global wait
    queue as well in order to benefit from its lock.

The flag is automatically set on initialization if the task's thread mask
indicates more than one thread. The caller must also set it if it intends
to let other threads update the task's expiration delay (e.g. delegated
I/Os), or if it intends to change the task's affinity over time as this
could lead to the same situation.

Right now only the situation described above seems to be affected by this
issue, and it is very difficult to trigger, and even then, will often have
no visible effect beyond stopping the checks for example once the race is
met. On my laptop it is feasible with the following config, chained to
httpterm:

    global
        maxconn 400 # provoke FD errors, calling health_adjust()

    defaults
        mode http
        timeout client 10s
        timeout server 10s
        timeout connect 10s

    listen px
        bind :8001
        option httpchk /?t=50
        server sback 127.0.0.1:8000 backup
        server-template s 0-999 127.0.0.1:8000 check port 8001 inter 100 fastinter 10 observe layer7

This patch will automatically address the case for the checks because
check tasks are created with multiple threads bound and will get the
TASK_SHARED_WQ flag set.

If in the future more tasks need to rely on this (multi-threaded muxes
for example) and the use of the global wait queue becomes a bottleneck
again, then it should not be too difficult to place locks on the local
wait queues and queue the task on its bound thread.

This patch needs to be backported to 2.1, 2.0 and 1.9. It depends on
previous patch "MINOR: task: only check TASK_WOKEN_ANY to decide to
requeue a task".

Many thanks to William Dauchy for providing detailed traces allowing to
spot the problem.
2019-12-19 14:42:22 +01:00
Willy Tarreau
8fe4253bf6 MINOR: task: only check TASK_WOKEN_ANY to decide to requeue a task
After processing a task, its RUNNING bit is cleared and at the same time
we check for other bits to decide whether to requeue the task or not. It
happens that we only want to check the TASK_WOKEN_* bits, because :
  - TASK_RUNNING was just cleared
  - TASK_GLOBAL and TASK_QUEUE cannot be set yet as the task was running,
    preventing it from being requeued

It's important not to catch yet undefined flags there because it would
prevent addition of new task flags. This also shows more clearly that
waking a task up with flags 0 is not something safe to do as the task
will not be woken up if it's already running.
2019-12-19 14:42:22 +01:00
William Lallemand
d5b464bfee REGTEST: run-regtests: implement #REQUIRE_BINARIES
Implement #REQUIRE_BINARIES for vtc files.

The run-regtests.sh script will check if the binary is available in the
environment, if not, it wil disable the vtc.
2019-12-19 14:36:46 +01:00
William Lallemand
9c1aa0a2a1 REGTEST: ssl: test the "set ssl cert" CLI command
Add a reg-test which test the update of a certificate over the CLI. This
test requires socat and curl.

This commit also adds an ECDSA certificate in the ssl directory.
2019-12-19 13:51:38 +01:00
Willy Tarreau
262c3f1a00 MINOR: http: add a new "replace-path" action
This action is very similar to "replace-uri" except that it only acts on the
path component. This is assumed to better match users' expectations when they
used to rely on "replace-uri" in HTTP/1 because mostly origin forms were used
in H1 while mostly absolute URI form is used in H2, and their rules very often
start with a '/', and as such do not match.

It could help users to get this backported to 2.0 and 2.1.
2019-12-19 09:24:57 +01:00
Willy Tarreau
0851fd5eef MINOR: debug: support logging to various sinks
As discussed in the thread below [1], the debug converter is currently
not of much use given that it's only built when DEBUG_EXPR is set, and
it is limited to stderr only.

This patch changes this to make it take an optional prefix and an optional
target sink so that it can log to stdout, stderr or a ring buffer. The
default output is the "buf0" ring buffer, that can be consulted from the
CLI.

[1] https://www.mail-archive.com/haproxy@formilux.org/msg35671.html

Note: if this patch is backported, it also requires the following commit to
work: 46dfd78cbf ("BUG/MINOR: sample: always check converters' arguments").
2019-12-19 09:19:13 +01:00
William Lallemand
ba22e901b3 BUG/MINOR: ssl/cli: fix build for openssl < 1.0.2
Commit d4f946c ("MINOR: ssl/cli: 'show ssl cert' give information on the
certificates") introduced a build issue with openssl version < 1.0.2
because it uses the certificate bundles.
2019-12-18 20:40:20 +01:00
William Lallemand
d4f946c469 MINOR: ssl/cli: 'show ssl cert' give information on the certificates
Implement the 'show ssl cert' command on the CLI which list the frontend
certificates. With a certificate name in parameter it will show more
details.
2019-12-18 18:16:34 +01:00
Olivier Houchard
545989f37f BUG/MEDIUM: ssl: Don't set the max early data we can receive too early.
When accepting the max early data, don't set it on the SSL_CTX while parsing
the configuration, as at this point global.tune.maxrewrite may still be -1,
either because it was not set, or because it hasn't been set yet. Instead,
set it for each connection, just after we created the new SSL.
Not doing so meant that we could pretend to accept early data bigger than one
of our buffer.

This should be backported to 2.1, 2.0, 1.9 and 1.8.
2019-12-17 15:45:38 +01:00
Tim Duesterhus
cd3732456b MINOR: sample: Validate the number of bits for the sha2 converter
Instead of failing the conversion when an invalid number of bits is
given the sha2 converter now fails with an appropriate error message
during startup.

The sha2 converter was introduced in d437630237,
which is in 2.1 and higher.
2019-12-17 13:28:00 +01:00
Willy Tarreau
46dfd78cbf BUG/MINOR: sample: always check converters' arguments
In 1.5-dev20, sample-fetch arguments parsing was addresse by commit
689a1df0a1 ("BUG/MEDIUM: sample: simplify and fix the argument parsing").
The issue was that argument checks were not run for sample-fetches if
parenthesis were not present. Surprisingly, the fix was mde only for
sample-fetches and not for converters which suffer from the exact same
problem. There are even a few comments in the code mentioning that some
argument validation functions are not called when arguments are missing.

This fix applies the exact same method as the one above. The impact of
this bug is limited because over the years the code has learned to work
around this issue instead of fixing it.

This may be backported to all maintained versions.
2019-12-17 10:44:49 +01:00
Willy Tarreau
5060326798 BUG/MINOR: sample: fix the closing bracket and LF in the debug converter
The closing bracket was emitted for the "debug" converter even when the
opening one was not sent, and the new line was not always emitted. Let's
fix this. This is harmless since this converter is not built by default.
2019-12-17 09:04:38 +01:00
Willy Tarreau
62b5913380 DOC: clarify the fact that replace-uri works on a full URI
With H2 deployments becoming more common, replace-uri starts to hit
users by not always matching absolute URIs due to rules expecting the
URI to start with a '/'.
2019-12-17 06:55:15 +01:00
Christopher Faulet
1eee6ca89e REGTEST: Add an HTX reg-test to check an edge case
This test checks that an HTTP message is properly processed when we failed to
add the HTX EOM block in an HTX message during the parsing because the buffer is
full. Some space must be released in the buffer to make it possible. This
requires an extra pass in the H1 multiplexer. Here, we must be sure the mux is
called while there is no more incoming data.

It is a "devel" test because conditions to run the test successfully is highly
dependent on the implementation. So if it fail, it is not necessarily a bug. It
may be due of an internal change. It relies on internal HTX sample fetches.
2019-12-11 16:46:16 +01:00
Christopher Faulet
29f7284333 MINOR: http-htx: Add some htx sample fetches for debugging purpose
These sample fetches are internal and must be used for debugging purpose. Idea
is to have a way to add some checks on the HTX content from http rules. The main
purpose is to ease reg-tests writing.
2019-12-11 16:46:16 +01:00
Christopher Faulet
76014fd118 MEDIUM: h1-htx: Add HTX EOM block when the message is in H1_MSG_DONE state
During H1 parsing, the HTX EOM block is added before switching the message state
to H1_MSG_DONE. It is an exception in the way to convert an H1 message to
HTX. Except for this block, the message is first switched to the right state
before starting to add the corresponding HTX blocks. For instance, the message
is switched in H1_MSG_DATA state and then the HTX DATA blocks are added.

With this patch, the message is switched to the H1_MSG_DONE state when all data
blocks or trailers were processed. It is the caller responsibility to call
h1_parse_msg_eom() when the H1_MSG_DONE state is reached. This way, it is far
easier to catch failures when the HTX buffer is full.

The H1 and FCGI muxes have been updated accordingly.

This patch may eventually be backported to 2.1 if it helps other backports.
2019-12-11 16:46:16 +01:00
Willy Tarreau
719e07c989 BUILD/MINOR: unix sockets: silence an absurd gcc warning about strncpy()
Apparently gcc developers decided that strncpy() semantics are no longer
valid and now deserve a warning, especially if used exactly as designed.
This results in issue #304. Let's just remove one to the target size to
please her majesty gcc, the God of C Compilers, who tries hard to make
users completely eliminate any use of string.h and reimplement it by
themselves at much higher risks. Pfff....

This can be backported to stable version, the fix is harmless since it
ignores the last zero that is already set on next line.
2019-12-11 16:29:10 +01:00
Willy Tarreau
fec56c6a76 BUG/MINOR: listener: fix off-by-one in state name check
As reported in issue #380, the state check in listener_state_str() is
invalid as it allows state value 9 to report crap. We don't use such
a state value so the issue should never happen unless the memory is
already corrupted, but better clean this now while it's harmless.

This should be backported to all maintained branches.
2019-12-11 15:51:37 +01:00
Willy Tarreau
2444108f16 BUG/MINOR: server: make "agent-addr" work on default-server line
As reported in issue #408, "agent-addr" doesn't work on default-server
lines. This is due to the transcription of the old "addr" option in commit
6e5e0d8f9e ("MINOR: server: Make 'default-server' support 'addr' keyword.")
which correctly assigns it to the check.addr and agent.addr fields, but
which also copies the default check.addr into both the check's and the
agent's addr fields. Thus the default agent's address is never used.

This fix makes sure to copy the check from the check and the agent from
the agent. However it's worth noting that if "addr" is specified on the
server line, it will still overwrite both the check and the agent's
addresses.

This must be backported as far as 1.8.
2019-12-11 15:43:45 +01:00
Willy Tarreau
cdcba115b8 BUG/MINOR: listener: do not immediately resume on transient error
The listener supports a "transient error" situation, which corresponds
to those situations where accept fails badly but poll() reports an event.
This happens for example when a listener is paused, or on out of FD. The
same mechanism is used when facing a maxconn or maxsessrate limitation.
When this happens, the listener is disabled for up to 100ms and put back
into the global listener queue so that it automatically wakes up again
as soon as the conditions change from an existing connection releasing
one resource, or the system recovers from a transient issue.

The listener_accept() function has a bug in its exit path causing a
freshly limited listener to be immediately enabled again because all
the conditions are met (connection count < max). It doesn't take into
account the fact that the listener might have been queued and must
first wait for the timeout to expire before doing so. The impact is
that upon certain errors, the faulty process will busy loop on the
accept code without sleeping. This is the scenario reported and
diagnosed by @hedong0411 in issue #382.

This commit fixes it by verifying that the global queue's delay is
at least expired before deciding to resume the listener. Another
approach could consist in having an extra state like LI_DELAY for
situations where only a delay is acceptable, but this would probably
not bring anything except more complex code.

This issue was introduced with the lock-free listener accept code
(commits 3f0d02b and 82c9789a) that were backported to 1.8.20+ and
1.9.7+, so this fix must be backported to the relevant branches.
2019-12-11 15:06:30 +01:00
Willy Tarreau
d26c9f9465 BUG/MINOR: mworker: properly pass SIGTTOU/SIGTTIN to workers
If a new process is started with -sf and it fails to bind, it may send
a SIGTTOU to the master process in hope that it will temporarily unbind.
Unfortunately this one doesn't catch it and stops to background instead
of forwarding the signal to the workers. The same is true for SIGTTIN.

This commit simply implements an extra signal handler for the master to
deal with such signals that must be passed down to the workers. It must
be backported as far as 1.8, though there the code differs in that it's
entirely in haproxy.c and doesn't require an extra sig handler.
2019-12-11 14:26:53 +01:00
Willy Tarreau
51013e82d4 BUG/MINOR: log: fix minor resource leaks on logformat error path
As reported by Ilya in issue #392, Coverity found that we're leaking
allocated strings on error paths in parse_logformat(). Let's use a
proper exit label for failures instead of seeding return 0 everywhere.

This should be backported to all supported versions.
2019-12-11 12:05:39 +01:00
Willy Tarreau
9ef75ecea1 DOC: remove references to the outdated architecture.txt
As mentionned in bug #405 we continue to reference architecture.txt from
places in the doc despite this file not being packaged for many years.
Better drop the reference if it's confusing.
2019-12-11 11:55:52 +01:00
Julien Pivotto
21ad315316 DOC: proxies: HAProxy only supports 3 connection modes
The 4th one (forceclose) has been deprecated and deleted from the
documentation in 10c6c16cde

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2019-12-11 10:17:23 +01:00
Willy Tarreau
c49ba52524 MINOR: tasks: split wake_expired_tasks() in two parts to avoid useless wakeups
We used to have wake_expired_tasks() wake up tasks and return the next
expiration delay. The problem this causes is that we have to call it just
before poll() in order to consider latest timers, but this also means that
we don't wake up all newly expired tasks upon return from poll(), which
thus systematically requires a second poll() round.

This is visible when running any scheduled task like a health check, as there
are systematically two poll() calls, one with the interval, nothing is done
after it, and another one with a zero delay, and the task is called:

  listen test
    bind *:8001
    server s1 127.0.0.1:1111 check

  09:37:38.200959 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8696843}) = 0
  09:37:38.200967 epoll_wait(3, [], 200, 1000) = 0
  09:37:39.202459 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8712467}) = 0
>> nothing run here, as the expired task was not woken up yet.
  09:37:39.202497 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8715766}) = 0
  09:37:39.202505 epoll_wait(3, [], 200, 0) = 0
  09:37:39.202513 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8719064}) = 0
>> now the expired task was woken up
  09:37:39.202522 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
  09:37:39.202537 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
  09:37:39.202565 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
  09:37:39.202577 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0
  09:37:39.202585 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
  09:37:39.202659 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0
  09:37:39.202673 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8814713}) = 0
  09:37:39.202683 epoll_wait(3, [{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1
  09:37:39.202693 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8818617}) = 0
  09:37:39.202701 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
  09:37:39.202715 close(7)                = 0

Let's instead split the function in two parts:
  - the first part, wake_expired_tasks(), called just before
    process_runnable_tasks(), wakes up all expired tasks; it doesn't
    compute any timeout.
  - the second part, next_timer_expiry(), called just before poll(),
    only computes the next timeout for the current thread.

Thanks to this, all expired tasks are properly woken up when leaving
poll, and each poll call's timeout remains up to date:

  09:41:16.270449 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10223556}) = 0
  09:41:16.270457 epoll_wait(3, [], 200, 999) = 0
  09:41:17.270130 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10238572}) = 0
  09:41:17.270157 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
  09:41:17.270194 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
  09:41:17.270204 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
  09:41:17.270216 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0
  09:41:17.270224 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
  09:41:17.270299 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0
  09:41:17.270314 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10337841}) = 0
  09:41:17.270323 epoll_wait(3, [{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1
  09:41:17.270332 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10341860}) = 0
  09:41:17.270340 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
  09:41:17.270367 close(7)                = 0

This may be backported to 2.1 and 2.0 though it's unlikely to bring any
user-visible improvement except to clarify debugging.
2019-12-11 09:42:58 +01:00
Willy Tarreau
440d09b244 BUG/MINOR: tasks: only requeue a task if it was already in the queue
Commit 0742c314c3 ("BUG/MEDIUM: tasks: Make sure we switch wait queues
in task_set_affinity().") had a slight side effect on expired timeouts,
which is that when used before a timeout is updated, it will cause an
existing task to be requeued earlier than its expected timeout when done
before being updated, resulting in the next poll wakup timeout too early
or even instantly if the previous wake up was done on a timeout. This is
visible in strace when health checks are enabled because there are two
poll calls, one of which has a short or zero delay. The correct solution
is to only requeue a task if it was already in the queue.

This can be backported to all branches having the fix above.
2019-12-11 09:21:36 +01:00
Willy Tarreau
4ac36d691a DOC: listeners: add a few missing transitions
Some disable() transitions were missing, and the distinction between
multi-threaded and single-threaded transitions was not mentioned.
2019-12-11 07:44:34 +01:00
Willy Tarreau
d7f76a0a50 BUG/MEDIUM: proto_udp/threads: recv() and send() must not be exclusive.
This is a complement to previous fix for bug #399. The exclusion between
the recv() and send() calls prevents send handlers from being called if
rx readiness is reported. The DNS code can trigger this situations with
threads where the fd_recv_ready() flag disappears between the test in
dgram_fd_handler() and the second test in dns_resolve_recv() while a
thread calls fd_cant_recv(), and this situation can sustain itself for
a while. With 8 threads and an error in the socket queue, placing a
printf on the return statement in dns_resolve_recv() scrolls very fast.

Simply removing the "else" in dgram_fd_handler() addresses the issue.

This fix must be backported as far as 1.6.
2019-12-10 19:09:15 +01:00
Willy Tarreau
1c75995611 BUG/MAJOR: dns: add minimalist error processing on the Rx path
It was reported in bug #399 that the DNS sometimes enters endless loops
after hours working fine. The issue is caused by a lack of error
processing in the DNS's recv() path combined with an exclusive recv OR
send in the UDP layer, resulting in some errors causing CPU loops that
will never stop until the process is restarted.

The basic cause is that the FD_POLL_ERR and FD_POLL_HUP flags are sticky
on the FD, and contrary to a stream socket, receiving an error on a
datagram socket doesn't indicate that this socket cannot be used anymore.
Thus the Rx code must at least handle this situation and flush the error
otherwise it will constantly be reported. In theory this should not be a
big issue but in practise it is due to another bug in the UDP datagram
handler which prevents the send() callback from being called when Rx
readiness was reported, so the situation cannot go away. It happens way
more easily with threads enabled, so that there is no dead time between
the moment the FD is disabled and another recv() is called, such as in
the example below where the request was sent to a closed port on the
loopback provoking an ICMP unreachable to be sent back:

  [pid 20888] 18:26:57.826408 sendto(29, ";\340\1\0\0\1\0\0\0\0\0\1\0031wt\2eu\0\0\34\0\1\0\0)\2\0\0\0\0\0\0\0", 35, 0, NULL, >
  [pid 20893] 18:26:57.826566 recvfrom(29, 0x7f97c54ef2f0, 513, 0, NULL, NULL) = -1 ECONNREFUSED (Connection refused)
  [pid 20889] 18:26:57.826601 recvfrom(29, 0x7f97c76182f0, 513, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 20892] 18:26:57.826630 recvfrom(29, 0x7f97c5cf02f0, 513, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 20891] 18:26:57.826684 recvfrom(29, 0x7f97c66162f0, 513, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 20895] 18:26:57.826716 recvfrom(29, 0x7f97bffda2f0, 513, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 20894] 18:26:57.826747 recvfrom(29, 0x7f97c4cee2f0, 513, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 20888] 18:26:58.419838 recvfrom(29, 0x7ffcc8712c20, 513, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 20893] 18:26:58.419900 recvfrom(29, 0x7f97c54ef2f0, 513, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  (... hundreds before next sendto() ...)

This situation was handled by clearing HUP and ERR when recv()
returns <0.

A second case was handled, there was a control for a missing dgram
handler, but it does nothing, causing the FD to ring again if this
situation ever happens. After looking at the rest of the code, it
doesn't seem possible to face such a situation because these handlers
are registered during startup, but at least we need to handle it
properly.

A third case was handled, that's mainly a small optimization. With
threads and massive responses, due to the large lock around the loop,
it's likely that some threads will have seen fd_recv_ready() and will
wait at the lock(). But if they wait here, chances are that other
threads will have eliminated pending data and issued fd_cant_recv().
In this case, better re-check fd_recv_ready() before performing the
recv() call to avoid the huge amounts of syscalls that happen on
massively threaded setups.

This patch must be backported as far as 1.6 (the atomic AND just
needs to be turned to a regular AND).
2019-12-10 19:09:15 +01:00
Olivier Houchard
eaefc3c503 BUG/MEDIUM: kqueue: Make sure we report read events even when no data.
When we have a EVFILT_READ event, an optimization was made, and the FD was
not reported as ready to receive if there were no data available. That way,
if the socket was closed by our peer (the EV8EOF flag was set), and there were
no remaining data to read, we would just close(), and avoid doing a recv().
However, it may be fine for TCP socket, but it is not for UDP.
If we send data via UDP, and we receive an error, the only way to detect it
is to attempt a recv(). However, in this case, kevent() will report a read
event, but with no data, so we'd just ignore that read event, nothing would be
done about it, and the poller would be woken up by it over and over.
To fix this, report read events if either we have data, or the EV_EOF flag
is not set.

This should be backported to 2.1, 2.0, 1.9 and 1.8.
2019-12-10 18:27:17 +01:00
Willy Tarreau
977afab3f8 DOC: document the listener state transitions
This was done by reading all the code affecting a listener's state,
hopefully it will save some time in the future.
2019-12-10 16:06:53 +01:00
Willy Tarreau
a1d97f88e0 REORG: listener: move the global listener queue code to listener.c
The global listener queue code and declarations were still lying in
haproxy.c while not needed there anymore at all. This complicates
the code for no reason. As a result, the global_listener_queue_task
and the global_listener_queue were made static.
2019-12-10 14:16:03 +01:00
Willy Tarreau
241797a3fc MINOR: listener: split dequeue_all_listener() in two
We use it half times for the global_listener_queue and half times
for a proxy's queue and this requires the callers to take care of
these. Let's split it in two versions, the current one working only
on the global queue and another one dedicated to proxies for the
per-proxy queues. This cleans up quite a bit of code.
2019-12-10 14:14:09 +01:00
Willy Tarreau
0591bf7deb MINOR: listener: make the wait paths cleaner and more reliable
In listener_accept() there are several situations where we have to wait
for an event or a delay. These ones all implement their own call to
limit_listener() and the associated task_schedule(). In addition to
being ugly and confusing, one expire date computation is even wrong as
it doesn't take in account the fact that we're using threads and that
the value might change in the middle. Fortunately task_schedule() gets
it right for us.

This patch creates two jump locations, one for the global queue and
one for the proxy queue, allowing the rest of the code to only compute
the expire delay and jump to the right location.
2019-12-10 12:04:27 +01:00