Since 1.9 with commit b20aa9eef3 ("MAJOR: tasks: create per-thread wait
queues") a task bound to a single thread will not use locks when being
queued or dequeued because the wait queue is assumed to be the owner
thread's.
But there exists a rare situation where this is not true: the health
check tasks may be running on one thread waiting for a response, and
may in parallel be requeued by another thread calling health_adjust()
after a detecting a response error in traffic when "observe l7" is set,
and "fastinter" is lower than "inter", requiring to shorten the running
check's timeout. In this case, the task being requeued was present in
another thread's wait queue, thus opening a race during task_unlink_wq(),
and gets requeued into the calling thread's wait queue instead of the
running one's, opening a second race here.
This patch aims at protecting against the risk of calling task_unlink_wq()
from one thread while the task is queued on another thread, hence unlocked,
by introducing a new TASK_SHARED_WQ flag.
This new flag indicates that a task's position in the wait queue may be
adjusted by other threads than then one currently executing it. This means
that such WQ manipulations must be performed under a lock. There are two
types of such tasks:
- the global ones, using the global wait queue (technically speaking,
those whose thread_mask has at least 2 bits set).
- some local ones, which for now will be placed into the global wait
queue as well in order to benefit from its lock.
The flag is automatically set on initialization if the task's thread mask
indicates more than one thread. The caller must also set it if it intends
to let other threads update the task's expiration delay (e.g. delegated
I/Os), or if it intends to change the task's affinity over time as this
could lead to the same situation.
Right now only the situation described above seems to be affected by this
issue, and it is very difficult to trigger, and even then, will often have
no visible effect beyond stopping the checks for example once the race is
met. On my laptop it is feasible with the following config, chained to
httpterm:
global
maxconn 400 # provoke FD errors, calling health_adjust()
defaults
mode http
timeout client 10s
timeout server 10s
timeout connect 10s
listen px
bind :8001
option httpchk /?t=50
server sback 127.0.0.1:8000 backup
server-template s 0-999 127.0.0.1:8000 check port 8001 inter 100 fastinter 10 observe layer7
This patch will automatically address the case for the checks because
check tasks are created with multiple threads bound and will get the
TASK_SHARED_WQ flag set.
If in the future more tasks need to rely on this (multi-threaded muxes
for example) and the use of the global wait queue becomes a bottleneck
again, then it should not be too difficult to place locks on the local
wait queues and queue the task on its bound thread.
This patch needs to be backported to 2.1, 2.0 and 1.9. It depends on
previous patch "MINOR: task: only check TASK_WOKEN_ANY to decide to
requeue a task".
Many thanks to William Dauchy for providing detailed traces allowing to
spot the problem.
During H1 parsing, the HTX EOM block is added before switching the message state
to H1_MSG_DONE. It is an exception in the way to convert an H1 message to
HTX. Except for this block, the message is first switched to the right state
before starting to add the corresponding HTX blocks. For instance, the message
is switched in H1_MSG_DATA state and then the HTX DATA blocks are added.
With this patch, the message is switched to the H1_MSG_DONE state when all data
blocks or trailers were processed. It is the caller responsibility to call
h1_parse_msg_eom() when the H1_MSG_DONE state is reached. This way, it is far
easier to catch failures when the HTX buffer is full.
The H1 and FCGI muxes have been updated accordingly.
This patch may eventually be backported to 2.1 if it helps other backports.
As reported in issue #380, the state check in listener_state_str() is
invalid as it allows state value 9 to report crap. We don't use such
a state value so the issue should never happen unless the memory is
already corrupted, but better clean this now while it's harmless.
This should be backported to all maintained branches.
If a new process is started with -sf and it fails to bind, it may send
a SIGTTOU to the master process in hope that it will temporarily unbind.
Unfortunately this one doesn't catch it and stops to background instead
of forwarding the signal to the workers. The same is true for SIGTTIN.
This commit simply implements an extra signal handler for the master to
deal with such signals that must be passed down to the workers. It must
be backported as far as 1.8, though there the code differs in that it's
entirely in haproxy.c and doesn't require an extra sig handler.
We used to have wake_expired_tasks() wake up tasks and return the next
expiration delay. The problem this causes is that we have to call it just
before poll() in order to consider latest timers, but this also means that
we don't wake up all newly expired tasks upon return from poll(), which
thus systematically requires a second poll() round.
This is visible when running any scheduled task like a health check, as there
are systematically two poll() calls, one with the interval, nothing is done
after it, and another one with a zero delay, and the task is called:
listen test
bind *:8001
server s1 127.0.0.1:1111 check
09:37:38.200959 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8696843}) = 0
09:37:38.200967 epoll_wait(3, [], 200, 1000) = 0
09:37:39.202459 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8712467}) = 0
>> nothing run here, as the expired task was not woken up yet.
09:37:39.202497 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8715766}) = 0
09:37:39.202505 epoll_wait(3, [], 200, 0) = 0
09:37:39.202513 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8719064}) = 0
>> now the expired task was woken up
09:37:39.202522 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
09:37:39.202537 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
09:37:39.202565 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
09:37:39.202577 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0
09:37:39.202585 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
09:37:39.202659 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0
09:37:39.202673 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8814713}) = 0
09:37:39.202683 epoll_wait(3, [{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1
09:37:39.202693 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8818617}) = 0
09:37:39.202701 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
09:37:39.202715 close(7) = 0
Let's instead split the function in two parts:
- the first part, wake_expired_tasks(), called just before
process_runnable_tasks(), wakes up all expired tasks; it doesn't
compute any timeout.
- the second part, next_timer_expiry(), called just before poll(),
only computes the next timeout for the current thread.
Thanks to this, all expired tasks are properly woken up when leaving
poll, and each poll call's timeout remains up to date:
09:41:16.270449 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10223556}) = 0
09:41:16.270457 epoll_wait(3, [], 200, 999) = 0
09:41:17.270130 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10238572}) = 0
09:41:17.270157 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
09:41:17.270194 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
09:41:17.270204 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
09:41:17.270216 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0
09:41:17.270224 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
09:41:17.270299 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0
09:41:17.270314 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10337841}) = 0
09:41:17.270323 epoll_wait(3, [{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1
09:41:17.270332 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10341860}) = 0
09:41:17.270340 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
09:41:17.270367 close(7) = 0
This may be backported to 2.1 and 2.0 though it's unlikely to bring any
user-visible improvement except to clarify debugging.
Commit 0742c314c3 ("BUG/MEDIUM: tasks: Make sure we switch wait queues
in task_set_affinity().") had a slight side effect on expired timeouts,
which is that when used before a timeout is updated, it will cause an
existing task to be requeued earlier than its expected timeout when done
before being updated, resulting in the next poll wakup timeout too early
or even instantly if the previous wake up was done on a timeout. This is
visible in strace when health checks are enabled because there are two
poll calls, one of which has a short or zero delay. The correct solution
is to only requeue a task if it was already in the queue.
This can be backported to all branches having the fix above.
We use it half times for the global_listener_queue and half times
for a proxy's queue and this requires the callers to take care of
these. Let's split it in two versions, the current one working only
on the global queue and another one dedicated to proxies for the
per-proxy queues. This cleans up quite a bit of code.
In task_set_affinity(), leave the wait_queue if any before changing the
affinity, and re-enter a wait queue once it is done. If we don't do that,
the task may stay in the wait queue of another thread, and we later may
end up modifying that wait queue while holding no lock, which could lead
to memory corruption.
THis should be backported to 2.1, 2.0 and 1.9.
Rework ssl_sock_prepare_ctx() so it fills a buffer with the error
messages instead of using ha_alert()/ha_warning(). Also returns an error
code (ERR_*) instead of the number of errors.
This change make the payload filtering uniform between TCP and HTTP
filters. Now, in TCP, like in HTTP, there is only one callback responsible to
forward data. Thus, old callbacks, tcp_data() and tcp_forward_data(), are
replaced by a single callback function, tcp_payload(). This new callback gets
the offset in the payload to (re)start the filtering and the maximum amount of
data it can forward. It is the filter's responsibility to be compatible with HTX
streams. If not, it must not set the flag FLT_CFG_FL_HTX.
Because of this change, nxt and fwd offsets are no longer needed. Thus they are
removed from the filter structure with their update functions,
flt_change_next_size() and flt_change_forward_size(). Moreover, the trace filter
has been updated accordingly.
This patch breaks the compatibility with the old API. Thus it should probably
not be backported. But, AFAIK, there is no TCP filter, thus the breakage is very
limited.
In tasklet_remove_from_tasket_list(), we can be called for a tasklet that is
either in the private task list, or in the shared tasklet list. Take that into
account and always use MT_LIST_DEL() to remove it, otherwise if we're in the
shared list and another thread attempts to add a tasklet in it, bad things
will happen.
__tasklet_remove_from_tasklet_list() is left unchanged, it's only supposed
to be used by process_runnable_task() to remove task/tasklets from the private
tast list.
This should not be backported.
This should fix github issue #357.
Since the legacy HTTP mode was removed, the stream is always released at the end
of each HTTP transaction and a new is created to handle the next request for
keep-alive connections. So the HTTP transaction is no longer reset and the
function http_reset_txn() can be removed.
Runtime traces are now supported for the streams, only if compiled with
debug. process_stream() is covered as well as TCP/HTTP analyzers and filters.
In traces, the first argument is always a stream. So it is easy to get the info
about the channels and the stream-interfaces. The second argument, when defined,
is always a HTTP transaction. And the third one is an HTTP message. The trace
message is adapted to report HTTP info when possible.
The macros DBG_TRACE_*() can be used instead of existing trace macros to emit
trace messages in debug mode only, ie, when HAProxy is compiled with DEBUG_FULL
or DEBUG_DEV. Otherwise, these macros do nothing. So it is possible to add
traces for development purpose without impacting performance of production
instances.
It can be sometimes interesting to have a timestamp with a
resolution of less than a second.
It is currently painful to obtain this, because concatenation
of date and date_us lead to a shorter timestamp during first
100ms of a second, which is not parseable and needs ugly ACLs
in configuration to prepend 0s when needed.
To improve this, add an optional <unit> parameter to date sample
to report an integer with desired unit.
Also support this unit in http_date converter to report
a date string with sub-second precision.
In si_connect(), only switch the strema_interface status to SI_ST_RDY if
we're reusing a connection and if the connection's mux is ready. Otherwise,
maybe we're reusing a connection that is not fully established yet, and may
fail, and setting SI_ST_RDY would mean we would not be able to retry to
connect.
This should be backported to 1.9 and 2.0.
This commit depends on 55234e33708c5a584fb9efea81d71ac47235d518.
Tasklets may be woken up to run on the calling thread or by a specific thread
(the owner). But since we use a non-thread safe mechanism when the calling
thread is also the for the owner, there may sometimes be collisions when two
threads decide to wake the same tasklet up at the same time and one of them
is the owner.
This is more of a matter of usage than code, in that a tasklet usually is
designed to be woken up and executed on the calling thread only (most cases)
or on a specific thread. Thus it is a property of the tasklet itself as this
solely depends how the code is constructed around it.
This patch performs a small change to address this. By default tasklet_new()
creates a "local" tasklet, which will run on the calling thread, like in 2.0.
This is done by setting tl->tid to a negative value. If the caller wants the
tasklet to run exclusively on a specific thread, it just has to set tl->tid,
which is already what shared tasklet callers do anyway.
No backport is needed.
The use of ~(1 << tid) to compute the sleeping_mask in tasklet_wakeup()
will result in breakage above 32 threads, because (1<<31) = 0xFFFFFFFF8000000,
and upper values will lead to theorically undefined results, but practically
will wrap over 0x1 to 0x80000000 again and indicate wrong sleeping masks. It
seems that the main visible effect maybe extra latency on some threads or
short CPU loops on others.
No backport is needed.
$ echo -e "set ssl cert certificate.pem <<\n$(cat certificate2.pem)\n" | \
socat stdio /var/run/haproxy.stat
Certificate updated!
The operation is locked at the ckch level with a HA_SPINLOCK_T which
prevents the ckch architecture (ckch_store, ckch_inst..) to be modified
at the same time. So you can't do a certificate update at the same time
from multiple CLI connections.
SNI trees are also locked with a HA_RWLOCK_T so reading operations are
locked only during a certificate update.
Bundles are supported but you need to update each file (.rsa|ecdsa|.dsa)
independently. If a file is used in the configuration as a bundle AND
as a unique certificate, both will be updated.
Bundles, directories and crt-list are supported, however filters in
crt-list are currently unsupported.
The code tries to allocate every SNIs and certificate instances first,
so it can rollback the operation if that was unsuccessful.
If you have too much instances of the certificate (at least 20000 in my
tests on my laptop), the function can take too much time and be killed
by the watchdog. This will be fixed later. Also with too much
certificates it's possible that socat exits before the end of the
generation without displaying a message, consider changing the socat
timeout in this case (-t2 for example).
The size of the certificate is currently limited by the maximum size of
a payload, that must fit in a buffer.
As using an mt_list for the tasklet list is costly, instead use a regular list,
but add an mt_list for tasklet woken up by other threads, to be run on the
current thread. At the beginning of process_runnable_tasks(), we just take
the new list, and merge it into the task_list.
This should give us performances comparable to before we started using a
mt_list, but allow us to use tasklet_wakeup() from other threads.
Several times some users have expressed the non-intuitive aspect of some
of our stat/info metrics and suggested to add some help. This patch
replaces the char* arrays with an array of name_desc so that we now have
some reserved room to store a description with each stat or info field.
These descriptions are currently empty and not reported yet.
This function now uses the address of the pointer to the htx message where the
copy must be performed. This way, when a zero-copy is performed, there is no
need to refresh the caller's htx message. It is a bit easier to do that way,
especially to add traces in the mux-h1.
William reported that since commit 6b3089856f ("MEDIUM: fd: do not use
the FD_POLL_* flags in the pollers anymore") the master's CLI often
fails to access sub-processes. There are two causes to this. One is
that we did report FD_POLL_ERR on an FD as soon as FD_EV_SHUT_W was
seen, which is automatically inherited from POLLHUP. And since we do
not store the current shutdown state of an FD we can't know if the
poller reports a sudden close resulting from an error or just a
byproduct of a previous shutdown(WR) followed by a read0. The current
patch addresses this by only considering this when the FD was active,
since a shutdown FD is not active. The second issue is that *somewhere*
down the chain, channel data are ignored if an error is reported on a
channel. This results in content truncation, but this cause was not
figured yet.
No backport is needed.
It is now possible to format stats counters as floats. But the stats applet does
not use it.
This patch is required by the Prometheus exporter to send the time averages in
seconds. If the promex change is backported, this patch must be backported
first.
Now that we can wake tasklet for other threads, make sure that if the thread
is sleeping, we wake it up, or the tasklet won't be executed until it's
done sleeping.
That also means that, before going to sleep, and after we put our bit
in sleeping_thread_mask, we have to check that nobody added a tasklet for
us, just checking for global_tasks_mask isn't enough anymore.
The aim is to rassemble all scheduler information related to the current
thread. It simply points to task_per_thread[tid] without having to perform
the operation at each time. We save around 1.2 kB of code on performance
sensitive paths and increase the request rate by almost 1%.
There are a number of tests there which are enforced on tasklets while
they will never apply (various handlers, destroyed task or not, arguments,
results, ...). Instead let's have a single TASK_IS_TASKLET() test and call
the tasklet processing function directly, skipping all the rest.
It now appears visible that the only unneeded code is the update to
curr_task that is never used for tasklets, except for opportunistic
reporting in the debug handler, which can only catch si_cs_io_cb,
which in practice doesn't appear in any report so the extra cost
incurred there is pointless.
This change alone removes 700 bytes of code, mostly in
process_runnable_tasks() and increases the performance by about
1%.
Change the tasklet code so that the tasklet list is now a mt_list.
That means that tasklet now do have an associated tid, for the thread it
is expected to run on, and any thread can now call tasklet_wakeup() for
that tasklet.
One can change the associated tid with tasklet_set_tid().
In srv_add_to_idle_list(), use LIST_DEL_INIT instead of just LIST_DEL.
We're about to add the connection to a mt_list, and MT_LIST_ADD/MT_LIST_ADDQ
will be modified to make sure we're not adding the element if it's already
in a list.
Instead of using the same type for regular linked lists and "autolocked"
linked lists, use a separate type, "struct mt_list", for the autolocked one,
and introduce a set of macros, similar to the LIST_* macros, with the
MT_ prefix.
When we use the same entry for both regular list and autolocked list, as
is done for the "list" field in struct connection, we know have to explicitely
cast it to struct mt_list when using MT_ macros.
The FCGI application handles all the configuration parameters used to format
requests sent to an application. The configuration of an application is grouped
in a dedicated section (fcgi-app <name>) and referenced in a backend to be used
(use-fcgi-app <name>). To be valid, a FCGI application must at least define a
document root. But it is also possible to set the default index, a regex to
split the script name and the path-info from the request URI, parameters to set
or unset... In addition, this patch also adds a FCGI filter, responsible for
all processing on a stream.
To avoid code duplication in the futur mux FCGI, functions parsing H1 messages
and converting them into HTX have been moved in the file h1_htx.c. Some
specific parts remain in the mux H1. But most of the parsing is now generic.
Application is a generic term here. It is a modules which handle its own log
server list, with no dependency on a proxy. Such applications can now call the
function app_log() to log messages, passing a log server list and a tag as
parameters. Internally, the function __send_log() has been adapted accordingly.
When using "http-reuse safe", which is the default, a new incoming connection
does not automatically reuse an existing connection for the first request, as
we don't want to risk to lose the contents if we know the client will not be
able to replay the request. A side effect to this is that when dealing with
mostly http-close traffic, the reuse rate is extremely low and we keep
accumulating server-side connections that may even never be reused. At some
point we're limited to a ratio of file descriptors, but when the system is
configured with very high FD limits, we can still reach the limit of outgoing
source ports and make the system significantly slow down trying to find an
available port for outgoing connections. A simple test on my laptop with
ulimit 100000 and with the following config results in the load immediately
dropping after a few seconds :
listen l1
bind :4445
mode http
server s1 127.0.0.1:8000
As can be seen, the load falls from 38k cps to 400 cps during the first 200ms
(in fact when the source port table is full and connect() takes ages to find
a spare port for a new connection):
$ injectl464 -p 4 -o 1 -u 10 -G 127.0.0.1:4445/ -F -c -w 100
hits ^hits hits/s ^h/s bytes kB/s last errs tout htime sdht ptime
2439 2439 39338 39338 356094 5743 5743 0 0 0.4 0.5 0.4
7637 5198 38185 37666 1115002 5575 5499 0 0 0.7 0.5 0.7
7719 82 25730 820 1127002 3756 120 0 0 21.8 18.8 21.8
7797 78 19492 780 1138446 2846 114 0 0 61.4 2.5 61.4
7877 80 15754 800 1150182 2300 117 0 0 58.6 0.5 58.6
7920 43 13200 430 1156488 1927 63 0 0 58.9 0.3 58.9
At this point, lots of connections are indeed in use, for only 10 connections
on the frontend side:
$ ss -ant state established | wc -l
39022
This patch makes sure we never keep more idle connections than we've ever
had outstanding requests on a server. This way the total number of idle
connections will never exceed the sum of maximum connections. Thus highly
loaded servers will be able to get many connections and slightly loaded
servers will keep less. Ideally we should apply similar limits per process
and the per backend, but in practice this already addresses the issues
pretty well:
$ injectl464 -p 4 -o 1 -u 10 -G 127.0.0.1:4445/ -F -c -w 100
hits ^hits hits/s ^h/s bytes kB/s last errs tout htime sdht ptime
4423 4423 40209 40209 645758 5870 5870 0 0 0.2 0.4 0.2
8020 3597 40100 39966 1170920 5854 5835 0 0 0.2 0.4 0.2
12037 4017 40123 40170 1757402 5858 5864 0 0 0.2 0.4 0.2
16069 4032 40172 40320 2346074 5865 5886 0 0 0.2 0.4 0.2
20047 3978 40013 39386 2926862 5842 5750 0 0 0.3 0.4 0.3
24005 3958 40008 39979 3504730 5841 5837 0 0 0.2 0.4 0.2
$ ss -ant state established | wc -l
234
This patch must be backported to 2.0. It could be useful in 1.9 as well
eventhough pools and reuse are not enabled by default there.
As mentioned in previous commit, these flags do not map well to
modern poller capabilities. Let's use the FD_EV_*_{R,W} flags instead.
This first patch only performs a 1-to-1 mapping making sure that the
previously reported flags are still reported identically while using
the closest possible semantics in the pollers.
It's worth noting that kqueue will now support improvements such as
returning distinctions between shut and errors on each direction,
though this is not exploited for now.
These two functions are used to enable recv/send but only if the FD is
not marked as active yet. The purpose is to conditionally mark them as
tentatively usable without interfering with the polling if polling was
already enabled, when it's supposed to be likely true.
Given that all our I/Os are now directed from top to bottom and not the
opposite way around, and the FD cache was removed, it doesn't make sense
anymore to create FDs that are marked not ready since this would prevent
the first accesses unless the caller explicitly does an fd_may_recv()
which is not expected to be its job (which conn_ctrl_init() has to do
by the way). Let's move this into fd_insert() instead, and have a single
atomic operation for both directions via fd_may_both().
Now that we don't have to update FD_EV_POLLED_* at the same time as
FD_EV_ACTIVE_*, we don't need to use a CAS anymore, a bit-test-and-set
operation is enough. Doing so reduces the code size by a bit more than
1 kB. One function was special, fd_done_recv(), whose comments and doc
were inaccurate for the part related to the lack of polling.
Since commit 7ac0e35f2 in 1.9-dev1 ("MAJOR: fd: compute the new fd polling
state out of the fd lock") we've started to update the FD POLLED bit a
bit more aggressively. Lately with the removal of the FD cache, this bit
is always equal to the ACTIVE bit. There's no point continuing to watch
it and update it anymore, all it does is create confusion and complicate
the code. One interesting side effect is that it now becomes visible that
all fd_*_{send,recv}() operations systematically call updt_fd_polling(),
except fd_cant_recv()/fd_cant_send() which never saw it change.
The detail level initially based on syslog levels is not used, while
something related is missing, trace verbosity, to indicate whether or
not we want to call the decoding callback and what level of decoding
we want (raw captures etc). Let's change the field to "verbosity" for
this. A verbosity of zero means that the decoding callback is not
called, and all other levels are handled by this callback and are
source-specific. The source is now prompted to list the levels that
are proposed to the user. When the source doesn't define anything,
"quiet" and "default" are available.
Working on adding traces to mux-h2 revealed that the function names are
manually copied a lot in developer traces. The reason is that they are
not preprocessor macros and as such cannot be concatenated. Let's
slightly adjust the trace() function call to take a function name just
after the file:line argument. This argument is only added for the
TRACE_DEVEL and 3 new TRACE_ENTER, TRACE_LEAVE, and TRACE_POINT macros
and left NULL for others. This way the function name is only reported
for traces aimed at the developers. The pretty-print callback was also
extended to benefit from this. This will also significantly shrink the
data segment as the "entering" and "leaving" strings will now be merged.
One technical point worth mentioning is that the function name is *not*
passed as an ist to the inline function because it's not considered as
a builtin constant by the compiler, and would lead to strlen() being
run on it from all call places before calling the inline function. Thus
instead we pass the const char * (that the compiler knows where to find)
and it's the __trace() function that converts it to an ist for internal
consumption and for the pretty-print callback. Doing this avoids losing
5-10% peak performance.
The "payload" trace level was ambigous because its initial purpose was
to be able to dump received data. But it doesn't make sense to force to
report data transfers just to be able to report state changes. For
example, all snd_buf()/rcv_buf() operations coming from the application
layer should be tagged at this level. So here we move this payload level
above the state transitions and rename it to avoid the ambiguity making
one think it's only about request/response payload. Now it clearly is
about any data transfer and is thus just below the developer level. The
help messages on the CLI and the doc were slightly reworded to help
remove this ambiguity.
Save the authority TLV in a PROXYv2 header from the client connection,
if present, and make it available as fc_pp_authority.
The fetch can be used, for example, to set the SNI for a backend TLS
connection.
Previously the callback was almost mandatory so it made sense to have it
before the message. Now that it can default to the one declared in the
trace source, most TRACE() calls contain series of empty args and callbacks,
which make them suitable for being at the end and being totally omitted.
This patch thus reverses the TRACE arguments so that the message appears
first, then the mask, then arg1..arg4, then the callback. In practice
we'll mostly see 1 arg, or 2 args and nothing else, and it will not be
needed anymore to pass long series of commas in the middle of the
arguments. However if a source is enforced, the empty commas will still
be needed for all omitted arguments.
The principle is that when emitting a message, if some dropped events
were logged, we first attempt to report this counter before going
further. This is done under an exclusive lock while all logs are
produced under a shared lock. This ensures that the dropped line is
accurately reported and doesn't accidently arrive after a later
event.
The three functions (attach, IO handler, and release) are meant to be
called by any CLI command which requires to dump the contents of a ring
buffer. We do not implement anything generic to dump any ring buffer on
the CLI since it's meant to be used by other functionalities above.
However these functions deal with locking and everything so it's trivial
to embed them in other code.
This function tries to write to the ring buffer, possibly removing enough
old messages to make room for the new one. It takes two arrays of fragments
on input to ease the insertion of prefixes by the caller. It atomically
writes the message, possibly truncating it if desired, and returns the
operation's status.
Our circular buffers are well suited for being used as ring buffers for
not-so-structured data. The machanism here consists in making room in a
buffer before inserting a new record which is prefixed by its size, and
looking up next record based on the previous one's offset and size. We
can have up to 255 consumers watching for data (dump in progress, tail)
which guarantee that entrees are not recycled while they're being dumped.
The complete representation is described in the header file. For now only
ring_new(), ring_resize() and ring_free() are created.
Currently both logs and event sinks may use a file descriptor to
atomically emit some output contents. The two may use the same FD though
nothing is done to make sure they use the same lock. Also there is quite
some redundancy between the two. Better make a specific function to send
a fragmented message to a file descriptor which will take care of the
locking via the fd's lock. The function is also able to truncate a
message and to enforce addition of a trailing LF when building the
output message.
The new TRACE_<level>() macros take a mask, 4 args, a callback and a
static message. From this they also inherit the TRACE_SOURCE macro from
the caller, which contains the pointer to the trace source (so that it's
not required to paste it everywhere), and an ist string is also made by
the concatenation of the file name and the line number. This uses string
concatenation by the preprocessor, and turns it into an ist by the compiler
so that there is no operation at all to perform to adjust the data length
as the compiler knows where to cut during the optimization phase. Last,
the message is also automatically turned into an ist so that it's trivial
to put it into an iovec without having to run strlen() on it.
All arguments and the callback may be empty and will then automatically
be replaced with a NULL pointer. This makes the TRACE calls slightly
lighter especially since arguments are not always used. Several other
options were considered to use variadic macros but there's no outstanding
rule that justifies to place an argument before another one, and it still
looks convenient to have the message be the last one to encourage copy-
pasting of the trace statements.
A generic TRACE() macro takes TRACE_LEVEL in from the source file as the
trace level instead of taking it from its name. This may slightly simplify
the production of traces that always run at the same level (internal core
parts may probably only be called at developer level).
The trace() call will support an optional decoding callback and 4
arguments that this function is supposed to know how to use to provide
extra information. The output remains unchanged when the function is
NULL. Otherwise, the message is pre-filled into the thread-local
trace_buf, and the function is called with all arguments so that it
completes the buffer in a readable form depending on the expected
level of detail.
This new "level" argument will allow the trace sources to label the
traces for different purposes, and filter out some of them if they
are not relevant to the current target. Right now we have 5 different
levels:
- USER : the least verbose one, only a few functional information
- PAYLOAD: like user but also displays some payload-related information
- PROTO: focuses on the protocol's framing
- STATE: also indicate state internal transitions or non-transitions
- DEVELOPER: adds extra info about branches taken in the code (break
points, return points)
We now pass an extra argument "where" to the trace() call, which
is supposed to be an ist made of the concatenation of the filename
and the line number. We only keep the last 10 chars from this string
since the end of file names is most often easy to recognize. This
gives developers useful information at very low cost.
For now it remains quite basic. It performs a few state checks, calls
the source's sink if defined, and performs the transitions between
RUNNING, STOPPED and WAITING when the configured events match.
For now it lists the sources if one is not provided, and checks
for the source's existence. It lists the events if not provided,
checks for their existence if provided, and adjusts reported
events/start/stop/pause events, and performs state transitions.
It lists sinks and adjusts them as well. Filters, lock, and
level are not implemented yet.
The principle of this subsystem will be to support taking live traces
at various places in the code with conditional triggers, filters, and
ability to lock on some elements. The traces will support typed events
and will be sent into sinks made of ring buffers, file descriptors or
remote servers.
This is the most basic type of sink. It pre-registers "stdout" and
"stderr", and is able to use writev() on them. The writev() operation
is locked to avoid mixing outputs. It's likely that the registration
should move somewhere else to take into account the fact that stdout
and stderr are still opened or are closed.
The principle will be to be able to dispatch events to various destinations
called "sinks". This is already done in part in logs where log servers can
be either a UDP socket or a file descriptor. This will be needed with the
new trace subsystem where we may also want to add ring buffers. And it turns
out that all such destinations make sense at all places. Logs may need to be
sent to a TCP server via a ring buffer, or consulted from the CLI. Trace
events may need to be sent to stdout/stderr as well as to remote log servers.
This patch creates a new structure "sink" aiming at addressing these similar
needs. The goal is to merge together what is common to all of them, such as
the output format, the dropped events count, etc, and also keep separately
the target identification (network address, file descriptor). Provisions
were made to have a "waiter" on the sink. For a TCP log server it will be
the task to wake up after writing to the log buffer. For a ring buffer, it
could be the list of watchers on the CLI running a "tail" operation and
waiting for new events. A lock was also placed in the struct since many
operations will require some locking, including the FD ones. The output
formats covers those in use by logs and two extra ones prepending the ISO
time in front of the message (convenient for stdio/buffer).
For now only the generic infrastructure is present, no type-specific
output is implemented. There's the sink_write() function which prepares
and formats a message to be sent, trying hard to avoid copies and only
using pointer manipulation, where the type-specific code just has to be
added. Dropped messages are already counted (for now 100% drop). The
message is put into an iovec array as it will be trivial to use with
file descriptors and sockets.
The current functions are seen outside from the debugging code and are
convenient to export so that we can improve the thread dump output :
void hlua_applet_tcp_fct(struct appctx *ctx);
void hlua_applet_http_fct(struct appctx *ctx);
struct task *hlua_process_task(struct task *task, void *context, unsigned short state);
Of course they are only available when USE_LUA is defined.
When I/O events are being processed, we want to make sure to mark the
thread as not stuck. The reason is that some pollers (like poll()) which
do not limit the number of FDs they report could possibly report a huge
amount of FD all having to perform moderately expensive operations in
the I/O callback (e.g. via mux-pt which forwards to the upper layers),
making the watchdog think the thread is stuck since it does not schedule.
Of course this must never happen but if it ever does we must be liberal
about it.
This should be backported to 2.0, where the situation may happen more
easily due to the FD cache which can start to collect a large amount of
events. It may be related to the report in issue #201 though nothing is
certain about it.
These functions perform all the boring filling of the appctx's
cli struct needed by CLI parsers to return a message or an error,
and they return 1 so that they can be used as a single-line return
statement. They may be used for const messages or dynamic messages.
When parsing references to stick-tables declared as backends, they are added to
a list of proxies (they are proxies!) which refer to this stick-tables.
Before this patch we added them to these list without checking they were already
present, making the silly hypothesis the actions/sample were checked/resolved in the same
order the proxies are parsed.
This patch implement a simple inline function to in_proxies_list() to test
the presence of a proxy in a list of proxies. We use this function when resolving
/checking samples/actions.
This bug was introduced by 015e4d7 commit.
Must be backported to 2.0.
In the poller code, instead of just remembering if we're currently polling
a fd or not, remember if we're polling it for writing and/or for reading, that
way, we can avoid to modify the polling if it's already polled as needed.
Now that the architecture was changed so that attempts to receive/send data
always come from the upper layers, instead of them only trying to do so when
the lower layer let them know they could try, we can finally get rid of the
fd cache. We don't really need it anymore, and removing it gives us a small
performance boost.
Dragan Dosen found that the listeners lock is not sufficient to protect
the listeners list when proxies are stopping because the listeners are
also unlinked from the protocol list, and under certain situations like
bombing with soft-stop signals or shutting down many frontends in parallel
from multiple CLI connections, it could be possible to provoke multiple
instances of delete_listener() to be called in parallel for different
listeners, thus corrupting the protocol lists.
Such operations are pretty rare, they are performed once per proxy upon
startup and once per proxy on shut down. Thus there is no point trying
to optimize anything and we can use a global lock to protect the protocol
lists during these manipulations.
This fix (or a variant) will have to be backported as far as 1.8.
Empty error files may be used to disable the sending of any message for specific
error codes. A common use-case is to use the file "/dev/null". This way the
default error message is overridden and no message is returned to the client. It
was supported in the legacy HTTP mode, but not in HTX. Because of a bug, such
messages triggered an error.
This patch must be backported to 2.0 and 1.9. However, the patch will have to be
adapted.
When forcing the outgoing address of a connection, till now we used to
allocate this outgoing connection and set the address into it, then set
SF_ADDR_SET. With connection reuse this causes a whole lot of issues and
difficulties in the code.
Thanks to the previous changes, it is now possible to store the target
address into the stream instead, and copy the address from the stream to
the connection when initializing the connection. assign_server_address()
does this and as a result SF_ADDR_SET now reflects the presence of the
target address in the stream, not in the connection. The http_proxy mode,
the peers and the master's CLI now use the same mechanism. For now the
existing connection code was not removed to limit the amount of tricky
changes, but the allocated connection is not used anymore.
This change also revealed a latent issue that we've been having around
option http_proxy : the address was set in the connection but neither the
SF_ADDR_SET nor the SF_ASSIGNED flags were set. It looks like the connection
could establish only due to the fact that it existed with a non-null
destination address.
Now addresses are dynamically allocated when needed. Each connection is
created with src=dst=NULL, these entries are allocated on the fly, and
released when the connection is released.
This commit places calls to sockaddr_alloc() at the places where an address
is needed, and makes sure that the allocation is properly tested. This does
not add too many error paths since connection allocations are already in the
vicinity and share the same error paths. For the two cases where a
clear_addr() was called, instead the address was not allocated.
This pool will be used to allocate storage for source and destination
addresses used in connections. Two functions sockaddr_{alloc,free}()
were added and will have to be used everywhere an address is needed.
These ones are safe for progressive replacement as they check that the
existing pointer is set before replacing it. The pool is not yet used
during allocation nor freeing. Also they operate on pointers to pointers
so they will perform checks and replace values. The free one nulls the
pointer.
This is in preparation for the switch to dynamic address allocation,
let's migrate the code using the old fields to the pointers instead.
Note that no extra check was added for now, the purpose is only to
get the code to use the pointers and still work.
In the proxy protocol message handling we make sure the addresses are
properly allocated before declaring them unset.
At the moment we're facing difficulties with connection reuse based on
the fact that connections may be allocated very early only to set a
target address in transparent mode. With the imminent removal of the
legacy mode, the connection reuse by a same stream will not exist
anymore and all this awful complexity is not justified anymore. However
we still need to be able to assign addresses somewhere.
Thus instead of allocating a connection, we'll only place addresses where
needed in the stream during operations. But this takes quite some room
(typically 128 bytes). This is a nice opportunity for cleaning all this
up and dynamically allocatating the addresses fields, which will result
in actually saving memory from connection structs since most of the time
the client's "to" address is not used and the server's "from" is not used
either, thus saving ~256 bytes per end-to-end connection.
For now these new "src" and "dst" pointers point to addr.from and addr.to.
This will allow us to smoothly update the whole code to use these pointers
prior to going further and switching them to pools.
These functions are not used anymore. They didn't report failures and
as such were often misused. conn_get_src() and conn_get_dst() now
replaced them everywhere.
The backend connect code uses conn_get_{from,to}_addr to forward addresses
in transparent mode and to map server ports, without really checking if the
operation succeeds. In preparation of future changes, let's switch to
conn_get_{src,dst}() and integrate status check for possible failures.
These functions currently are the same as conn_get_from_addr() and
conn_get_to_addr() respectively except that they return a status for
the operation that the caller can test.
Default HTTP error messages are stored in an array of chunks. And since the HTX
was added, these messages are also converted in HTX and stored in another
array. But now, the first array is not used anymore because the legacy HTTP mode
was removed.
So now, only the array with the HTX messages are kept. The other one was
removed.
The old module proto_http does not exist anymore. All code dedicated to the HTTP
analysis is now grouped in the file proto_htx.c. So, to finish the polishing
after removing the legacy HTTP code, proto_htx.{c,h} files have been moved in
http_ana.{c,h} files.
In addition, all HTX analyzers and related functions prefixed with "htx_" have
been renamed to start with "http_" instead.
First of all, all legacy HTTP analyzers and all functions exclusively used by
them were removed. So the most of the functions in proto_http.{c,h} were
removed. Only functions to deal with the HTTP transaction have been kept. Then,
http_msg and hdr_idx modules were entirely removed. And finally the structure
http_msg was lightened of all its useless information about the legacy HTTP. The
structure hdr_ctx was also removed because unused now, just like unused states
in the enum h1_state. Note that the memory pool "hdr_idx" was removed and
"http_txn" is now smaller.
This commit breaks the compatibility with filters still relying on the legacy
HTTP code. The legacy callbacks were removed (http_data, http_chunk_trailers and
http_forward_data).
For now, the filters must still set the flag FLT_CFG_FL_HTX to be used on HTX
streams.
Since the legacy HTTP mode is disbabled, all HTTP sample fetches work on HTX
streams. So it is safe to remove all code relying on HTTP legacy mode. Among
other things, the function smp_prefetch_http() was removed with the associated
macros CHECK_HTTP_MESSAGE_FIRST() and CHECK_HTTP_MESSAGE_FIRST_PERM().
Since the legacy HTTP mode is disabled and no multiplexer relies on it anymore,
there is no reason to have 2 multiplexer protocols for the HTTP. So the protocol
PROTO_MODE_HTX was removed and all HTTP multiplexers use now PROTO_MODE_HTTP.
A long time ago, applets were seen as an alternative to connections,
and since their respective sizes were roughly equal it appeared wise
to share the same pool. Nowadays, connections got significantly larger
but applets are not that often used, except for the cache. However
applets are mostly complementary and not alternatives anymore, as
it's very possible not to have a back connection or to share one with
other streams.
The connections will soon lose their addresses and their size will
shrink so much that appctx won't fit anymore. Given that the old
benefits of sharing these pools have long disappeared, let's stop
doing this and have a dedicated pool for appctx.
Move the logic to decide if we redispatch to a new server from
sess_update_st_cer() to a new inline function, stream_choose_redispatch(), and
use it in do_l7_retry() instead of just setting the state to SI_ST_REQ.
That way, when using L7 retries, we won't redispatch the request to another
server except if "option redispatch" is used.
This should be backported to 2.0.
Sometimes we need to delegate some list processing to a function running
on another thread. In this case the list element will simply be queued
into a dedicated self-locked list and the task responsible for this list
will be woken up, calling the associated function which will run over the
list.
This is what work_list does. Such lists will be dedicated to a limited
type of work but will significantly ease such remote handling. A function
is provided to create these per-thread lists, their tasks and to properly
bind each task to a distinct thread, so that the caller only has to store
the resulting pointer to the start of the structure.
These structures should not be abused though as each head will consume
4 pointers per thread, hence 32 bytes per thread or 2 kB for 64 threads.
When we're purging idle connections, there's a race condition, when we're
removing the connection from the idle list, to add it to the list of
connections to free, if the thread owning the connection tries to free it
at the same time.
To fix this, simply add a per-thread lock, that has to be hold before
removing the connection from the idle list, and when, in conn_free(), we're
about to remove the connection from every list. That way, we know for sure
the connection will stay valid while we remove it from the idle list, to add
it to the list of connections to free.
This should happen rarely enough that it shouldn't have any impact on
performances.
This has not been reported yet, but could provoke random segfaults.
This should be backported to 2.0.
The maximum number of idle connections for a server can be configured by setting
the server option "pool-max-conn". But when we try to add a connection in its
idle list, because of a wrong comparison, it may be rejected because there are
already "pool-max-conn - 1" idle connections.
This patch must be backported to 2.0 and 1.9.
While experimenting with potentially improved fairness and latency using
ticket locks on a Ryzen 16-thread/8-core, a very strange situation happened
a lot for some levels of traffic. Around 300k connections per second, no
more connections would be accepted on the multi-threaded listener but all
others would continue to work fine. All attempts to trace showed that the
threads were all in the trylock in the fd cache, or in the spinlock of
fd_update_events(), or in the one of fd_may_recv(). But as indicated this
was not a deadlock since the process continues to work fine.
After quite some investigation it appeared that the issue is caused by a
lack of fairness between the fdcache's trylock and these functions' spin
locks above. In fact, regardless of the success or failure of the fdcache's
attempt at grabbing the lock, the poller was calling fd_update_events()
which locks the FD once for something that can be done with a CAS, and
then calls fd_may_recv() with another lock for something that most often
didn't change. The high contention on these spinlocks leaves no chance to
any other thread to grab the lock using trylock(), and once this happens,
there is no thread left to process incoming connection events nor to stop
polling on the FD, leaving all threads at 100% CPU but partially operational.
This patch addresses the issue by using bit-test-and-set instead of the OR
in fd_may_recv() / fd_may_send() so that nothing is done if the FD was
already configured as expected. It does the same in fd_update_events()
using a CAS to check if the FD's events need to be changed at all or not.
With this patch applied, it became impossible to reproduce the issue, and
now there's no way to saturate all 16 CPUs with the load used for testing,
as no more than 1350-1400 were noticed at 300+kcps vs 1600.
Ideally this patch should go further and try to remove the remaining
incarnations of the fdlock as this seems possible, but it's difficult
enough to be done in a distinct patch that will not have to be backported.
It is possible that workloads involving a high connection rate may slightly
benefit from this patch and observe a slightly lower CPU usage even when
the service doesn't misbehave.
This patch must be backported to 2.0 and 1.9.
In the function stream_int_notify(), when the opposite stream-interface is
blocked because there is no more room into the input buffer, if the flag
CF_WRITE_PARTIAL is set on this buffer, it is unblocked. It is a way to unblock
the reads on the other side because some data was sent.
But it is a problem during the fast-forwarding because only the stream is able
to remove the flag CF_WRITE_PARTIAL. So it is possible to have this flag because
of a previous send while the input buffer of the opposite stream-interface is
now full. In such case, the opposite stream-interface will be woken up for
nothing because its input buffer is full. If the same happens on the opposite
side, we will have a loop consumming all the CPU.
To fix the bug, the opposite side is now only notify if there is some available
room in its input buffer in the function si_cs_send(), so only if some data was
sent.
This patch must be backported to 2.0 and 1.9.
When deciding if we keep an idle connection in the session, check if
the number of connections currently in the session is >= the max allowed,
not >, or we'll keep an extra connection.
This should be backported to 1.9 and 2.0.
Just calling conn_force_unsubscribe() from conn_upgrade_mux_fe() is not
enough, as there may be multiple XPRT involved. Instead, require that
any user of conn_upgrade_mux_fe() unsubscribe itself before calling it.
This should fix upgrading a TCP connection to HTX when using SSL.
This should be backported to 2.0.