We already have my_ffsl() to find the lowest bit set in a word, and
this patch implements the search for the highest bit set in a word.
On x86 it uses the bsr instruction and on other architectures it
uses an efficient implementation.
By picking two randoms following the P2C algorithm, we seldom observe
asymmetric loads on bursts of small session counts. This is typically
what makes h2load take a bit of time to complete the last 100% because
if a thread gets two connections while the other ones only have one,
it takes twice the time to complete its work.
This patch proposes a modification of the p2c algorithm which seems
more suitable to this case : it mixes a rotating index with a random.
This way, we're certain that all threads are consulted in turn and at
the same time we're not forced to use the ones we're giving a chance.
This significantly increases the traffic rate. Now h2load shows faster
completion and the average request rates on H2 and the TLS resume rate
increases by a bit more than 5% compared to pure p2c.
The index was placed into the struct bind_conf because 1) it's faster
there and it's the best place to optimally distribute traffic among a
group of listeners. It's the only runtime-modified element there and
it will be quite cache-hot.
By using LIST_DEL_INIT() instead of LIST_DEL()+LIST_INIT() we manage
to bump the peak connection rate by no less than 3% on 8 threads.
The perf top profile shows much less contention in this area which
suffered from the second reload.
It turns out that we call LIST_DEL+LIST_INIT very frequently and that
the compiler doesn't know what pointers get modified in the e->n->p
and e->p->n dance, so when LIST_INIT() is called, it reloads these
pointers, which is quite a bit of a mess in terms of performance.
This patch adds LIST_DEL_INIT() to perform the two operations at once
using local temporary variables so that the compiler knows these
pointers are left unaffected.
This patch enables the part of this reg test which could not work due to a vtest
(formerly varnishtest) bug.
NOTE: You must have a vtest version with 4e43cc1 commit for this bug fix to make this
script succeed (see 4e43cc1fec
for more information).
This patch adds "protobuf" protocol buffers specific converter wich
may used in combination with "ungrpc" as first converter to extract
a protocol buffers field value. It is simply implemented reusing
protobuf_field_lookup() which is the protocol buffers specific parser already
used by "ungrpc" converter which only parse a gRPC header in addition of
parsing protocol buffers message.
Update the documentation for this new "protobuf" converter.
We move the code responsible of parsing protocol buffers messages
inside gRPC messages from sample.c to include/proto/protocol_buffers.h
so that to reuse it to cascade "ungrpc" converter.
In 84e417d8 ("MINOR: ssl: support Openssl 1.1.1 early callback for
switchctx") the code was extended to also support OpenSSL 1.1.1
(code already supported BoringSSL). A configuration check warning
was updated but with the wrong logic, the #ifdef needs a && instead
of an ||.
Reported in #54.
Should be backported to 1.8.
Emeric reports that when MAX_THREADS and/or MAX_PROCS are set to lower
values, referencing thread or process numbers higher than these limits
in cpu-map returns errors. This is annoying because these typically are
silent settings that are expected to be used only when set. Let's switch
back to LONGBITS for this limit.
Since the "wurfl" device detection engine was merged slightly more than
two years ago (2016-11-04), it never received a single fix nor update.
For almost two years it didn't receive even the minimal review or changes
needed to be compatible with threads, and it's remained build-broken for
about the last 9 months, consecutive to the last buffer API changes,
without anyone ever noticing! When asked on the list, nobody confirmed
using it :
https://www.mail-archive.com/haproxy@formilux.org/msg32516.html
And obviously nobody even cared to verify that it did still build. So we
are left with this broken code with no user and no maintainer. It might
even suffer from remotely exploitable vulnerabilities without anyone
being able to check if it presents any risk. It's a pain to update each
time there is an API change because it doesn't build as it depends on
external libraries that are not publicly accessible, leading to careful
blind changes. It slows down the whole project. This situation is not
acceptable at all.
It's time to cure the problem where it is. This patch removes all this
dead, non-buildable, non-working code. If anyone ever decides to use it,
which I seriously doubt based on history, it could be reintegrated, but
this time the following guarantees will be required :
- someone has to step up as a maintainer and have his name listed in
the MAINTAINERS file (I should have been more careful last time).
This person will take the sole blame for all issues and will be
responsible for fixing the bugs and incompatibilities affecting
this code, and for making it evolve to follow regular internal API
updates.
- support building on a standard distro with automated tools (i.e. no
more "click on this site, register your e-mail and download an
archive then figure how to place this into your build system").
Dummy libs are OK though as long as they allow the mainline code to
build and start.
- multi-threaded support must be fixed. I mean seriously, not worked
around with a check saying "please disable threads, we've been busy
fishing for the last two years".
This may be backported to 1.9 given that the code has never worked there
either, thus at least we're certain nobody will miss it.
For now on, "ungrpc" may take a second optional argument to provide
the protocol buffers types used to encode the field value to be extracted.
When absent the field value is extracted as a binary sample which may then
followed by others converters like "hex" which takes binary as input sample.
When this second argument is a type which does not match the one found by "ungrpc",
this field is considered as not found even if present.
With this patch we also remove the useless "varint" and "svarint" converters.
Update the documentation about "ungrpc" converters.
Parsing protocol buffer fields always consists in skip the field
if the field is not found or store the field value if found.
So, with this patch we factorize a little bit the code for "ungrpc" converter.
While the legacy code converts h2 to h1 and provides some control over
what is passed, in htx mode there is no such control and it is possible
to pass control chars and linear white spaces in the path, which are
possibly reencoded differently once passed to the H1 side.
HTX supports parse error reporting using a special flag. Let's check
the correctness of the :path pseudo header and report any anomaly in
the HTX flag.
Thanks to Jérôme Magnin for reporting this bug with a working reproducer.
This fix must be backported to 1.9 along with the two previous patches
("MINOR: htx: unconditionally handle parsing errors in requests or
responses" and "MINOR: mux-h2: always pass HTX_FL_PARSING_ERROR
between h2s and buf on RX").
In order to allow the H2 parser to report parsing errors, we must make
sure to always pass the HTX_FL_PARSING_ERROR flag from the h2s htx to
the conn_stream's htx.
The htx request and response processing functions currently only check
for HTX_FL_PARSING_ERROR on incomplete messages because that's how mux_h1
delivers these. However with H2 we have to detect some parsing errors in
the format of certain pseudo-headers (e.g. :path), so we do have a complete
message but we want to report an error.
Let's move the parse error check earlier so that it always triggers when
the flag is present. It was also moved for htx_wait_for_request_body()
since we definitely want to be able to abort processing such an invalid
request even if it appears complete, but it was not changed in the forward
functions so as not to truncate contents before the position of the first
error.
Well, that's becoming embarrassing. Now this fixes commit 4ef6801c
("BUG/MEDIUM: list: correct fix for LIST_POP_LOCKED's removal of last
element") which itself tried to fix commit 285192564. This fix only
works under low contention and was tested with the listener's queue.
With the idle conns it's obvious that it's still wrong since adding
more than one element to the list leaves a LLIST_BUSY pointer into
the list's head. This was visible when accumulating idle connections
in a server's list.
This new version of the fix almost goes back to the original code,
except that since then we addressed issues with expectedly idempotent
operations that were not. Now the code has been verified on paper again
and has survived 300 million connections spread over 4 threads.
This will have to be backported if the commit above is backported.
This patch simply extracts the code of smp_fetch_req_ungrpc() for "req.ungrpc"
from http_fetch.c to move it to sample.c with very few modifications.
Furthermore smp_fetch_body_buf() used to fetch the body contents is no more needed.
Update the documentation for gRPC.
A crash in H2 was reported in issue #52. It turns out that there is a
small but existing race by which a conn_stream could detach itself
using h2_detach(), not being able to destroy the h2s due to pending
output data blocked by flow control, then upon next h2s activity
(transfer_data or trailers parsing), an ES flag may need to be turned
into a CS_FL_REOS bit, causing a dereference of a NULL stream. This is
a side effect of the fact that we still have a few places which
incorrectly depend on the CS flags, while these flags should only be
set by h2_rcv_buf() and h2_snd_buf().
All candidate locations along this path have been secured against this
risk, but the code should really evolve to stop depending on CS anymore.
This fix must be backported to 1.9 and possibly partially to 1.8.
Commit 26f6ae12c ("MAJOR: config: disable support for nbproc and nbthread
in parallel") revealed that there was accidently nbproc+nbthread in this
test while nbproc is the one expected. This likely is a leftover from a
previous attempt at reproducing the issue.
The global maxconn value is often a pain to configure :
- in development the user never has the permissions to increase the
rlim_cur value too high and gets warnings all the time ;
- in some production environments, users may have limited actions on
it or may only be able to act on rlim_fd_cur using ulimit -n. This
is sometimes particularly true in containers or whatever environment
where the user has no privilege to upgrade the limits.
- keeping config homogenous between machines is even less easy.
We already had the ability to automatically compute maxconn from the
memory limits when they were set. This patch goes a bit further by also
computing the limit permitted by the configured limit on the number of
FDs. For this it simply reverses the rlim_fd_cur calculation to determine
maxconn based on the number of reserved sockets for listeners & checks,
the number of SSL engines and the number of pipes (absolute or relative).
This way it becomes possible to make maxconn always be the highest possible
value resulting in maxsock matching what was set using "ulimit -n", without
ever setting it. Note that we adjust to the soft limit, not the hard one,
since it's what is configured with ulimit -n. This allows users to also
limit to low values if needed.
Just like before, the calculated value is reported in verbose mode.
We'll need to know the global maxsock before the maxconn calculation.
Actually only two components were calculated too late, the peers FD
and the stats FD. Let's move them a few lines upward.
The default number of pipes is adjusted based on the sum of frontends
and backends maxconn/fullconn settings. Now that it is possible to have
a null maxconn on a frontend to indicate "unlimited" with commit
c8d5b95e6 ("MEDIUM: config: don't enforce a low frontend maxconn value
anymore"), the sum of maxconn may remain low and limited to the only
frontends/backends where this limit is set.
This patch considers this new unlimited case when doing the check, and
automatically switches to the default value which is maxconn/4 in this
case. All the calculation was moved to a distinct function for ease of
use. This function also supports returning unlimited (-1) when the
value depends on global.maxconn and this latter is not yet set.
When the master re-execs itself on reload, it doesn't restore the initial
rlim_fd_cur/rlim_fd_max values, which have been modified by the ulimit-n
or global maxconn directives. This is a problem, because if these values
were set really low it could prevent the process from restarting, and if
they were set very high, this could have some implications on the restart
time, or later on the computed maxconn.
Let's simply reset these values to the ones we had at boot to maintain
the system in a consistent state.
A backport could be performed to 1.9 and maybe 1.8. This patch depends on
the two previous ones.
It's not normal that external processes are run with high FD limits,
as quite often such processes (especially shell scripts) will iterate
over all FDs to close them. Ideally we should even provide a tunable
with the external-check directive to adjust this value, but at least
we need to restore it to the value that was active when starting
haproxy (before it was adjusted for maxconn). Additionally with very
low maxconn values causing rlim_fd_cur to be low, some heavy checks
could possibly fail. This was also mentioned in issue #45.
Currently the following config and scripts report this :
$ cat rlim.cfg
global
maxconn 500000
external-check
listen www
bind :8001
timeout client 5s
timeout server 5s
timeout connect 5s
option external-check
external-check command "$PWD/sleep1.sh"
server local 127.0.0.1:80 check inter 1s
$ cat sleep1.sh
#!/bin/sh
/bin/sleep 0.1
echo -n "soft: ";ulimit -S -n
echo -n "hard: ";ulimit -H -n
# ./haproxy -db -f rlim.cfg
soft: 1000012
hard: 1000012
soft: 1000012
hard: 1000012
Now with the fix :
# ./haproxy -db -f rlim.cfg
soft: 1024
hard: 4096
soft: 1024
hard: 4096
This fix should be backported to stable versions but it depends on
"MINOR: global: keep a copy of the initial rlim_fd_cur and rlim_fd_max
values" and "BUG/MINOR: init: never lower rlim_fd_max".
If a ulimit-n value is set, we must not lower the rlim_max value if the
new value is lower, we must only adjust the rlim_cur one. The effect is
that on very low values, this could prevent a master-worker reload, or
make an external check fail by lack of FDs.
This may be backported to 1.9 and earlier, but it depends on this patch
"MINOR: global: keep a copy of the initial rlim_fd_cur and rlim_fd_max
values".
Let's keep a copy of these initial values. They will be useful to
compute automatic maxconn, as well as to restore proper limits when
doing an execve() on external checks.
This patch implements peer heartbeat feature to prevent any haproxy peer
from reconnecting too often, consuming sockets for nothing.
To do so, we add PEER_MSG_CTRL_HEARTBEAT new message to PEER_MSG_CLASS_CONTROL peers
control class of messages. A ->heartbeat field is added to peer structs
to store the heatbeat timeout value which is handled by the same function as for ->reconnect
to control the session timeouts. A 2-bytes heartbeat message is sent every 3s when
no updates have to be sent. This way, the peer which receives such a message is sure
the remote peer is still alive. So, it resets the ->reconnect peer session
timeout to its initial value (5s). This prevents any reconnection to an
already connected alive peer.
Historically the default frontend's maxconn used to be quite low (2000),
which was sufficient two decades ago but often proved to be a problem
when users had purposely set the global maxconn value but forgot to set
the frontend's.
There is no point in keeping this arbitrary limit for frontends : when
the global maxconn is lower, it's already too high and when the global
maxconn is much higher, it becomes a limiting factor which causes trouble
in production.
This commit allows the value to be set to zero, which becomes the new
default value, to mean it's not directly limited, or in fact it's set
to the global maxconn. Since this operation used to be performed before
computing a possibly automatic global maxconn based on memory limits,
the calculation of the maxconn value and its propagation to the backends'
fullconn has now moved to a dedicated function, proxy_adjust_all_maxconn(),
which is called once the global maxconn is stabilized.
This comes with two benefits :
1) a configuration missing "maxconn" in the defaults section will not
limit itself to a magically hardcoded value but will scale up to the
global maxconn ;
2) when the global maxconn is not set and memory limits are used instead,
the frontends' maxconn automatically adapts, and the backends' fullconn
as well.
It is possible to update a frontend's maxconn from the CLI. Unfortunately
when doing this it scratches all listeners' maxconn values and sets them
all to the new frontend's value. This can be problematic when mixing
different traffic classes (bind to interface or private networks, etc).
Now that the listener's maxconn is allowed to remain unset, let's not
change these values when setting the frontend's maxconn. This way the
overall frontend's limit can be raised but if certain specific listeners
had their own value forced in the config, they will be preserved. This
makes more sense and is more in line with the principle of defaults
propagation.
It's pointless to always set and maintain l->maxconn because the accept
loop already enforces the frontend's limit anyway. Thus let's stop setting
this value by default and keep it to zero meaning "no limit". This way the
frontend's maxconn will be used by default. Of course if a value is set,
it will be enforced.
In an attempt to try to provide automatic maxconn settings, we need to
decorrelate a listner's backlog and maxconn so that these values can be
independent. This introduces a listener_backlog() function which retrieves
the backlog value from the listener's backlog, the frontend's, the
listener's maxconn, the frontend's or falls back to 1024. This
corresponds to what was done in cfgparse.c to force a value there except
the last fallback which was not set since the frontend's maxconn is always
known.
As seen with Olivier, in the end the fix in commit 285192564 ("BUG/MEDIUM:
list: fix LIST_POP_LOCKED's removal of the last pointer") is wrong,
the code there was right but the bug was triggered by another bug in
LIST_ADDQ_LOCKED() which doesn't properly update the list's head by
inserting in the wrong order.
This will have to be backported if the commit above is backported.
We were not checking p->feconn nor the global actconn soon enough. In
older versions this could result in a frontend accepting more connections
than allowed by its maxconn or the global maxconn, exactly N-1 extra
connections where N is the number of threads, provided each of these
threads were running a different listener. But with the lock removal,
it became worse, the excess could be the listener's maxconn multiplied
by the number of threads. Among the nasty side effect was that LI_FULL
could be removed while the limit was still over and in some cases the
polling on the socket was no re-enabled.
This commit takes care of updating and checking p->feconn and the global
actconn *before* processing the connection, so that the listener can be
turned off before accepting the socket if needed. This requires to move
some of the bookkeeping operations form session to listen, which totally
makes sense in this context.
Now the limits are properly respected, even if a listener's maxconn is
over a frontend's. This only applies on top of the listener lock removal
series and doesn't have to be backported.
There is a very difficult to reproduce race in the listener's accept
code, which is much easier to reproduce once connection limits are
properly enforced. It's an ABBA lock issue :
- the following functions take l->lock then lq_lock :
disable_listener, pause_listener, listener_full, limit_listener,
do_unbind_listener
- the following ones take lq_lock then l->lock :
resume_listener, dequeue_all_listener
This is because __resume_listener() only takes the listener's lock
and expects to be called with lq_lock held. The problem can easily
happen when listener_full() and limit_listener() are called a lot
while in parallel another thread releases sessions for the same
listener using listener_release() which in turn calls resume_listener().
This scenario is more prevalent in 2.0-dev since the removal of the
accept lock in listener_accept(). However in 1.9 and before, a different
but extremely unlikely scenario can happen :
thread1 thread2
............................ enter listener_accept()
limit_listener()
............................ long pause before taking the lock
session_free()
dequeue_all_listeners()
lock(lq_lock) [1]
............................ try_lock(l->lock) [2]
__resume_listener()
spin_lock(l->lock) =>WAIT[2]
............................ accept()
l->accept()
nbconn==maxconn =>
listener_full()
state==LI_LIMITED =>
lock(lq_lock) =>DEADLOCK[1]!
In practice it is almost impossible to trigger it because it requires
to limit both on the listener's maxconn and the frontend's rate limit,
at the same time, and to release the listener when the connection rate
goes below the limit between poll() returns the FD and the lock is
taken (a few nanoseconds). But maybe with threads competing on the
same core it has more chances to appear.
This patch removes the lq_lock and replaces it with a lockless queue
for the listener's wait queue (well, technically speaking a self-locked
queue) brought by commit a8434ec14 ("MINOR: lists: Implement locked
variations.") and its few subsequent fixes. This relieves us from the
need of the lq_lock and removes the deadlock. It also gets rid of the
distinction between __resume_listener() and resume_listener() since the
only difference was the lq_lock. All listener removals from the list
are now unconditional to avoid races on the state. It's worth noting
that the list used to never be initialized and that it used to work
only thanks to the state tests, so the initialization has now been
added.
This patch must carefully be backported to 1.9 and very likely 1.8.
It is mandatory to be careful about replacing all manipulations of
l->wait_queue, global.listener_queue and p->listener_queue.
Since LIST_DEL_LOCKED() and LIST_POP_LOCKED() now automatically reinitialize
the removed element, there's no need for keeping this LIST_INIT() call in the
idle connection code.
These operations previously used to return a "locked" element, which is
a constraint when multiple threads try to delete the same element, because
the second one will block indefinitely. Instead, let's make sure that both
LIST_DEL_LOCKED() and LIST_POP_LOCKED() always reinitialize the element
after deleting it. This ensures that the second thread will immediately
unblock and succeed with the removal. It also secures the pop vs delete
competition that may happen when trying to remove an element that's about
to be dequeued.
Commit a8434ec14 ("MINOR: lists: Implement locked variations.")
introduced locked lists which use the elements pointers as locks
for concurrent operations. Under heavy stress the lists occasionally
fail. The cause is a missing barrier at some points when updating
the list element and the head : nothing prevents the compiler (or
CPU) from updating the list head first before updating the element,
making another thread jump to a wrong location. This patch simply
adds the missing barriers before these two opeations.
This will have to be backported if the commit above is backported.
There was a typo making the last updated pointer be the pre-last element's
prev instead of the last's prev element. It didn't show up during early
tests because the contention is very rare on this one and it's implicitly
recovered when updating the pointers to go to the next element, but it was
clearly visible in the listener_accept() tests by having all threads block
on LIST_POP_LOCKED() with n==p==LLIST_BUSY.
This will have to be backported if commit a8434ec14 ("MINOR: lists:
Implement locked variations.") is backported.
Commit a8434ec14 ("MINOR: lists: Implement locked variations.")
introduced locked lists which use the elements pointers as locks
for concurrent operations. A copy-paste typo in LIST_ADDQ_LOCKED()
causes corruption in the list in case the next pointer is already
held, as it restores the previous pointer into the next one. It
may impact the server pools.
This will have to be backported if the commit above is backported.
global.maxsock used to be augmented by the frontend's maxconn value
for each frontend listener, which is absurd when there are many
listeners in a frontend because the frontend's maxconn fixes an
upper limit to how many connections will be accepted on all of its
listeners anyway. What is needed instead is to add one to count the
listening socket.
In addition, the CLI's and peers' value was incremented twice, the
first time when creating the listener and the second time in the
main init code.
Let's now make sure we only increment global.maxsock by the required
amount of sockets. This means not adding maxconn for each listener,
and relying on the global values when they are correct.
Threads have long matured by now, still for most users their usage is
not trivial. It's about time to enable them by default on platforms
where we know the number of CPUs bound. This patch does this, it counts
the number of CPUs the process is bound to upon startup, and enables as
many threads by default. Of course, "nbthread" still overrides this, but
if it's not set the default behaviour is to start one thread per CPU.
The default number of threads is reported in "haproxy -vv". Simply using
"taskset -c" is now enough to adjust this number of threads so that there
is no more need for playing with cpu-map. And thanks to the previous
patches on the listener, the vast majority of configurations will not
need to duplicate "bind" lines with the "process x/y" statement anymore
either, so a simple config will automatically adapt to the number of
processors available.
tune.listener.multi-queue { on | off }
Enables ('on') or disables ('off') the listener's multi-queue accept which
spreads the incoming traffic to all threads a "bind" line is allowed to run
on instead of taking them for itself. This provides a smoother traffic
distribution and scales much better, especially in environments where threads
may be unevenly loaded due to external activity (network interrupts colliding
with one thread for example). This option is enabled by default, but it may
be forcefully disabled for troubleshooting or for situations where it is
estimated that the operating system already provides a good enough
distribution and connections are extremely short-lived.
It's important to monitor the accept queues to know if some incoming
connections had to be handled by their originating thread due to an
overflow. It's also important to be able to confirm thread fairness.
This patch adds "accq_pushed" to activity reporting, which reports
the number of connections that were successfully pushed into each
thread's queue, and "accq_full", which indicates the number of
connections that couldn't be pushed because the thread's queue was
full.
The idea is to redistribute an incoming connection to one of the
threads a bind_conf is bound to when there is more than one. We do this
using a random improved by the p2c algorithm : a random() call returns
two different thread numbers. We then compare their respective connection
count and the length of their accept queues, and pick the least loaded
one. We even use this deferred accept mechanism if the target thread
ends up being the local thread, because this maintains fairness between
all connections and tests show that it's about 1% faster this way,
likely due to cache locality. If the target thread's accept queue is
full, the connection is accepted synchronously by the current thread.
There is one point where we can migrate a connection to another thread
without taking risk, it's when we accept it : the new FD is not yet in
the fd cache and no task was created yet. It's still possible to assign
it a different thread than the one which accepted the connection. The
only requirement for this is to have one accept queue per thread and
their respective processing tasks that have to be woken up each time
an entry is added to the queue.
This is a multiple-producer, single-consumer model. Entries are added
at the queue's tail and the processing task is woken up. The consumer
picks entries at the head and processes them in order. The accept queue
contains the fd, the source address, and the listener. Each entry of
the accept queue was rounded up to 64 bytes (one cache line) to avoid
cache aliasing because tests have shown that otherwise performance
suffers a lot (5%). A test has shown that it's important to have at
least 256 entries for the rings, as at 128 it's still possible to fill
them often at high loads on small thread counts.
The processing task does almost nothing except calling the listener's
accept() function and updating the global session and SSL rate counters
just like listener_accept() does on synchronous calls.
At this point the accept queue is implemented but not used.
In order to quickly pick a thread ID when accepting a connection, we'll
need to know certain pre-computed values derived from the thread mask,
which are counts of bits per position multiples of 1, 2, 4, 8, 16 and
32. In practice it is sufficient to compute only the 4 first ones and
store them in the bind_conf. We update the count every time the
bind_thread value is adjusted.
The fields in the bind_conf struct have been moved around a little bit
to make it easier to group all thread bit values into the same cache
line.
The function used to return a thread number is bind_map_thread_id(),
and it maps a number between 0 and 31/63 to a thread ID between 0 and
31/63, starting from the left.