RAND_keep_random_devices_open is OpenSSL specific function, not
implemented in LibreSSL and BoringSSL. Let us define guard
HAVE_SSL_RAND_KEEP_RANDOM_DEVICES_OPEN in include/haproxy/openssl-compat.h
That guard does not depend anymore on HA_OPENSSL_VERSION
The runqueue_ticks counts the number of task wakeups and is used to
position new tasks in the run queue, but since we've had per-thread
run queues, the values there are not very relevant anymore and the
nice value doesn't apply well if some threads are more loaded than
others. In addition, letting all threads compete over a shared counter
is not smart as this may cause some excessive contention.
Let's move this index close to the run queues themselves, i.e. one per
thread and a global one. In addition to improving fairness, this has
increased global performance by 2% on 16 threads thanks to the lower
contention on rqueue_ticks.
Fairness issues were not observed, but if any were to be, this patch
could be backported as far as 2.0 to address them.
Historically this function would try to wake the most accurate number of
process_stream() waiters. But since the introduction of filters which could
also require buffers (e.g. for compression), things started not to be as
accurate anymore. Nowadays muxes and transport layers also use buffers, so
the runqueue size has nothing to do anymore with the number of supposed
users to come.
In addition to this, the threshold was compared to the number of free buffer
calculated as allocated minus used, but this didn't work anymore with local
pools since these counts are not updated upon alloc/free!
Let's clean this up and pass the number of released buffers instead, and
consider that each waiter successfully called counts as one buffer. This
is not rocket science and will not suddenly fix everything, but at least
it cannot be as wrong as it is today.
This could have been marked as a bug given that the current situation is
totally broken regarding this, but this probably doesn't completely fix
it, it only goes in a better direction. It is possible however that it
makes sense in the future to backport this as part of a larger series if
the situation significantly improves.
The buffer wait queue used to be global historically but this doest not
make any sense anymore given that the most common use case is to have
thread-local pools. Thus there's no point waking up waiters of other
threads after releasing an entry, as they won't benefit from it.
Let's move the queue head to the thread_info structure and use
ti->buffer_wq from now on.
This revert the commit 63e6cba12 ("MEDIUM: server: add server-states version
2"), but keeping all recent features added to the server-sate file. Instead
of adding a 2nd version for the server-state file format to handle the 5 new
fields added during the 2.4 development, these fields are considered as
optionnal during the parsing. So it is possible to load a server-state file
from HAProxy 2.3. However, from 2.4, these new fields are always dumped in
the server-state file. But it should not be a problem to load it on the 2.3.
This patch seems a bit huge but the diff ignoring the space is much smaller.
The version 2 of the server-state file format is reserved for a real
refactoring to address all issues of the current format.
Remove ebmb_node entry from struct connection and create a dedicated
struct conn_hash_node. struct connection contains now only a pointer to
a conn_hash_node, allocated only for connections where target is of type
OBJ_TYPE_SERVER. This will reduce memory footprints for every
connections that does not need http-reuse such as frontend connections.
In MT_LIST_TRY_ADDQ(), deal with the "prev" field of the element before the
"next". If the element is the first in the list, then its next will
already have been locked when we locked list->prev->next, so locking it
again will fail, and we'll start over and over.
This should be backported to 2.3.
The maximum number of connections accepted at once by a thread for a single
listener used to default to 64 divided by the number of processes but the
tasklet-based model is much more scalable and benefits from smaller values.
Experimentation has shown that 4 gives the highest accept rate for all
thread values, and that 3 and 5 come very close, as shown below (HTTP/1
connections forwarded per second at multi-accept 4 and 64):
ac\thr| 1 2 4 8 16
------+------------------------------
4| 80k 106k 168k 270k 336k
64| 63k 89k 145k 230k 274k
Some tests were also conducted on SSL and absolutely no change was observed.
The value was placed into a define because it used to be spread all over the
code.
It might be useful at some point to backport this to 2.3 and 2.2 to help
those who observed some performance regressions from 1.6.
Since a lot of internal callbacks were turned to tasklets, the runqueue
depth had not been readjusted from the default 200 which was initially
used to favor batched processing. But nowadays it appears too large
already based on the following tests conducted on a 8c16t machine with
a simple config involving "balance leastconn" and one server. The setup
always involved the two threads of a same CPU core except for 1 thread,
and the client was running over 1000 concurrent H1 connections. The
number of requests per second is reported for each (runqueue-depth,
nbthread) couple:
rq\thr| 1 2 4 8 16
------+------------------------------
32| 120k 159k 276k 477k 698k
40| 122k 160k 276k 478k 722k
48| 121k 159k 274k 482k 720k
64| 121k 160k 274k 469k 710k
200| 114k 150k 247k 415k 613k <-- default
It's possible to save up to about 18% performance by lowering the
default value to 40. One possible explanation to this is that checking
I/Os more frequently allows to flush buffers faster and to smooth the
I/O wait time over multiple operations instead of alternating phases
of processing, waiting for locks and waiting for new I/Os.
The total round trip time also fell from 1.62ms to 1.40ms on average,
among which at least 0.5ms is attributed to the testing tools since
this is the minimum attainable on the loopback.
After some observation it would be nice to backport this to 2.3 and
2.2 which observe similar improvements, since some users have already
observed some perf regressions between 1.6 and 2.2.
SCTL (signed certificate timestamp list) specified in RFC6962
was implemented in c74ce24cd22e8c683ba0e5353c0762f8616e597d, let
us introduce macro HAVE_SSL_SCTL for the HAVE_SSL_SCTL sake,
which in turn is based on SN_ct_cert_scts, which comes in the same commit
smp_is_safe() function is used to be sure a sample may be safely
modified. For string samples, a test is performed to verify if there is a
null-terminated byte. If not, one is added, if possible. It means if the
sample is not const and if there is some free space in the buffer, after
data. However, we must not try to read the null-terminated byte if the
string sample is too long (data >= size) or if the size is equal to
zero. This last test was not performed. Thus it was possible to consider a
string sample as safe by testing a byte outside the buffer.
Now, a zero size string sample is always considered as unsafe and is
duplicated when smp_make_safe() is called.
This patch must be backported in all stable versions.
It's pretty easy to pre-initialize the index, change it on free() and check
it during the wakeup, so let's do this to ease detection of any accidental
task_wakeup() after a task_free() or tasklet_wakeup() after a tasklet_free().
If this would ever happen we'd then get a backtrace and a core now. The
index's parity is respected so that the call history remains exploitable.
The idea is to know who woke a task up, by recording the last two
callers in a rotating mode. For now it's trivial with task_wakeup()
but tasklet_wakeup_on() will require quite some more changes.
This typically gives this from the debugger:
(gdb) p t->debug
$2 = {
caller_file = {0x0, 0x8c0d80 "src/task.c"},
caller_line = {0, 260},
caller_idx = 1
}
or this:
(gdb) p t->debug
$6 = {
caller_file = {0x7fffe40329e0 "", 0x885feb "src/stream.c"},
caller_line = {284, 284},
caller_idx = 1
}
But it also provides a trivial macro allowing to simply place a call in
a task/tasklet handler that needs to be observed:
DEBUG_TASK_PRINT_CALLER(t);
Then starting haproxy this way would trivially yield such info:
$ ./haproxy -db -f test.cfg | sort | uniq -c | sort -nr
199992 h1_io_cb woken up from src/sock.c:797
51764 h1_io_cb woken up from src/mux_h1.c:3634
65 h1_io_cb woken up from src/connection.c:169
45 h1_io_cb woken up from src/sock.c:777
The two algos defining these functions (first and leastconn) do not need the
server's lock. However it's already present in pendconn_process_next_strm()
so the API must be updated so that the functions may take it if needed and
that the callers indicate whether they already own it.
As such, the call places (backend.c and stream.c) now do not take it
anymore, queue.c was unchanged since it's already held, and both "first"
and "leastconn" were updated to take it if not already held.
A quick test on the "first" algo showed a jump from 432 to 565k rps by
just dropping the lock in stream.c!
This reverts commit 8f1f177ed0.
Repeated tests have shown a small perforamnce degradation of ~1.8%
caused by this patch at high request rates on 16 threads. The exact
cause is not yet perfectly known but it probably stems in slower
accesses for non-64-bit aligned atomic accesses.
The remaining contention on the server lock solely comes from
sess_change_server() which takes the lock to add and remove a
stream from the server's actconn list. This is both expensive
and pointless since we have mt-lists, and this list is only
used by the CLI's "shutdown server sessions" command!
Let's migrate to an mt-list and remove the need for this costly
lock. By doing so, the request rate increased by ~1.8%.
Since OTHER_LOCK is commonly used it's become much more difficult to
profile lock contention by temporarily changing a lock label. Let's
add DEBUG1..5 to serve only for debugging. These ones must not be
used in committed code. We could decide to only define them when
DEBUG_THREAD is set but that would complicate attempts at measuring
performance with debugging turned off.
The server lock was taken preventively for anything in health_adjust(),
including the static config checks needed to detect that the lock was not
needed, while the function is always called on the response path to update
a server's status. This was responsible for huge contention causing a
performance drop of about 17% on 16 threads. Let's move the lock only
where it should be, i.e. inside the function around the critical sections
only. By doing this, a 16-thread process jumped back from 575 to 675 krps.
This should be backported to 2.3 as the situation degraded there, and
maybe later to 2.2.
Always try to remove a connexion from its toremove_list in conn_free.
This prevents a double-free in case the connection is freed but was
already added in toremove_list.
This bug was easily reproduced by running 4-5 runs of inject on a
single-thread instance of haproxy :
$ inject -u 10000 -d 10 -G 127.0.0.1:20080
A crash would soon be triggered in srv_cleanup_toremove_connections.
This does not need to be backported.
move listen status to a helper, defining both status enum and string
definition.
this will be helpful to be reused in prometheus code. It also removes
this hard-to-read nested ternary.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
prometheus approach requires to output all values for a given metric
name; meaning we iterate through all metrics, and then iterate in the
inner loop on all objects for this metric.
In order to allow more code reuse, adapt the stats API to be able to
select one field or fill them all otherwise.
From this patch it should be possible to add support for listen stats in
prometheus.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
This patch introduce the "dns_stream_nameserver" to use DNS over
TCP on strict nameservers. For the upper layer it is analog to
the api used with udp nameservers except that the user que switch
the name server in "stream" mode at the init using "dns_stream_init".
The fallback from UDP to TCP is not handled and this is not the
purpose of this feature. This is done to choose the transport layer
during the initialization.
Currently there is a hardcoded limit of 4 pipelined transactions
per TCP connections. A batch of idle connections is expired every 5s.
This code is designed to support a maximum DNS message size on TCP: 64k.
Note: this code won't perform retry on unanswered queries this
should be handled by the upper layer
This patch splits current dns.c into two files:
The first dns.c contains code related to DNS message exchange over UDP
and in future other TCP. We try to remove depencies to resolving
to make it usable by other stuff as DNS load balancing.
The new resolvers.c inherit of the code specific to the actual
resolvers.
Note:
It was really difficult to obtain a clean diff dur to the amount
of moved code.
Note2:
Counters and stuff related to stats is not cleany separated because
currently counters for both layers are merged and hard to separate
for now.
This patch splits recv and send functions in two layers. the
lowest is responsible of DNS message transactions over
the network. Doing this we could use DNS message layer
for something else than resolving. Load balancing for instance.
This patch also re-works the way to init a nameserver and
introduce the new struct dns_dgram_server to prepare the arrival
of dns_stream_server and the support of DNS over TCP.
The way to retry a send failure of a request because of EAGAIN
was re-worked. Previously there was no control and all "pending"
queries were re-played each time it reaches a EAGAIN. This
patch introduce a ring to stack messages in case of sent
failure. This patch is emptied if poller shows that the
socket is ready again to push messages.
Counters are currently stored into lowlevel nameservers struct but
most of them are resolving layer data and increased in the upper layer
So this patch renames the prototype used to allocate/dump them with prefix
'resolv' waiting for a clean split.
Some types are specific to resolver code and a renamed using
the 'resolv' prefix instead 'dns'.
-struct dns_query_item {
+struct resolv_query_item {
-struct dns_answer_item {
+struct resolv_answer_item {
-struct dns_response_packet {
+struct resolv_response {
This patch adds the attribute packed on struct dns_question
because it is directly memcpy to network building a response.
This patch also removes the commented line:
// struct list options; /* list of option records */
because it is also used directly using memcpy to build a request
and must not contain host data.
Resolv callbacks are also updated to rely on counters and not on
nameservers.
"show stat domain dns" will now show the parent id (i.e. resolvers
section name).
The "show peers" output has become huge due to the dictionaries making it
less readable. Now this feature has reached a certain level of maturity
which doesn't warrant to dump it all the time, given that it was essentially
needed by developers. Let's make it optional, and disabled by default, only
when "show peers dict" is requested. The default output reminds about the
command. The output has been divided by 5 :
$ socat - /tmp/sock1 <<< "show peers dict" | wc -l
125
$ socat - /tmp/sock1 <<< "show peers" | wc -l
26
It could be useful to backport this to recent stable versions.
Now default proxies are stored into a dedicated tree, sorted by name.
Only unnamed entries are not kept upon new section creation. The very
first call to cfg_parse_listen() will automatically allocate a dummy
defaults section which corresponds to the previous static one, since
the code requires to have one at a few places.
The first immediately visible benefit is that it allows to reuse
alloc_new_proxy() to allocate a defaults section instead of doing it by
hand. And the secret goal is to allow to keep multiple named defaults
section in memory to reuse them from various proxies.
Now we'll have a tree of named defaults sections. The regular insertion
and lookup functions take care of the capability in order to select the
appropriate tree. A new function proxy_destroy_defaults() removes a
proxy from this tree and frees it entirely.
We don't want to expose this one anymore as we'll soon keep multiple
default proxies. Let's move it inside the parser which is the only
place which still uses it, and initialize it on the fly once needed
instead of doing it at boot time.
The default proxy was passed as a variable, which in addition to being
a PITA to deal with in the config parser, doesn't feel safe to use when
it ought to be const.
This will only affect new code so no backport is needed.
The default proxy was passed as a variable, which in addition to being
a PITA to deal with in the config parser, doesn't feel safe to use when
it ought to be const.
This will only affect new code so no backport is needed.
The default proxy was passed as a variable, which in addition to being
a PITA to deal with in the config parser, doesn't feel safe to use when
it ought to be const.
This will only affect new code so no backport is needed.
This used to be open-coded in cfgparse-listen.c when facing a "defaults"
keyword. Let's move this into proxy_free_defaults(). This code is ugly and
doesn't even reset the just freed pointers. Let's not change this yet.
This code should probably be merged with a generic proxy deinit function
called from deinit(). However there's a catch on uri_auth which cannot be
freed because it might be used by one or several proxies. We definitely
need refcounts there!
This new function takes over the old open-coding that used to be done
for too long in cfg_parse_listen() and it now does everything at once
in a proxy-centric function. The function does all the job of allocating
the structure, initializing it, presetting its defaults from the default
proxy and checking for errors. The code was almost unchanged except for
defproxy being passed as a pointer, and the error message being passed
using memprintf().
This change will be needed to ease reuse of multiple default proxies,
or to create dynamic backends in a distant future.
init_default_instance() was still left in cfgparse.c which is not the
best place to pre-initialize a proxy. Let's place it in proxy.c just
after init_new_proxy(), take this opportunity for renaming it to
proxy_preset_defaults() and taking out init_new_proxy() from it, and
let's pass it the pointer to the default proxy to be initialized instead
of implicitly assuming defproxy. We'll soon be able to exploit this.
Only two call places had to be updated.
struct comp is used in struct proxy but never declared prior to this
so depending on where proxy.h is included, touching the <comp> field
can break the build.