As reported in issue #724, openbsd fails to build in haproxy.c
due to a faulty comma in the middle of a warning message. This code
is only compiled when RLIMIT_AS is not defined, which seems to be
rare these days.
This may be backported to older versions as the problem was likely
introduced when strict limits were added.
Commit 69f591e3b ("MINOR: cli/proxy: add a new "show servers conn" command")
added the ability to dump the idle connections state for a server, but we
must not do this if idle connections were not allocated, which happens if
the server is configured with pool-max-conn 0.
This is 2.2, no backport is needed.
In the various timeout functions, make sure nobody stole the connection from
us before attempting to doing anything with it, there's a very small race
condition between the time we access the task context, and the time we
actually check it again with the lock, where it could have been free'd.
task_kill() may fail to queue a task if this task has never ever run,
because its equivalent (tasklet->list) member has never been "emptied"
since it didn't pass through the LIST_DEL_INIT() that's performed by
run_tasks_from_lists(). This results in these tasks to never be freed.
It happens during the mux takeover since the target task usually is
the timeout task which, by definition, has never run yet.
This fixes commit eb8c2c69f ("MEDIUM: sched: implement task_kill() to
kill a task") which was introduced after 2.2-dev11 and doesn't need to
be backported.
Now it's possible to preserve spacing everywhere except in "log-format",
"log-format-sd" and "unique-id-format" directives, where spaces are
delimiters and are merged. That may be useful when the response payload
is specified as a log format string by "lf-file" or "lf-string", or even
for headers or anything else.
In order to merge spaces, a new option LOG_OPT_MERGE_SPACES is applied
exclusively on options passed to function parse_logformat_string().
This patch fixes an issue #701 ("http-request return log-format file
evaluation altering spacing of ASCII output/art").
Now when building with -DDEBUG_MEM_STATS, some malloc/calloc/strdup/realloc
stats are kept per file+line number and may be displayed and even reset on
the CLI using "debug dev memstats". This allows to easily track potential
leakers or abnormal usages.
Enables ('on') or disables ('off') sharing of idle connection pools between
threads for a same server. The default is to share them between threads in
order to minimize the number of persistent connections to a server, and to
optimize the connection reuse rate. But to help with debugging or when
suspecting a bug in HAProxy around connection reuse, it can be convenient to
forcefully disable this idle pool sharing between multiple threads, and force
this option to "off". The default is on.
This could have been nice to have during the idle connections debugging,
but it's not too late to add it!
In addition they were in the wrong alphabetical order in the doc. They
were added in 2.0 by commit 88698d966 ("MEDIUM: connections: Add a way
to control the number of idling connections.") so this must be backported
to 2.0.
These two keywords didn't have an entry in the index. They were added in
2.0 by commit 88698d966 ("MEDIUM: connections: Add a way to control the
number of idling connections.") so this must be backported to 2.0.
Add two new macros, MT_LIST_DEL_SAFE_NOINIT makes sure we remove the
element from the list, without reinitializing its next and prev, and
MT_LIST_ADDQ_NOCHECK is similar to MT_LIST_ADDQ(), except it doesn't check
if the element is already in a list.
The goal is to be able to move an element from a list we're currently
parsing to another, keeping it locked in the meanwhile.
task_wakeup() passes the task through the global run queue under the
global RQ lock, which is expensive when dealing with large amounts of
fcgi_takeover() calls. Let's use the new task_kill() instead to kill the
task.
task_wakeup() passes the task through the global run queue under the
global RQ lock, which is expensive when dealing with large amounts of
h2_takeover() calls. Let's use the new task_kill() instead to kill the
task.
task_wakeup() passes the task through the global run queue under the
global RQ lock, which is expensive when dealing with large amounts of
h1_takeover() calls. Let's use the new task_kill() instead to kill the
task.
By doing so, a scenario involving approximately 130k takeover/s running on
16 threads gained almost 3% performance from 319k req/s to 328k.
task_kill() may be used by any thread to kill any task with less overhead
than a regular wakeup. In order to achieve this, it bypasses the priority
tree and inserts the task directly into the shared tasklets list, cast as
a tasklet. The task_list_size is updated to make sure it is properly
decremented after execution of this task. The task will thus be picked by
process_runnable_tasks() after checking the tree and sent to the TL_URGENT
list, where it will be processed and killed.
If the task is bound to more than one thread, its first thread will be the
one notified.
If the task was already queued or running, nothing is done, only the flag
is added so that it gets killed before or after execution. Of course it's
the caller's responsibility to make sur any resources allocated by this
task were already cleaned up or taken over.
This flag, when set, will be used to indicate that the task must die.
At the moment this may only be placed by the task itself or by the
scheduler when placing it into the TL_NORMAL queue.
The next thread walking algorithm in commit 566df309c ("MEDIUM:
connections: Attempt to get idle connections from other threads.")
proved to be sufficient for most cases, but it still has some rough
edges when threads are unevenly loaded. If one thread wakes up with
10 streams to process in a burst, it will mainly take over connections
from the next one until it doesn't have anymore.
This patch implements a rotating index that is stored into the server
list and that any thread taking over a connection is responsible for
updating. This way it starts mostly random and avoids always picking
from the same place. This results in a smoother distribution overall
and a slightly lower takeover rate.
There's a tricky behavior that was lost when the idle connections were
made sharable between thread in commit 566df309c ("MEDIUM: connections:
Attempt to get idle connections from other threads."), it is the ability
to retry from the safe list when looking for any type of idle connection
and not finding one in the idle list.
It is already important when dealing with long-lived connections since
they ultimately all become safe, but that case is already covered by
the fact that safe conns not being used end up closing and are not
looked up anymore since connect_server() sees there are none.
But it's even more important when using server-side connections which
periodically close, because the new connections may spend half of their
time in safe state and the other half in the idle state, and failing
to grab one such connection from the right list results in establishing
a new connection.
This patch makes sure that a failure to find an idle connection results
in a new attempt at finding one from the safe list if available. In order
to avoid locking twice, connections are attempted alternatively from the
idle then safe list when picking from siblings. Tests have shown a ~2%
performance increase by avoiding to lock twice.
A typical test with 10000 connections over 16 threads with 210 servers
having a 1 millisecond response time and closing every 5 requests shows
a degrading performance starting at 120k req/s down to 60-90k and an
average reuse rate of 44%. After the fix, the reuse rate raises to 79%
and the performance becomes stable at 254k req/s. Similarly the previous
test with full keep-alive has now increased from 96% reuse rate to 99%
and from 352k to 375k req/s.
No backport is needed as this is 2.2-only.
The problem with the way idle connections currently work is that it's
easy for a thread to steal all of its siblings' connections, then release
them, then it's done by another one, etc. This happens even more easily
due to scheduling latencies, or merged events inside the same pool loop,
which, when dealing with a fast server responding in sub-millisecond
delays, can really result in one thread being fully at work at a time.
In such a case, we perform a huge amount of takeover() which consumes
CPU and requires quite some locking, sometimes resulting in lower
performance than expected.
In order to fight against this problem, this patch introduces a new server
setting "pool-low-conn", whose purpose is to dictate when it is allowed to
steal connections from a sibling. As long as the number of idle connections
remains at least as high as this value, it is permitted to take over another
connection. When the idle connection count becomes lower, a thread may only
use its own connections or create a new one. By proceeding like this even
with a low number (typically 2*nbthreads), we quickly end up in a situation
where all active threads have a few connections. It then becomes possible
to connect to a server without bothering other threads the vast majority
of the time, while still being able to use these connections when the
number of available FDs becomes low.
We also use this threshold instead of global.nbthread in the connection
release logic, allowing to keep more extra connections if needed.
A test performed with 10000 concurrent HTTP/1 connections, 16 threads
and 210 servers with 1 millisecond of server response time showed the
following numbers:
haproxy 2.1.7: 185000 requests per second
haproxy 2.2: 314000 requests per second
haproxy 2.2 lowconn 32: 352000 requests per second
The takeover rate goes down from 300k/s to 13k/s. The difference is
further amplified as the response time shrinks.
There was a logic bug in commit ddfe0743d ("MEDIUM: server: use the two
thresholds for the connection release algorithm"): instead of keeping
only our first idle connection when FDs become scarce, the condition was
inverted resulting in enforcing this constraint unless FDs are scarce.
This results in less idle connections than permitted to be kept under
normal condition.
No backport needed.
In conn_backend_get() we can avoid locking other servers when trying
to steal their connections when we know for sure they will not have
one, so let's do it to lower the contention on the lock.
This command reuses the existing "show servers state" to also dump the
state of active and idle connections. The main use is to serve as a
debugging tool to troubleshot connection reuse issues.
Actually the cleanup in commit 6ff8143f7 ("BUG/MINOR: proxy: fix
dump_server_state()'s misuse of the trash") allowed to spot that the
trash is never reset when dumping a servers state. I couldn't manage
to make it dump garbage even with large setups but didn't find either
where it's cleared between successive calls while other handlers do
explicitly invoke chunk_reset(), so it seems to happen a bit by luck.
Let's use chunk_printf() here for each turn, it makes things clearer.
This could be backported along with previous patch, especially if any
user reports occasional garbage appearing in the show servers output.
dump_server_state() claims to dump into a buffer but instead it writes
into a buffer then dumps the trash into the channel, so it only supports
being called with buf=&trash and doesn't need this buffer. There doesn't
seem to be any current impact of this mistake since the function is called
from one location only.
A backport may be performed if it helps fixing other bugs but it will not
fix an existing bug by itself.
This patch adds a missing break to end the loop in case when '%[' is not
properly closed with ']'.
The issue has been introduced with commit cd0d2ed ("MEDIUM: log-format:
make the LF parser aware of sample expressions' end").
In pat_match_str() and pat_math_beg() functions, a trailing zero is
systematically added at the end of the string, even if the buffer is not large
enough to accommodate it. It is a possible buffer overflow. For instance, when
the alpn is matched against a list of strings, the sample fetch is filled with a
non-null terminated string returned by the SSL library. No trailing zero must be
added at the end of this string, because it is outside the buffer.
So, to fix the bug, a trailing zero is added only if the buffer is large enough
to accommodate it. Otherwise, the sample fetch is duplicated. smp_dup() function
adds a trailing zero to the duplicated string, truncating it if it is too long.
This patch should fix the issue #718. It must be backported to all supported
versions.
When not sharing pools (i.e. when building with -DDEBUG_DONT_SHARE_POOLS)
we have about 47 pools right now, while MAX_BASE_POOLS is only 32, meaning
that only the first 32 ones will benefit from a per-thread cache entry.
This totally kills performance when pools are not shared (roughly -20%).
Let's double the limit to gain some margin, and make it possible to set
it as a build option.
It might be useful to backport this to stable versions as they're likely
to be affected as well.
If the owning task is already dying (context was destroyed by fcgi_takeover)
there's no point taking the lock then removing it later since all the code
in between is conditionned by a non-null context. Let's simplify this.
If the owning task is already dying (context was destroyed by h2_takeover)
there's no point taking the lock then removing it later since all the code
in between is conditionned by a non-null context. Let's simplify this.
If the owning task is already dying (context was destroyed by h1_takeover)
there's no point taking the lock then removing it later since all the code
in between is conditionned by a non-null context. Let's simplify this.
In commit 3ef7a190b ("MEDIUM: tasks: apply a fair CPU distribution
between tasklet classes") we compute a total weight to be used to
split the CPU time between queues. There is a mention that the
total cannot be null, wihch is based on the fact that we only get
there if thread_has_task() returns non-zero. But there is a very
small race which can break this assumption: if two threads conflict
on MT_LIST_ADDQ() on an empty shared list and both roll back before
trying again, there is the possibility that a first call to
MT_LIST_ISEMPTY() sees the first thread install itself, then the
second call will see the list empty when both roll back. Thus we
could proceed with the queue while it's temporarily empty and
compute max lengths using a divide by zero. This case is very
hard to trigger, it seldom happens on 16 threads at 400k req/s.
Let's simply test for max_total and leave the loop when we've not
found any work.
No backport is needed, that's 2.2-only.
The parsing of http deny rules with no argument or only the deny_status argument
is buggy if followed by an ACLs expression (starting with "if" or "unless"
keyword). Instead of using the proxy errorfiles, a dummy error is used. To fix
the bug, the parsing function must also check for "if" or "unless" keyword in
such cases.
This patch should fix the issue #720. No backport is needed.
The algorithm improvement in bdb86bd ("MEDIUM: server: improve estimate
of the need for idle connections") is still not enough because there's
a hard limit between below and above the FD count, so it continues to
end up with many killed connections.
Here we're proceeding differently. Given that there are two configured
limits, a low and a high one, what we do is that we drop connections
when the high limit is reached (what's already done by the killing task
anyway), when we're between the low and the high threshold, we only keep
the connection if our idle entries are empty (with a preference for safe
ones), and below the low threshold, we keep any connection so as to give
them a chance of being reused or taken over by another thread.
Proceeding like this results in much less dropped connections, we
typically see a 99.3% reuse rate (76k conns for 10M requests over 200
servers and 4 threads, with 335k takeovers or 3%), and much less CPU
usage variations because there are no more bursts to try to kill extra
connections.
It should be possible to further improve this by counting the number
of threads exploiting a server and trying to optimize the amount of
per-thread idle connections so that it is approximately balanced among
the threads.
The idle server connection estimates brought in commit bdb86bd ("MEDIUM:
server: improve estimate of the need for idle connections") were committed
without the minimum of 1 idle conn needed for the current thread. The net
effect is that there are bursts of dropped connections when the load varies
because there's no provision for the last connection.
No backport needed, this is 2.2-dev.
Commit d645574 ("MINOR: soft-stop: let the first stopper only signal
other threads") introduced a minor mistake which is that when a stopping
thread signals all other threads, it also signals itself. When
single-threaded, the process constantly wakes up while waiting for
last connections to exit. Let's reintroduce the lost mask to avoid
this.
No backport is needed, this is 2.2-dev only.
This reverts previous commit 347bbf79d20e1cff57075a8a378355dfac2475e2i.
The original code was correct. This patch resulted from a mistaken analysis
and breaks the scheduler:
########################## Starting vtest ##########################
Testing with haproxy version: 2.2-dev11-90b7d9-23
# top TEST reg-tests/lua/close_wait_lf.vtc TIMED OUT (kill -9)
# top TEST reg-tests/lua/close_wait_lf.vtc FAILED (10.008) signal=9
1 tests failed, 0 tests skipped, 88 tests passed
Program terminated with signal SIGABRT, Aborted.
[Current thread is 1 (Thread 0x7fb0dac2c700 (LWP 11292))]
(gdb) bt
#0 0x00007fb0e7c143f8 in raise () from /lib64/libc.so.6
#1 0x00007fb0e7c15ffa in abort () from /lib64/libc.so.6
#2 0x000000000053f5d6 in ha_panic () at src/debug.c:269
#3 0x00000000005a6248 in wdt_handler (sig=14, si=<optimized out>, arg=<optimized out>) at src/wdt.c:119
#4 <signal handler called>
#5 0x00000000004fbccd in tasklet_wakeup (tl=0x1b5abc0) at include/haproxy/task.h:351
#6 listener_accept (fd=<optimized out>) at src/listener.c:999
#7 0x00000000004262df in fd_update_events (evts=<optimized out>, fd=6) at include/haproxy/fd.h:418
#8 _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:251
#9 0x0000000000548d0f in run_poll_loop () at src/haproxy.c:2949
#10 0x000000000054908b in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3067
#11 0x00007fb0e902b684 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fb0e7ce5eed in clone () from /lib64/libc.so.6
(gdb) up
#5 0x00000000004fbccd in tasklet_wakeup (tl=0x1b5abc0) at include/haproxy/task.h:351
351 if (MT_LIST_ADDQ(&task_per_thread[tl->tid].shared_tasklet_list, (struct mt_list *)&tl->list) == 1) {
If the commit above is ever backported, this one must be as well!
In MT_LIST_ADDQ() and MT_LIST_ADD() we can't just check if the element is
already in a list, because there's a small race condition, it could be added
between the time we checked, and the time we actually set its next and prev.
So we have to lock it first.
This should be backported to 2.1.
The max_used_conns value is used as an estimate of the needed number of
connections on a server to know how many to keep open. But this one is
not reported, making it hard to troubleshoot reuse issues. Let's export
it in the sessions/current column.
Starting with commit 079cb9a ("MEDIUM: connections: Revamp the way idle
connections are killed") we started to improve the way to compute the
need for idle connections. But the condition to keep a connection idle
or drop it when releasing it was not updated. This often results in
storms of close when certain thresholds are met, and long series of
takeover() when there aren't enough connections left for a thread on
a server.
This patch tries to improve the situation this way:
- it keeps an estimate of the number of connections needed for a server.
This estimate is a copy of the max over previous purge period, or is a
max of what is seen over current period; it differs from max_used_conns
in that this one is a counter that's reset on each purge period ;
- when releasing, if the number of current idle+used connections is
lower than this last estimate, then we'll keep the connection;
- when releasing, if the current thread's idle conns head is empty,
and we don't exceed the estimate by the number of threads, then
we'll keep the connection.
- when cleaning up connections, we consider the max of the last two
periods to avoid killing too many idle conns when facing bursty
traffic.
Thanks to this we can better converge towards a situation where, provided
there are enough FDs, each active server keeps at least one idle connection
per thread all the time, with a total number close to what was needed over
the previous measurement period (as defined by pool-purge-delay).
On tests with large numbers of concurrent connections (30k) and many
servers (200), this has quite smoothed the CPU usage pattern, increased
the reuse rate and roughly halved the takeover rate.
There's a minor glitch with the way idle connections start to be evicted.
The lookup always goes from thread 0 to thread N-1. This causes depletion
of connections on the first threads and abundance on the last ones. This
is visible with the takeover() stats below:
$ socat - /tmp/sock1 <<< "show activity"|grep ^fd ; \
sleep 10 ; \
socat -/tmp/sock1 <<< "show activity"|grep ^fd
fd_takeover: 300144 [ 91887 84029 66254 57974 ]
fd_takeover: 359631 [ 111369 99699 79145 69418 ]
There are respectively 19k, 15k, 13k and 11k takeovers for only 4 threads,
indicating that the first thread needs a foreign FD twice more often than
the 4th one.
This patch changes this si that all threads are scanned in round robin
starting with the current one. The takeovers now happen in a much more
distributed way (about 4 times 9k) :
fd_takeover: 1420081 [ 359562 359453 346586 354480 ]
fd_takeover: 1457044 [ 368779 368429 355990 363846 ]
There is no need to backport this, as this happened along a few patches
that were merged during 2.2 development.
The FD takeover operation might have certain impacts explaining
unexpected activities, so it's important to report such a counter
there. We thus count the number of times a thread has stolen an
FD from another thread.
The servers have internal states describing the status of idle connections,
unfortunately these were not exported in the stats. This patch adds the 3
following gauges:
- idle_conn_cur : Current number of unsafe idle connections
- safe_conn_cur : Current number of safe idle connections
- used_conn_cur : Current number of connections in use
DEBUG_FD was added by commit 38e8a1c in 2.2-dev, and "show fd" was
slightly modified to still allow to print orphaned/closed FDs if their
count is non-null. But bypassing the existing test made it possible
to dereference fdt.owner which can be null. Let's adjust the condition
to avoid this.
No backport is needed.
The LRU cache head was an array of list, which causes false sharing
between 4 to 8 threads in the same cache line. Let's move it to the
thread_info structure instead. There's no need to do the same for the
pool_cache[] array since it's already quite large (32 pointers each).
By doing this the request rate increased by 1% on a 16-thread machine.