These function names are unbearably long, they don't even fit into the
screen in "show profiling", let's trim the "_connections" to "_conns",
which happens to match the name of the lists there.
There's a fairness issue between SSL and clear text. A full end-to-end
cleartext connection can require up to ~7.7 wakeups on average, plus 3.3
for the SSL tasklet, one of which is particularly expensive. So if we
accept to process many handshakes taking 1ms each, we significantly
increase the processing time of regular tasks just by adding an extra
delay between their calls. Ideally in order to be fair we should have a
1:18 call ratio, but this requires a bit more accounting. With very little
effort we can mark the SSL handshake tasklet as TASK_HEAVY until the
handshake completes, and remove it once done.
Doing so reduces from 14 to 3.0 ms the total response time experienced
by HTTP clients running in parallel to 1000 SSL clients doing full
handshakes in loops. Better, when tune.sched.low-latency is set to "on",
the latency further drops to 1.8 ms.
The tasks latency distribution explain pretty well what is happening:
Without the patch:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks {on|auto|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
ssl_sock_io_cb 2785375 19.35m 416.9us 5.401h 6.980ms
h1_io_cb 1868949 9.853s 5.271us 4.829h 9.302ms
process_stream 1864066 7.582s 4.067us 2.058h 3.974ms
si_cs_io_cb 1733808 1.932s 1.114us 26.83m 928.5us
h1_timeout_task 935760 - - 1.033h 3.975ms
accept_queue_process 303606 4.627s 15.24us 16.65m 3.291ms
srv_cleanup_toremove_connections452 64.31ms 142.3us 2.447s 5.415ms
task_run_applet 47 5.149ms 109.6us 57.09ms 1.215ms
srv_cleanup_idle_connections 34 2.210ms 65.00us 87.49ms 2.573ms
With the patch:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks {on|auto|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
ssl_sock_io_cb 3000365 21.08m 421.6us 20.30h 24.36ms
h1_io_cb 2031932 9.278s 4.565us 46.70m 1.379ms
process_stream 2010682 7.391s 3.675us 22.83m 681.2us
si_cs_io_cb 1702070 1.571s 922.0ns 8.732m 307.8us
h1_timeout_task 1009594 - - 17.63m 1.048ms
accept_queue_process 339595 4.792s 14.11us 3.714m 656.2us
srv_cleanup_toremove_connections779 75.42ms 96.81us 438.3ms 562.6us
srv_cleanup_idle_connections 48 2.498ms 52.05us 178.1us 3.709us
task_run_applet 17 1.738ms 102.3us 11.29ms 663.9us
other 1 947.8us 947.8us 202.6us 202.6us
=> h1_io_cb() and process_stream() are divided by 6 while ssl_sock_io_cb() is
multipled by 4
And with low-latency on:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks {on|auto|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
ssl_sock_io_cb 3000565 20.96m 419.1us 20.74h 24.89ms
h1_io_cb 2019702 9.294s 4.601us 49.22m 1.462ms
process_stream 2009755 6.570s 3.269us 1.493m 44.57us
si_cs_io_cb 1997820 1.566s 783.0ns 2.985m 89.66us
h1_timeout_task 1009742 - - 1.647m 97.86us
accept_queue_process 494509 4.697s 9.498us 1.240m 150.4us
srv_cleanup_toremove_connections1120 92.32ms 82.43us 463.0ms 413.4us
srv_cleanup_idle_connections 70 2.703ms 38.61us 204.5us 2.921us
task_run_applet 13 1.303ms 100.3us 85.12us 6.548us
=> process_stream() is divided by 100 while ssl_sock_io_cb() is
multipled by 4
Interestingly, the total HTTPS response time doesn't increase and even very
slightly decreases, with an overall ~1% higher request rate. The net effect
here is a redistribution of the CPU resources between internal tasks, and
in the case of SSL, handshakes wait bit more but everything after completes
faster.
This was made simple enough to be backportable if it helps some users
suffering from high latencies in mixed traffic.
While the scheduler is priority-aware and class-aware, and consistently
tries to maintain fairness between all classes, it doesn't make use of a
fine execution budget to compensate for high-latency tasks such as TLS
handshakes. This can result in many subsequent calls adding multiple
milliseconds of latency between the various steps of other tasklets that
don't even depend on this.
An ideal solution would be to add a 4th queue, have all tasks announce
their estimated cost upfront and let the scheduler maintain an auto-
refilling budget to pick from the most suitable queue.
But it turns out that a very simplified version of this already provides
impressive gains with very tiny changes and could easily be backported.
The principle is to reserve a new task flag "TASK_HEAVY" that indicates
that a task is expected to take a lot of time without yielding (e.g. an
SSL handshake typically takes 700 microseconds of crypto computation).
When the scheduler sees this flag when queuing a tasklet, it will place
it into the bulk queue. And during dequeuing, we accept only one of
these in a full round. This means that the first one will be accepted,
will not prevent other lower priority tasks from running, but if a new
one arrives, then the queue stops here and goes back to the polling.
This will allow to collect more important updates for other tasks that
will be batched before the next call of a heavy task.
Preliminary tests consisting in placing this flag on the SSL handshake
tasklet show that response times under SSL stress fell from 14 ms
before the patch to 3.0 ms with the patch, and even 1.8 ms if
tune.sched.low-latency is set to "on".
Only the first 3 characters are compared for ';no-maint' suffix in
http_handle_stats. Fix it by doing a full match over the entire suffix.
As a side effect, the ';norefresh' suffix matched the inaccurate
comparison, so the maintenance servers were always hidden on the stats
page in this case.
no-maint suffix is present since commit
3e32036701
MINOR: stats: also support a "no-maint" show stat modifier
It should be backported up to 2.3.
This fixes github issue #1147.
In H1, H2 and FCGI muxes, in the show_fd function, there is duplicated test on
the stream's subs field.
This patch fixes the issue #1142. It may be backported as far as 2.2.
Some static functions are now exported and renamed to follow the same
pattern of other exported functions. Here is the list :
* update_server_fqdn: Renamed to srv_update_fqdn and exported
* update_server_check_addr_port: renamed to srv_update_check_addr_port and exported
* update_server_agent_addr_port: renamed to srv_update_agent_addr_port and exported
* update_server_addr: renamed to srv_update_addr
* update_server_addr_potr: renamed to srv_update_addr_port
* srv_prepare_for_resolution: exported
This change is mandatory to move all functions dealing with the server-state
files in a separate file.
This change is not huge but may have a visible impact for users. Now, if a
line of a server-state file is corrupted, the whole file is ignored. A
warning is emitted with the corrupted line number.
In fact, there is no way to recover from a corrupted line. A line is
considered as corrupted if it is too long (truncated line) or if it contains
the wrong number of arguments. In both cases, it means the file was forged
(or at least manually edited). It is safer to ignore it.
Note for now, memory allocation errors are not reported and the
corresponding line is silently ignored.
Now, srv_state_parse_and_store_line() function is used to parse and store a
line in a tree. It is used for global and local server-state files. This
significatly simplies the apply_server_state() function.
Just like for the global server-state file, the line of a local server-state
file are now stored in a tree. This way, the file is fully parsed before
loading the servers state. And with this change, global and local
server-state files are now handled the same way. This will be the
opportunity to factorize the code. It is also a good way to validate the
file before loading any server state.
The loop on the servers of a proxy to load the server states was moved in
the function srv_state_px_update(). This simplify a bit the
apply_server_state() function. It is aslo mandatory to simplify the loading
of local server-state file.
When a server for a given backend is found in the tree containing all lines
of the global server-state file, the node is removed from the tree. It is
useless to keep it longer. It is a small improvement, but it may also be
usefull to track the orphan lines (not used for now).
Parsed parameters are now stored in the tree of server-state lines. This
way, a line from the global server-state file is only parsed once. Before,
it was parsed a first time to store it in the tree and one more time to load
the server state. To do so, the server-state line object must be allocated
before parsing a line. This means its size must no longer depend on the
length of first parsed parameters (backend and server names). Thus the node
type was changed to use a hashed key instead of a string.
Now, we read a full line and expects to found an integer only on it. And if
the line is empty or truncated, an error is returned. If the version is not
valid, an error is also returned. This way, the first line is no longer
partially read.
There is no reason to use a global variable to store the lines of the global
server-state file. This tree is only used during the file parsing, as a line
cache. Now the eb-tree is declared as a local variable in the
apply_server_state() function.
The structure used to store a server-state line in an eb-tree has a too
generic name. Instead of state_line, the structure is renamed as
server_state_line.
<state_line.name_name> field is a node in an eb-tree. Thus, instead of
"name_name", we now use "node" to name this field. If is a more explicit
name and not too strange.
The apply_server_state() function is really hard to read. Thus it was
refactored to be more maintainable. First, an helper function is used to get
the server-state file path. Some useless variables were removed and most of
other variables were renamed to be more readable. The error messages are now
prefixed to know the context (global vs per-proxy). Finally, the loop on the
proxies list was simplified.
This patch may seem a bit huge, but the changes are not so important.
There is no reason to fill two parameter arrays in srv_state_parse_line()
function. Now, only one array is used. The 4th first entries are just
skipped when srv_update_state() is called.
The srv_state_parse_line() function was rewritten to be more strict. First
of all, it is possible to make the difference between an ignored line and an
malformed one. Then, only blank characters (spaces and tabs) are now allowed
as field separator. An error is reported for truncated lines or for lines
with an unexpected number of arguments regarding the provided version.
However, for now, errors are ignored by the caller, invalid lines are just
skipped.
First, we don't want to measure wakeup times if the call date had not
been set before profiling was enabled at run time. And second, we may
only collect the value before clearing the TASK_IN_LIST bit, otherwise
another wakeup might happen on another thread and replace the call date
we're about to use, hence artificially lower the wakeup times.
It is extremely useful to be able to observe the wakeup latency of some
important I/O operations, so let's accept to inflate the tasklet struct
by 8 extra bytes when DEBUG_TASK is set. With just this we have enough
to get live reports like this:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks {on|auto|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
si_cs_io_cb 8099492 4.833s 596.0ns 8.974m 66.48us
h1_io_cb 7460365 11.55s 1.548us 2.477m 19.92us
process_stream 7383828 22.79s 3.086us 18.39m 149.5us
h1_timeout_task 4157 - - 348.4ms 83.81us
srv_cleanup_toremove_connections751 39.70ms 52.86us 10.54ms 14.04us
srv_cleanup_idle_connections 21 1.405ms 66.89us 30.82us 1.467us
task_run_applet 16 1.058ms 66.13us 446.2us 27.89us
accept_queue_process 7 34.53us 4.933us 333.1us 47.58us
Instead of decrementing grq_total once per task picked from the global
run queue, let's do it at once after the loop like we do for other
counters. This simplifies the code everywhere. It is not expected to
bring noticeable improvements however, since global tasks tend to be
less common nowadays.
Now we don't need to decrement rq_total when we pick a tack in the tree
to immediately increment it again after installing it into the local
list. Instead, we simply add to the local queue count the number of
globally picked tasks. Avoiding this shows ~0.5% performance gains at
1Mreq/s (2M task switches/s).
As indicated in previous commit, this function tries to guess which tree
the task is in to figure what counters to update, while we already have
that info in the caller. Let's just pick the relevant parts to place them
in the caller.
In process_runnable_tasks() we're still calling __task_unlink_rq() to
pick a task, and this function tries to guess where to pick the task
from and which counter to update while the caller's context already
has everything. Worse, the number of local tasks is decremented then
recredited, doubling the operations. In order to avoid this we first
need to keep separate counters for local and global tasks that were
picked. This is what this patch does.
The function htx_reserve_max_data() should be used to get an HTX DATA block
with the max possible size. A current block may be extended or a new one
created, depending on the HTX message state. But the idea is to let the
caller to copy a bunch of data without requesting many new blocks. It is its
responsibility to resize the block at the end, to set the final block size.
This function will be used to parse messages with small chunks. Indeed, we
can have more than 2700 1-byte chunks in a 16Kb of input data. So it is easy
to understand how this function may help to improve the parsing of chunk
messages.
If the DNS resolution failed for a server, its ip address must be
removed. Otherwise, the server is stopped but keeps its ip. This may be
confusing when the servers state are retrieved on the CLI and it may lead to
undefined behavior if HAproxy is configured to load its servers state from a
file.
This patch should be backported as far as 2.0.
When a SRV record expires, the ip/port assigned to the associated server are
now removed. Otherwise, the server is stopped but keeps its ip/port while
the server hostname is removed. It is confusing when the servers state are
retrieve on the CLI and may be a problem if saved in a server-state
file. Because the reload may fail because of this inconsistency.
Here is an example:
* Declare a server template in a backend, using the resolver <dns>
server-template test 2 _http._tcp.example.com resolvers dns check
* 2 SRV records are announced with the corresponding additional
records. Thus, 2 servers are filled. Here is the "show servers state"
output :
2 frt 1 test1 192.168.1.1 2 64 0 1 2 15 3 4 6 0 0 0 http1.example.com 8001 _http._tcp.example.com 0 0 - - 0
2 frt 2 test2 192.168.1.2 2 64 0 1 1 15 3 4 6 0 0 0 http2.example.com 8002 _http._tcp.example.com 0 0 - - 0
* Then, one additional record is removed (or a SRV record is removed, the
result is the same). Here is the new "show servers state" output :
2 frt 1 test1 192.168.1.1 2 64 0 1 38 15 3 4 6 0 0 0 http1.example.com 8001 _http._tcp.example.com 0 0 - - 0
2 frt 2 test2 192.168.1.2 0 96 0 1 19 15 3 0 14 0 0 0 - 8002 _http._tcp.example.com 0 0 - - 0
On reload, if a server-state file is used, this leads to undefined behaviors
depending on the configuration.
This patch should be backported as far as 2.0.
When a SRV record was created, it used to register the regular server name
resolution callbacks. That said, SRV records and regular server name
resolution don't work the same way, furthermore on error management.
This patch introduces a new call back to manage DNS errors related to
the SRV queries.
this fixes github issue #50.
Backport status: 2.3, 2.2, 2.1, 2.0
If no additional record is associated to a SRV record, its TTL must not be
renewed. Otherwise the entry never expires. Thus once announced a first
time, the entry remains blocked on the same IP/port except if a new announce
replaces the old one.
Now, the TTL is updated if a SRV record is received while a matching
existing one is found with an additional record or when an new additional
record is assigned to an existing SRV record.
This patch should be backported as far as 2.2.
At the end of resolv_validate_dns_response(), if a received additionnal
record is not assigned to an existing server record, it is released. But the
condition to do so is buggy. If "answer_record" (the received AR) is not
assigned, "tmp_record" is not a valid record object. It is just a dummy
record "representing" the head of the record list.
Now, the condition is far cleaner. This patch must be backported as far as
2.2.
This function has become large with the multi-queue scheduler. We need
to keep the fast path and the debugging parts inlined, but the rest now
moves to task.c just like was done for task_wakeup(). This has reduced
the code size by 6kB due to less inlining of large parts that are always
context-dependent, and as a side effect, has increased the overall
performance by 1%.
The nb_tasks counter was still global and gets incremented and decremented
for each task_new()/task_free(), and was read in process_runnable_tasks().
But it's only used for stats reporting, so doing this this often is
pointless and expensive. Let's move it to the task_per_thread struct and
have the stats sum it when needed.
The test in __task_wakeup() to figure if the remote threads are sleeping
doesn't make sense outside of the global runqueue test, since there are
only two possibilities here: local runqueue or global runqueue, hence a
sleeping thread is another one and can only happen when sending to the
global run queue. Let's move the test inside the "if" block.
Historically we used to call __task_wakeup() with a known tree root but
this is not the case and the code has remained needlessly complicated
with the root calculation in task_wakeup() passed in argument to
__task_wakeup() which compares it again.
Let's get rid of this and just move the detection code there. This
eliminates some ifdefs and allows to simplify the test conditions quite
a bit.
This one is systematically misunderstood due to its unclear name. It
is in fact the number of tasks in the local tasklet list. Let's call
it "tasks_in_list" to remove some of the confusion.
This one is exclusively used as a boolean nowadays and is non-zero only
when the thread-local run queue is not empty. Better check the root tree's
pointer and avoid updating this counter all the time.
This counter is solely used for reporting in the stats and is the hottest
thread contention point to date. Moving it to the scheduler and having a
separate one for the global run queue dramatically improves the performance,
showing a 12% boost on the request rate on 16 threads!
In addition, the thread debugging output which used to rely on rqueue_size
was not totally accurate as it would only report task counts. Now we can
return the exact thread's run queue length.
It is also interesting to note that there are still a few other task/tasklet
counters in the scheduler that are not efficiently updated because some cover
a single area and others cover multiple areas. It looks like having a distinct
counter for each of the following entries would help and would keep the code
a bit cleaner:
- global run queue (tree)
- per-thread run queue (tree)
- per-thread shared tasklets list
- per-thread local lists
Maybe even splitting the shared tasklets lists between pure tasklets and
tasks instead of having the whole and tasks would simplify the code because
there remain a number of places where several counters have to be updated.
dns_session_release() only uses its struct dns_stream_server to access
the lock, so a warning is emitted when threads are disabled. Let's mark
it __maybe_unused.
The lock was still used exclusively to deal with the concurrency between
the "show sess" release handler and a stream_new() or stream_free() on
another thread. All other accesses made by "show sess" are already done
under thread isolation. The release handler only requires to unlink its
node when stopping in the middle of a dump (error, timeout etc). Let's
just isolate the thread to deal with this case so that it's compatible
with the dump conditions, and remove all remaining locking on the streams.
This effectively kills the streams lock. The measured gain here is around
1.6% with 4 threads (374krps -> 380k).
The global streams list is exclusively used for "show sess", to look up
a stream to shut down, and for the hard-stop. Having all of them in a
single list is extremely expensive in terms of locking when using threads,
with performance losses as high as 7% having been observed just due to
this.
This patch makes the list per-thread, since there's no need to have a
global one in this situation. All call places just iterate over all
threads. The most "invasive" changes was in "show sess" where the end
of list needs to go back to the beginning of next thread's list until
the last thread is seen. For now the lock was maintained to keep the
code auditable but a next commit should get rid of it.
The observed performance gain here with only 4 threads is already 7%
(350krps -> 374krps).
Instead of placing the current stream at the end of the stream list when
issuing a "show sess" on the CLI as was done in 2.2 with commit c6e7a1b8e
("MINOR: cli: make "show sess" stop at the last known session"), now we
compare the listed stream's epoch with the dumping stream's and stop on
more recent ones.
This way we're certain to always only dump known streams at the moment we
issue the dump command without having to modify the list. In theory we
could miss some streams if more than 2^31 "show sess" requests are issued
while an old stream remains present, but that's 68 years at 1 "show sess"
per second and it's unlikely we'll keep a process, let alone a stream, that
long.
It could be verified that the count of dumped streams still matches the
one before this change.
The "show sess" CLI command currently lists all streams and needs to
stop at a given position to avoid dumping forever. Since 2.2 with
commit c6e7a1b8e ("MINOR: cli: make "show sess" stop at the last known
session"), a hack consists in unlinking the stream running the applet
and linking it again at the current end of the list, in order to serve
as a delimiter. But this forces the stream list to be global, which
affects scalability.
This patch introduces an epoch, which is a global 32-bit counter that
is incremented by the "show sess" command, and which is copied by newly
created streams. This way any stream can know whether any other one is
newer or older than itself.
For now it's only stored and not exploited.
The hard-stop event didn't wake threads up. In the past it wasn't an issue
as the poll timeout was limited to 1 second, but since commit 4f59d3861
("MINOR: time: increase the minimum wakeup interval to 60s") it has become
a problem because old processes can remain live for up to one minute after
the hard-stop-after delay. Let's just wake them up.
This may be backported to older releases, though before 2.4 the extra
delay was only one second.
There's no locking around the lookup of a stream nor its shutdown
when issuing "shutdown sessions" over the CLI so the risk of crashing
the process is particularly high.
Let's use a thread_isolate() there which is suitable for this task, and
there are not that many alternatives.
This must be backported to 1.8.
When setting hard-stop-after, hard_stop() is called at the end to kill
last pending streams. Unfortunately there's no locking there while
walking over the streams list nor when shutting them down, so it's
very likely that some old processes have been crashing or gone wild
due to this. Let's use a thread_isolate() call for this as we don't
have much other choice (and it happens once in the process' life,
that's OK).
This must be backported to 1.8.
This patch adds a lock to functions vars_get_by_name() and
vars_get_by_desc() to protect accesses to the list of variables.
After the variable is fetched, a sample data is duplicated by using
smp_dup() because the variable may be modified by another thread.
This should be backported to all versions supporting vars along with
"BUG/MINOR: sample: secure convs that accept base64 string and var name
as args" which this patch depends on.