Released version 3.0-dev11 with the following main changes :
- BUILD: clock: improve check for pthread_getcpuclockid()
- CI: add Illumos scheduled workflow
- CI: netbsd: limit scheduled workflow to parent repo only
- OPTIM: log: resolve logformat options during postparsing
- BUG/MINOR: haproxy: only tid 0 must not sleep if got signal
- REGTEST: add tests for acl() sample fetch
- BUG/MINOR: acl: support built-in ACLs with acl() sample
- BUG/MINOR: cfgparse: use curproxy global var from config post validation
- MEDIUM: stconn/muxes: Add an abort reason for SE shutdowns on muxes
- MINOR: mux-h2: Set the SE abort reason when a RST_STREAM frame is received
- MEDIUM: mux-h2: Forward h2 client cancellations to h2 servers
- MINOR: mux-quic: Set tha SE abort reason when a STOP_SENDING frame is received
- MINOR: stconn: Add samples to retrieve about stream aborts
- MINOR: mux-quic: Add .ctl callback function to get info about a mux connection
- MINOR: muxes: Add ctl commands to get info on streams for a connection
- MINOR: connection: Add samples to retrieve info on streams for a connection
- BUG/MEDIUM: log/ring: broken syslog octet counting
- BUG/MEDIUM: mux-quic: fix crash on STOP_SENDING received without SD
- DOC: lua: fix filters.txt file location
- MINOR: dynbuf: pass a criticality argument to b_alloc()
- MINOR: dynbuf: add functions to help queue/requeue buffer_wait fields
- MINOR: dynbuf: use the b_queue()/b_requeue() functions everywhere
- MEDIUM: dynbuf: make the buffer_wq an array of list heads
- CLEANUP: tinfo: better align fields in thread_ctx
- MINOR: dynbuf: provide a b_dequeue() function to detach a bw from the queue
- MEDIUM: dynbuf: generalize the use of b_dequeue() to detach buffer_wait
- MEDIUM: dynbuf/stream: re-enable queueing upon failed buffer allocation
- MEDIUM: dynbuf/stream: do not allocate the buffers in the callback
- MEDIUM: applet: make appctx_buf_available() only wake the applet up, not allocate
- MINOR: applet: set the blocking flag in the buffer allocation function
- MINOR: applet: adjust the allocation criticity based on the requested buffer
- MINOR: dynbuf/mux-h1: use different criticalities for buffer allocations
- MEDIUM: dynbuf/mux-h1: do not allocate the buffers in the callback
- MEDIUM: dynbuf: refrain from offering a buffer if more critical ones are waiting
- MINOR: stconn: report that a buffer allocation succeeded
- MINOR: stream: report that a buffer allocation succeeded
- MINOR: applet: report about buffer allocation success
- MINOR: mux-h1: report that a buffer allocation succeeded
- MEDIUM: stream: allocate without queuing when retrying
- MEDIUM: channel: allocate without queuing when retrying
- MEDIUM: mux-h1: allocate without queuing when retrying
- MEDIUM: dynbuf: implement emergency buffers
- MEDIUM: dynbuf: use emergency buffers upon failed memory allocations
Now, if a pool_alloc() fails for a buffer and if conditions are met
based on the queue number, we'll try to get an emergency buffer.
Thanks to this the situation is way more stable now. With only 4 reserve
buffers and 1 buffer it's possible to reliably serve 500 concurrent end-
to-end H1 connections and consult stats in parallel in loops showing the
growing number of buf_wait events in "show activity" without facing an
instant stall like in the past. Lower values still cause quick stalls
though.
It's also apparent that some subsystems do not seem to detach from the
buffer_wait lists when leaving. For example several crashes in the H1
part showed list elements still present after a free(), so maybe some
operations performed inside h1_release() after the b_dequeue() call
can sometimes result in a new allocation. Same for streams, where
the dequeue is done relatively early.
The buffer reserve set by tune.buffers.reserve has long been unused, and
in order to deal gracefully with failed memory allocations we'll need to
resort to a few emergency buffers that are pre-allocated per thread.
These buffers are only for emergency use, so every time their count is
below the configured number a b_free() will refill them. For this reason
their count can remain pretty low. We changed the default number from 2
to 4 per thread, and the minimum value is now zero (e.g. for low-memory
systems). The tune.buffers.limit setting has always been a problem when
trying to deal with the reserve but now we could simplify it by simply
pushing the limit (if set) to match the reserve. That was already done in
the past with a static value, but now with threads it was a bit trickier,
which is why the per-thread allocators increment the limit on the fly
before allocating their own buffers. This also means that the configured
limit is saner and now corresponds to the regular buffers that can be
allocated on top of emergency buffers.
At the moment these emergency buffers are not used upon allocation
failure. The only reason is to ease bisecting later if needed, since
this commit only has to deal with resource management.
Now when trying to allocate a buffer, we can check if we've been notified
of availability via the callback, in which case we should not consult the
queue, or if we're doing a first allocation and check the queue. At this
point it still doesn't change much since the stream still doesn't make use
of it but some progress is expected.
Now when trying to allocate a channel buffer, we can check if we've been
notified of availability via the producer stream connector callback, in
which case we should not consult the queue, or if we're doing a first
allocation and check the queue.
Now when trying to allocate the work buffer, we can check if we've been
notified of availability via the buf_wait callback, in which case we
should not consult the queue, or if we're doing a first allocation and
check the queue.
When the buffer allocation callback is notified of a buffer availability,
it will now set a MAYALLOC flag in addition to clearing the ALLOC one, for
each of the 3 levels where we may fail an allocation. The flag will be
cleared upon a successful allocation. This will soon be used to decide to
re-allocate without waiting again in the queue. For now it has no effect.
There's just a trick, we need to clear the various *_ALLOC flags before
testing h1_recv_allowed() otherwise it will return false!
When appctx_buf_available() is called, it now sets APPCTX_FL_IN_MAYALLOC
or APPCTX_FL_OUT_MAYALLOC depending on the reportedly permitted buffer
allocation, and these flags are cleared when the said buffers are
allocated. For now they're not used for anything else.
When the buffer allocation callback is notified of a buffer availability,
it will now set a MAYALLOC flag on the stream so that the stream knows it
is allowed to bypass the queue checks. For now this is not used.
We used to have two states for the channel's input buffer used by the SC,
NEED_BUFF or not, flipped by sc_need_buff() and sc_have_buff(). We want to
have a 3rd state, indicating that we've just got a desired buffer. Let's
add an HAVE_BUFF flag that is set by sc_have_buff() and that is cleared by
sc_used_buff(). This way by looking at HAVE_BUFF we know that we're coming
back from the allocation callback and that the offered buffer has not yet
been used.
Now b_alloc() will check the queues at the same and higher criticality
levels before allocating a buffer, and will refrain from allocating one
if these are not empty. The purpose is to put some priorities in the
allocation order so that most critical allocators are offered a chance
to complete.
However in order to permit a freshly dequeued task to allocate again while
siblings are still in the queue, there is a special DB_F_NOQUEUE flag to
pass to b_alloc() that will take care of this special situation.
One of the problematic designs with the buffer_wait mechanism is that
the callbacks pre-allocate the buffers and stay in the run queue for
a while, resulting in all of the few buffers being assigned to waiting
tasks instead of being all available to one task that needs them all at
once.
Here we simply stop doing this, the callback clears the waiting flags
and wakes the task up so that it has a chance of still finding some
buffers.
While it could certainly still be improved, this first approach consists
in assigning buffers like this in the H1 mux:
- h1c->obuf : DB_MUX_TX
- h1c->ibuf : DB_MUX_RX
- h1s->rxbuf: DB_SE_RX
That's done via 3 distinct functions for better code clarity, and it
also allowed to move the missing buffer flags assignment there.
Among possible improvements would be to take into consideration the
state of the parser (i.e. no data yet vs data, or headers vs payload)
so that even server beginning of response or pure payload can be lowered
in priority.
When we want to allocate an in buffer, it's in order to pass data to
the applet, that will consume it, so it must be seen as the same as
a send() from the higher level, i.e. MUX_TX. And for the outbuf, it's
a stream endpoint returning data, i.e. DB_SE_RX.
Instead of having each caller of appctx_get_buf() think about setting
the blocking flag, better have the function do it, since it's already
handling the queue anyway. This way we're sure that both are consistent.
Now we don't want bufwait handlers to preallocate the resources they
were expecting since it contributes to the shortage. Let's just wake
the applet up and that's all.
One of the problematic designs with the buffer_wait mechanism is that
the callbacks pre-allocate the buffers and stay in the run queue for
a while, resulting in all of the few buffers being assigned to waiting
tasks instead of being all available to one task that needs them all at
once.
Here we simply stop doing this, the callback clears the waiting flags
and wakes the task up so that it has a chance of still finding some
buffers.
The errors were not working fine anyway since we know that upon low memory
condition everything freezes. However we have a chance to do better now,
so let's start by re-enabling queueing when allocations fail.
Now that we need to keep the bitmap in sync with the list heads, we don't
want tasks to leave just doing a LIST_DEL_INIT() without updating the map.
Let's provide a b_dequeue() function for that purpose. The function detects
when it's going to remove the last element and figures the queue number
based on the pointer since it points to the root. It's not used yet.
The introduction of buffer_wq[] in thread_ctx pushed a few fields around
and the cache line alignment is less satisfying. And more importantly, even
before this, all the lists in the local parts were 8-aligned, with the first
one split across two cache lines.
We can do better:
- sched_profile_entry is not atomic at all, the data it points to is
atomic so it doesn't need to be in the atomic-only region, and it can
fill the 8-hole before the lists
- the align(2*void) that was only before tasklets[] moves before all
lists (and it's a nop for now)
This now makes the lists and buffer_wq[] start on a cache line boundary,
leaves 48 bytes after the lists before the atomic-only cache line, and
leaves a full cache line at the end for 128-alignment. This way we still
have plenty of room in both parts with better aligned fields.
Let's turn the buffer_wq into an array of 4 list heads. These are chosen
by criticality. The DB_CRIT_TO_QUEUE() macro maps each criticality level
into one of these 4 queues. The goal here clearly is to make it possible
to wake up the most critical queues in priority in order to let some tasks
finish their job and release buffers that others can use.
In order to avoid having to look up all queues, a bit map indicates which
queues are in use, which also allows to avoid looping in the most common
case where queues are empty..
The code places that were used to manipulate the buffer_wq manually
now just call b_queue() or b_requeue(). This will simplify the multiple
list management later.
When failing an allocation we always do the same dance, add the
buffer_wait struct to a list if it's not, and return. Let's just add
dedicated functions to centralize this, this will be useful to implement
a bit more complex logic.
For now they're not used.
The goal is to indicate how critical the allocation is, between the
least one (growing an existing buffer ring) and the topmost one (boot
time allocation for the life of the process).
The 3 tcp-based muxes (h1, h2, fcgi) use a common allocation function
to try to allocate otherwise subscribe. There's currently no distinction
of direction nor part that tries to allocate, and this should be revisited
to improve this situation, particularly when we consider that mux-h2 can
reduce its Tx allocations if needed.
For now, 4 main levels are planned, to translate how the data travels
inside haproxy from a producer to a consumer:
- MUX_RX: buffer used to receive data from the OS
- SE_RX: buffer used to place a transformation of the RX data for
a mux, or to produce a response for an applet
- CHANNEL: the channel buffer for sync recv
- MUX_TX: buffer used to transfer data from the channel to the outside,
generally a mux but there can be a few specificities (e.g.
http client's response buffer passed to the application,
which also gets a transformation of the channel data).
The other levels are a bit different in that they don't strictly need to
allocate for the first two ones, or they're permanent for the last one
(used by compression).
At the beginning of the filter class section, we encourage the user to
check out filters.txt file to get to know how the filters API works
within haproxy.
However the file location is incorrect. The proper directory to look for
the file is: doc/internals/api.
It should be backported up to 2.5.
Abort reason code received on STOP_SENDING is notified to upper layer
since the following commit :
367ce1ebf3
MINOR: mux-quic: Set tha SE abort reason when a STOP_SENDING frame is received
However, this causes a crash when a STOP_SENDING is received on a QCS
instance without any stream instantiated. Fix this by checking first if
qcs->sd is not NULL before setting abort code.
This bug can easily be reproduced by emitting a STOP_SENDING as first
frame of a stream.
This should fix github issue #2563.
This does not need to be backported.
As reported by Tristan in GH #2561, syslog messages sent over rings are
malformed since commit 01aa0a05 ("MEDIUM: ring: change the ring reader
to use the new vector-based API now").
Indeed, take a look at the following log message produced prior to
01aa0a05:
181 <134>1 2024-05-07T09:45:21.543263+02:00 - haproxy 113700 - - 127.0.0.1:56136 [07/May/2024:09:45:21.491] front front/s1 0/0/21/30/51 404 369 - - ---- 1/1/0/0/0 0/0 "GET / HTTP/1.1"
Starting with 01aa0a05, here's the equivalent log message:
<134>1 2024-05-07T09:45:21.543263+02:00 - haproxy 112729 - - 127.0.0.1:56136 [07/May/2024:09:45:21.491] front front/s1 0/0/66/39/105 404 369 - - ---- 1/1/0/0/0 0/0 "GET / HTTP/1.1"-fwr
-> Message is missing octet counting header, and garbage bytes are found
at the end of the payload.
This bug is caused by a small mistake in syslog_applet_append_event():
when the function was refactored to use vector API instead of buffer
API, we used 'trash.area' as starting pointer to write the event instead
of 'trash.area + trash.data', causing existing octet counting prefix
(already written in trash) to be overwritten and trash.data to be
wrongly incremented.
No backport needed (01aa0a05 was introduced during 3.0 development)
Thanks to the previous fix, it is now possible to get the number of opened
streams for a connection and the negociated limit. Here, corresponding
sample feches are added, in fc_ and bc_ scopes.
On frontend side, the limit of streams is imposed by HAProxy. But on the
backend side, the limit is defined by the server. it may be useful for
debugging purpose because it may explain slow-downs on some processing.
There are 2 new ctl commands that may be used to retrieve the current number
of streams openned for a connection and its limit (the maximum number of
streams a mux connection supports).
For the PT and H1 muxes, the limit is always 1 and the current number of
streams is 0 for idle connections, otherwise 1 is returned.
For the H2 and the FCGI muxes, info are already available in the mux
connection.
For the QUIC mux, the limit is also directly available. It is the maximum
initial sub-ID of bidirectional stream allowed for the connection. For the
current number of streams, it is the number of SC attached on the connection
and the number of not already attached streams present in the "opening_list"
list.
Other muxes implement this callback function. It was not implemented for the
QUIC mux because it was useless. It will be used to retrieve the current/max
number of stream for a quic connection. So let's added it, adding the
default support for MUX_CTL_EXIT_STATUS command.
It is now possible to retrieve some info about the abort received for a
server or a client stream, if any.
* fs.aborted and bs.aborted can be used to know if an abort was received
on frontend or backend side. A boolean is returned.
* fs.rst_code and bs.rst_code return the code of the received RESET_STREAM
frame for a H2 stream or the code of the received STOP_SENDING frame for
a QUIC stream. In both cases, the error code attached to the frame is
returned. The sample fetch fails if no such frame was received or if the
stream is not an H2/QUIC stream.
When STOP_SENDING frame is received for a quic stream, the error code is now
saved in the SE abort reason. To do so, we use the QUIC source
(SE_ABRT_SRC_MUX_QUIC). For now, this code is only set but not used on the
opposite side.
When a H2 client sends a RST_STREAM(CANCEL) frame to abort a request, the
abort reason is now used on server side, in the H2 mux, to set the
RST_STREAM code. The main use case is to forward client cancellations to
gRPC applications.
This patch should fix the issue #172.
When RST_STREAM frame is received, the error code is now saved in the SE
abort reason. To do so, we use the H2 source (SE_ABRT_SRC_MUX_H2). For now,
this code is only set but not used on the opposite side.
A reason is now passed as parameter to muxes shutdowns to pass additional
info about the abort, if any. No info means no abort or only generic one.
For now, the reason is composed of 2 32-bits integer. The first on represents
the abort code and the other one represents the info about the code (for
instance the source). The code should be interpreted according to the associated
info.
One info is the source, encoding on 5 bits. Other bits are reserverd for now.
For now, the muxes are the only supported source. But we can imagine to extend
it to applets, streams, health-checks...
The current design is quite simple and will most probably evolved.. But the
idea is to let the opposite side forward some errors and let's a mux know
why its stream was aborted. At first glance, a abort reason must only be
evaluated if SE_SHW_SILENT flag is set.
The main goal at short term, is to forward some H2 RST_STREAM codes because
it is mandatory for gRPC applications, mainly to forward gRPC cancellation
from an H2 client to an H2 server. But we can imagine to alter this reason
at the applicative level to enrich it. It would also be used to report more
accurate errors in logs.
Previously check_config_validity() had its own curproxy variable. This
resulted in the acl() sample fetch being unable to determine which
proxy was in use when used from within log-format statements. This
change addresses the issue by having the check_config_validity()
function use the global variable instead.
This patch fixes the commit eea152ee68
("BUG/MINOR: signals/poller: ensure wakeup from signals").
There is some probability that run_poll_loop() becomes inifinite, if
TH_FL_SLEEPING is withdrawn from all threads in the second signal_queue_len
check, when a signal has received just after the first one.
In such particular case, the 'wake' variable, which is used to terminate
thread's poll loop is never reset to 0. So, we never enter to the "stopping"
part of the run_poll_loop() and threads, except the one with id 0 (tid 0
handles signals), will continue to call _do_poll() eternally and will never
sleep, as its TH_FL_SLEEPING flag was unset.
This flag needs to be removed only for the tid 0, as it was done in the first
signal_queue_len check.
This fixes an issue #2537 "infinite loop when shutting down".
This fix must be backported in every stable version.
In lf_buildctx_prepare(), we perform costly bitwise operations for every
nodes to resolve node options and check for incompatibilities with global
options.
In fact, all this logic may safely be performed during postparsing. This
is what we're doing in this commit. Doing so saves us from unnecessary
runtime checks and could help speedup sess_build_logline().
Since checks are not as costly as before (due to them being performed
during postparsing and not on log building path anymore), an complementary
check for OPT_HTTP vs OPT_ENCODE incompatibity was added:
encoding is ignored if HTTP option is set, unless HTTP option wasn't
set globally and encoding was set globally, which means encoding
takes the precedence
Thanks to this patch, lf_buildctx_prepare() now only takes care of
assigning proper typecast and options settings depending if it's used
from global or per-node context, and prepares CBOR-specific structure
members when CBOR encode option is set.
Released version 3.0-dev10 with the following main changes :
- BUG/MEDIUM: cache: Vary not working properly on anything other than accept-encoding
- REGTESTS: cache: Add test on 'vary' other than accept-encoding
- BUG/MINOR: stats: replace objt_* by __objt_* macros
- CLEANUP: tools/cbor: rename cbor_encode_ctx struct members
- MINOR: log/cbor: _lf_cbor_encode_byte() explicitly requires non-NULL ctx
- BUG/MINOR: log: fix global lf_expr node options behavior
- CLEANUP: log: add a macro to know if a lf_node is configurable
- MINOR: httpclient: allow to use absolute URI with new flag HC_F_HTTPROXY
- MINOR: ssl: introduce ocsp_update.http_proxy for ocsp-update keyword
- BUG/MINOR: log/encode: consider global options for key encoding
- BUG/MINOR: log/encode: fix potential NULL-dereference in LOGCHAR()
- BUG/MINOR: log: fix global lf_expr node options behavior (2nd try)
- MINOR: log/cbor: _lf_cbor_encode_byte() explicitly requires non-NULL ctx (again)
- BUG/MEDIUM: log: don't ignore disabled node's options
- BUG/MINOR: stconn: don't wake up an applet waiting on buffer allocation
- MINOR: sock: rename sock to sock_fd in sock_create_server_socket
- MEDIUM: proto_uxst: take in account server namespace
- MEIDUM: unix sock: use my_socketat to create bind socket
- MINOR: sock_set_mark: take sock family in account
- MEDIUM: proto: make common fd checks in sock_create_server_socket
- MINOR: sock: add EPERM case in sock_handle_system_err
- MINOR: capabilities: add cap_sys_admin support
- CLEANUP: ssl: clean the includes in ssl_ocsp.c
- CLEANUP: ssl: move the global ocsp-update options parsing to ssl_ocsp.c
- MINOR: stats: fix visual alignment for stat_cols_px definition
- MINOR: stats: convert req_tot as generic column
- MINOR: stats: prepare stats-file support for values other than FN_COUNTER
- MINOR: counters: move freq-ctr from proxy/server into counters struct
- MINOR: stats: support rate in stats-file
- MINOR: stats: convert rate as generic column for proxy stats
- MINOR: counters: move last_change into counters struct
- MINOR: stats: support age in stats-file
- MINOR: stats: convert age as generic column for proxy stat
- CLEANUP: ssl: rename new_ckch_store_load_files_path() to ckch_store_new_load_files_path()
- MINOR: ssl: rename ocsp_update.http_proxy into ocsp-update.httpproxy
- REORG: stats: define stats-proxy source module
- MINOR: stats: extract proxy clear-counter in a dedicated function
- REGTESTS: stats: add test stats-file counters preload
- CI: netbsd: adjust packages after NetBSD-10 released
- CLEANUP: assorted typo fixes in the code and comments
- REGTESTS: replace REQUIRE_VERSION by version_atleast
- MEDIUM: log: optimizing tmp->type handling in sess_build_logline()
- BUG/MINOR: log: prevent double spaces emission in sess_build_logline()
- OPTIM: log: declare empty buffer as global variable
- OPTIM: log: use thread local lf_buildctx to stop pushing it on the stack
- OPTIM: log: use lf_buildctx's buffer instead of temporary stack buffers
- OPTIM: log: speedup date printing in sess_build_logline() when no encoding is used
In sess_build_logline(), we have multiple fieds such as '%t' that build
a fixed-length string out of a date struct and then print it using
lf_rawtext(). In fact, printing it using lf_rawtext() is only mandatory
to deal with encoding options, but when no encoding is used we can output
the result to tmplog directly. Since most dates generate between 25 and 30
chars, doing so spares us from writing them twice and could help make
sess_build_logline() a bit faster when no encoding is used. (to match with
pre-encoding patch series performance).
Now that lf_buildctx isn't pushed on the stack anymore, let's take this
opportunity to store a small buffer of 256 bytes within it, and then use
this buffer as general purpose buffer to build fixed-length strings that
are then printed using lf_{raw}text() function. By doing so we stop
relying on temporary stack buffers.
Following previous commit's logic, let's move lf_buildctx ctx away from
sess_build_logline() to stop abusing from the stack to push large
structure each time sess_build_logline() is called. Also, don't memset
the structure for each invokation, but only reset members explicitly when
required.
For that we now declare one static lf_buildctx per thread (using
THREAD_LOCAL) and make sess_build_logline() refer to it using a pointer.
'empty' buffer used in sess_build_logline() inside a loop, and since it
is only being read from and not modified, until recently it ended up being
cached most of the time and didn't cause overhead due to systematic push
on the stack.
However, due recent encoding work and new added variables on the stack,
we're starting to reach a stack limit and declaring 'empty' buffer within
the loop seems to cause non-negligible CPU overhead.
Since the variable isn't modified during log generation, let's declare
'empty' buffer as a global variable outside from sess_build_logline()
to prevent pushing it on the stack for each node evaluation.