This commit introduces the keyword "client-sigalgs" for the bind line,
which does the same as "sigalgs" but for the client authentication.
"ssl-default-bind-client-sigalgs" allows to set the default parameter
for all the bind lines.
This patch should fix issue #2081.
This patch introduces the "sigalgs" keyword for the bind line, which
allows to configure the list of server signature algorithms negociated
during the handshake. Also available as "ssl-default-bind-sigalgs" in
the default section.
This patch was originally written by Bruno Henc.
The thread dump mechanism that is used by "show threads" and by the
panic dump is overly complicated due to an initial misdesign. It
firsts wakes all threads, then serializes their dumps, then releases
them, while taking extreme care not to face colliding dumps. In fact
this is not what we need and it reached a limit where big machines
cannot dump all their threads anymore due to buffer size limitations.
What is needed instead is to be able to dump *one* thread, and to let
the requester iterate on all threads.
That's what this patch does. It adds the thread_dump_buffer to the
struct thread_ctx so that the requester offers the buffer to the
thread that is about to be dumped. This buffer also serves as a lock.
A thread at rest has a NULL, a valid pointer indicates the thread is
using it, and 0x1 (NULL+1) is used by the dumped thread to tell the
requester it's done. This makes sure that a given thread is dumped
once at a time. In addition to this, the calling thread decides
whether it accesses the thread by itself or via the debug signal
handler, in order to get a backtrace. This is much saner because the
calling thread is free to do whatever it wants with the buffer after
each thread is dumped, and there is no dependency between threads,
once they've dumped, they're free to continue (and possibly to dump
for another requester if needed). Finally, when the THREAD_DUMP
feature is disabled and the debug signal is not used, the requester
accesses the thread by itself like before.
For now we still have the buffer size limitation but it will be
addressed in future patches.
NS_TO_TV helper was implemented in 591fa59 ("MINOR: time: add conversions
to/from nanosecond timestamps")
Due to NS_TO_TV being implemented as a macro and not a function, we must
take extra care when manipulating user input.
In current implementation, 't' argument is not isolated within the macro.
Because of this, NS_TO_TV(1 + 1) will expand to:
((const struct timeval){ .tv_sec = 1 + 1 / 1000000000ULL, .tv_usec = (1 + 1 % 1000000000ULL) / 1000U })
Instead of:
((const struct timeval){ .tv_sec = 2 / 1000000000ULL, .tv_usec = (2 % 1000000000ULL) / 1000U })
As such, NS_TO_TV usage in hlua_now() is currently incorrect and this
results in unexpected values being passed to lua.
In this patch, we're adding an extra parenthesis around 't' in NS_TO_TV()
macro to make it safe against such usages. (that is: ensure proper
argument expansion as if NS_TO_TV was implemented as a function)
This is a 2.8 specific bug, no backport needed.
When a fatal error is detected by the QUIC MUX or H3 layer, the
connection should be closed with a CONNECTION_CLOSE with an error code
as the reason.
Previously, a direct call was used to the quic_conn layer to try to
close the connection. This API was adjusted to be more flexible. Now,
when an error is detected, the function qcc_set_error() is called. This
set the flag QC_CF_ERRL with the error code stored by the MUX. The
connection will be closed soon so most of the operations are not
conducted anymore. Connection is then finally closed during qc_send()
via quic_conn layer if QC_CF_ERRL is set. This will set the flag
QC_CF_ERRL_DONE which indicates that the MUX instance can be freed.
This model is cleaner and brings the following improvments :
- interaction with quic_conn layer for closure is centralized on a
single function
- CO_FL_ERROR is not set anymore. This was incorrect as this should be
reserved to errors reported by the transport layer to be similar with
other haproxy components. As a consequence, qcc_is_dead() has been
adjusted to check for QC_CF_ERRL_DONE to release the MUX instance.
This should be backported up to 2.7.
Add a dedicated trace event QMUX_EV_QCC_ERR. This is used for locally
detected error when a CONNECTION_CLOSE should be emitted.
This should be backported up to 2.7.
Now that "now" is no more a timeval, there's no point keeping a copy
of it as a timeval, let's also switch start_time to nanoseconds, it
simplifies operations.
This puts an end to the occasional confusion between the "now" date
that is internal, monotonic and not synchronized with the system's
date, and "date" which is the system's date and not necessarily
monotonic. Variable "now" was removed and replaced with a 64-bit
integer "now_ns" which is a counter of nanoseconds. It wraps every
585 years, so if all goes well (i.e. if humanity does not need
haproxy anymore in 500 years), it will just never wrap. This implies
that now_ns is never nul and that the zero value can reliably be used
as "not set yet" for a timestamp if needed. This will also simplify
date checks where it becomes possible again to do "date1<date2".
All occurrences of "tv_to_ns(&now)" were simply replaced by "now_ns".
Due to the intricacies between now, global_now and now_offset, all 3
had to be turned to nanoseconds at once. It's not a problem since all
of them were solely used in 3 functions in clock.c, but they make the
patch look bigger than it really is.
The clock_update_local_date() and clock_update_global_date() functions
are now much simpler as there's no need anymore to perform conversions
nor to round the timeval up or down.
The wrapping continues to happen by presetting the internal offset in
the short future so that the 32-bit now_ms continues to wrap 20 seconds
after boot.
The start_time used to calculate uptime can still be turned to
nanoseconds now. One interrogation concerns global_now_ms which is used
only for the freq counters. It's unclear whether there's more value in
using two variables that need to be synchronized sequentially like today
or to just use global_now_ns divided by 1 million. Both approaches will
work equally well on modern systems, the difference might come from
smaller ones. Better not change anyhting for now.
One benefit of the new approach is that we now have an internal date
with a resolution of the nanosecond and the precision of the microsecond,
which can be useful to extend some measurements given that timestamps
also have this resolution.
Instead we're using ns_to_sec(tv_to_ns(&now)) which allows the tv_sec
part to disappear. At this point, "now" is only used as a timeval in
clock.c where it is updated.
Let's get rid of timeval in storage of internal timestamps so that they
are no longer mistaken for wall clock time. These were exclusively used
subtracted from each other or to/from "now" after being converted to ns,
so this patch removes the tv_to_ns() conversion to use them natively. Two
occurrences of tv_isge() were turned to a regular wrapping subtract.
Instead of operating on {sec, usec} now we convert both operands to
ns then subtract them and convert to ms. This is a first step towards
dropping timeval from these timestamps.
Interestingly, tv_ms_elapsed() and tv_ms_remain() are no longer used at
all and could be removed.
In order to ease the transition away from the timeval used in internal
timestamps, let's first create a few functions and macro to return a
counter from a timeval and conversely, as well as ease the conversions
to/from ns/us/ms/sec to save the user from having to count zeroes and
to think about appending ULL in conversions.
SC_FL_SND_NEVERWAIT and SC_FL_SND_EXP_MORE flags have the same value. It is
not critical because these flags are only used to know if MSG_MORE flag must
be set on a send().
No backport needed.
It is possible to start too many applets on sporadic burst of events after
an inactivity period. It is due to the way we estimate if a new applet must
be created or not. It is based on a frequency counter. We compare the events
processing rate against the number of events currently processed (in
progress or waiting to be processed). But we should also take care of the
number of idle applets.
We already track the number of idle applets, but it is global and not
per-thread. Thus we now also track the number of idle applets per-thread. It
is not a big deal because this fills a hole in the spoe_agent structure.
Thanks to this counter, we can refrain applets creation if there is enough
idle applets to handle currently processed events.
This patch should be backported to every stable versions.
During accept, a quic-conn is rebind to a new thread. This process is
done in two times :
* first on the original thread via qc_set_tid_affinity()
* then on the newly assigned thread via qc_finalize_affinity_rebind()
Most quic_conn operations (I/O tasklet, task and quic_conn FD socket
read) are reactivated ony after the second step. However, there is a
possibility that datagrams are handled before it via quic_dgram_parse()
when using listener sockets. This does not seem to cause any issue but
this may cause unexpected behavior in the future.
To simplify this, qc_finalize_affinity_rebind() will be called both by
qc_xprt_start() and quic_dgram_parse(). Only one invocation will be
performed thanks to the new flag QUIC_FL_CONN_AFFINITY_CHANGED.
This should be backported up to 2.7.
Some HTX responses may not always contain a EOM block. For example this
is the case if content-length header is missing from the HTTP server
response. Stream termination is thus signaled to QUIC mux via shutw
callback. However, this is interpreted inconditionnally as an early
close by the mux with a RESET_STREAM emission. Most of the times, QUIC
clients report this as an error.
To fix this, check if htx.extra is set to HTX_UNKOWN_PAYLOAD_LENGTH for
a qcs instance. If true, shutw will never be used to emit a
RESET_STREAM. Instead, the stream will be closed properly with a FIN
STREAM frame. If all data were already transfered, an empty STREAM frame
is sent.
This fix may help with the github issue #2004 where chrome browser stop
to use QUIC after receiving RESET_STREAM frames.
This issue was reported by Vladimir Zakharychev. Thanks to him for his
help and testing. It was also reproduced locally using httpterm with the
query string "/?s=1k&b=0&C=1".
This should be backported up to 2.7.
On aarch64 there's also a guaranted invalid instruction, called UDF, and
which even supports an optional 16-bit immediate operand:
https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/UDF--Permanently-Undefined-?lang=en
It's conveniently encoded as 4 zeroes (when the operand is zero). It's
unclear when support for it was added into GAS, if at all; even a
not-so-old 2.27 doesn't know about it. Let's byte-encode it.
Tested on an A72 and works as expected.
BUG_ON() calls currently trigger a segfault. This is more convenient
than abort() as it doesn't rely on any function call nor signal handler
and never causes non-unwindable stacks when opening cores. But it adds
quite some confusion in bug reports which are rightfully tagged "segv"
and do not instantly allow to distinguish real segv (e.g. null derefs)
from code asserts.
Some CPU architectures offer various crashing methods. On x86 we have
INT3 (0xCC), which stops into the debugger, and UD0/UD1/UD2. INT3 looks
appealing but for whatever reason (maybe signal handling somewhere) it
loses the last call point in the stack, making backtraces unusable. UD2
has the merit of being only 2 bytes and causing an invalid instruction,
which almost never happens normally, so it's easily distinguishable.
Here it was defined as a macro so that the line number in the core
matches the one where the BUG_ON() macro is called, and the debugger
shows the last frame exactly at its calligg point.
E.g. when calling "debug dev bug":
Program terminated with signal SIGILL, Illegal instruction.
#0 debug_parse_cli_bug (args=<optimized out>, payload=<optimized out>, appctx=<optimized out>, private=<optimized out>) at src/debug.c:408
408 BUG_ON(one > zero);
[Current thread is 1 (Thread 0x7f7a660cc1c0 (LWP 14238))]
(gdb) bt
#0 debug_parse_cli_bug (args=<optimized out>, payload=<optimized out>, appctx=<optimized out>, private=<optimized out>) at src/debug.c:408
#1 debug_parse_cli_bug (args=<optimized out>, payload=<optimized out>, appctx=<optimized out>, private=<optimized out>) at src/debug.c:402
#2 0x000000000061a69f in cli_parse_request (appctx=appctx@entry=0x181c0160) at src/cli.c:832
#3 0x000000000061af86 in cli_io_handler (appctx=0x181c0160) at src/cli.c:1035
#4 0x00000000006ca2f2 in task_run_applet (t=0x181c0290, context=0x181c0160, state=<optimized out>) at src/applet.c:449
Rename all frame variables with the suffix _frm. This helps to
differentiate frame instances from other internal objects.
This should be backported up to 2.7.
Each frame type used in quic_frame union has been renamed with the
following prefix "qf_". This helps to differentiate frame instances from
other internal objects.
This should be backported up to 2.7.
This new setting accepts "by-process", "by-group" and "by-thread" and
will dictate how listeners will be sharded by default when nothing is
specified. While the default remains "by-process", "by-group" should be
much more efficient with many threads, while not changing anything for
single-group setups.
When testing if a protocol supports SO_REUSEPORT, we're now able to
verify if the OS does really support it. While it may be supported at
build time, it may possibly have been blocked in a container for
example so we'd rather know what it's like.
The new function _sock_supports_reuseport() will be used to check if a
protocol type supports SO_REUSEPORT or not. This will be useful to verify
that shards can really work.
The new function protocol_supports_flag() checks the protocol flags
to verify if some features are supported, but will support being
extended to refine the tests. Let's use it to check for REUSEPORT.
Some protocol support SO_REUSEPORT and others not. Some have such a
limitation in the kernel, and others in haproxy itself (e.g. sock_unix
cannot support multiple bindings since each one will unbind the previous
one). Also it's really protocol-dependent and not just family-dependent
because on Linux for some time it was supported for TCP and not UDP.
Let's move the definition to the protocols instead. Now it's preset in
tcp/udp/quic when SO_REUSEPORT is defined, and is otherwise left unset.
The enabled() config condition test validates IPv4 (generally sufficient),
and -dR / noreuseport all protocols at once.
We'll use these flags to know if some protocols are supported, and if
so, with what options/extensions. Reuseport will move there for example.
Two functions were added to globally set/clear a flag.
What used to be only two lines to apply a mask in a loop in
check_config_validity() grew into a 130-line block that performs deeply
listener-specific operations that do not have their place there anymore.
In addition it's worth noting that the peers code still doesn't support
shards nor being bound to more than one group, which is a second reason
for moving that code to its own function. Nothing was changed except
recreating the missing variables from the bind_conf itself (the fe only).
This field forces an unaligned hole between two list heads. Let's move
it up where it will be more easily combined with other fields. In
addition, turn it to unsigned while it's still not used.
There's a two-byte hole in proto_fam after sock_family, let's move the
l3_addrlen there as a ushort. Note that contrary to what the comment
says, it's still not used by hash algorithms though it could.
One limitation of the current thread index mechanism is that if the
values are assigned multiple times to the same thread and the index
loops, it can match again the old value, which will not prevent a
competing thread from finishing its CAS and assigning traffic to a
thread that's not the optimal one. The probability is low but the
solution is simple enough and consists in implementing an update
counter in the high bits of the index to force a mismatch in this
case (assuming we don't try to cover for extremely unlikely cases
where the update counter loops while the index remains equal). So
let's do that. In order to improve the situation a little bit, we
now set the index to a ulong so that in 32 bits we have 8 bits of
counter and in 64 bits we have 40 bits.
There has always been a race when checking the length of an accept queue
to determine which one is more loaded that another, because the head and
tail are read at two different moments. This is not required, we can merge
them as two 16 bit numbers inside a single 32-bit index that is always
accessed atomically. This way we read both values at once and always have
a consistent measurement.
The purpose of this new flag will be to mark that some listeners
duplicate their reference's FD instead of trying to setup a completely
new listener from scratch. This will be used when multiple groups want
to listen to the same socket, via multiple FDs.
In order to create multiple receivers for one multi-group shard, we'll
need some more info about the shard. Here we store:
- the number of groups (= number of receivers)
- the number of threads (will be used for accept LB)
- pointer to the reference rx (to get the FD and to find all threads)
- pointers to the other members (to iterate over all threads)
For now since there's only one group per shard it remains simple. The
listener deletion code already takes care of removing the current
member from its shards list and moving others' reference to the last
one if it was their reference (so as to avoid o(n^2) updates during
ordered deletes).
Since the vast majority of setups will not use multi-group shards, we
try to save memory usage by only allocating the shard_info when it is
needed, so the principle here is that a receiver shard_info==NULL is
alone and doesn't share its socket with another group.
Various approaches were considered and tests show that the management
of the listeners during boot makes it easier to just attach to or
detach from a shard_info and automatically allocate it if it does not
exist, which is what is being done here.
For now the attach code is not called, but detach is already called
on delete.
This new algorithm for rebalancing incoming connections to multiple
threads is simpler and instead of considering the threads load, it will
only cycle through all of them, offering a fair share of the traffic to
each thread. It may be well suited for short-lived connections but is
also convenient for very large thread counts where it's not always certain
that the least loaded thread will always be found.
There's a li_per_thread array in each listener for use with QUIC
listeners. Since thread groups were introduced, this array can be
allocated too large because global.nbthread is allocated for each
listener, while only no more than MIN(nbthread,MAX_THREADS_PER_GROUP)
may be used by a single listener. This was because the global thread
ID is used as the index instead of the local ID (since a listener may
only be used by a single group). Let's just switch to local ID and
reduce the allocated size.
When migrating a quic_conn to another thread, we may need to also
switch the listener if the thread belongs to another group. When
this happens, the freshly created connection will already have the
target listener, so let's just pick it from the connection and use
it in qc_set_tid_affinity(). Note that it will be the caller's
responsibility to guarantee this.
Operational and administrative state change causes are not propagated
through srv_update_status(), instead they are directly consumed within
the function to provide additional info during the call when required.
Thus, there is no valid reason for keeping adm and op causes within
server struct. We are wasting space and keeping uneeded complexity.
We now exlicitly pass change type (operational or administrative) and
associated cause to srv_update_status() so that no extra storage is
needed since those values are only relevant from srv_update_status().
This one is greatly inspired by "MINOR: server: change adm_st_chg_cause storage type".
While looking at current srv_op_st_chg_cause usage, it was clear that
the struct needed some cleanup since some leftovers from asynchronous server
state change updates were left behind and resulted in some useless code
duplication, and making the whole thing harder to maintain.
Two observations were made:
- by tracking down srv_set_{running, stopped, stopping} usage,
we can see that the <reason> argument is always a fixed statically
allocated string.
- check-related state change context (duration, status, code...) is
not used anymore since srv_append_status() directly extracts the
values from the server->check. This is pure legacy from when
the state changes were applied asynchronously.
To prevent code duplication, useless string copies and make the reason/cause
more exportable, we store it as an enum now, and we provide
srv_op_st_chg_cause() function to fetch the related description string.
HEALTH and AGENT causes (check related) are now explicitly identified to
make consumers like srv_append_op_chg_cause() able to fetch checks info
from the server itself if they need to.
srv_append_status() has become a swiss-knife function over time.
It is used from server code and also from checks code, with various
inputs and distincts code paths, making it very hard to guess the
actual behavior of the function (resulting string output).
To simplify the logic behind it, we're dividing it in multiple contextual
functions that take simple inputs and do explicit things, making them
more predictable and easier to maintain.
Even though it doesn't look like it at first glance, this is more like
a cleanup than an actual code improvement:
Given that srv->adm_st_chg_cause has been used to exclusively store
static strings ever since it was implemented, we make the choice to
store it as an enum instead of a fixed-size string within server
struct.
This will allow to save some space in server struct, and will make
it more easily exportable (ie: event handlers) because of the
reduced memory footprint during handling and the ability to later get
the corresponding human-readable message when it's explicitly needed.
For advanced async handlers only
(Registered using EVENT_HDL_ASYNC_TASK() macro):
event->when is provided as a struct timeval and fetched from 'date'
haproxy global variable.
Thanks to 'when', related event consumers will be able to timestamp
events, even if they don't work in real-time or near real-time.
Indeed, unlike sync or normal async handlers, advanced async handlers
could purposely delay the consumption of pending events, which means
that the date wouldn't be accurate if computed directly from within
the handler.
Add the ability to provide a cleanup function for event data passed
via the publishing function.
One use case could be the need to provide valid pointers in the safe
section of the data struct.
Cleanup function will be automatically called with data (or copy of data)
as argument when all handlers consumed the event, which provides an easy
way to release some memory or decrement refcounts to ressources that were
provided through the data struct.
data in itself may not be freed by the cleanup function, it is handled
by the API.
This would allow passing large (allocated) data blocks through the data
struct while keeping data struct size under the EVENT_HDL_ASYNC_EVENT_DATA
size limit.
To do so, when publishing an event, where we would currently do:
struct event_hdl_cb_data_new_family event_data;
/* safe data, available from both sync and async contexts
* may not use pointers to short-living resources
*/
event_data.safe.my_custom_data = x;
/* unsafe data, only available from sync contexts */
event_data.unsafe.my_unsafe_data = y;
/* once data is prepared, we can publish the event */
event_hdl_publish(NULL,
EVENT_HDL_SUB_NEW_FAMILY_SUBTYPE_1,
EVENT_HDL_CB_DATA(&event_data));
We could do:
struct event_hdl_cb_data_new_family event_data;
/* safe data, available from both sync and async contexts
* may not use pointers to short-living resources,
* unless EVENT_HDL_CB_DATA_DM is used to ensure pointer
* consistency (ie: refcount)
*/
event_data.safe.my_custom_static_data = x;
event_data.safe.my_custom_dynamic_data = malloc(1);
/* unsafe data, only available from sync contexts */
event_data.unsafe.my_unsafe_data = y;
/* once data is prepared, we can publish the event */
event_hdl_publish(NULL,
EVENT_HDL_SUB_NEW_FAMILY_SUBTYPE_1,
EVENT_HDL_CB_DATA_DM(&event_data, data_new_family_cleanup));
With data_new_family_cleanup func which would look like this:
void data_new_family_cleanup(const void *data)
{
const struct event_hdl_cb_data_new_family *event_data = ptr;
/* some data members require specific cleanup once the event
* is consumed
*/
free(event_data.safe.my_custom_dynamic_data);
/* don't ever free data! it is not ours */
}
Not sure if this feature will become relevant in the future, so I prefer not
to mention it in the doc for now.
But given that the implementation is trivial and does not put a burden
on the existing API, it's a good thing to have it there, just in case.
ESUB_INDEX(n) index macro is used exclusively with n > 0
Fixing it so that it starts numbering at 1 instead of 2.
This way, we don't waste a subtype slot in event_hdl_sub_type
struct, and we comply with the structure comments about max
supported event subtypes (currently set at 16).
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
This commit does nothing that ought to be mentioned, except that
it adds missing comments and slighty moves some function calls
out of "sensitive" code in preparation of some server code refactors.
Expose proxy_uuid variable in event_hdl_cb_data_server struct to
overcome proxy_name fixed length limitation.
proxy_uuid may be used by the handler to perform proxy lookups.
This should be preferred over lookups relying proxy_name.
(proxy_name is suitable for printing / logging purposes but not for
ID lookups since it has a maximum fixed length)
Since this commit:
BUG/MINOR: quic: Possible wrapped values used as ACK tree purging limit.
There are more chances that ack ranges may be removed from their trees when
building a packet. It is preferable to impose a limit to these trees. This
will be the subject of the a next commit to come.
For now on, it is sufficient to stop deleting ack range from their trees.
Remove quic_ack_frm_reduce_sz() and quic_rm_last_ack_ranges() which were
there to do that.
Make qc_frm_len() support ACK frames and calls it to ensure an ACK frame
may be added to a packet before building it.
Must be backported to 2.6 and 2.7.
Not all hlua "time" variables use the same time logic.
hlua->wake_time relies on ticks since its meant to be used in conjunction
with task scheduling. Thus, it should be stored as a signed int and
manipulated using the tick api.
Adding a few comments about that to prevent mixups with hlua internal
timer api which doesn't rely on the ticks api.
For non yieldable lua handlers (converters, fetches or yield
incompatible lua functions), current timeout detection relies on now_ms
thread local variable.
But within non-yieldable contexts, now_ms won't be updated if not by us
(because we're momentarily stuck in lua context so we won't
re-enter the polling loop, which is responsible for clock updates).
To circumvent this, clock_update_date(0, 1) was manually performed right
before now_ms is being read for the timeout checks.
But this fails to work consistently, because if no other concurrent
threads periodically run clock_update_global_date(), which do happen if
we're the only active thread (nbthread=1 or low traffic), our
clock_update_date() call won't reliably update our local now_ms variable
Moreover, clock_update_date() is not the right tool for this anyway, as
it was initially meant to be used from the polling context.
Using it could have negative impact on other threads relying on now_ms
to be stable. (because clock_update_date() performs global clock update
from time to time)
-> Introducing hlua multipurpose timer, which is internally based on
now_cpu_time_fast() that provides per-thread consistent clock readings.
Thanks to this new hlua timer API, hlua timeout logic is less error-prone
and more robust.
This allows the timeout detection to work as expected for both yieldable
and non-yieldable lua handlers.
This patch depends on commit "MINOR: clock: add now_cpu_time_fast() function"
While this could theorically be backported to all stable versions,
it is advisable to avoid backports unless we're confident enough
since it could cause slight behavior changes (timing related) in
existing setups.
Same as now_cpu_time(), but for fast queries (less accurate)
Relies on now_cpu_time() and now_mono_time_fast() is used
as a cache expiration hint to prevent now_cpu_time() from being
called too often since it is known to be quite expensive.
Depends on commit "MINOR: clock: add now_mono_time_fast() function"
Same as now_mono_time(), but for fast queries (less accurate)
Relies on coarse clock source (also known as fast clock source on
some systems).
Fallback to now_mono_time() if coarse source is not supported on the system.
Remove the receiver RX_F_LOCAL_ACCEPT flag. This was used by QUIC
protocol before thread rebinding was supported by the quic_conn layer.
This should be backported up to 2.7 after the previous patch has also
been taken.
When a quic_conn instance is rebinded on a new thread its tasks and
tasklet are destroyed and new ones created. Its socket is also migrated
to a new thread which stop reception on it.
To properly reactivate a quic_conn after rebind, wake up its tasks and
tasklet if they were active before thread rebind. Also reactivate
reading on the socket FD. These operations are implemented on a new
function qc_finalize_affinity_rebind().
This should be backported up to 2.7 after a period of observation.
Implement a new function qc_set_tid_affinity(). This function is
responsible to rebind a quic_conn instance to a new thread.
This operation consists mostly of releasing existing tasks and tasklet
and allocating new instances on the new thread. If the quic_conn uses
its owned socket, it is also migrated to the new thread. The migration
is finally completed with updated the CID TID to the new thread. After
this step, the connection is thus accessible to the new thread and
cannot be access anymore on the old one without risking race condition.
To ensure rebinding is either done completely or not at all, tasks and
tasklet are pre-allocated before all operations. If this fails, an error
is returned and rebiding is not done.
To destroy the older tasklet, its context is set to NULL before wake up.
In I/O callbacks, a new function qc_process() is used to check context
and free the tasklet if NULL.
The thread rebinding can cause a race condition if the older thread
quic_dghdlrs::dgrams list contains datagram for the connection after
rebinding is done. To prevent this, quic_rx_pkt_retrieve_conn() always
check if the packet CID is still associated to the current thread or
not. In the latter case, no connection is returned and the new thread is
returned to allow to redispatch the datagram to the new thread in a
thread-safe way.
This should be backported up to 2.7 after a period of observation.
When QUIC handshake is completed on our side, some frames are prepared
to be sent :
* HANDSHAKE_DONE
* several NEW_CONNECTION_ID with CIDs allocated
This step was previously executed in quic_conn_io_cb() directly after
CRYPTO frames parsing. This patch delays it to be completed after
accept. Special care have been taken to ensure it is still functional
with 0-RTT activated.
For the moment, this patch should have no impact. However, when
quic_conn thread migration on accept will be implemented, it will be
easier to remap only one CID to the new thread. New CIDs will be
allocated after migration on the new thread.
This should be backported up to 2.7 after a period of observation.
Define a new protocol callback set_affinity. This function is used
during listener_accept() to notify about a rebind on a new thread just
before pushing the connection on the selected thread queue. If the
callback fails, accept is done locally.
This change will be useful for protocols with state allocated before
accept is done. For the moment, only QUIC protocol is concerned. This
will allow to rebind the quic_conn to a new thread depending on its
load.
This should be backported up to 2.7 after a period of observation.
CIDs were moved from a per-thread list to a global list instance. The
TID-encoded is thus non needed anymore.
This should be backported up to 2.7 after a period of observation.
Previously, quic_connection_id were stored in a per-thread tree list.
Datagram were first dispatched to the correct thread using the encoded
TID before a tree lookup was done.
Remove these trees and replace it with a global trees list of 256
entries. A CID is using the list index corresponding to its first byte.
On datagram dispatch, CID is lookup on its tree and TID is retrieved
using new member quic_connection_id.tid. As such, a read-write lock
protects each list instances. With 256 entries, it is expected that
contention should be reduced.
A new structure quic_cid_tree served as a tree container associated with
its read-write lock. An API is implemented to ensure lock safety for
insert/lookup/delete operation.
This patch is a step forward to be able to break the affinity between a
CID and a TID encoded thread. This is required to be able to migrate a
quic_conn after accept to select thread based on their load.
This should be backported up to 2.7 after a period of observation.
Remove <tid> member in quic_conn. This is moved to quic_connection_id
instance.
For the moment, this change has no impact. Indeed, qc.tid reference
could easily be replaced by tid as all of this work was already done on
the connection thread. However, it is planified to support quic_conn
thread migration in the future, so removal of qc.tid will simplify this.
This should be backported up to 2.7.
ODCID are never stored in the CID tree. Instead, we store our generated
CID which is directly derived from the CID using a hash function. This
operation is done via quic_derive_cid().
Previously, generated CID was returned as a 64-bits integer. However,
this is cumbersome to convert as an array of bytes which is the most
common CID representation. Adjust this by modifying return type to a
quic_cid struct.
This should be backported up to 2.7.
qc_parse_hd_form() is the function used to parse the first byte of a
packet and return its type and version. Its API has been simplified with
the following changes :
* extra out paremeters are removed (long_header and version). All infos
are now stored directly in quic_rx_packet instance
* a new dummy version is declared in quic_versions array with a 0 number
code. This can be used to match Version negotiation packets.
* a new default packet type is defined QUIC_PACKET_TYPE_UNKNOWN to be
used as an initial value.
Also, the function has been exported to an include file. This will be
useful to be able to reuse on quic-sock to parse the first packet of a
datagram.
This should be backported up to 2.7.
Two different structs exists for QUIC connection ID :
* quic_connection_id which represents a full CID with its sequence
number
* quic_cid which is just a buffer with a length. It is contained in the
above structure.
To better differentiate them, rename all quic_connection_id variable
instances to "conn_id" by contrast to "cid" which is used for quic_cid.
This should be backported up to 2.7.
QUIC_LOCK label is never used. Indeed, lock usage is minimal on QUIC as
every connection is pinned to its owned thread.
This should be backported up to 2.7.
SC_FL_EOS flag is added to report the end-of-stream at the SC level. It will
be used to distinguish end of stream reported by the endoint, via the
SE_FL_EOS flag, and the abort triggered by the stream, via the
SC_FL_ABRT_DONE flag.
In this patch, the flag is defined and is systematically tested everywhere
SC_FL_ABRT_DONE is tested. It should be safe because it is never set.
The flag SC_FL_ERROR is added to ack errors on the endpoint. When
SE_FL_ERROR flag is detected on the SE descriptor, the corresponding is set
on the SC. Idea is to avoid, as far as possible, to manipulated the SE
descriptor in upper layers and know when an error in the endpoint is handled
by the SC.
For now, this flag is only set and cleared but never tested.
Here again, it is just a flag renaming. In SC flags, there is no longer
shutdown for reads but aborts. For now this flag is set when a read0 is
detected. It is of couse not accurate. This will be changed later.
After the flag renaming, it is now the turn for the channel function to be
renamed and moved in the SC scope. channel_shutw_now() is replaced by
sc_schedule_shutdown(). The request channel is replaced by the front SC and
the response is replace by the back SC.
Because shutowns for reads are now considered as aborts, the shudowns for
writes can now be considered as shutdowns. Here it is just a flag
renaming. SC_FL_SHUTW_NOW is renamed SC_FL_SHUT_WANTED.
After the flag renaming, it is now the turn for the channel function to be
renamed and moved in the SC scope. channel_shutr_now() is replaced by
sc_schedule_abort(). The request channel is replaced by the front SC and the
response is replace by the back SC.
Most of calls to channel_abort() are associated to a call to
channel_auto_close(). Others are in areas where the auto close is the
default. So, it is now systematically enabled when an abort is performed on
a channel, as part of channel_abort() function.
Add the number of packet losts and the maximum congestion control window computed
by the algorithms to "show quic".
Same thing for the traces of existent congestion control algorithms.
Must be backported to 2.7 and 2.6.
Instead of artificially setting the shards count to MAX_THREAD when
"by-thread" is used, let's reserve special values for symbolic names
so that we can add more in the future. For now we use value -1 for
"by-thread", which requires to turn the type to signed int but it was
already used as such everywhere anyway.
fd_migrate_on() can be used to migrate an existing FD to any thread, even
one belonging to a different group from the current one and from the
caller's. All that is needed is to make sure the FD is still valid when
the operation is performed (which is the case when such operations happen).
This is potentially slightly expensive since it locks the tgid during the
delicate operation, but it is normally performed only from an owning
thread to offer the FD to another one (e.g. reassign a better thread upon
accept()).
In order to permit to migrate FDs from one thread group to another,
we'll need to be able to set a TGID that is compatible with no other
thread group. Either we use a special value or we dedicate a special
bit. Given that we already have way more bits than needed, let's just
sacrifice the topmost one to serve as a lock bit, indicating the tgid
is not valid anymore. This will make all fd_grab_tgid() fail to grab
it.
The new fd_lock_tgid() function now tries to assign a locked tgid to
an idle FD, and fd_unlock_tgid() simply drops the lock bit, revealing
the target tgid.
For now it's still unused so it must not have any effect.
fd_claim_tgid() uses a CAS to set the desired TGID on the FD. It's only
called from fd_insert() where in the vast majority of cases, the tgid
and refcount are zero before the call. However the loop was optimized
for the case where it was equal to the desired TGID, systematically
causing one extra round in the loop there. Better start assuming a
zero value.
We're only checking for 0, 1, or >1 groups enabled there, and we'll soon
need to be more precise and know quickly which groups are non-empty.
Let's just replace the count with a mask of enabled groups. This will
allow to quickly spot the presence of any such group in a set.
Alert when the len argument of a stick table type contains incorrect
characters.
Replace atol by strtol.
Could be backported in every maintained versions.
Once in a while we introduce an sprintf() or strncat() function by
accident. These ones are particularly dangerous and must never ever
be used because the only way to use them safely is at least as
complicated if not more, than their safe counterparts. By redefining
a few of these functions with an attribute_warning() we can deliver a
message to the developer who is tempted to use them. This commit does
it for strcat(), strcpy(), strncat(), sprintf(), vsprintf(). More could
come later if needed, such as strtok() and maybe a few others, but these
are less common.
__attribute__((deprecated)) is convenient to discourage from using
something deprecated, but gcc >= 4.3 provides __attribute__((warning(x)))
that allows to display a specific warning if something is used. This is
particularly convenient to give indications when some API parts need to
be adapted. Let's just define it as a macro that falls back to the older
deprecated attribute when not available.
It's supported on clang 14 as well but works differently and errors
out when redefined (while the main purpose precisely is to add such a
redefinition). Thus instead on clang we use deprecated(msg) which is
OK. See https://github.com/llvm/llvm-project/issues/56519
It appeared that __has_attribute() doesn't work on gcc 4.4 and older
because the concatenation of __has_attribute##x isn't resolved as a one
before being passed to __equals_1() which immediately concatenates it to
comma_for_one. We first need to pass it through an extra layer to resolve
this name to a value. The new version was tested with gcc 4.2 to 11.3.
This may be backported though it's pretty minor.
Add code so that compression can be used for requests as well.
New compression keywords are introduced :
"direction" that specifies what we want to compress. Valid values are
"request", "response", or "both".
"type-req" and "type-res" define content-type to be compressed for
requests and responses, respectively. "type" is kept as an alias for
"type-res" for backward compatibilty.
"algo-req" specifies the compression algorithm to be used for requests.
Only one algorithm can be provided.
"algo-res" provides the list of algorithm that can be used to compress
responses. "algo" is kept as an alias for "algo-res" for backward
compatibility.
Make provision for being able to store both compression algorithms and
content-types to compress for both requests and responses. For now only
the responses one are used.
Make provision for storing the compression algorithm and the compression
context twice, one for requests, and the other for responses. Only the
response ones are used for now.
In the commit 2954bcc1e (BUG/MINOR: http-ana: Don't switch message to DATA
when waiting for payload), the HTTP message flags were extended and don't
fit anymore in an unsigned char. So, we must use an unsigned integer now. It
is not a big deal because there was already a 6-bytes hole in the structure,
just after the flags. Now, there are a 3-bytes hold before.
This patch should fix the issue #2105. It is 2.8-specific, no backport
needed.
Previously, ODCID were concatenated with the client address. This was
done to prevent a collision between two endpoints which used the same
ODCID.
Thanks to the two previous patches, first connection generated CID is
now directly derived from the client ODCID using a hash function which
uses the client source address from the same purpose. Thus, it is now
unneeded to concatenate client address to <odcid> quic-conn member.
This change allows to simplify the quic_cid structure management and
reduce its size which is important as it is embedded several times in
various structures such as quic_conn and quic_rx_packet.
This should be backported up to 2.7.
First connection CID generation has been altered. It is now directly
derived from client ODCID since previous commit :
commit 162baaff7a
MINOR: quic: derive first DCID from client ODCID
This patch removes the ODCID tree which is now unneeded. On connection
lookup via CID, if a DCID is not found the hash derivation is performed
for an INITIAL/0-RTT packet only. In case a client has used multiple
times an ODCID, this will allow to retrieve our generated DCID in the
CID tree without storing the ODCID node.
The impact of this two combined patch is that it may improve slightly
haproxy memory footprint by removing a tree node from quic_conn
structure. The cpu calculation induced by hash derivation should only be
performed only a few times per connection as the client will start to
use our generated CID as soon as it received it.
This should be backported up to 2.7.
HTTP_MSGF_EXPECT_CHECKED is now set on the request message to know the
"Expect: " header was already handled, if any. The flag is set from the
moment we try to handle the header to send a "100-continue" response,
whether it was found or not.
This way, when we are waiting for the request payload, thanks to this flag,
we only try to handle "Expect: " header only once. Before it was performed
by changing the message state from BODY to DATA. But this has some side
effects and it is no accurate. So, it is better to rely on a flag to do so.
Now that the event handler API is pretty mature, we can expose it in
the lua API.
Introducing the core.event_sub(<event_types>, <cb>) lua function that
takes an array of event types <event_types> as well as a callback
function <cb> as argument.
The function returns a subscription <sub> on success.
Subscription <sub> allows you to manage the subscription from anywhere
in the script.
To this day only the sub->unsub method is implemented.
The following event types are currently supported:
- "SERVER_ADD": when a server is added
- "SERVER_DEL": when a server is removed from haproxy
- "SERVER_DOWN": server states goes from up to down
- "SERVER_UP": server states goes from down to up
As for the <cb> function: it will be called when one of the registered
event types occur. The function will be called with 3 arguments:
cb(<event>,<data>,<sub>)
<event>: event type (string) that triggered the function.
(could be any of the types used in <event_types> when registering
the subscription)
<data>: data associated with the event (specific to each event family).
For "SERVER_" family events, server details such as server name/id/proxy
will be provided.
If the server still exists (not yet deleted), a reference to the live
server is provided to spare you from an additionnal lookup if you need
to have direct access to the server from lua.
<sub> refers to the subscription. In case you need to manage it from
within an event handler.
(It refers to the same subscription that the one returned from
core.event_sub())
Subscriptions are per-thread: the thread that will be handling the
event is the one who performed the subscription using
core.event_sub() function.
Each thread treats events sequentially, it means that if you have,
let's say SERVER_UP, then SERVER_DOWN in a short timelapse, then your
cb function will first be called with SERVER_UP, and once you're done
handling the event, your function will be called again with SERVER_DOWN.
This is to ensure event consitency when it comes to logging / triggering
logic from lua.
Your lua cb function may yield if needed, but you're pleased to process
the event as fast as possible to prevent the event queue from growing up
To prevent abuses, if the event queue for the current subscription goes
over 100 unconsumed events, the subscription will pause itself
automatically for as long as it takes for your handler to catch up.
This would lead to events being missed, so a warning will be emitted in
the logs to inform you about that. This is not something you want to let
happen too often, it may indicate that you subscribed to an event that
is occurring too frequently or/and that your callback function is too
slow to keep up the pace and you should review it.
If you want to do some parallel processing because your callback
functions are slow: you might want to create subtasks from lua using
core.register_task() from within your callback function to perform the
heavy job in a dedicated task and allow remaining events to be processed
more quickly.
Please check the lua documentation for more information.
Adding alternative findserver() functions to be able to perform an
unique match based on name or puid and by leveraging revision id (rid)
to make sure the function won't match with a new server reusing the
same name or puid of the "potentially deleted" server we were initially
looking for.
For example, if you were in the position of finding a server based on
a given name provided to you by a different context:
Since dynamic servers were implemented, between the time the name was
picked and the time you will perform the findserver() call some dynamic
server deletion/additions could've been performed in the mean time.
In such cases, findserver() could return a new server that re-uses the
name of a previously deleted server. Depending on your needs, it could
be perfectly fine, but there are some cases where you want to lookup
the original server that was provided to you (if it still exists).
While working on event handling from lua, the need for a pause/resume
function to temporarily disable a subscription was raised.
We solve this by introducing the EHDL_SUB_F_PAUSED flag for
subscriptions.
The flag is set via _pause() and cleared via _resume(), and it is
checked prior to notifying the subscription in publish function.
Pause and Resume functions are also available for via lookups for
identified subscriptions.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
Use event_hdl_async_equeue_size() in advanced async task handler to
get the near real-time event queue size.
By near real-time, you should understand that the queue size is not
updated during element insertion/removal, but shortly before insertion
and shortly after removal, so the size should reflect the approximate
queue size at a given time but should definitely not be used as a
unique source of truth.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
Add event_hdl_async_equeue_isempty() to check is the event queue is
empty from an advanced async task handler.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
advanced async mode (EVENT_HDL_ASYNC_TASK) provided full support for
custom tasklets registration.
Due to the similarities between tasks and tasklets, it may be useful
to use the advanced mode with an existing task (not a tasklet).
While the API did not explicitly disallow this usage, things would
get bad if we try to wakeup a task using tasklet_wakeup() for notifying
the task about new events.
To make the API support both custom tasks and tasklets, we use the
TASK_IS_TASKLET() macro to call the proper waking function depending
on the task's type:
- For tasklets: we use tasklet_wakeup()
- For tasks: we use task_wakeup()
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
soft-stop was not explicitly handled in event_hdl API.
Because of this, event_hdl was causing some leaks on deinit paths.
Moreover, a task responsible for handling events could require some
additional cleanups (ie: advanced async task), and as the task was not
protected against abort when soft-stopping, such cleanup could not be
performed unless the task itself implements the required protections,
which is not optimal.
Consider this new approach:
'jobs' global variable is incremented whenever an async subscription is
created to prevent the related task from being aborted before the task
acknowledges the final END event.
Once the END event is acknowledged and freed by the task, the 'jobs'
variable is decremented, and the deinit process may continue (including
the abortion of remaining tasks not guarded by the 'jobs' variable).
To do this, a new global mt_list is required: known_event_hdl_sub_list
This list tracks the known (initialized) subscription lists within the
process.
sub_lists are automatically added to the "known" list when calling
event_hdl_sub_list_init(), and are removed from the list with
event_hdl_sub_list_destroy().
This allows us to implement a global thread-safe event_hdl deinit()
function that is automatically called on soft-stop thanks to signal(0).
When event_hdl deinit() is initiated, we simply iterate against the known
subscription lists to destroy them.
event_hdl_subscribe_ptr() was slightly modified to make sure that a sub_list
may not accept new subscriptions once it is destroyed (removed from the
known list)
This can occur between the time the soft-stop is initiated (signal(0)) and
haproxy actually enters in the deinit() function (once tasks are either
finished or aborted and other threads already joined).
It is safe to destroy() the subscription list multiple times as long
as the pointer is still valid (ie: first on soft-stop when handling
the '0' signal, then from regular deinit() path): the function does
nothing if the subscription list is already removed.
We partially reverted "BUG/MINOR: event_hdl: make event_hdl_subscribe thread-safe"
since we can use parent mt_list locking instead of a dedicated lock to make
the check gainst duplicate subscription ID.
(insert_lock is not useful anymore)
The check in itself is not changed, only the locking method.
sizeof(event_hdl_sub_list) slightly increases: from 24 bits to 32bits due
to the additional mt_list struct within it.
With that said, having thread-safe list to store known subscription lists
is a good thing: it could help to implement additional management
logic for subcription lists and could be useful to add some stats or
debugging tools in the future.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
event_hdl_sub_list_init() and event_hdl_sub_list_destroy() don't expect
to be called with a NULL argument (to use global subscription list
implicitly), simply because the global subscription list init and
destroy is internally managed.
Adding BUG_ON() to detect such invalid usages, and updating some comments
to prevent confusion around these functions.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
List insertion in event_hdl_subscribe() was not thread-safe when dealing
with unique identifiers. Indeed, in this case the list insertion is
conditional (we check for a duplicate, then we insert). And while we're
using mt lists for this, the whole operation is not atomic: there is a
race between the check and the insertion.
This could lead to the same ID being registered multiple times with
concurrent calls to event_hdl_subscribe() on the same ID.
To fix this, we add 'insert_lock' dedicated lock in the subscription
list struct. The lock's cost is nearly 0 since it is only used when
registering identified subscriptions and the lock window is very short:
we only guard the duplicate check and the list insertion to make the
conditional insertion "atomic" within a given subscription list.
This is the only place where we need the lock: as soon as the item is
properly inserted we're out of trouble because all other operations on
the list are already thread-safe thanks to mt lists.
A new lock hint is introduced: LOCK_EHDL which is dedicated to event_hdl
The patch may seem quite large since we had to rework the logic around
the subscribe function and switch from simple mt_list to a dedicated
struct wrapping both the mt_list and the insert_lock for the
event_hdl_sub_list type.
(sizeof(event_hdl_sub_list) is now 24 instead of 16)
However, all the changes are internal: we don't break the API.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
rid is stored as a uint32_t within struct server, but it was stored as
a signed int within the server event data struct.
Switching from signed int to uint32_t in event_hdl_cb_data_server struct
to make sure it won't overflow.
If 129ecf441 ("MINOR: server/event_hdl: add support for SERVER_ADD and SERVER_DEL events")
is being backported, then this commit should be backported with it.
This patch proposes to enumerate servers using internal HAProxy list.
Also, remove the flag SRV_F_NON_PURGEABLE which makes the server non
purgeable each time Lua uses the server.
Removing reg-tests/cli_delete_server_lua.vtc since this test is no
longer relevant (we don't set the SRV_F_NON_PURGEABLE flag anymore)
and we already have a more generic test:
reg-tests/server/cli_delete_server.vtc
Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
When HAproxy is loaded with a lot of frontends/backends (tested with 300k),
it is slow to start and it uses a lot of memory just for indexing backends
in the lua tables.
This patch uses the internal frontend/backend index of HAProxy in place of
lua table.
HAProxy startup is now quicker as each frontend/backend object is created
on demand and not at init.
This has to come with some cost: the execution of Lua will be a little bit
slower.
Two lua init function seems to return something useful, but it
is not the case. The function "hlua_concat_init" seems to return
a failure status, but the function never fails. The function
"hlua_fcn_reg_core_fcn" seems to return a number of elements in
the stack, but it is not the case.
We recently discovered a bug which affects dynamic server deletion:
When a server is deleted, it is removed from the "visible" server list.
But as we've seen in previous commit
("MINOR: server: add SRV_F_DELETED flag"), it can still be accessed by
someone who keeps a reference on it (waiting for the final srv_drop()).
Throughout this transient state, server ptr is still valid (may be
dereferenced) and the flag SRV_F_DELETED is set.
However, as the server is not part of server list anymore, we have
an issue: srv->next pointer won't be updated anymore as the only place
where we perform such update is in cli_parse_delete_server() by
iterating over the "visible" server list.
Because of this, we cannot guarantee that a server with the
SRV_F_DELETED flag has a valid 'next' ptr: 'next' could be pointing
to a fully removed (already freed) server.
This problem can be easily demonstrated with server dumping in
the stats:
server list dumping is performed in stats_dump_proxy_to_buffer()
The function can be interrupted and resumed later by design.
ie: output buffer is full: partial dump and finish the dump after
the flush
This is implemented by calling srv_take() on the server being dumped,
and only releasing it when we're done with it using srv_drop().
(drop can be delayed after function resume if buffer is full)
While the function design seems OK, it works with the assumption that
srv->next will still be valid after the function resumes, which is
not true. (especially if multiple servers are being removed in between
the 2 dumping attempts)
In practice, this did not cause any crash yet (at least this was not
reported so far), because server dumping is so fast that it is very
unlikely that multiple server deletions make their way between 2
dumping attempts in most setups. But still, this is a problem that we
need to address because some upcoming work might depend on this
assumption as well and for the moment it is not safe at all.
========================================================================
Here is a quick reproducer:
With this patch, we're creating a large deletion window of 3s as soon
as we reach a server named "t2" while iterating over the list.
This will give us plenty of time to perform multiple deletions before
the function is resumed.
| diff --git a/src/stats.c b/src/stats.c
| index 84a4f9b6e..15e49b4cd 100644
| --- a/src/stats.c
| +++ b/src/stats.c
| @@ -3189,11 +3189,24 @@ int stats_dump_proxy_to_buffer(struct stconn *sc, struct htx *htx,
| * Temporarily increment its refcount to prevent its
| * anticipated cleaning. Call free_server to release it.
| */
| + struct server *orig = ctx->obj2;
| for (; ctx->obj2 != NULL;
| ctx->obj2 = srv_drop(sv)) {
|
| sv = ctx->obj2;
| + printf("sv = %s\n", sv->id);
| srv_take(sv);
| + if (!strcmp("t2", sv->id) && orig == px->srv) {
| + printf("deletion window: 3s\n");
| + thread_idle_now();
| + thread_harmless_now();
| + sleep(3);
| + thread_harmless_end();
| +
| + thread_idle_end();
| +
| + goto full; /* simulate full buffer */
| + }
|
| if (htx) {
| if (htx_almost_full(htx))
| @@ -4353,6 +4366,7 @@ static void http_stats_io_handler(struct appctx *appctx)
| struct channel *res = sc_ic(sc);
| struct htx *req_htx, *res_htx;
|
| + printf("http dump\n");
| /* only proxy stats are available via http */
| ctx->domain = STATS_DOMAIN_PROXY;
|
Ok, we're ready, now we start haproxy with the following conf:
global
stats socket /tmp/ha.sock mode 660 level admin expose-fd listeners thread 1-1
nbthread 2
frontend stats
mode http
bind *:8081 thread 2-2
stats enable
stats uri /
backend farm
server t1 127.0.0.1:1899 disabled
server t2 127.0.0.1:18999 disabled
server t3 127.0.0.1:18998 disabled
server t4 127.0.0.1:18997 disabled
And finally, we execute the following script:
curl localhost:8081/stats&
sleep .2
echo "del server farm/t2" | nc -U /tmp/ha.sock
echo "del server farm/t3" | nc -U /tmp/ha.sock
This should be enough to reveal the issue, I easily manage to
consistently crash haproxy with the following reproducer:
http dump
sv = t1
http dump
sv = t1
sv = t2
deletion window = 3s
[NOTICE] (2940566) : Server deleted.
[NOTICE] (2940566) : Server deleted.
http dump
sv = t2
sv = �����U
[1] 2940566 segmentation fault (core dumped) ./haproxy -f ttt.conf
========================================================================
To fix this, we add prev_deleted mt_list in server struct.
For a given "visible" server, this list will contain the pending
"deleted" servers references that point to it using their 'next' ptr.
This way, whenever this "visible" server is going to be deleted via
cli_parse_delete_server() it will check for servers in its
'prev_deleted' list and update their 'next' pointer so that they no
longer point to it, and then it will push them in its
'next->prev_deleted' list to transfer the update responsibility to the
next 'visible' server (if next != NULL).
Then, following the same logic, the server about to be removed in
cli_parse_delete_server() will push itself as well into its
'next->prev_deleted' list (if next != NULL) so that it may still use its
'next' ptr for the time it is in transient removal state.
In srv_drop(), right before the server is finally freed, we make sure
to remove it from the 'next->prev_deleted' list so that 'next' won't
try to perform the pointers update for this server anymore.
This has to be done atomically to prevent 'next' srv from accessing a
purged server.
As a result:
for a valid server, either deleted or not, 'next' ptr will always
point to a non deleted (ie: visible) server.
With the proposed fix, and several removal combinations (including
unordered cli_parse_delete_server() and srv_drop() calls), I cannot
reproduce the crash anymore.
Example tricky removal sequence that is now properly handled:
sv list: t1,t2,t3,t4,t5,t6
ops:
take(t2)
del(t4)
del(t3)
del(t5)
drop(t3)
drop(t4)
drop(t5)
drop(t2)
Set the SRV_F_DELETED flag when server is removed from the cli.
When removing a server from the cli (in cli_parse_delete_server()),
we update the "visible" server list so that the removed server is no
longer part of the list.
However, despite the server being removed from "visible" server list,
one could still access the server data from a valid ptr (ie: srv_take())
Deleted flag helps detecting when a server is in transient removal
state: that is, removed from the list, thus not visible but not yet
purged from memory.
The purpose of this patch is only a one-to-one replacement, as far as
possible.
CF_SHUTR(_NOW) and CF_SHUTW(_NOW) flags are now carried by the
stream-connecter. CF_ prefix is replaced by SC_FL_ one. Of course, it is not
so simple because at many places, we were testing if a channel was shut for
reads and writes in same time. To do the same, shut for reads must be tested
on one side on the SC and shut for writes on the other side on the opposite
SC. A special care was taken with process_stream(). flags of SCs must be
saved to be able to detect changes, just like for the channels.
If OPENSSL_NO_DEPRECATED is set, we get a 'error: ‘RSA_PKCS1_PADDING’
undeclared' when building jwt.c. The symbol is not deprecated, we are
just missing an include.
This was raised in GitHub issue #2098.
It does not need to be backported.
This very old bug is there since the first implementation of newreno congestion
algorithm implementation. This was a very bad idea to put a state variable
into quic_cc_algo struct which only defines the congestion control algorithm used
by a QUIC listener, typically its type and its callbacks.
This bug could lead to crashes since BUG_ON() calls have been added to each algorithm
implementation. This was revealed by interop test, but not very often as there was
not very often several connections run at the time during these tests.
Hopefully this was also reported by Tristan in GH #2095.
Move the congestion algorithm state to the correct structures which are private
to a connection (see cubic and nr structs).
Must be backported to 2.7 and 2.6.
This algorithm does nothing except initializing the congestion control window
to a fixed value. Very smart!
Modify the QUIC congestion control configuration parser to support this new
algorithm. The congestion control algorithm must be set as follows:
quic-cc-algo nocc-<cc window size(KB))
For instance if "nocc-15" is provided as quic-cc-algo keyword value, this
will set a fixed window of 15KB.
Reuse the idle timeout task to delay the acknowledgments. The time of the
idle timer expiration is for now on stored in ->idle_expire. The one to
trigger the acknowledgements is stored in ->ack_expire.
Add QUIC_FL_CONN_ACK_TIMER_FIRED new connection flag to mark a connection
as having its acknowledgement timer been triggered.
Modify qc_may_build_pkt() to prevent the sending of "ack only" packets and
allows the connection to send packet when the ack timer has fired.
It is possible that acks are sent before the ack timer has triggered. In
this case it is cancelled only if ACK frames are really sent.
The idle timer expiration must be set again when the ack timer has been
triggered or when it is cancelled.
Must be backported to 2.7.
Dump variables displayed by TRACE_ENTER() or TRACE_LEAVE() by calls to TRACE_PROTO().
No more variables are displayed by the two former macros. For now on, these information
are accessible from proto level.
Add new calls to TRACE_PROTO() at important locations in relation whith QUIC transport
protocol.
When relevant, try to prefix such traces with TX or RX keyword to identify the
concerned subpart (transmission or reception) of the protocol.
Must be backported to 2.7.
As now_ms may wrap, one must use the ticks API to protect the cubic congestion
control algorithm implementation from side effects due to this.
Furthermore to make the cubic congestion control algorithm more readable and easy
to maintain, adding a new state ("in recovery period" QUIC_CC_ST_RP new enum) helps
in reaching this goal. Implement quic_cc_cubic_rp_cb() which is the callback for
this new state.
Must be backported to 2.7 and 2.6.
When a proxy enters the STOPPED state, it will no longer accept new
connections.
However, it doesn't mean that it's completely inactive yet: it will
still be able to handle already pending / keep-alive connections,
thus finishing ongoing work before effectively stopping.
be_usable_srv(), which is used by nbsrv converter and sample fetch,
will return 0 if the proxy is either stopped or disabled.
nbsrv behaves this way since it was originally implemented in b7e7c4720
("MINOR: Add nbsrv sample converter").
(Since then, multiple refactors were performed around this area, but
the current implementation still follows the same logic)
It was found that if nbsrv is used in a proxy section to perform routing
logic, unexpected decisions are being made when nbsrv is used on a proxy
with STOPPED state, since in-flight requests will suffer from nbsrv
returning 0 instead of the current number of usable servers which may
still process existing connections.
For instance, this can happen during process soft-stop, or even when
stopping the proxy from the cli / lua.
To fix this: we now make sure be_usable_srv() always returns the
current number of usable servers, unless the proxy is explicitly
disabled (from the config, not at runtime)
This could be backported up to 2.6.
For older versions, the need for a backport should be evaluated first.
--
Note for 2.4: proxy flags did not exist, it was implemented with fd10ab5e
("MINOR: proxy: Introduce proxy flags to replace disabled bitfield")
For 2.2: STOPPED and DISABLED states were not separated, so we have no
easy way to apply the fix anyway.
This commit adds a new argument to smp_fetch_url_param
that makes the parameter key comparison case-insensitive.
Several levels of callers were modified to pass this info.
Since eb77824 ("MEDIUM: proxy: remove the deprecated "grace" keyword"),
stop_time is never set, so the related code in manage_proxy() is not
relevant anymore.
Removing code that refers to p->stop_time, since it was probably
overlooked.
This patch follows this commit which was not sufficient:
BUG/MINOR: quic: Missing STREAM frame data pointer updates
Indeed, after updating the ->offset field, the bit which informs the
frame builder of its presence must be systematically set.
This bug was revealed by the following BUG_ON() from
quic_build_stream_frame() :
bug condition "!!(frm->type & 0x04) != !!stream->offset.key" matched at src/quic_frame.c:515
This should fix the last crash occured on github issue #2074.
Must be backported to 2.6 and 2.7.
Instead of reporting the inaccurate "malloc_trim() support" on -vv, let's
report the case where the memory allocator was actively replaced from the
one used at build time, as this is the corner case we want to be cautious
about. We also put a tainted bit when this happens so that it's possible
to detect it at run time (e.g. the user might have inherited it from an
environment variable during a reload operation).
The now unused is_trim_enabled() function was finally dropped.
As reported by Miroslav in commit d8a97d8f6 ("BUG/MINOR: illegal use of
the malloc_trim() function if jemalloc is used") there are still occasional
cases where it's discovered that malloc_trim() is being used without its
suitability being checked first. This is a problem when using another
incompatible allocator. But there's a class of use cases we'll never be
able to cover, it's dynamic libraries loaded from Lua. In order to address
this more reliably, we now define our own malloc_trim() that calls the
previous one after checking that the feature is supported and that the
allocator is the expected one. This way child libraries that would call
it will also be safe.
The function is intentionally left defined all the time so that it will
be possible to clean up some code that uses it by removing ifdefs.
In the event that HAProxy is linked with the jemalloc library, it is still
shown that malloc_trim() is enabled when executing "haproxy -vv":
..
Support for malloc_trim() is enabled.
..
It's not so much a problem as it is that malloc_trim() is called in the
pat_ref_purge_range() function without any checking.
This was solved by setting the using_default_allocator variable to the
correct value in the detect_allocator() function and before calling
malloc_trim() it is checked whether the function should be called.
Building without thread support was broken in 2.8-dev2 with commit
7e70bfc8c ("MINOR: threads: add a thread_harmless_end() version that
doesn't wait") that forgot to define the function for the threadless
cases. No backport is needed.
b_alloc() is used to allocate a buffer. We can provoke fault injection
based on forced memory allocation failures using -dMfail on the command
line, but we know that the buffer_wait list is a bit weak and doesn't
always recover well. As such, submitting buffer allocation to such a
treatment seriously limits the usefulness of -dMfail which cannot really
be used for other purposes. Let's just disable it for buffers for now.