peers-t.h uses "struct stktable" as well as STKTABLE_DATA_TYPES which
are defined in stick-table-t.h. It works by accident because
stick-table-t.h was always included before. But could provoke build
issue with EXTRA code.
To be backported as far as 2.2.
Add a new value in stats ctx: field.
Implement field support in line dumping parent functions
stats_print_proxy_field_json() and stats_dump_proxy_to_buffer().
This will allow child dumping functions to support partial line dumping
when needed. ie: when dumping buffer is exhausted: do a partial send and
wait for a new buffer to finish the dump. Thanks to field ctx, the function
can start dumping where it left off on previous (unterminated) invokation.
Extract function h2_parse_cont_len_header() in the generic HTTP module.
This allows to reuse it for all HTTP/x parsers. The function is now
available as http_parse_cont_len_header().
Most notably, this will be reused in the next bugfix for the H3 parser.
This is necessary to check that content-length header match the length
of DATA frames.
Thus, it must be backported to 2.6.
qc_new_conn() is used to allocate a quic_conn instance and its various
internal members. If one allocation fails, quic_conn_release() is used
to cleanup things.
For the moment, pool_zalloc() is used which ensures that all content is
null. However, some members must be initialized to a special values
to be able to use quic_conn_release() safely. This is the case for
quic_conn lists and its tasklet.
Also, some quic_conn internal allocation functions were doing their own
cleanup on failure without reset to NULL. This caused an issue with
quic_conn_release() which also frees this members. To fix this, these
functions now only return an error without cleanup. It is the caller
responsibility to free the allocated content, which is done via
quic_conn_release().
Without this patch, allocation failure in qc_new_conn() would often
result in segfault. This was reproduced easily using fail-alloc at 10%.
This should be backported up to 2.6.
This patch introduces haproxy_backend_agg_check_status metric
as we wanted in 42d7c402d but with the right data source.
This patch could be backported as far as 2.4.
haproxy_backend_agg_server_check_status currently aggregates
haproxy_server_status instead of haproxy_server_check_status.
We deprecate this and create a new one,
haproxy_backend_agg_server_status to clarify what it really
does.
This patch could be backported as far as 2.4.
Since the massive pools cleanup that happened in 2.6, the pools
architecture was made quite more hierarchical and many alternate code
blocks could be moved to runtime flags set by -dM. One of them had not
been converted by then, DEBUG_UAF. It's not much more difficult actually,
since it only acts on a pair of functions indirection on the slow path
(OS-level allocator) and a default setting for the cache activation.
This patch adds the "uaf" setting to the options permitted in -dM so
that it now becomes possible to set or unset UAF at boot time without
recompiling. This is particularly convenient, because every 3 months on
average, developers ask a user to recompile haproxy with DEBUG_UAF to
understand a bug. Now it will not be needed anymore, instead the user
will only have to disable pools and enable uaf using -dMuaf. Note that
-dMuaf only disables previously enabled pools, but it remains possible
to re-enable caching by specifying the cache after, like -dMuaf,cache.
A few tests with this mode show that it can be an interesting combination
which catches significantly less UAF but will do so with much less
overhead, so it might be compatible with some high-traffic deployments.
The change is very small and isolated. It could be helpful to backport
this at least to 2.7 once confirmed not to cause build issues on exotic
systems, and even to 2.6 a bit later as this has proven to be useful
over time, and could be even more if it did not require a rebuild. If
a backport is desired, the following patches are needed as well:
CLEANUP: pools: move the write before free to the uaf-only function
CLEANUP: pool: only include pool-os from pool.c not pool.h
REORG: pool: move all the OS specific code to pool-os.h
CLEANUP: pools: get rid of CONFIG_HAP_POOLS
DEBUG: pool: show a few examples in -dMhelp
This one was set in defaults.h only when neither DEBUG_NO_POOLS nor
DEBUG_UAF were set. This was not the most convenient location to look
for it, and it was only used in pool.c to decide on the initial value
of POOL_DBG_NO_CACHE.
Let's just use DEBUG_NO_POOLS || DEBUG_UAF directly on this flag and
get rid of the intermediary condition. This also has the benefit of
removing a double inversion, which is always nice for understanding.
Till now pool-os used to contain a mapping from pool_{alloc,free}_area()
to pool_{alloc,free}_area_uaf() in case of DEBUG_UAF, or the regular
malloc-based function. And the *_uaf() functions were in pool.c. But
since 2.4 with the first cleanup of the pools, there has been no more
calls to pool_{alloc,free}_area() from anywhere but pool.c, from exactly
one place each. As such, there's no more need to keep *_uaf() apart in
pool.c, we can inline it into pool-os.h and leave all the OS stuff there,
with pool.c calling either based on DEBUG_UAF. This is cleaner with less
round trips between both files and easier to find.
There's no need for the low-level pool functions to be known from all
callers anymore, they're only used by pool.c. Let's reduce the amount
of header files processed.
We get a build error in ncbuf.c when building for ARMv8.2-a because ncbuf
has minimal includes and among them bug.h which includes atomic.h. Atomic.h
may use "forceinline" without including compiler.h, hence the build error.
It was verified that adding it doesn't inflate the total headers.
Since all other C files include api.h which already covers this, there's
no real need to bapkport this. The issue was already there in 2.3 though.
With previous commit, 9e080bf ("BUG/MINOR: checks: make sure fastinter is used
even on forced transitions"), on-error mark-down|sudden-death|fail-check are
now working as expected.
However, on-error fastinter remains broken because srv_getinter(), used in
the above commit to check the expiration date, won't return fastinter interval
if server health is maxed out (which is the case with on-error fastinter mode).
To fix this, we introduce a check flag named CHK_ST_FASTINTER.
This flag is set when on-error is triggered. This way we can force
srv_getinter() to return fastinter interval whenever the flag is set.
The flag is automatically cleared as soon as the new check task expiry is
recalculated in process_chk_conn().
This restores original behavior prior to d114f4a ("MEDIUM: checks: spread the
checks load over random threads").
It must be backported to 2.7 along with the aforementioned commits.
We're using srv_update_status() as the only event source or UP/DOWN server
events in an attempt to simplify the support for these 2 events.
It seems srv_update_status() is the common path for server state changes anyway
Tested with server state updated from various sources:
- the cli
- server-state file (maybe we could disable this or at least don't publish
in global event queue in the future if it ends in slower startup for setups
relying on huge server state files)
- dns records (ie: srv template)
(again, could be fined tuned to only publish in server specific subscriber
list and no longer in global subscription list if mass dns update tend to
slow down srv_update_status())
- normal checks and observe checks (HCHK_STATUS_HANA)
(same as above, if checks related state update storms are expected)
- lua scripts
- html stats page (admin mode)
Basic support for ADD and DEL server events are added through this commit:
SERVER_ADD is published on dynamic server addition through cli.
SERVER_DEL is published on dynamic server deletion through cli.
This work depends on:
"MINOR: event_hdl: add event handler base api"
"MINOR: server: add srv->rid (revision id) value"
Make use of the new srv->rid value in stats.
Stat is referred as ST_F_SRID, it is now used in stats_fill_sv_stats
function in order to be included in csv and json stats dumps.
Moreover, "rid: $value" will be displayed next to server puid
in html stats page if "stats show-legend" is specified in the stats frontend.
(mouse hovering tooltip)
Depends on the following commit:
"MINOR: server: add srv->rid (revision id) value"
With current design, we could not distinguish between
previously existing deleted server and a new server reusing
the deleted server name/id.
This can cause some confusion when auditing stats/events/logs,
because the new server will look similar to the old
one.
To address this, we're adding a new value in server structure: rid
rid (revision id) value is an unsigned 32bits value that is set upon
server creation. Value is derived from a global counter that starts
at 0 and is incremented each time one or multiple server deletions are
followed by a server addition (meaning that old name/id reuse could occur).
Thanks to this revision id, it is now easy to tell whether the server
we're looking at is the same as before or if it has been deleted and
re-added in the meantime.
(combining server name/id + server revision id yields a process-wide unique
identifier)
UDP addresses may change over time for a QUIC connection. When using
quic-conn owned socket, we have to detect address change to break the
bind/connect association on the socket.
For the moment, on change detected, QUIC connection socket is closed and
a new one is opened. In the future, we may improve this by trying to
keep the original socket and reexecute only bind/connect syscalls.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
This change is the second part for reception on QUIC connection socket.
All operations inside the FD handler has been delayed to quic-conn
tasklet via the new function qc_rcv_buf().
With this change, buffer management on reception has been simplified. It
is now possible to use a local buffer inside qc_rcv_buf() instead of
quic_receiver_buf().
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
Try to use the quic-conn socket for reception if it is allocated. For
this, the socket is inserted in the fdtab. This will call the new
handler quic_conn_io_cb() which is responsible to process the recv()
system call. It will reuse datagram dispatch for simplicity. However,
this is guaranteed to be called on the quic-conn thread, so it will be
more efficient to use a dedicated buffer. This will be implemented in
another commit.
This patch should improve performance by reducing contention on the
receiver socket. However, more gain can be obtained when the datagram
dispatch operation will be skipped.
Older quic_sock_fd_iocb() is renamed to quic_lstnr_sock_fd_iocb() to
emphasize its usage for the receiver socket.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
Allocate quic-conn owned socket if possible. This requires that this is
activated in haproxy configuration. Also, this is done only if local
address is known so it depends on the support of IP_PKTINFO.
For the moment this socket is not used. This causes QUIC support to be
broken as received datagram are not read. This commit will be completed
by a following patch to support recv operation on the newly allocated
socket.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
To be able to use individual sockets for QUIC connections, we rely on
the OS network stack which must support UDP sockets binding on the same
local address.
Add a detection code for this feature executed on startup. When the
first QUIC listener socket is binded, a test socket is created and
binded on the same address. If the bind call fails, we consider that
it's impossible to use individual socket for QUIC connections.
A new global option GTUNE_QUIC_SOCK_PER_CONN is defined. If startup
detect fails, this value is resetted from global options. For the
moment, there is no code to activate the option : this will be in a
follow-up patch with the introduction of a new configuration option.
This change is part of quic-conn owned socket implementation.
It may be backported to 2.7 after a period of observation.
Detect connection migration attempted by the client. This is done by
comparing addresses stored in quic-conn with src/dest addresses of the
UDP datagram.
A new function qc_handle_conn_migration() has been added. For the
moment, no operation is conducted and the function will be completed
during connection migration implementation. The only notable things is
the increment of a new counter "quic_conn_migration_done".
This should be backported up to 2.7.
Complete ipcmp() function with a new argument <check_port>. If this
argument is true, the function will compare port values besides IP
addresses and return true only if both are identical.
This commit will simplify QUIC connection migration detection. As such,
it should be backported to 2.7.
Extract individual datagram parsing code outside of datagrams list loop
in quic_lstnr_dghdlr(). This is moved in a new function named
quic_dgram_parse().
To complete this change, quic_lstnr_dghdlr() has been moved into
quic_sock source file : it belongs to QUIC socket lower layer and is
directly called by quic_sock_fd_iocb().
This commit will ease implementation of quic-conn owned socket.
New function quic_dgram_parse() will be easily usable after a receive
operation done on quic-conn IO-cb.
This should be backported up to 2.7.
quic_rx_packet struct had a reference to the quic_conn instance. This is
useless as qc instance is always passed through function argument. In
fact, pkt.qc is used only in qc_pkt_decrypt() on key update, even though
qc is also passed as argument.
Simplify this by removing qc field from quic_rx_packet structure
definition. Also clean up qc_pkt_decrypt() documentation and interface
to align it with other quic-conn related functions.
This should be backported up to 2.7.
Rename the structure "cert_key_and_chain" to "ckch_data" in order to
avoid confusion with the store whcih often called "ckchs".
The "cert_key_and_chain *ckch" were renamed "ckch_data *data", so we now
have store->data instead of ckchs->ckch.
Marked medium because it changes the API.
Adding base code to provide subscribe/publish API for internal
events processing.
event_hdl provides two complementary APIs, both are implemented
in src/event_hdl.c and include/haproxy/event_hdl{-t.h,.h}:
One API targeting developers that want to register event handlers
that will be notified on specific events.
(SUBSCRIBE)
One API targeting developers that want to notify registered handlers
about an event.
(PUBLISH)
This feature is being considered to address the following scenarios:
- mailers code refactoring (getting rid of deprecated
tcp-check ruleset implementation)
- server events from lua code (registering user defined
lua function that is executed with relevant data when a
server is dynamically added/removed or on server state change)
- providing a stable and easy to use API for upcoming
developments that rely on specific events to perform actions.
(e.g: ressource cleanup when a server is deleted from haproxy)
At this time though, we don't have much use cases in mind in addition to
server events handling, but the API is aimed at being multipurpose
so that new event families, with their own particularities, can be
easily implemented afterwards (and hopefully) without requiring breaking
changes to the API.
Moreover, you should know that the API was not designed to cope well
with high rate event publishing.
Mostly because publishing means iterating over unsorted subscriber list.
So it won't scale well as subscriber list increases, but it is intended in
order to keep the code simple and versatile.
Instead, it is assumed that events implemented using this API
should be periodic events, and that events related to critical
io/networking processing should be handled using
dedicated facilities anyway.
(After all, this is meant to be a general purpose event API)
Apart from being easily extensible, one of the main goals of this API is
to make subscriber code as simple and safe as possible.
This is done by offering multiple event handling modes:
- SYNC mode:
publishing code directly
leverages handler code (callback function)
and handler code has a direct access to "live" event data
(pointers mostly, alongside with lock hints/context
so that accessing data pointers can be done properly)
- normal ASYNC mode:
handler is executed in a backward compatible way with sync mode,
so that it is easy to switch from and to SYNC/ASYNC mode.
Only here the handler has access to "offline" event data, and
not "live" data (ptrs) so that data consistency is guaranteed.
By offline, you should understand "snapshot" of relevant data
at the time of the event, so that the handler can consume it
later (even if associated ressource is not valid anymore)
- advanced ASYNC mode
same as normal ASYNC mode, but here handler is not a function
that is executed with event data passed as argument: handler is a
user defined tasklet that is notified when event occurs.
The tasklet may consume pending events and associated data
through its own message queue.
ASYNC mode should be considered first if you don't rely on live event
data and you wan't to make sure that your code has the lowest impact
possible on publisher code. (ie: you don't want to break stuff)
Internal API documentation will follow:
You will find more details about the notions we roughly approached here.
WolfSSL does not implement the TLS1_3_CK_AES_128_CCM_SHA256 cipher as
well as the SSL_ERROR_WANT_ASYNC, SSL_ERROR_WANT_ASYNC_JOB and
SSL_ERROR_WANT_CLIENT_HELLO_CB error codes.
This patch disables them for WolfSSL.
Signed-off-by: William Lallemand <wlallemand@haproxy.org>
The function used to calculate the shard number currently requires a
stktable_key on input for this. Unfortunately, it happens that peers
currently miss this calculation and they do not provide stktable_key
at all, instead they're open-coding all the low-level stick-table work
(hence why it's missing). Thus we'll need to be able to calculate the
shard number in keys coming from peers as well but the current API does
not make it possible.
This commit addresses this by inverting the order where the length and
the shard number are used. Now the low-level function is independent on
stksess and stktable_key, it takes a table, pointer and length and does
all the job. The upper function takes care of the type and key to get
the its length, and is for use only from stick-table code.
This doesn't change anything except that the low-level one will be usable
from outside (hence why it's exported now).
ncbuf API relies on lot of small functions. Mark these functions as
inline to reduce call invocations and facilitate compiler optimizations
to reduce code size.
This should be backported up to 2.6.
Instead of using memcpy() to concatenate the table's name to the key when
allocating an stksess, let's compute once for all a per-table seed at boot
time and use it to calculate the key's hash. This saves two memcpy() and
the usage of a chunk, it's always nice in a fast path.
When tested under extreme conditions with a 80-byte long table name, it
showed a 1% performance increase.
With an OpenSSL library which use the wrong OPENSSLDIR, HAProxy tries to
load the OPENSSLDIR/certs/ into @system-ca, but emits a warning when it
can't.
This patch fixes the issue by allowing to shut the error when the SSL
configuration for the httpclient is not explicit.
Must be backported in 2.6.
This adds a USE_OPENSSL_WOLFSSL option, wolfSSL must be used with the
OpenSSL compatibility layer. This must be used with USE_OPENSSL=1.
WolfSSL build options:
./configure --prefix=/opt/wolfssl --enable-haproxy
HAProxy build options:
USE_OPENSSL=1 USE_OPENSSL_WOLFSSL=1 WOLFSSL_INC=/opt/wolfssl/include/ WOLFSSL_LIB=/opt/wolfssl/lib/ ADDLIB='-Wl,-rpath=/opt/wolfssl/lib'
Using at least the commit 54466b6 ("Merge pull request #5810 from
Uriah-wolfSSL/haproxy-integration") from WolfSSL. (2022-11-23).
This is still to be improved, reg-tests are not supported yet, and more
tests are to be done.
Signed-off-by: William Lallemand <wlallemand@haproxy.org>
A number of internal flags started to be exposed to external programs
at the location of their definition since commit 77acaf5af ("MINOR:
flags: add a new file to host flag dumping macros"). This allowed the
"flags" utility to decode many more of them and always correctly. The
condition to expose them was to rely on the preliminary definition of
EOF that indicates that stdio is already included. But this was a
wrong approach. It only guarantees that snprintf() can safely be used
but still causes large functions to be built. But stdio is often
included before some of these includes, so these heavy inline functions
actually have to be compiled in many cases. The result is that the
build time significantly increased, especially with fast compilers
like gcc -O0 which took +50% or TCC which took +100%!
This patch addresses the problem by instead relying on an explicit
macro HA_EXPOSE_FLAGS that the calling program must explicitly define
before including these files. flags.c does this and that's all. The
previous build time is now restored with a speed up of 20 to 50%
depending on the build options.
These includes brought by commit 9c76637ff ("MINOR: anon: add new macros
and functions to anonymize contents") resulted in an increase of exactly
20% of the number of lines to build. These include are not needed there,
only tools.c needs xxhash.h.
Building with TCC caused a warning on __attribute__() being redefined,
because we do define it on compilers that don't have it, but we didn't
include the compiler's definitions first to leave it a chance to expose
its definitions. The correct way to do this would be to include
sys/cdefs.h but we currently don't include it explicitly and a few
reports on the net mention some platforms where it could be missing
by default. Let's use inttypes.h instead, it always causes it (or its
equivalent) to be included and we know it's present on supported
platforms since we already depend on it.
No backport is needed.
The issue addressed by commit fbb934da9 ("BUG/MEDIUM: stick-table: fix
a race condition when updating the expiration task") is still present
when thread groups are enabled, but this time it lies in the scheduler.
What happens is that a task configured to run anywhere might already
have been queued into one group's wait queue. When updating a stick
table entry, sometimes the task will have to be dequeued and requeued.
For this a lock is taken on the current thread group's wait queue lock,
but while this is necessary for the queuing, it's not sufficient for
dequeuing since another thread might be in the process of expiring this
task under its own group's lock which is different. This is easy to test
using 3 stick tables with 1ms expiration, 3 track-sc rules and 4 thread
groups. The process crashes almost instantly under heavy traffic.
One approach could consist in storing the group number the task was
queued under in its descriptor (we don't need 32 bits to store the
thread id, it's possible to use one short for the tid and another
one for the tgrp). Sadly, no safe way to do this was figured, because
the race remains at the moment the thread group number is checked, as
it might be in the process of being changed by another thread. It seems
that a working approach could consist in always having it associated
with one group, and only allowing to change it under this group's lock,
so that any code trying to change it would have to iterately read it
and lock its group until the value matches, confirming it really holds
the correct lock. But this seems a bit complicated, particularly with
wait_expired_tasks() which already uses upgradable locks to switch from
read state to a write state.
Given that the shared tasks are not that common (stick-table expirations,
rate-limited listeners, maybe resolvers), it doesn't seem worth the extra
complexity for now. This patch takes a simpler and safer approach
consisting in switching back to a single wq_lock, but still keeping
separate wait queues. Given that shared wait queues are almost always
empty and that otherwise they're scanned under a read lock, the
contention remains manageable and most of the time the lock doesn't
even need to be taken since such tasks are not present in a group's
queue. In essence, this patch reverts half of the aforementionned
patch. This was tested and confirmed to work fine, without observing
any performance degradation under any workload. The performance with
8 groups on an EPYC 74F3 and 3 tables remains twice the one of a
single group, with the contention remaining on the table's lock first.
No backport is needed.
In order to evenly pick idle connections from other threads, there is
a "next_takeover" index in the server, that is incremented each time
a connection is picked from another thread, and indicates which one to
start from next time.
With thread groups this doesn't work well because the index is the same
regardless of the group, and if a group has more threads than another,
there's even a risk to reintroduce an imbalance.
This patch introduces a new per-tgroup storage in servers which, for now,
only contains an instance of this next_takeover index. This way each
thread will now only manipulate the index specific to its own group, and
the takeover will become fair again. More entries may come soon.
In 2.2, some idle conns usage metrics were added by commit cf612a045
("MINOR: servers: Add a counter for the number of currently used
connections."), which mentioned that the operation doesn't need to be
atomic since we're not seeking exact values. This is true but at least
we should use atomic stores to make sure not to cause invalid values
to appear on archs that wouldn't guarantee atomicity when writing an
int, such as writing two 16-bit words. This is pretty unlikely on our
targets but better keep the code safe against this.
This may be backported as far as 2.2.
The "show pools" command is used a lot for debugging but didn't get much
love over the years. This patch brings new capabilities:
- sorting the output by pool names to ese their finding ("byname").
- sorting the output by reverse item size to spot the biggest ones("bysize")
- sorting the output by reverse number of allocated bytes ("byusage")
The last one (byusage) also omits displaying the ones with zero allocation.
In addition, an optional max number of output entries may be passed so as
to dump only the N most relevant ones.
This previous patch was not sufficient to prevent haproxy from
crashing when some Handshake packets had to be inspected before being possibly
retransmitted:
"BUG/MAJOR: quic: Crash upon retransmission of dgrams with several packets"
This patch introduced another issue: access to packets which have been
released because still attached to others (in the same datagram). This was
the case for instance when discarding the Initial packet number space before
inspecting an Handshake packet in the same datagram through its ->prev or
member in our case.
This patch implements quic_tx_packet_dgram_detach() which detaches a packet
from the adjacent ones in the same datagram to be called when ackwowledging
a packet (as done in the previous commit) and when releasing its memory. This
was, we are sure the released packets will not be accessed during retransmissions.
Thank you to @gabrieltz for having reported this issue in GH #1903.
Must be backported to 2.6.
As revealed by some traces provided by @gabrieltz in GH #1903 issue,
there are clients (chrome I guess) which acknowledge only one packet among others
in the same datagram. This is the case for the first datagram sent by a QUIC haproxy
listener made an Initial packet followed by an Handshake one. In this identified
case, this is the Handshake packet only which is acknowledged. But if the
client is able to respond with an Handshake packet (ACK frame) this is because
it has successfully parsed the Initial packet. So, why not also acknowledging it?
AFAIK, this is mandatory. On our side, when restransmitting this datagram, the
Handshake packet was accessed from the Initial packet after having being released.
Anyway. There is an issue on our side. Obviously, we must not expect an
implementation to respect the RFC especially when it want to build an attack ;)
With this simple patch for each TX packet we send, we also set the previous one
in addition to the next one. When a packet is acknowledged, we detach the next one
and the next one in the same datagram from this packet, so that it cannot be
resent when resending these packets (the previous one, in our case).
Thank you to @gabrieltz for having reported this issue.
Must be backported to 2.6.