Do not use an extra DCID parameter on new_quic_cid to be able to
associated a new generated CID to a thread ID. Simply do the computation
inside the function. The API is cleaner this way.
This also has the effects to improve the apparent randomness of CIDs.
With the previous version the first byte of all CIDs are identical for a
connection which could lead to privacy issue. This version may not be
totally perfect on this aspect but it improves the situation.
The producer must know where is the tailing hole in the RX buffer
when it purges it from consumed datagram. This is done allocating
a fake datagram with the remaining number of bytes which cannot be produced
at the tail of the RX buffer as length.
The DeviceAtlas addon can optionally interacts with the new service
without change of configuration in the HAProxy part.
Note that however it requires the DeviceAtlas Identification C API
2.4.0 minimum from this point.
Signed-off-by: David Carlier <dcarlier@deviceatlas.com>
Mentions of the new database update runtime mode and update of
the legit module and the dummy part too.
Note the DeviceAtlas C API version 2.4.0 minimum required
alongside with libCURL, libzip and libgz.
New specialized service to daily handle the update of download file without
interruption of service and to be preemptively started before HAProxy.
It consists on a standalone utility which shared a memory block with
the DeviceAtlas module which handles the JSON data file update on
a daily basis.
Signed-off-by: David Carlier <dcarlier@deviceatlas.com>
The CID trees are no more attached to the listener receiver but to the
underlying datagram handlers (one by thread) which run always on the same thread.
So, any operation on these trees do not require any locking.
We copy the first octet of the original destination connection ID to any CID for
the connection calling new_quic_cid(). So this patch modifies only this function
to take a dcid as passed parameter.
As the RX buffer is not consumed by the sock i/o handler as soon as a datagram
is produced, when full an RX buffer must not be reset. The remaining room is
consumed without modifying it. The consumer has a represention of its contents:
a list of datagrams.
Rename quic_lstnr_dgram_read() to quic_lstnr_dgram_dispatch() to reflect its new role.
After calling this latter, the sock i/o handler must consume the buffer only if
the datagram it received is detected as wrong by quic_lstnr_dgram_dispatch().
The datagram handler task mark the datagram as consumed atomically setting ->buf
to NULL value. The sock i/o handler is responsible of flushing its RX buffer
before using it. It also keeps a datagram among the consumed ones so that
to pass it to quic_lstnr_dgram_dispatch() and prevent it from allocating a new one.
As mentionned in the comment, the tx_qrings and rxbufs members of
receiver struct must be pointers to pointers!
Modify the functions responsible of their allocations consequently.
Note that this code could work because sizeof rxbuf and sizeof tx_qrings
are greater than the size of pointer!
The quic_dgram_ctx struct has been replaced by quic_dgram struct.
There is no need to keek a typedef for a pointer to function since we
converted the UDP datagram parser (quic_dgram_read()) into a task.
quic_dgram_read() parses all the QUIC packets from a UDP datagram. It is the best
candidate to be converted into a task, because is processing data unit is the UDP
datagram received by the QUIC sock i/o handler. If correct, this datagram is
added to the context of a task, quic_lstnr_dghdlr(), a conversion of quic_dgram_read()
into a task. This task pop a datagram from an mt_list and passes it among to
the packet handler (quic_lstnr_pkt_rcv()).
Modify the quic_dgram struct to play the role of the old quic_dgram_ctx struct when
passed to quic_lstnr_pkt_rcv().
Modify the datagram handlers allocation to set their tasks to quic_lstnr_dghdlr().
Add quic_dghdlr new struct do define datagram handler tasks, one by thread.
Allocate them and attach them to the listener receiver part calling
quic_alloc_dghdlrs_listener() newly implemented function.
Add quic_dgram new structure to store information about datagrams received
by the sock I/O handler (quic_sock_fd_iocb) and its associated pool.
Implement quic_get_dgram_dcid() to retrieve the datagram DCID which must
be the same for all the packets in the datagram.
Modify quic_lstnr_dgram_read() called by the sock I/O handler to allocate
a quic_dgram each time a correct datagram is found and add it to the sock I/O
handler rxbuf dgram list.
Define the offsets of the DCIDs from the beginning of a QUIC packets.
Note that they must always be present. As QUIC servers, QUIC haproxy listeners
always use a CID, source CID on the haproxy side, which is a destination ID on the
peer side.
This function is no more used anymore, broken and uses code shared with the
listener packet parser. This is becoming anoying to continue to modify
it without testing each time we modify the code it shares with the
listener packet parser.
This is to be sure xprt functions do not manipulate the buffer struct
passed as parameter to quic_lstnr_dgram_read() from low level datagram
I/O callback in quic_sock.c (quic_sock_fd_iocb()).
Mention that the token is sent only by servers in both server and listener
packet parsers.
Remove a "TO DO" section in listener packet parser because there is nothing
more to do in this function about the token
This quic_dgram_ctx struct member is used to denote if we are parsing a new
datagram (null value), or a coalesced packet into the current datagram (non null
value). But it was never set.
There's a risk that fdtab is not 64-byte aligned. The first effect is that
it may cause false sharing between cache lines resulting in contention
when adjacent FDs are used by different threads. The second is related
to what is explained in commit "BUG/MAJOR: compiler: relax alignment
constraints on certain structures", i.e. that modern compilers might
make use of aligned vector operations to zero some entries, and would
crash. We do not use any memset() or so on fdtab, so the risk is almost
inexistent, but that's not a reason for violating some valid assumptions.
This patch addresses this by allocating 64 extra bytes and aligning the
structure manually (this is an extremely cheap solution for this specific
case). The original address is stored in a new variable "fdtab_addr" and
is the one that gets freed. This remains extremely simple and should be
easily backportable. A dedicated aligned allocator later would help, of
course.
This needs to be backported as far as 2.2. No issue related to this was
reported yet, but it could very well happen as compilers evolve. In
addition this should preserve high performance across restarts (i.e.
no more dependency on allocator's alignment).
In github bug #1517, Mike Lothian reported instant crashes on startup
on RHEL8 + gcc-11 that appeared with 2.4 when allocating a proxy.
The analysis brought us down to the THREAD_ALIGN() entries that were
placed inside the "server" struct to avoid false sharing of cache lines.
It turns out that some modern gcc make use of aligned vector operations
to manipulate some fields (e.g. memset() etc) and that these structures
allocated using malloc() are not necessarily aligned, hence the crash.
The compiler is allowed to do that because the structure claims to be
aligned. The problem is in fact that the alignment propagates to other
structures that embed it. While most of these structures are used as
statically allocated variables, some are dynamic and cannot use that.
A deeper analysis showed that struct server does this, propagates to
struct proxy, which propagates to struct spoe_config, all of which
are allocated using malloc/calloc.
A better approach would consist in usins posix_memalign(), but this one
is not available everywhere and will either need to be reimplemented
less efficiently (by always wasting 64 bytes before the area), or a
few functions will have to be specifically written to deal with the
few structures that are dynamically allocated.
But the deeper problem that remains is that it is difficult to track
structure alignment, as there's no available warning to check this.
For the long term we'll probably have to create a macro such as
"struct_malloc()" etc which takes a type and enforces an alignment
based on the one of this type. This also means propagating that to
pools as well, and it's not a tiny task.
For now, let's get rid of the forced alignment in struct server, and
replace it with extra padding. By punching 63-byte holes, we can keep
areas on separate cache lines. Doing so moderately increases the size
of the "server" structure (~+6%) but that's the best short-term option
and it's easily backportable.
This will have to be backported as far as 2.4.
Thanks to Mike for the detailed report.
The standalone testing code used to rely on rand(), but switching to a
xorshift generator speeds up the test by 7% which is important to
accurately measure the real impact of the LRU code itself.
Do not proceed to direct accept when creating a new quic_conn. Wait for
the QUIC handshake to succeeds to insert the quic_conn in the accept
queue. A tasklet is then woken up to call listener_accept to accept the
quic_conn.
The most important effect is that the connection/mux layers are not
instantiated at the same time as the quic_conn. This forces to delay
some process to be sure that the mux is allocated :
* initialization of mux transport parameters
* installation of the app-ops
Also, the mux instance is not checked now to wake up the quic_conn
tasklet. This is safe because the xprt-quic code is now ready to handle
the absence of the connection/mux layers.
Note that this commit has a deep impact as it changes significantly the
lower QUIC architecture. Most notably, it breaks the 0-RTT feature.
Create a new structure li_per_thread. This is uses as an array in the
listener structure, with an entry allocated per thread. The new function
li_init_per_thr is responsible of the allocation.
For now, li_per_thread contains fields only useful for QUIC listeners.
As such, it is only allocated for QUIC listeners.
Create a new type quic_accept_queue to handle QUIC connections accept.
A queue will be allocated for each thread. It contains a list of
listeners which contains at least one quic_conn ready to be accepted and
the tasklet to run listener_accept for these listeners.
Mark QUIC listeners with the flag LI_F_QUIC_LISTENER. It is set by the
proto-quic layer on the add listener callback. This allows to override
more clearly the accept callback on quic_session_accept.
Define a new field in listener structure named flags.
For the moment, no flag is defined. This will be notably useful to
differentiate QUIC listeners with the implementation of a QUIC conn
accept queue.
The connection is allocated after finishing the QUIC handshake. Remove
handshake/L6 flags when initializing the connection as handshake is
finished with success at this stage.
Remove usage of connection in quic_conn_from_buf. As connection and
quic_conn are decorrelated, it is not logical to check connection flags
when using sendto.
This require to store the L4 peer address in quic_conn to be able to use
sendto.
This change is required to delay allocation of connection.
QUIC connections are distributed accross threads by xprt-quic according
to their CIDs. As such disable the thread selection in listener_accept
for QUIC listeners.
This prevents connection from migrating to another threads after its
allocation which can results in unexpected side-effects.
This flag is named RX_F_LOCAL_ACCEPT. It will be activated for special
receivers where connection balancing to threads is already handle
outside of listener_accept, such as with QUIC listeners.
Add a new function in mux-quic to install app-ops. For now this
functions is called during the ALPN negotiation of the QUIC handshake.
This change will be useful when the connection accept queue will be
implemented. It will be thus required to delay the app-ops
initialization because the mux won't be allocated anymore during the
QUIC handshake.
Define a new enum to represent the status of the mux/connection layer
above a quic_conn. This is important to know if it's possible to handle
application data, or if it should be buffered or dropped.
Adjust the function to check if header protection can be removed. It can
now be used both for a single packet in qc_lstnr_pkt_rcv and in the
quic_conn handler to handle buffered packets for a specific encryption
level.
When squashing commit add43fa43 ("DEBUG: pools: add new build option
DEBUG_POOL_TRACING") I managed to break the build and to fail to detect
it even after the rebase and a full rebuild :-(
David Carlier reported a build breakage on Haiku since commit
5be7c198e ("DEBUG: cli: add a new "debug dev fd" expert command")
due to O_ASYNC not being defined. Ilya also reported it broke the
build on Cygwin. It's not that portable and sometimes defined as
O_NONBLOCK for portability. But here we don't even need that, as
we already condition other flags, let's just ignore it if it does
not exist.
The poller's pipe was only registered on the read side since we don't
need to poll to write on it. But this leaves some known FDs so it's
better to also register the write side with no event. This will allow
to show them in "show fd" and to avoid dumping them as unhandled FDs.
Note that the only other type of unhandled FDs left are:
- stdin/stdout/stderr
- epoll FDs
The later can be registered upon startup though but at least a dummy
handler would be needed to keep the fdtab clean.
This command will scan the whole file descriptors space to look for
existing FDs that are unknown to haproxy's fdtab, and will try to dump
a maximum number of information about them (including type, mode, device,
size, uid/gid, cloexec, O_* flags, socket types and addresses when
relevant). The goal is to help detecting inherited FDs from parent
processes as well as potential leaks.
Some of those listed are actually known but handled so deep into some
systems that they're not in the fdtab (such as epoll FDs or inter-
thread pipes). This might be refined in the future so that these ones
become known and do not appear.
Example of output:
$ socat - /tmp/sock1 <<< "expert-mode on;debug dev fd"
0 type=tty. mod=0620 dev=0x8803 siz=0 uid=1000 gid=5 fs=0x16 ino=0x6 getfd=+0 getfl=O_RDONLY,O_APPEND
1 type=tty. mod=0620 dev=0x8803 siz=0 uid=1000 gid=5 fs=0x16 ino=0x6 getfd=+0 getfl=O_RDONLY,O_APPEND
2 type=tty. mod=0620 dev=0x8803 siz=0 uid=1000 gid=5 fs=0x16 ino=0x6 getfd=+0 getfl=O_RDONLY,O_APPEND
3 type=pipe mod=0600 dev=0 siz=0 uid=1000 gid=100 fs=0xc ino=0x18112348 getfd=+0
4 type=epol mod=0600 dev=0 siz=0 uid=0 gid=0 fs=0xd ino=0x3674 getfd=+0 getfl=O_RDONLY
33 type=pipe mod=0600 dev=0 siz=0 uid=1000 gid=100 fs=0xc ino=0x24af8251 getfd=+0 getfl=O_RDONLY
34 type=epol mod=0600 dev=0 siz=0 uid=0 gid=0 fs=0xd ino=0x3674 getfd=+0 getfl=O_RDONLY
36 type=pipe mod=0600 dev=0 siz=0 uid=1000 gid=100 fs=0xc ino=0x24af8d1b getfd=+0 getfl=O_RDONLY
37 type=epol mod=0600 dev=0 siz=0 uid=0 gid=0 fs=0xd ino=0x3674 getfd=+0 getfl=O_RDONLY
39 type=pipe mod=0600 dev=0 siz=0 uid=1000 gid=100 fs=0xc ino=0x24afa04f getfd=+0 getfl=O_RDONLY
41 type=pipe mod=0600 dev=0 siz=0 uid=1000 gid=100 fs=0xc ino=0x24af8252 getfd=+0 getfl=O_RDONLY
42 type=epol mod=0600 dev=0 siz=0 uid=0 gid=0 fs=0xd ino=0x3674 getfd=+0 getfl=O_RDONLY
This new option, when set, will cause the callers of pool_alloc() and
pool_free() to be recorded into an extra area in the pool that is expected
to be helpful for later inspection (e.g. in core dumps). For example it
may help figure that an object was released to a pool with some sub-fields
not yet released or that a use-after-free happened after releasing it,
with an immediate indication about the exact line of code that released
it (possibly an error path).
This only works with the per-thread cache, and even objects refilled from
the shared pool directly into the thread-local cache will have a NULL
there. That's not an issue since these objects have not yet been freed.
It's worth noting that pool_alloc_nocache() continues not to set any
caller pointer (e.g. when the cache is empty) because that would require
a possibly undesirable API change.
The extra cost is minimal (one pointer per object) and this completes
well with DEBUG_POOL_INTEGRITY.
This adds a caller to pool_put_to_cache() and pool_get_from_cache()
which will optionally be used to pass a pointer to their callers. For
now it's not used, only the API is extended to support this pointer.
Here the idea is to calculate the POOL_EXTRA size that is appended at
the end of a pool object based on the sum of enabled optional fields
so that we can more easily compute offsets and sizes depending on build
options.
For this, POOL_EXTRA is replaced with POOL_EXTRA_MARK which itself is
set either to sizeof(void*) or zero depending on whether we enable
marking the origin pool or not upon allocation.
The pool_alloc() function was already a wrapper to __pool_alloc() which
was also inlined but took a set of flags. This latter was uninlined and
moved to pool.c, and pool_alloc()/pool_zalloc() turned to macros so that
they can more easily evolve to support debugging options.
The number of call places made this code grow over time and doing only
this change saved ~1% of the whole executable's size.
The pool_free() function has become a bit big over time due to the
extra consistency checks. It used to remain inline only to deal
cleanly with the NULL pointer free that's quite present on some
structures (e.g. in stream_free()).
Here we're splitting the function in two:
- __pool_free() does the inner block without the pointer test and
becomes a function ;
- pool_free() is now a macro that only checks the pointer and calls
__pool_free() if needed.
The use of a macro versus an inline function is only motivated by an
easier intrumentation of the code later.
With this change, the code size reduces by ~1%, which means that at
this point all pool_free() call places used to represent more than
1% of the total code size.