Implement a way to test if some options are enabled at run-time. For now,
following options may be detected:
POLL, EPOLL, KQUEUE, EVPORTS, SPLICE, GETADDRINFO, REUSEPORT,
FAST-FORWARD, SERVER-SSL-VERIFY-NONE
These options are those that can be disabled on the command line. This way
it is possible, from a reg-test for instance, to know if a feature is
supported or not :
feature cmd "$HAPROXY_PROGRAM -cc '!(globa.tune & GTUNE_NO_FAST_FWD)'"
This global option is documented but it is not in the list of supported
options for the global section. So let's add it.
This patch could be backported to all stable versions.
The option was renamed to only permit to disable the fast-forward. First
there is no reason to enable it because it is the default behavior. Then it
introduced a bug because there is no way to be sure the command line has
precedence over the configuration this way. So, the option is now named
"tune.disable-fast-forward" and does not support any argument. And of
course, the commande line option "-dF" has now precedence over the
configuration.
No backport needed.
For server connections, both the frontend and backend were considered to
enable the httpclose option. However, it is ambiguous because on client side
only the frontend is considerd. In addition for 2 frontends, one with the
option enabled and not for the other, the HTTP connection mode may differ
while it is a backend setting.
Thus, now, for the server side, only the backend is considered. Of course,
if the option is set for a listener, the option will be enabled if the
listener is the backend's connection.
Since the HTX, the decription of options about HTTP connection modes is
wrong. In fact, it is worst, all the documentation about HTTP connection
mode is wrong. But only options will be updated for now to be backported.
So, documentation of "option httpclose", "option "http-keep-alive", "option
http-server-close" and "option "http-pretend-keepalive" was reviewed. First,
it is specify these options only concern HTT/1.x connections. Then, the
descriptions were updated to reflect the HTX implementation.
The main changes concerns the fact that server connections are no longer
attached to client connections. The connection mode on one side does not
affect the connection mode on the other side. It is especially true for
t"option httpclose". For client connections, only the frontend option is
considered and for server ones, both frontend and backend options are
considered.
This patch should be backported as far as 2.2.
We must never exit for the stream processing function with an expired
task. Otherwise, we are pretty sure this will ends with a spinning loop. It
is really better to abort as far as possible and with the original buggy
state. This will ease the debug sessions.
If the TX buffer (->tx.buf) attached to the connection is not drained, there
are chances that this will be detected by qc_txb_release() which triggers
a BUG_ON_HOT() when this is the case as follows
[00|quic|2|c_conn.c:3477] UDP port unreachable : qc@0x5584f18d6d50 pto_count=0 cwnd=6816 ppif=1046 pif=1046
[00|quic|5|ic_conn.c:749] qc_kill_conn(): entering : qc@0x5584f18d6d50
[00|quic|5|ic_conn.c:752] qc_kill_conn(): leaving : qc@0x5584f18d6d50
[00|quic|5|c_conn.c:3532] qc_send_ppkts(): leaving : qc@0x5584f18d6d50 pto_count=0 cwnd=6816 ppif=1046 pif=1046
FATAL: bug condition "buf && b_data(buf)" matched at src/quic_conn.c:3098
Consume the remaining data in the TX buffer calling b_del().
This bug arrived with this commit:
a2c62c314 MINOR: quic: Kill the connections on ICMP (port unreachable) packet receipt
Takes also the opportunity of this patch to modify the comments for qc_send_ppkts()
which should have arrived with a2c62c314 commit.
Must be backported to 2.7 where this latter commit is supposed to be backported.
Traces from this function would miss a TRACE_LEAVE() on the success path,
which had for consequences, 1) that it was difficult to figure where the
function was left, and 2) that we never had the allocated stream ID
clearly visible (actually the one returned by h2c_frt_stream_new() is
the right one but it's not obvious).
This can be backported to 2.7 and 2.6.
Functions which are called with dummy streams pass it down the traces
and that leads to somewhat confusing "h2s=0x1234568(0,IDL)" for example
while the nature of the called function makes this stream useless at that
place. Better not report a random pointer, especially since it always
requires to look at the code before remembering how this should be
interpreted.
Now what we're doing is that the idle stream only prints "h2s=IDL" which
is shorter and doesn't report a pointer, closed stream do not report
anything since the stream ID 0 already implies it, and other ones are
reported normally.
This could be backported to 2.7 and 2.6 as it improves traces legibility.
With previous commit, quic-conn are now handled as jobs to prevent the
termination of haproxy process. This ensures that QUIC connections are
closed when all data are acknowledged by the client and there is no more
active streams.
The quic-conn layer emits a CONNECTION_CLOSE once the MUX has been
released and all streams are acknowledged. Then, the timer is scheduled
to definitely free the connection after the idle timeout period. This
allows to treat late-arriving packets.
Adjust this procedure to deactivate this timer when process stopping is
in progress. In this case, quic-conn timer is set to expire immediately
to free the quic-conn instance as soon as possible. This allows to
quickly close haproxy process.
This should be backported up to 2.7.
To prevent data loss for QUIC connections, haproxy global variable jobs
is incremented each time a quic-conn socket is allocated. This allows
the QUIC connection to terminate all its transfer operation during proxy
soft-stop. Without this patch, the process will be terminated without
waiting for QUIC connections.
Note that this is done in qc_alloc_fd(). This means only QUIC connection
with their owned socket will properly support soft-stop. In the other
case, the connection will be interrupted abruptly as before. Similarly,
jobs decrement is conducted in qc_release_fd().
This should be backported up to 2.7.
Properly implement support for haproxy soft-stop on QUIC MUX. This code
is similar to H2 MUX :
* on timeout refresh, if stop-stop in progress, schedule the timeout to
expire with regards to the close-spread-end window.
* after input/output processing, if soft-stop in progress, shutdown the
connection. This is randomly spread by close-spread-end window. In the
case of H3 connection, a GOAWAY is emitted and the connection is kept
until all data are sent for opened streams. If the client tries to use
new streams, they are rejected in conformance with the GOAWAY
specification.
This ensures that MUX is able to forward all content properly before
closing the connection. The lower quic-conn layer is then responsible
for retransmission and should be closed when all data are acknowledged.
This will be implemented in the next commit to fully support soft-stop
for QUIC connections.
This should be backported up to 2.7.
Implement client-fin timeout for MUX quic. This timeout is used once an
applicative layer shutdown has been called. In HTTP/3, this corresponds
to the emission of a GOAWAY.
This should be backported up to 2.7.
Define a new function qc_process(). This function will regroup several
internal operation which should be called both on I/O tasklet and wake()
callback. For the moment, only streams purge is conducted there.
This patch is useful to support haproxy soft stop. This should be
backported up to 2.7.
Factorize shutdown operation in a dedicated function qc_shutdown(). This
will allow to call it from multiple places. A new flag QC_CF_APP_SHUT is
also defined to ensure it will only be executed once even if called
multiple times per connection.
This commit will be useful to properly support haproxy soft stop.
This should be backported up to 2.7.
When a GOAWAY has been emitted, an ID is announced to represent handled
streams. H3 RFC suggests that higher streams should be resetted with the
error code H3_REQUEST_CANCELLED. This allows the peer to replay requests
on another connection.
For the moment, the impact of this change is limitted as GOAWAY is only
used on connection shutdown just before the MUX is freed. However, for
soft-stop support, a GOAWAY can be emitted in anticipation while keeping
the MUX to finish the active streams. In this case, new streams opened
by the client are resetted.
As a consequence of this change, app_ops.attach() operation has been
delayed at the very end of qcs_new(). This ensure that all qcs members
are initialized to support RESET_STREAM sending.
This should be backported up to 2.7.
h3s stores the current demux frame type and length as a state info. It
should be big enough to store a QUIC variable-length integer which is
the maximum H3 frame type and size.
Without this patch, there is a risk of integer overflow if H3 frame size
is bigger than INT_MAX. This can typically causes demux state mismatch
and demux frame error. However, no occurence has been found yet of this
bug with the current implementation.
This should be backported up to 2.6.
When the MUX is freed, the quic-conn layer may stay active until all
streams acknowledgment are processed. In this interval, if a new stream
is opened by the client, the quic-conn is thus now responsible to handle
it. This is done by the emission of a STOP_SENDING + RESET_STREAM.
Prior to this patch, the received packet was not acknowledged. This is
undesirable if the quic-conn is able to properly reject the request as
this can lead to unneeded retransmission from the client.
This must be backported up to 2.6.
When the MUX is freed, the quic-conn layer may stay active until all
streams acknowledgment are processed. In this interval, if a new stream
is opened by the client, the quic-conn is thus now responsible to handle
it. This is done by the emission of a STOP_SENDING.
This process has been completed to also emit a RESET_STREAM with the
same error code H3_REQUEST_REJECTED. This is done to conform with the H3
specification to invite the client to retry its request on a new
connection.
This should be backported up to 2.6.
When the MUX is freed, the quic-conn layer may stay active until all
streams acknowledgment are processed. In this interval, if a new stream
is opened by the client, the quic-conn is thus now responsible to handle
it. This is done by the emission of a STOP_SENDING.
This process is closely related to HTTP/3 protocol despite being handled
by the quic-conn layer. This highlights a flaw in our QUIC architecture
which should be adjusted. To reflect this situation, the function
qc_stop_sending_frm_enqueue() is renamed qc_h3_request_reject(). Also,
internal H3 treatment such as uni-directional bypass has been moved
inside the function.
This commit is only a refactor. However, bug fix on next patches will
rely on it so it should be backported up to 2.6.
This was revealed by Amaury when setting tune.quic.frontend.max-streams-bidi to 8
and asking a client to open 12 streams. haproxy has to send short packets
with little MAX_STREAMS frames encoded with 2 bytes. In addition to a packet number
encoded with only one byte. In the case <len_frms> is the length of the encoded
frames to be added to the packet plus the length of the packet number.
Ensure the length of the packet is at least QUIC_PACKET_PN_MAXLEN adding a PADDING
frame wich (QUIC_PACKET_PN_MAXLEN - <len_frms>) as size. For instance with
a two bytes MAX_STREAMS frames and a one byte packet number length, this adds
one byte of padding.
See https://datatracker.ietf.org/doc/html/rfc9001#name-header-protection-sample.
Must be backported to 2.7 and 2.6.
When receiving an Initial packet a peer must drop it if the datagram is smaller
than 1200. Before this patch, this is the entire datagram which was dropped.
In such a case, drop the packet after having parsed its length.
Must be backported to 2.6 and 2.7
This bug arrives with this commit:
982896961 MINOR: quic: split and rename qc_lstnr_pkt_rcv()
The first block of code consists in possibly setting this variable to true.
But it was already initialized to true before entering this code section.
Should be initialized to false.
Also take the opportunity to remove an unused "err" label.
Must be backported to 2.6 and 2.7.
Before probing the Initial packet number space, verify that we can at least
sent 1200 bytes by datagram. This may not be the case due to the amplification limit.
Must be backported to 2.6 and 2.7.
This should help in diagnosing issues revealed by the interop runner which counts
the number of handshakes from the number of Initial packets sent by the server.
Must be backported to 2.7.
The aim of this function is to rearm the idle timer. The ->expire
field of the timer task was updated without being requeued.
Some connection could be unexpectedly terminated.
Must be backported to 2.6 and 2.7.
This is very helpful during retranmission when receiving ICMP port unreachable
errors after the peer has left. This is the unique case at prevent where
qc_send_hdshk_pkts() or qc_send_app_probing() may fail (when they call
qc_send_ppkts() which fails with ECONNREFUSED as errno).
Also make the callers qc_dgrams_retransmit() stop their packet process. This
is the case of quic_conn_app_io_cb() and quic_conn_io_cb().
This modifications stops definitively any packet processing when receiving
ICMP port unreachable errors.
Must be backported to 2.7.
The send*() syscall which are responsible of such ICMP packets reception
fails with ECONNREFUSED as errno.
man(7) udp
ECONNREFUSED
No receiver was associated with the destination address.
This might be caused by a previous packet sent over the socket.
We must kill asap the underlying connection.
Must be backported to 2.7.
This code was there because the timer task was not running on the same thread
as the one which parse the QUIC packets. Now that this is no more the case,
we can wake up this task directly.
Must be backported to 2.7.
Move quic_rx_pkts_del() out of quic_conn.h to make it benefit from the TRACE API.
Add traces which already already helped in diagnosing an issue encountered with
ngtcp2 which sent too much 1RTT packets before the handshake completion. This
has been fixed here after having discussed with Tasuhiro on QUIC dev slack:
https://github.com/ngtcp2/ngtcp2/pull/663
Must be backported to 2.7.
Some counters could potentially be incremented even if send*() syscall returned
no error when ret >= 0 and ret != sz. This could be the case for instance if
a first call to send*() returned -1 with errno set to EINTR (or any previous syscall
which set errno to a non-null value) and if the next call to send*() returned
something positive and smaller than <sz>.
Must be backported to 2.7 and 2.6.
Add traces inside h3_decode_qcs(). Every error path has now its
dedicated trace which should simplify debugging. Each early returns has
been converted to a goto invocation.
To complete the demux tracing, demux frame type and length are now
printed using the h3s instance whenever its possible on trace
invocation. A new internal value H3_FT_UNINIT is used as a frame type to
mark demuxing as inactive.
This should be backported up to 2.7.
Since the recent changes on the clocks, now.tv_sec is not to be used
between processes because it's a clock which is local to the process and
does not contain a real unix timestamp. This patch fixes the issue by
using "data.tv_sec" which is the wall clock instead of "now.tv_sec'.
It prevents having incoherent timestamps.
It also introduces some checks on negatives values in order to never
displays a netative value if it was computed from a wrong value set by a
previous haproxy version.
It must be backported as far as 2.0.
Implement support for clients that emit the stream FIN with an empty
STREAM frame. For that, qcc_recv() offset comparison has been adjusted.
If offset has already been received but the FIN bit is now transmitted,
do not skip the rest of the function and call application layer
decode_qcs() callback.
Without this, streams will be kept open forever as HTX EOM is never
transfered to the upper stream layer.
This behavior was observed with mvfst client prior to its patch
38c955a024aba753be8bf50fdeb45fba3ac23cfd
Fix hq-interop (HTTP 0.9 over QUIC)
This notably caused the interop multiplexing test to fail as unclosed
streams on haproxy side prevented the emission of new MAX_STREAMS frame
to the client.
This shoud be backported up to 2.6. It also relies on previous commit :
381d8137e3
MINOR: h3/hq-interop: handle no data in decode_qcs() with FIN set
Properly handle a STREAM frame with no data but the FIN bit set at the
application layer. H3 and hq-interop decode_qcs() callback have been
adjusted to not return early in this case.
If the FIN bit is accepted, a HTX EOM must be inserted for the upper
stream layer. If the FIN is rejected because the stream cannot be
closed, a proper CONNECTION_CLOSE error will be triggered.
A new utility function qcs_http_handle_standalone_fin() has been
implemented in the qmux_http module. This allows to simply add the HTX
EOM on qcs HTX buffer. If the HTX buffer is empty, a EOT is first added
to ensure it will be transmitted above.
This commit will allow to properly handle FIN notify through an empty
STREAM frame. However, it is not sufficient as currently qcc_recv() skip
the decode_qcs() invocation when the offset is already received. This
will be fixed in the next commit.
This should be backported up to 2.6 along with the next patch.
Several times during debugging it has been difficult to find a way to
reliably indicate if a thread had been started and if it was still
running. It's really not easy because the elements we look at are not
necessarily reliable (e.g. harmless bit or idle bit might not reflect
what we think during a signal). And such notions can be subjective
anyway.
Here we define two thread flags, TH_FL_STARTED which is set as soon as
a thread enters run_thread_poll_loop() and drops the idle bit, and
another one, TH_FL_IN_LOOP, which is set when entering run_poll_loop()
and cleared when leaving it. This should help init/deinit code know
whether it's called from a non-initialized thread (i.e. tid must not
be trusted), or shared functions know if they're being called from a
running thread or from init/deinit code outside of the polling loop.
As reported in github issue #1881, there are situations where an excess
of TLS handshakes can cause a livelock. What's happening is that normally
we process at most one TLS handshake per loop iteration to maintain the
latency low. This is done by tagging them with TASK_HEAVY, queuing these
tasklets in the TL_HEAVY queue. But if something slows down the loop, such
as a connect() call when no more ports are available, we could end up
processing no more than a few hundred or thousands handshakes per second.
If the llmit becomes lower than the rate of incoming handshakes, we will
accumulate them and at some point users will get impatient and give up or
retry. Then a new problem happens: the queue fills up with even more
handshake attempts, only one of which will be handled per iteration, so
we can end up processing only outdated handshakes at a low rate, with
basically nothing else in the queue. This can for example happen in
parallel with health checks that don't require incoming handshakes to
succeed to continue to cause some activity that could maintain the high
latency stuff active.
Here we're taking a slightly different approach. First, instead of always
allowing only one handshake per loop (and usually it's critical for
latency), we take the current situation into account:
- if configured with tune.sched.low-latency, the limit remains 1
- if there are other non-heavy tasks, we set the limit to 1 + one
per 1024 tasks, so that a heavily loaded queue of 4k handshakes
per thread will be able to drain them at ~4 per loops with a
limited impact on latency
- if there are no other tasks, the limit grows to 1 + one per 128
tasks, so that a heavily loaded queue of 4k handshakes per thread
will be able to drain them at ~32 per loop with still a very
limited impact on latency since only I/O will get delayed.
It was verified on a 56-core Xeon-8480 that this did not degrade the
latency; all requests remained below 1ms end-to-end in full close+
handshake, and even 500us under low-lat + busy-polling.
This must be backported to 2.4.
There's a per-thread "long_rq" counter that is used to indicate how
often we leave the scheduler with tasks still present in the run queue.
The purpose is to know when tune.runqueue-depth served to limit latency,
due to a large number of tasks being runnable at once.
However there's a bug there, it's not always set: if after the first
run, one heavy task was processed and later only heavy tasks remain,
we'll loop back to not_done_yet where we try to pick more tasks, but
none are eligible (since heavy ones have already run) so we directly
return without incrementing the counter. This is what causes ultra-low
values on long_rq during massive SSL handshakes, that are confusing
because they make one believe that tl_class_mask doesn't have the HEAVY
flag anymore. Let's just fix that by not returning from the middle of
the function.
This can be backported as far as 2.4.
In 2.7, the method used to check for a sleeping thread changed with
commit e7475c8e7 ("MEDIUM: tasks/fd: replace sleeping_thread_mask with
a TH_FL_SLEEPING flag"). Previously there was a global sleeping mask
and now there is a flag per thread. The commit above partially broke
the watchdog by looking at the current thread's flags via th_ctx
instead of the reported thread's flags, and using an AND condition
instead of an OR to update and leave. This can cause a wrong thread
to be killed when the load is uneven. For example, when enabling
busy polling and sending traffic over a single connection, all
threads have their run time grow, and if the one receiving the
signal is also processing some traffic, it will not match the
sleeping/harmless condition and will set the stuck flag, then die
upon next invocation. While it's reproducible in tests, it's unlikely
to be met in field.
This fix should be backported to 2.7.
A feature command was added to detect if infinite forward is disabled to be
able to skip the script. Unfortunately, it is no supported to evaluate such
expression. Thus remove it. For now, reg-tests must not be executed with
"-dF" option.
The -dF option can now be used to disable data fast-forward. It does the
same than the global option "tune.fast-forward off". Some reg-tests may rely
on this optim. To detect the feature and skip such script, the following
vtest command must be used:
feature cmd "$HAPROXY_PROGRAM -cc '!(globa.tune & GTUNE_NO_FAST_FWD)'"
The new global option "tune.fast-forward" can be set to "off" to disable the
data fast-forward. It is an debug option, thus it is internally marked as
experimental. The directive "expose-experimental-directives" must be set
first to use this one. By default, the data fast-forward is enable.
It could be usefull to force to wake the stream up when data are
received. To be sure, evreything works fine in this case. The data
fast-forward is an optim. It must work without it. But some code may rely on
the fact the stream will not be woken up. With this option, it is possible
to spot some hidden bugs.
At the stream level, the read expiration date is unset if a shutr was
received but not if the end of input was reached. If we know no more data
are excpected, there is no reason to let the read expiration date armed,
except to respect clientfin/serverfin timeout on some circumstances.
This patch could slowly be backported as far as 2.2.
During the payload forwarding, since the commit f2b02cfd9 ("MAJOR: http-ana:
Review error handling during HTTP payload forwarding"), when an error
occurred on one side, we don't rely anymore on a specific HTTP message state
to detect it on the other side. However, nothing was added to detect the
error. Thus, when this happens, a spinning loop may be experienced and an
abort because of the watchdog.
To fix the bug, we must detect the opposite side is closed by checking the
opposite SC state. Concretly, in http_end_request() and http_end_response(),
we wait for the other side iff the HTTP message state is lower to
HTTP_MSG_DONE (the message is not finished) and the SC state is not
SC_ST_CLO (the opposite side is not closed). In these function, we don't
care if there was an error on the opposite side. We only take care to detect
when we must stop waiting the other side.
This patch should fix the issue #2042. No backport needed.