Extend extra pacing support for newreno and nocc congestion algorithms,
as with cubic.
For better extensibility of cc algo definition, define a new flags field
in quic_cc_algo structure. For now, the only value is
QUIC_CC_ALGO_FL_OPT_PACING which is set if pacing support can be
optionally activated. Both cubic, newreno and nocc now supports this.
This new flag is then reused by QUIC config parser. If set, extra
quic-cc-algo burst parameter is taken into account. If positive, this
will activate pacing support on top of the congestion algorithm. As with
cubic previously, pacing is only supported if running under experimental
mode.
Only BBR is not flagged with this new value as pacing is directly
builtin in the algorithm and cannot be turn off. Furthermore, BBR
calculates automatically its value for maximum burst. As such, any
quic-cc-algo burst argument used with BBR is still ignored with a
warning.
This only concerns functions emitting warnings about misplaced tcp-request
rules. The direction is now specified in the functions name. For instance
"warnif_misplaced_tcp_conn" is replaced by "warnif_misplaced_tcp_req_conn".
In warnings about misplaced rules, only the first keyword is mentionned. It
works well for http-request or quic-initial rules for instance. But it is a
bit confusing for tcp-request rules, because the layer is missing (session
or content).
To make it a bit systematic (and genric), the second argument can now be
provided. It can be set to NULL if there is no layer or scope. But
otherwise, it may be specified and it will be reported in the warning.
So the following snippet:
tcp-request content reject if FALSE
tcp-request session reject if FALSE
tcp-request connection reject if FALSE
Will now emit the following warnings:
a 'tcp-request session' rule placed after a 'tcp-request content' rule will still be processed before.
a 'tcp-request connection' rule placed after a 'tcp-request session' rule will still be processed before.
This patch should fix the issue #2596.
qc_packet_loss_lookup() aim is to detect the packet losses. This is this function
which must called ->on_pkt_lost() BBR specific callback. It also set
<bytes_lost> passed parameter to the total number of bytes detected as lost upon
an ACK frame receipt for its caller.
Modify qc_release_lost_pkts() to call ->congestion_event() with the send time
from the newest packet detected as lost.
Modify qc_release_lost_pkts() to call ->slow_start() callback only if define
by the congestion control algorithm. This is not the case for BBR.
Add several callbacks to quic_cc_algo struct which are only called by BBR.
->get_drs() may be used to retrieve the delivery rate sampling information
from an congestion algorithm struct (quic_cc).
->on_transmit() must be called before sending any packet a QUIC sender.
->on_ack_rcvd() must be called after having received an ACK.
->on_pkt_lost() must be called after having detected a packet loss.
->congestion_event() must be called after any congestion event detection
Modify quic_cc.c to call ->event only if defined. This is not the case
for BBR.
Implement the version 3 of BBR for QUIC specified by the IETF in this draft:
https://datatracker.ietf.org/doc/draft-ietf-ccwg-bbr/
Here is an extract from the Abstract part to sum up the the capabilities of BBR:
BBR ("Bottleneck Bandwidth and Round-trip propagation time") uses recent
measurements of a transport connection's delivery rate, round-trip time, and
packet loss rate to build an explicit model of the network path. BBR then uses
this model to control both how fast it sends data and the maximum volume of data
it allows in flight in the network at any time. Relative to loss-based congestion
control algorithms such as Reno [RFC5681] or CUBIC [RFC9438], BBR offers
substantially higher throughput for bottlenecks with shallow buffers or random
losses, and substantially lower queueing delays for bottlenecks with deep buffers
(avoiding "bufferbloat"). BBR can be implemented in any transport protocol that
supports packet-delivery acknowledgment. Thus far, open source implementations
are available for TCP [RFC9293] and QUIC [RFC9000].
In haproxy, this implementation is considered as still experimental. It depends
on the newly implemented pacing feature.
BBR was asked in GH #2516 by @KazuyaKanemura, @osevan and @kennyZ96.
Implement the Kathleen Nichols' algorithm used by several congestion control
algorithm implementation (TCP/BBR in Linux kernel, QUIC/BBR in quiche) to track
the maximum value of a data type during a fixe time interval.
In this implementation, counters which are periodically reset are used in place
of timestamps.
Only the max part has been implemented.
(see lib/minmax.c implemenation for Linux kernel).
Add ->initial_wnd new member to quic_cc_path struct to keep the initial value
of the congestion window. This member is initialized as soon as a QUIC connection
is allocated. This modification is required for BBR congestion control algorithm.
Once in a while we find some makefiles ignoring some outdated arguments
and just emit a warning. What's annoying is that if users (say, distro
packagers), have purposely added ERR=1 to their build scripts to make
sure to fail on any warning, these ones will be ignored and the build
can continue with invalid or missing options.
William rightfully suggested that ERR=1 should also catch make's warnings
so this patch implements this, by creating a new "complain" variable that
points either to "error" or "warning" depending on $(ERR), and that is
used to send the messages using $(call $(complain),...). This does the
job right at little effort (tested from GNU make 3.82 to 4.3).
Note that for this purpose the ERR declaration was upped in the makefile
so that it appears before the new errors.mk file is included.
tasklet_wakeup_on() and its derivates (tasklet_wakeup_after() and
tasklet_wakeup()) do not support passing a wakeup cause like
task_wakeup(). This is essentially due to an API limitation cause by
the fact that for a very long time the only reason for waking up was
to process pending I/O. But with the growing complexity of mux tasks,
it is becoming important to be able to skip certain heavy processing
when not strictly needed.
One possibility is to permit the caller of tasklet_wakeup() to pass
flags like task_wakeup(). Instead of going with a complex naming scheme,
let's simply make the flags optional and be zero when not specified. This
means that tasklet_wakeup_on() now takes either 2 or 3 args, and that the
third one is the optional flags to be passed to the callee. Eligible flags
are essentially the non-persistent ones (TASK_F_UEVT* and TASK_WOKEN_*)
which are cleared when the tasklet is executed. This way the handler
will find them in its <state> argument and will be able to distinguish
various causes for the call.
Everything in the tasklet layer supports flags, except that they are
just not implemented in the wakeup functions, while they are in the
task_wakeup functions. Initially it was not considered useful to pass
wakeup causes because these were essentially I/O, but with the growing
number of I/O handlers having to deal with various types of operations
(typically cheap I/O notifications on subscribe vs heavy parsing on
application-level wakeups), it would be nice to start to make this
distinction possible.
This commit extends _tasklet_wakeup_on() and _tasklet_wakeup_after()
to pass a set of flags that continues to be set as zero. For now this
changes nothing, but new functions will come.
Currently tasks being profiled have th_ctx->sched_call_date set to the
current nanosecond in monotonic time. But there's no other way to have
this, despite the scheduler being capable of it. Let's just declare a
new task flag, TASK_F_WANTS_TIME, that makes the scheduler take the time
just before calling the handler. This way, a task that needs nanosecond
resolution on the call date will be able to be called with an up-to-date
date without having to abuse now_mono_time() if not needed. In addition,
if CLOCK_MONOTONIC is not supported (now_mono_time() always returns 0),
the date is set to the most recently known now_ns, which is guaranteed
to be atomic and is only updated once per poll loop.
This date can be more conveniently retrieved using task_mono_time().
This can be useful, e.g. for pacing. The code was slightly adjusted so
as to merge the common parts between the profiling case and this one.
We used to store it in 32-bits since we'd only use it for latency and CPU
usage calculation but usages will evolve so let's not truncate the value
anymore. Now we store the full 64 bits. Note that this doesn't even
increase the storage size due to alignment. The 3 usage places were
verified to still be valid (most were already cast to 32 bits anyway).
To improve debugging, extend "show quic" output to report if pacing is
activated on a connection. Two values will be displayed for pacing :
* a new counter paced_sent_ctr is defined in QCC structure. It will be
incremented each time an emission is interrupted due to pacing.
* pacing engine now saves the number of datagrams sent in the last paced
emission. This will be helpful to ensure burst parameter is valid.
Define a new QUIC congestion algorithm token 'cubic-pacing' for
quic-cc-algo bind keyword. This is identical to default cubic
implementation, except that pacing is used for STREAM frames emission.
This algorithm supports an extra argument to specify a burst size. This
is stored into a new bind_conf member named quic_pacing_burst which can
be reuse to initialize quic path.
Pacing support is still considered experimental. As such, 'cubic-pacing'
can only be used with expose-experimental-directives set.
Support pacing emission for STREAM frames at the QUIC MUX layer. This is
implemented by adding a quic_pacer engine into QCC structure.
The main changes have been written into qcc_io_send(). It now
differentiates cases when some frames have been rejected by transport
layer. This can occur as previously due to congestion or FD buffer full,
which requires subscribing on transport layer. The new case is when
emission has been interrupted due to pacing timing. In this case, QUIC
MUX I/O tasklet is rescheduled to run with the flag TASK_F_USR1.
On tasklet execution, if TASK_F_USR1 is set, all standard processing for
emission and reception is skipped. Instead, a new function
qcc_purge_sending() is called. Its purpose is to retry emission with the
saved STREAM frames list. Either all remaining frames can now be send,
subscribe is done on transport error or tasklet must be rescheduled for
pacing purging.
In the meantime, if tasklet is rescheduled due to other conditions,
TASK_F_USR1 is reset. This will trigger a full regeneration of STREAM
frames. In this case, pacing expiration must be check before calling
qcc_send_frames() to ensure emission is now allowed.
QUIC MUX will be responsible to drive emission with pacing. This will be
implemented via setting TASK_F_USR1 before I/O tasklet wakeup. To
prepare this, encapsulate each I/O tasklet wakeup into a new function
qcc_wakeup().
This commit is purely refactoring prior to pacing implementation into
QUIC MUX.
For STREAM emission, MUX QUIC previously used a local list defined under
qcc_io_send(). This was suitable as either all frames were sent, or
emission must be interrupted due to transport congestion or fatal error.
In the latter case, the list was emptied anyway and a new frame list was
built on future qcc_io_send() invokation.
For pacing, MUX QUIC may have to save the frame list if pacing should be
applied across emission. This is necessary to avoid to unnecessarily
rebuilt stream frame list between each paced emission. To support this,
STREAM list is now stored as a member of QCC structure.
Ensure frame list is always deleted, even on QCC release, using newly
defined utility function qcc_tx_frms_free().
qc_send_mux() has been extended previously to support pacing emission.
This will ensure that no more than one datagram will be emitted during
each invokation. However, to achieve better performance, it may be
necessary to emit a batch of several datagrams one one turn.
A so-called burst value can be specified by the user in the
configuration. However, some congestion control algos may defined their
owned dynamic value. As such, a new CC callback pacing_burst is defined.
quic_cc_default_pacing_burst() can be used for algo without pacing
interaction, such as cubic. It will returns a static value based on user
selected configuration.
Pacing will be implemented for STREAM frames emission. As such,
qc_send_mux() API has been extended to add an argument to a quic_pacer
engine.
If non NULL, engine will be used to pace emission. In short, no more
than one datagram will be emitted for each qc_send_mux() invokation.
Pacer is then notified about the emission and a timer for a future
emission is calculated. qc_send_mux() will return PACING error value, to
inform QUIC MUX layer that it will be responsible to retry emission
after some delay.
Extend quic_pacer engine to support pacing emission. Several functions
are defined.
* quic_pacing_sent_done() to notify engine about an emission of one or
several datagrams
* quic_pacing_expired() to check if emission should be delayed or can be
conducted immediately
This commit is part of a adjustment on QUIC transport send API to
support pacing. Here, qc_send_mux() return type has been changed to use
a new enum quic_tx_err.
This is useful to explain different failure causes of emission. For now,
only two values have been defined : NONE and FATAL. When pacing will be
implemented, a new value would be added to specify that emission was
interrupted on pacing. This won't be a fatal error as this allows to
retry emission but not immediately.
Extend QUIC transport emission function to support a maximum datagram
argument. The purpose is to ensure that qc_send() won't emit more than
the specified value, unless it is 0 which is considered as unlimited.
In qc_prep_pkts(), a counter of built datagram has been added to support
this. The packet building loop is interrupted if it reaches a specified
maximum value. Also, its return value has been changed to the number of
prepared datagrams. This is reused by qc_send() to interrupt its work if
a specified max datagram argument value is reached over one or several
iteration of prepared/sent datagrams.
This change is necessary to support pacing emission. Note that ideally,
the total length in bytes of emitted datagrams should be taken into
account instead of the raw number of datagrams. However, for a first
implementation, it was deemed easier to implement it with the latter.
Add QCC QC_CF_WAIT_FOR_HS and QCS QC_SF_TXBUB_OOB flags to their
respective show_flags to be able to decipher them via dev flags utility.
These values have been added in the current dev version, thus no need to
backport this patch.
When the request is too large to fit in a buffer a 414 or a 431 error
message is returned depending on the error state of the request parser. A
414 is returned if the URI is too long, otherwise a 431 is returned.
This patch should fix the issue #1309.
414-Uri-Too-Long and 431-Request-Header-Fields-Too-Large are now part of
supported status codes that can be define as error files. The hash table
defined in http_get_status_idx() was updated accordingly.
It is now possible to use a log-format string to define the "Set-Cookie"
header value of a response generated by a redirect rule. There is no special
check on the result format and it is not possible during the configuration
parsing. It is proably not a big deal because already existing "set-cookie"
and "clear-cookie" options don't perform any check.
Here is an example:
http-request redirect location https://someurl.com/ set-cookie haproxy="%[var(txn.var)]"
This patch should fix the issue #1784.
On prefix-based redirect, there is an option to drop the query-string of the
location. Here it is the opposite. an option is added to preserve the
query-string of the original URI for a localtion-based redirect.
By setting "keep-query" option, for a location-based redirect only, the
query-string of the original URI is appended to the location. If there is no
query-string, nothing is added (no empty '?'). If there is already a
non-empty query-string on the localtion, the original one is appended with
'&' separator.
This patch should fix issue #2728.
parse_size_err() currently is a function working only on an uint. It's
not convenient for certain elements such as rings on large machines.
This commit addresses this by having one function for uints and one
for ullong, and making parse_size_err() a macro that automatically
calls one or the other. It also has the benefit of automatically
supporting compatible types (long, size_t etc).
From time to time we face a configuration with very small timeouts which
look accidental because there could be expectations that they're expressed
in seconds and not milliseconds.
This commit adds a check for non-nul unitless values smaller than 100
and emits a warning suggesting to append an explicit unit if that was
the intent.
Only the common timeouts, the server check intervals and the resolvers
hold and timeout values were covered for now. All the code needs to be
manually reviewed to verify if it supports emitting warnings.
This may break some configs using "zero-warning", but greps in existing
configs indicate that these are extremely rare and solely intentionally
done during tests. At least even if a user leaves that after a test, it
will be more obvious when reading 10ms that something's probably not
correct.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "4k". Let's make use of parse_size_err() on it so that
units are supported. This requires to turn it to uint as well, which
was verified to be OK.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, preventing from starting when set
e.g. to "64k". Let's make use of parse_size_err() on it so that units are
supported. This requires to turn it to uint as well, and to explicitly
limit its range to INT_MAX - 2*sizeof(void*), which was previously
partially handled as part of the sign check.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on it so that
units are supported. This requires to turn it to uint as well, and
since it's sometimes compared to an int, we limit its range to
0..INT_MAX.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on it so that
units are supported. This requires to turn it to uint as well, which
was verified to be OK.
Till now these values were parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on them so that
units are supported. This requires to turn them to uint as well, which
is OK.
Till now these values were parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on them so that
units are supported. This requires to turn them to uint as well, which
is OK.
The qcc_report_glitch() function is now replaced with a macro to support
enumerating counters for each individual glitch line. For now this adds
36 such counters. The macro supports an optional description, though that
is not being used for now.
As a reminder, this requires to build with -DDEBUG_GLITCHES=1.
proxy auth_uri struct was manually cleaned up during deinit, but the logic
behind was kind of akward because it was required to find out which ones
were shared or not. Instead, let's switch to a proper refcount mechanism
and free the auth_uri struct directly in proxy_free_common().
COUNT_GLITCH() will implement an unconditional counter on its declaration
line when DEBUG_GLITCHES is set, and do nothing otherwise. The output will
be reported as "GLT" and can be filtered as "glt" on the CLI. The purpose
is to help figure what's happening if some glitches counters start going
through the roof. The macro supports an optional string argument to
describe the cause of the glitch (e.g. "truncated header"), which is then
reported in the dump.
For now this is conditioned by DEBUG_GLITCHES but if it turns out to be
light enough, maybe we'll keep it enabled full time. In this case it
might have to be moved away from debug dev, or at least documented (or
done as debug counters maybe so that dev can remain undocumented and
updatable within a branch?).
In order to count new event types, we'll need to support empty conditions
so that we don't have to fake if (1) that would pollute the output. This
change checks if #cond is an empty string before concatenating it with
the optional var args, and avoids dumping the colon on the dump if the
whole description is empty.
After master-worker refactoring, master performs re-exec only once up to
receiving "reload" command or USR2 signal. There is no more the second
master's re-exec to free unused memory. Thus, there is no longer need to export
environment variable HAPROXY_LOAD_SUCCESS with worker process load status. This
status can be simply saved in a global variable load_status.
Since 3.0, it is possible to assign a GUID to proxies, listeners and
servers. These objects are stored in a global tree guid_tree.
Proxies and listeners are static. However, servers may be added or
deleted at runtime, which imply that guid_tree must be protected. Fix
this by declaring a read-write lock to protect tree access.
For now, only guid_insert() and guid_remove() are protected using a
write lock. Outside of these, GUID tree is not accessed at runtime. If
server CLI commands are extended to support GUID as server identifier,
lookup operation should be extended with a read lock protection.
Note that during stat-file preloading, GUID tree is accessed for lookup.
However, as it is performed on startup which is single threaded, there
is no need for lock here. A BUG_ON() has been added to ensure this
precondition remains true.
This bug could caused a segfault when using dynamic servers with GUID.
However, it was never reproduced for now.
This must be backported up to 3.0. To avoid a conflict issue, the
previous cleanup patch can be merged before it.