Instead of speaking of an initialisation stage for each data
fast-forwarding, we now use the negociate term. Thus init_ff/init_fastfwd
functions were renamed nego_ff/nego_fastfwd.
Data fast-forwarding does not build without the kernel splicing support
because counters about splicing don't exist. To make the code more readable,
all code about splicing is disabled if kernel splicing is not supported.
The zero-copy forwarding or the mux-to-mux forwarding is a way to
fast-forward data without using the channels buffers. Data are transferred
from a mux to the other one. The kernel splicing is an optimization of the
zero-copy forwarding. But it can also use normal buffers (but not channels
ones). This way, it could be possible to fast-forward data with muxes not
supporting the kernel splicing (H2 and H3 muxes) but also with applets.
However, this mode can introduce regressions or bugs in future (just like
the kernel splicing). Thus, It could be usefull to disable this optim. To do
so, in configuration, the global tune settting
'tune.disable-zero-copy-forwarding' may be set in a global section or the
'-dZ' command line parameter may be used to start HAProxy. Of course, this
also disables the kernel splicing.
The PT multiplexer now implements callbacks function to produce and consume
fast-forwarded data. Only splicing is support because the mux-pt does not
use its own buffers.
Because channel_is_empty() function does now only check the channel's
buffer, we can remove it and rely on co_data() instead. Of course, all tests
must be inverted.
channel_is_empty() is thus removed.
It is important to split channels and I/O buffers. When data are pushed in
an I/O buffer, we consider them as forwarded. The channel never sees
them. Fast-forwarded data are now handled in the SE only.
The H2 multiplexer now implements callbacks to consume fast-forwarded
data. It is the most usful case: A H2 client getting data from a H1
server. It is also the easiest case to implement. The producer side is
trickier because of multiplexing. It is not obvious this case would be
improved with data fast-forwarding.
When message headers are parsed and an HTX start-line is created, if we
detect the response must not have any payload, a specific flag must be set
on the HTX start-line. It happens for instance for response to HEAD
requests. This flag is useb by the multiplexers to know response payload, if
any, must be silently skipped.
This was not performed when h2 HEADERS frames were decoded. This HTX flag
was specifically added to fix a bug when the splicing is inuse. Thus the H2
multiplexer was not concerned. Because the mux-to-mux fast-forwarding will
be introduced, it is important handle this flag in the H2 multiplexer too.
Just like for the zero-copy, this patch tries to simplify the code
responsible to format the message payload before sending it. But here, we
take care to simplify the loop on the HTX blocks. The result should be
less errorrpone.
In h1_make_data(), the function responsible to format the message payload
before sending it, the code dealing with zero-copy was slighly simplified
(at least for me :).
There is no real change but there is a better split between messages with a
content-length and cunked messages.
This function should be used to send the chunk size, before appending the
chunk payload. It also takes care to add a CRLF to finish a previous chunk,
if necessary. This function will be used to fix the splicing for re-chunk
responses with an unknown length.
When data were sent using the kernel splicing, we tried to send all data
with no restriction. Most of time it is valid. However, because the payload
representation may differ between the producer and the consumer, it is
important to be able to specify how must data to send via the splicing.
Of course, for performance reason, it is important to maximize amount of
data send via splicing at each call. However, on edge-cases, this now can be
limited.
On the sending path, there are 3 states for chunked payload in H1:
* H1_MSG_CHUNK_SIZE: the chunk size must be emitted
* H1_MSH_CHUNK_CRLF: The end of the chunk must be emitted
* H1_MSG_DATA: Chunked data must be emitted
However, some shortcuts were used on the sending path to avoid some
transitions. Especially, outgoing messages were never switched in
H1_MSG_CHUNK_SIZE state.
However, it will be necessary to properly handle all transitions on the payload
to implement mux-to-mux forwarding, to be sure to always known when the chunk
size or the end of the chunk must be emitted.
For now, it is not an issue, but it is safer to explicitly ignore HTX extra
field for responses with unknown length. This will be mandatory to future
fixes, to be able to re-chunk responses with an unknown length..
Now the kernel splicing support was removed, we can add mux-to-mux
fast-forward support. Of course, the splicing support will be reintroduced
in the muxes themselves but this will be transparent.
Changes are mainly located into sc_conn_recv() and sc_conn_send().
The kernel splicing support was totally remove waiting for the mux-to-mux
fast-forward implementation. So corresponding mux callbacks can be removed
now.
Because the kernel splicing support was removed from the stconn, it is
useless to keep it in muxes. In this patch, we remove the kernel splicing
support from the H1 multiplexer. It will be replaced by the mux-to-mux data
fast-forwarding.
Because the kernel splicing support was removed from the stconn, it is
useless to keep it in muxes. In this patch, we remove the kernel splicing
support from the passthough multiplexer. It will be replaced by the
mux-to-mux data fast-forwarding.
mux-to-mux fast-forwarding will be added. To avoid mix with the splicing and
simplify the commits, the kernel splicing support is removed from the
stconn. CF_KERN_SPLICING flag is removed and the support is no longer tested
in process_stream().
In the stconn part, rcv_pipe() callback function is no longer called.
Reg-tests scripts testing the kernel splicing are temporarly marked as
broken.
To perform the mux-to-mux data fast-forwarding, 4 new callbacks were added
into the mux_ops structure. 2 callbacks will be used from the stconn for
fast-forward data. The 2 other callbacks will be used by the endpoint to
request an iobuf to the opposite endpoint.
* fastfwd() callback function is used by a producer to forward data
* resume_fastfwd() callback function is used by a consumer if some data are
blocked in the iobuf, to resume the data forwarding.
* init_fastfwd() must be used by an endpoint (the producer one), inside the
fastfwd() callback to request an iobuf to the opposite side (the consumer
one).
* done_fastfwd() must be used by an endpoint (the producer one) at the end
of fastfwd() to notify the opposite endpoint (the consumer one) if data
were forwarded or not.
This API is still under development, so it may evolved. Especially when the
fast-forward will be extended to applets.
2 helper functions were also added into the SE api to wrap init_fastfwd()
and done_fastfwd() callback function of the underlying endpoint.
For now, this API is unsed and not implemented at all in muxes.
It is unused for now, but the iobuf structure now owns a pointer to a
buffer. This buffer will be used to perform mux-to-mux fast-forwarding when
splicing is not supported or unusable. This pointer should be filled by an
endpoint to let the opposite one forward data.
Extra fields, in addition to the buffer, are mandatory because the buffer
may already contains some data. the ".offset" field may be used may be used
as the position to start to copy data. Finally, the amount of data copied in
this buffer must be saved in ".data" field.
Some flags are also added to prepare next changes. And helper stconn
fnuctions are updated to also count data in the buffer. For a first
implementation, it is not planned to handle data in the buffer and in the
pipe in same time. But it will be possible to do so.
Instead of talking about kernel splicing at stconn/sedesc level, we now try
to talk about mux-to-mux fast-forwarding. To do so, 2 functions were added
to know if there are fast-forwarded data and to retrieve this amount of
data. Of course, for now, there is only data in a pipe.
In addition, some flags were renamed to reflect this notion. Note the
channel's documentation was not updated yet.
The pipes used to put data when the kernel splicing is in used are moved in
the SE descriptors. For now, it is just a simple remplacement but there is a
major difference with the pipes in the channel. The data are pushed in the
consumer's pipe while it was pushed in the producer's pipe. So it means the
request data are now pushed in the pipe of the backend SE descriptor and
response data are pushed in the pipe of the frontend SE descriptor.
The idea is to hide the pipe from the channel/SC side and to be able to
handle fast-forwading in pipe but also in buffer. To do so, the pipe is
inside a new entity, called iobuf. This entity will be extended.
If a shutw is blocked because the mux is full or busy, we must defer the
shutr. In this case, the H2 stream is not in H2_SS_CLOSED state because the
shutw is also deferred. If the shutr is performed, this will lead to a
error.
Concretly, when the mux is unblocked, a RST_STREAM is sent while in some
cases, an empty DATA frame with ES flag set could be sent.
This patch should be backported to all stable versions.
Redirect responses sent during the HTTP analysis have no payload. However
there is still a "Content-Length" header. It is important to set the
corresponding flag on the HTX start-line to be sure to preserve this header
when the reponse is sent to the client. The same is true with the stats
applet, when it returns a redirect responses.
It is especially important because we no ignore in-fly modifications of
"Content-Length" or "Transfer-Encoding" headers without updating the HTX
start-line flags.
This patch may be backported to all stable versions but it is probably
useless because only the 2.9-dev is affected by the bug.
Since commit 723c73f8a ("MEDIUM: mux-h1: Split h1_process_mux() to make code
more readable"), outgoing H1 chunked messages with no data at all get
delayed by 200ms. It is due to the fact that we end processing too early and
we don't have the opportunity to process trailers in this case.
This fix addresses it by verifying if it's required to emit EOT or trailers,
if any, when retruning from h1_make_data()
No backport is needed, this was in 2.9-dev.
Since last fixes about the lua cosocket, the appctx is no longer initialized
in hlua_socket_new(). The code to deal with error at this stage can be
removed.
This patch should fix the issue #2308.
The two timer handlers qc_process_timer() and qc_idle_timer_task() would
inadvertently return NULL when they don't want to be requeued, instead
of just returning the task itself. The effect of returning NULL for the
scheduler is that it considers the task as freed, so it must not touch
it anymore. As such, the TASK_F_RUNNING flag is never removed from these
tasks, and when quic_conn_release() later tries to release these tasks
using task_destroy(), the latter sees the RUNNING flag and just sets
->process to NULL, hoping that the scheduler will kill them on return,
but there's no longer being executed so this never happens and they are
leaked.
Interestingly, this doesn't seem to happen as much when multi-queue is
set to off, but it's likely because the tasks are being replaced and the
first ones have already been woken up and leaked, while the latter might
only trigger on a timeout or timer renewal.
This should address github issue #2310. Thanks to @hpn0t0ad for the
numerous traces that helped understand this sequence.
This must be backported to 2.7 at least, and adapted for 2.6
(qc_idle_timer_task must return t there).
When looking at "show pools", it's often difficult to know which alloc()
corresponds to which free() since it's not often 1:1. But sometimes we
have all elements available to maintain a link between alloc and free.
Indeed, when the caller is recorded in the allocated area, we can store
the pointer to the just created bin instead of the caller address itself,
since the caller address is already in the memprof bin. By doing so, we
permit the pool_free() call to locate the allocator bin and update its
free count when caller tracing is enabled. This for example allows to
produce outputs like this on "show profiling" and a process started with
-dMcaller:
1391967 1391968 22805987328 22806003712| 0x59f72f process_stream+0x19f/0x3a7a p_alloc(0) [delta=-16384] [pool=buffer]
1391936 1391937 22805479424 22805495808| 0x6e1476 task_run_applet+0x426/0xea2 p_alloc(0) [delta=-16384] [pool=buffer]
1391925 1391925 22805299200 22805299200| 0x58435a main+0xdf07a p_alloc(0) [delta=0] [pool=buffer]
0 2087930 0 34208645120| 0x59b519 stream_release_buffers+0xf9/0x110 p_free(-16384) [pool=buffer]
695993 695992 11403149312 11403132928| 0x66018f main+0x1baeaf p_alloc(0) [delta=16384] [pool=buffer]
0 1391957 0 22805823488| 0x59b47c stream_release_buffers+0x5c/0x110 p_free(-16384) [pool=buffer]
695968 695970 11402739712 11402772480| 0x587b85 h1_io_cb+0x9a5/0xe7c p_alloc(0) [delta=-32768] [pool=buffer]
0 1391923 0 22805266432| 0x57f388 main+0xda0a8 p_free(-16384) [pool=buffer]
695959 695960 11402592256 11402608640| 0x586add main+0xe17fd p_alloc(0) [delta=-16384] [pool=buffer]
0 695978 0 11402903552| 0x59cc58 stream_free+0x178/0x9ea p_free(-16384) [pool=buffer]
(...)
Here it's quickly visible that all of them got properly released.
An interesting issue was met when testing the mux-to-mux forwarding code.
In order to preserve fairness, in h2_snd_buf() if other streams are waiting
in send_list or fctl_list, the stream that is attempting to send also goes
to its list, and will be woken up by h2_process_mux() or h2_send() when
some space is released. But on rare occasions, there are only a few (or
even a single) streams waiting in this list, and these streams are just
quickly removed because of a timeout or a quick h2_detach() that calls
h2s_destroy(). In this case there's no even to wake up the other waiting
stream in its list, and this will possibly resume processing after some
client WINDOW_UPDATE frames or even new streams, so usually it doesn't
last too long and it not much noticeable, reason why it was left that
long. In addition, measures have shown that in heavy network-bound
benchmark, this exact situation happens on less than 1% of the streams
(reached 4% with mux-mux).
The fix here consists in replacing these LIST_DEL_INIT() calls on
h2s->list with a function call that checks if other streams were queued
to the send_list recently, and if so, which also tries to resume them
by calling h2_resume_each_sending_h2s(). The detection of late additions
is made via a new flag on the connection, H2_CF_WAIT_INLIST, which is set
when a stream is queued due to other streams being present, and which is
cleared when this is function is called.
It is particularly difficult to reproduce this case which is particularly
timing-dependent, but in a constrained environment, a test involving 32
conns of 20 streams each, all downloading a 10 MB object previously
showed a limitation of 17 Gbps with lots of idle CPU time, and now
filled the cable at 25 Gbps.
This should be backported to all versions where it applies.
Except if we must silently ignore empty connections by enabling
http-ignore-probes or dontlognull options, when a client connection is
closed before the first request, a 400-bad-request response must be sent
with the corresponding log message. However, that is broken since the commit
fc473a6453 ("MEDIUM: mux-h1: Rely on the H1C to deal with shutdown for
reads").
The bug is subtle. Parsing errors are no longer reported on connection errors
before the first request while it should be.
This patch must be backported where the above commit is (as far as 2.7).
In the same way than for stream-connectors (see "BUG/MEDIUM: stconn: Report
a send activity everytime data were sent" for details), we now report a send
activity everytime something was consumed by an applet, even if some output
data remains blocked into the channel's buffer.
This patch must be backported to 2.8.
When read/write timeouts were refactored in 2.8, we decided to change when a
send activity had to be reported. Before, everytime some data were sent a
send activity were reported. At this time, the channel's wex timer were
updated. During the refactoring, we decided to limit send activity to sends
that ampty te channel's buffer, consuming all outgoing data. Idea behind
this change was to protect haproxy against clients consumming data very
slowly.
However, it is too strict. Some congested muxes but still active can hit the
client or the server timeout. It seems a bit unfair. It is especially
visible with QUIC/H3 but it is probably also possible with H2 if the window
size is small.
The better is to restore the old behavior.
This patch must be backported to 2.8.
"log-bufsize" may now be used for a log server (in a log backend) to
configure the bufsize of implicit ring associated to the server (which
defaults to BUFSIZE).
This regtest declares and uses 3 log backends, one of which has TCP syslog
servers declared in it and other ones UDP syslog servers.
Some tests aims at testing log distribution reliability by leveraging the
log-balance hash algorithm with a key extracted from the request URL, and
the dummy vtest syslog servers ensure that messages are sent to the
correct endpoint. Overall this regtest covers essential parts of the log
message distribution and log-balancing logic involved with log backends.
It also leverages the log-forward section to perform the TCP->UDP
translation required to test UDP endpoints since vtest syslog servers
work in UDP mode.
Finally, we have some tests to ensure that the server queuing/dequeuing
and failover (backup) logics work properly.
hash lb algorithm can be configured with the "log-balance hash <cnv_list>"
directive. With this algorithm, the user specifies a converter list with
<cnv_list>.
The produced log message will be passed as-is to the provided converter
list, and the resulting hash will be used to select the log server that
will receive the log message.
split sample_process() in 2 parts in order to be able to only process
the converter part of a sample expression from an existing input sample
struct passed as parameter.
Instead of systematically computing the avalanche hash right after the
gen_hash() call, do it inside the gen_hash() function directly to ensure
avalanche setting is always considered.
Allow the use of the "none" hash-type function so that the key resulting
from the sample expression is directly used as the hash.
This can be useful to do the hashing manually using available hashing
converters, or even custom ones, and then inform haproxy that it can
directly rely on the sample expression result which is explictly handled
as an integer in this case.
In this patch we add basic support for the random algorithm:
random algorithm picks a random server using the result of the
statistical_prng() function as if it was a hash key to then compute the
related server ID.
There is no support for the <draw> parameter (which is implemented for
tcp/http load-balancing), because we don't have the required metrics to
evaluate server's load in log backends for the moment. Plus it would add
more complexity to the __do_send_log_backend() function so we'll keep it
this way for now but this might be needed in the future.
sticky algorithm always tries to send log messages to the first server in
the farm. The server will stay in front during queue and dequeue
operations (no other server can steal its place), unless it becomes
unavailable, in which case it will be replaced by another server from
the tree.
Using "mode log" in a backend section turns the proxy in a log backend
which can be used to log-balance logs between multiple log targets
(udp or tcp servers)
log backends can be used as regular log targets using the log directive
with "backend@be_name" prefix, like so:
| log backend@mybackend local0
A log backend will distribute log messages to servers according to the
log load-balancing algorithm that can be set using the "log-balance"
option from the log backend section. For now, only the roundrobin
algorithm is supported and set by default.
This helper function can be used to create a new sink from an existing
server struct (and thus existing proxy as well), in order to spare some
resources when possible.
implicit rings were automatically forced to the parent logger format, but
this was done upon ring creation.
This is quite restrictive because we might want to choose the desired
format right before generating the log header (ie: when producing the
log message), depending on the logger (log directive) that is
responsible for the log message, and with current logic this is not
possible. (To this day, we still have dedicated implicit ring per log
directive, but this might change)
In ring_write(), we check if the sink->fmt is specified:
- defined: we use it since it is the most precise format
(ie: for named rings)
- undefined: then we fallback to the format from the logger
With this change, implicit rings' format is now set to UNSPEC upon
creation. This is safe because the log header building function
automatically enforces the "raw" format when UNSPEC is set. And since
logger->format also defaults to "raw", no change of default behavior
should be expected.
Introduce log_header struct to easily pass log header data between
functions and use that to simplify the logic around log header
handling.
While at it, some outdated comments were updated as well.
No change in behavior should be expected.