Some parts from the previous doc about logging have been merged and
updated. Most of those parts have been reworked and completed. The
examples are now accurate and reflect recent versions.
As subject when i try to compile haproxy with -DDEBUG_FULL it stop at
stream_sock.c file with:
gcc -Iinclude -Wall -O2 -g -DDEBUG_FULL -DTPROXY -DENABLE_POLL
-DENABLE_EPOLL -DENABLE_SEPOLL -DNETFILTER -DUSE_GETSOCKNAME
-DCONFIG_HAPROXY_VERSION=\"1.3.15\"
-DCONFIG_HAPROXY_DATE=\"2008/04/19\" -c -o src/stream_sock.o
src/stream_sock.c
src/stream_sock.c: In function 'stream_sock_chk_rcv':
src/stream_sock.c:905: error: 'fd' undeclared (first use in this function)
src/stream_sock.c:905: error: (Each undeclared identifier is reported only once
src/stream_sock.c:905: error: for each function it appears in.)
src/stream_sock.c:905: error: 'ob' undeclared (first use in this function)
src/stream_sock.c: In function 'stream_sock_chk_snd':
src/stream_sock.c:940: error: 'fd' undeclared (first use in this function)
src/stream_sock.c:940: error: 'ib' undeclared (first use in this function)
make: *** [src/stream_sock.o] Error 1
With this patch all build fine:
Using the wrong operator (&& instead of &) causes DOWN->UP
transition to take longer than it should and to produce a lot of
redundant logs. With typical "track" usage (1-6 tracking servers) it
shouldn't make a big difference but for heavily tracked servers
this bug leads to hang with 100% CPU usage and extremely big
log spam.
The "bind-process" keyword lets the admin select which instances may
run on which process (in multi-process mode). It makes it easier to
more evenly distribute the load across multiple processes by avoiding
having too many listen to the same IP:ports.
Specifying "interface <name>" after the "source" statement allows
one to bind to a specific interface for proxy<->server traffic.
This makes it possible to use multiple links to reach multiple
servers, and to force traffic to pass via an interface different
from the one the system would have chosen based on the routing
table.
By appending "interface <name>" to a "bind" line, it is now possible
to specifically bind to a physical interface name. Note that this
currently only works on Linux and requires root privileges.
This will provide high performance data forwarding between sockets,
but it is broken on many kernels and will sometimes forward corrupted
data without some kernel patches. Consider this experimental for now.
Setting "nosplice" in the global section will disable the use of TCP
splicing (both tcpsplice and linux 2.6 splice). The same will be
achieved using the "-dS" parameter on the command line.
The global tuning options right now only concern the polling mechanisms,
and they are not in the global struct itself. It's not very practical to
add other options so let's move them to the global struct and remove
types/polling.h which was not used for anything else.
global.maxconn/4 seems to be a good hint for global.maxpipes when that
one must be guessed. If the limit is reached, it's still possible to
set it manually in the configuration.
Using pipe pools makes pipe management a lot easier. It also allows to
remove quite a bunch of #ifdefs in areas which depended on the presence
or not of support for kernel splicing.
The buffer now holds a pointer to a pipe structure which is always NULL
except if there are still data in the pipe. When it needs to use that
pipe, it dynamically allocates it from the pipe pool. When the data is
consumed, the pipe is immediately released.
That way, there is no need anymore to care about pipe closure upon
session termination, nor about pipe creation when trying to use
splice().
Another immediate advantage of this method is that it considerably
reduces the number of pipes needed to use splice(). Tests have shown
that even with 0.2 pipe per connection, almost all sessions can use
splice(), because the same pipe may be used by several consecutive
calls to splice().
A new data type has been added : pipes. Some pre-allocated empty pipes
are maintained in a pool for users such as splice which use them a lot
for very short times.
Pipes are allocated using get_pipe() and released using put_pipe().
Pipes which are released with pending data are immediately killed.
The struct pipe is small (16 to 20 bytes) and may even be further
reduced by unifying ->data and ->next.
It would be nice to have a dedicated cleanup task which would watch
for the pipes usage and destroy a few of them from time to time.
Did a full compile of the 1.3.15.7 - 20081208 snapshot on Freebsd-7.x
recently, and noted that there needs to be a quick patch done on the
Makefile for bsd machines.
This was due to the stream_interface replacing the send data commands
in the rewrite Willy did a while ago.
Simple fix, and it compiled cleanly otherwise. Thanks for the work
Willy!
Cheers,
Ross.
-=
Kernels before 2.6.27.13 would have splice() return EAGAIN on shutdown.
By adding a few tricks, we can deal with the situation. If splice()
returns EAGAIN and the pipe is empty, then fallback to recv() which
will be able to check if it's an end of connection or not.
The advantage of this method is that it remains transparent for good
kernels since there is no reason that epoll() will return EPOLLIN
without anything to read, and even if it would happen, the recv()
overhead on this check is minimal.
If splicing is enabled in a backend, we need to guess how many
pipes will be needed. We used to rely on fullconn, but this leads
to non-working splicing when fullconn is not specified. So we now
fallback to global.maxconn.
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
Some older libc don't define the splice() syscall, and some even
define a wrong one. For this reason, we try our best to declare
it correctly. These definitions still work with recent glibc.
When CONFIG_HAP_LINUX_SPLICE is defined, the buffer structure will be
slightly enlarged to support information needed for kernel splicing
on Linux.
A first attempt consisted in putting this information into the stream
interface, but in the long term, it appeared really awkward. This
version puts the information into the buffer. The platform-dependant
part is conditionally added and will only enlarge the buffers when
compiled in.
One new flag has also been added to the buffers: BF_KERN_SPLICING.
It indicates that the application considers it is appropriate to
use splicing to forward remaining data.
Three new options have been added when CONFIG_HAP_LINUX_SPLICE is
set :
- splice-request
- splice-response
- splice-auto
They are used to enable splicing per frontend/backend. They are also
supported in defaults sections. The "splice-auto" option is meant to
automatically turn splice on for buffers marked as fast streamers.
This should save quite a bunch of file descriptors.
It was required to add a new "options2" field to the proxy structure
because the original "options" is full.
When global.maxpipes is not set, it is automatically adjusted to
the max of the sums of all frontend's and backend's maxconns for
those which have at least one splice option enabled.
When the producer calls stream_sock_chk_snd(), we now try to send
all pending data asynchronously. If it succeeds, we don't have to
enable polling on the FD which saves about half of the calls to
epoll_wait().
In stream_sock_read(), we finally set the WAIT_ROOM flag as soon as
possible, in preparation of the splice code. We reset it when we
detect that some room has been released either in the buffer or in
the splice.
The condition to cakk ->chk_snd() in stream_sock_read() was suboptimal
because we did not call it when the socket was shut down nor when there
was an error after data were added.
Now we ensure to call is whenever there are data pending.
Also, the "full" condition was handled before calling chk_snd(), which
could cause deadlock issues if chk_snd() did consume some data.
stream_sock_write() has been split in two parts :
- the poll callback, intented to be called when an I/O event has
been detected
- the write() core function, which ought to be usable from various
other places, possibly not meant to wake the task up.
The code has also been slightly cleaned up in the process. It's more
readable now.
Some tricks to handle situations where we write nothing were in the
middle of the main loop in stream_sock_write(). This cleanup provides
better source and object code, and slightly shrinks the output code.
This construct collapses into ((flags & (X|Y)) == X) when X is a
single-bit flag. This provides a noticeable code shrink and the
output code results in less conditional jumps.
In the buffers, the read limit used to leave some place for header
rewriting was set by a pointer to the end of the buffer. Not only
this required subtracts at every place in the code, but this will
also soon not be usable anymore when we want to support keepalive.
Let's replace this with a length limit, comparable to the buffer's
length. This has also sightly reduced the code size.
It is not always wise to return 0 in stream_sock_read() upon EAGAIN,
because if we have read enough data, we should consider that enough
and try again later without polling in between.
We still make a difference between small reads and large reads though.
Small reads still lead to polling because we're sure that there's
nothing left in the system's buffers if we read less than one MSS.
The way the buffers and stream interfaces handled ->to_forward was
really not handy for multiple reasons. Now we've moved its control
to the receive-side of the buffer, which is also responsible for
keeping send_max up to date. This makes more sense as it now becomes
possible to send some pre-formatted data followed by forwarded data.
The following explanation has also been added to buffer.h to clarify
the situation. Right now, tests show that the I/O is behaving extremely
well. Some work will have to be done to adapt existing splice code
though.
/* Note about the buffer structure
The buffer contains two length indicators, one to_forward counter and one
send_max limit. First, it must be understood that the buffer is in fact
split in two parts :
- the visible data (->data, for ->l bytes)
- the invisible data, typically in kernel buffers forwarded directly from
the source stream sock to the destination stream sock (->splice_len
bytes). Those are used only during forward.
In order not to mix data streams, the producer may only feed the invisible
data with data to forward, and only when the visible buffer is empty. The
consumer may not always be able to feed the invisible buffer due to platform
limitations (lack of kernel support).
Conversely, the consumer must always take data from the invisible data first
before ever considering visible data. There is no limit to the size of data
to consume from the invisible buffer, as platform-specific implementations
will rarely leave enough control on this. So any byte fed into the invisible
buffer is expected to reach the destination file descriptor, by any means.
However, it's the consumer's responsibility to ensure that the invisible
data has been entirely consumed before consuming visible data. This must be
reflected by ->splice_len. This is very important as this and only this can
ensure strict ordering of data between buffers.
The producer is responsible for decreasing ->to_forward and increasing
->send_max. The ->to_forward parameter indicates how many bytes may be fed
into either data buffer without waking the parent up. The ->send_max
parameter says how many bytes may be read from the visible buffer. Thus it
may never exceed ->l. This parameter is updated by any buffer_write() as
well as any data forwarded through the visible buffer.
The consumer is responsible for decreasing ->send_max when it sends data
from the visible buffer, and ->splice_len when it sends data from the
invisible buffer.
A real-world example consists in part in an HTTP response waiting in a
buffer to be forwarded. We know the header length (300) and the amount of
data to forward (content-length=9000). The buffer already contains 1000
bytes of data after the 300 bytes of headers. Thus the caller will set
->send_max to 300 indicating that it explicitly wants to send those data,
and set ->to_forward to 9000 (content-length). This value must be normalised
immediately after updating ->to_forward : since there are already 1300 bytes
in the buffer, 300 of which are already counted in ->send_max, and that size
is smaller than ->to_forward, we must update ->send_max to 1300 to flush the
whole buffer, and reduce ->to_forward to 8000. After that, the producer may
try to feed the additional data through the invisible buffer using a
platform-specific method such as splice().
*/
Previously, we wrote nothing only if the buffer was empty. Now with
send_max, we can also write nothing because we are not allowed to send
anything due to send_max.
The code starts to look like spaghetti. It needs to be rearranged a
lot before merging the splice patches.
In preparation of splice support, let's add the splice_len member
to the buffer struct. An earlier implementation made it conditional,
which made the whole logics very complex due to a large number of
ifdefs.
Now BF_EMPTY is only set once both buf->l and buf->splice_len are
null. Splice_len is initialized to zero during buffer creation and
is currently not changed, so the whole logics remains unaffected.
When splice gets merged, splice_len will reflect the number of bytes
in flight out of the buffer but not yet sent, typically in a pipe for
the Linux case.
If an analyser sets buf->to_forward to a given value, that many
data will be forwarded between the two stream interfaces attached
to a buffer without waking the task up. The same applies once all
analysers have been released. This saves a large amount of calls
to process_session() and a number of task_dequeue/queue.
By letting the producer tell the consumer there is data to check,
and the consumer tell the producer there is some space left again,
we can cut in half the number of session wakeups.
This is also an important starting point for future splicing support.
Sometimes we don't care about a read timeout, for instance, from the
client when waiting for the server, but we still want the client to
be able to read.
Till now it was done by articially forcing the read timeout to ETERNITY.
But this will cause trouble when we want the low level stream sock to
communicate without waking the session up. So we add a BF_READ_NOEXP
flag to indicate that when the read timeout is to be set, it might
have to be set to ETERNITY.
Since BF_READ_ENA was not used, we replaced this flag.
We don't want to report a buffer timeout if there was I/O activity
for the same events. That way we'll not have to always re-arm timeouts
on I/O, without the fear of a timeout triggering too fast.
For keep-alive, line-mode protocols and splicing, we will need to
limit the sender to process a certain amount of bytes. The limit
is automatically set to the buffer size when analysers are detached
from the buffer.
"option transparent" was set and checked on frontends only while it
is purely a backend thing as it replaces the "balance" mode. For this
reason, it did only work in "listen" sections. This change will then
not affect the rare users of this option.