Three new options have been added when CONFIG_HAP_LINUX_SPLICE is
set :
- splice-request
- splice-response
- splice-auto
They are used to enable splicing per frontend/backend. They are also
supported in defaults sections. The "splice-auto" option is meant to
automatically turn splice on for buffers marked as fast streamers.
This should save quite a bunch of file descriptors.
It was required to add a new "options2" field to the proxy structure
because the original "options" is full.
When global.maxpipes is not set, it is automatically adjusted to
the max of the sums of all frontend's and backend's maxconns for
those which have at least one splice option enabled.
In the buffers, the read limit used to leave some place for header
rewriting was set by a pointer to the end of the buffer. Not only
this required subtracts at every place in the code, but this will
also soon not be usable anymore when we want to support keepalive.
Let's replace this with a length limit, comparable to the buffer's
length. This has also sightly reduced the code size.
The way the buffers and stream interfaces handled ->to_forward was
really not handy for multiple reasons. Now we've moved its control
to the receive-side of the buffer, which is also responsible for
keeping send_max up to date. This makes more sense as it now becomes
possible to send some pre-formatted data followed by forwarded data.
The following explanation has also been added to buffer.h to clarify
the situation. Right now, tests show that the I/O is behaving extremely
well. Some work will have to be done to adapt existing splice code
though.
/* Note about the buffer structure
The buffer contains two length indicators, one to_forward counter and one
send_max limit. First, it must be understood that the buffer is in fact
split in two parts :
- the visible data (->data, for ->l bytes)
- the invisible data, typically in kernel buffers forwarded directly from
the source stream sock to the destination stream sock (->splice_len
bytes). Those are used only during forward.
In order not to mix data streams, the producer may only feed the invisible
data with data to forward, and only when the visible buffer is empty. The
consumer may not always be able to feed the invisible buffer due to platform
limitations (lack of kernel support).
Conversely, the consumer must always take data from the invisible data first
before ever considering visible data. There is no limit to the size of data
to consume from the invisible buffer, as platform-specific implementations
will rarely leave enough control on this. So any byte fed into the invisible
buffer is expected to reach the destination file descriptor, by any means.
However, it's the consumer's responsibility to ensure that the invisible
data has been entirely consumed before consuming visible data. This must be
reflected by ->splice_len. This is very important as this and only this can
ensure strict ordering of data between buffers.
The producer is responsible for decreasing ->to_forward and increasing
->send_max. The ->to_forward parameter indicates how many bytes may be fed
into either data buffer without waking the parent up. The ->send_max
parameter says how many bytes may be read from the visible buffer. Thus it
may never exceed ->l. This parameter is updated by any buffer_write() as
well as any data forwarded through the visible buffer.
The consumer is responsible for decreasing ->send_max when it sends data
from the visible buffer, and ->splice_len when it sends data from the
invisible buffer.
A real-world example consists in part in an HTTP response waiting in a
buffer to be forwarded. We know the header length (300) and the amount of
data to forward (content-length=9000). The buffer already contains 1000
bytes of data after the 300 bytes of headers. Thus the caller will set
->send_max to 300 indicating that it explicitly wants to send those data,
and set ->to_forward to 9000 (content-length). This value must be normalised
immediately after updating ->to_forward : since there are already 1300 bytes
in the buffer, 300 of which are already counted in ->send_max, and that size
is smaller than ->to_forward, we must update ->send_max to 1300 to flush the
whole buffer, and reduce ->to_forward to 8000. After that, the producer may
try to feed the additional data through the invisible buffer using a
platform-specific method such as splice().
*/
In preparation of splice support, let's add the splice_len member
to the buffer struct. An earlier implementation made it conditional,
which made the whole logics very complex due to a large number of
ifdefs.
Now BF_EMPTY is only set once both buf->l and buf->splice_len are
null. Splice_len is initialized to zero during buffer creation and
is currently not changed, so the whole logics remains unaffected.
When splice gets merged, splice_len will reflect the number of bytes
in flight out of the buffer but not yet sent, typically in a pipe for
the Linux case.
If an analyser sets buf->to_forward to a given value, that many
data will be forwarded between the two stream interfaces attached
to a buffer without waking the task up. The same applies once all
analysers have been released. This saves a large amount of calls
to process_session() and a number of task_dequeue/queue.
By letting the producer tell the consumer there is data to check,
and the consumer tell the producer there is some space left again,
we can cut in half the number of session wakeups.
This is also an important starting point for future splicing support.
Sometimes we don't care about a read timeout, for instance, from the
client when waiting for the server, but we still want the client to
be able to read.
Till now it was done by articially forcing the read timeout to ETERNITY.
But this will cause trouble when we want the low level stream sock to
communicate without waking the session up. So we add a BF_READ_NOEXP
flag to indicate that when the read timeout is to be set, it might
have to be set to ETERNITY.
Since BF_READ_ENA was not used, we replaced this flag.
For keep-alive, line-mode protocols and splicing, we will need to
limit the sender to process a certain amount of bytes. The limit
is automatically set to the buffer size when analysers are detached
from the buffer.
It is now possible to set or clear a cookie during a redirection. This
is useful for logout pages, or for protecting against some DoSes. Check
the documentation for the options supported by the "redirect" keyword.
(cherry-picked from commit 4af993822e880d8c932f4ad6920db4c9242b0981)
If "drop-query" is present on a "redirect" line using the "prefix" mode,
then the returned Location header will be the request URI without the
query-string. This may be used on some login/logout pages, or when it
must be decided to redirect the user to a non-secure server.
(cherry-picked from commit f2d361ccd73aa16538ce767c766362dd8f0a88fd)
It is now possible to list all known sessions by issuing "show sess"
on the unix stats socket. The format is not much evolved but it is
very useful for debugging.
The doc has been updated to reflect the new keyword.
This is the first step in implementing a session dump tool.
A session dump will need restart points. It will be necessary for
it to get references to sessions which can be moved when the session
dies.
The principle is not that complex : when a session ends, it looks for
any potential back-references. If it finds any, then it moves them to
the next session in the list. The dump function will of course have
to restart from that new point.
Instead of calling a hard-coded function to produce data, let's
reference this function into the buffer and call it from there
when BF_HIJACK is set. This goes in the direction of more generic
session management code.
The listener referenced in the fd was only used to check the
listener state upon session termination. There was no guarantee
that the FD had not been reassigned by the moment it was processed,
so this was a bit racy. Having it in the session is more robust.
It will be very convenient to have an analyser state in the session.
It will always be initialized to zero. The analysers can make use of
it, but must reset it to zero when they leave.
In order to achieve more generic accept() code, we can set the request
analysers at the listener registration time. It's better than doing it
during accept(), and allows more code reuse.
It was a bit awkward to have session.c call return_srv_error() for
HTTP error messages related to servers. The function has been adapted
to be passed a pointer to the faulty stream interface, and is now a
pointer in the session. It is possible that in the future, it will
become a callback in the stream interface itself.
In order to avoid having to call per-protocol logging function directly
from session.c, it's better to assign the logging function when the session
is created. This also eliminates a test when the function is needed, and
opens the way to more complete logging functions.
proto_http.c was not suitable for session-related processing, it was
just convenient for the tranformation.
Some more splitting must occur: process_request/response in proto_http.c
must be split again per protocol, and the caller must run a list.
Some functions should be directly attached to the session or the buffer
(eg: perform_http_redirect, return_srv_error, http_sess_log).
All the processing has now completely been split in layers. As of
now, everything is still in process_session() which is not the right
place, but the code sequence works. Timeouts, retries, errors, all
work.
The shutdown sequence has been strictly applied: BF_SHUTR/BF_SHUTW
are only assigned by lower layers. Upper layers can only indicate
their wish to close using BF_SHUTR_NOW and BF_SHUTW_NOW.
When a shutdown is performed on a stream interface, the buffer flags
are updated accordingly and re-checked by upper layers. A lot of care
has been taken to ensure that aborts during intermediate connection
setups are correctly handled and shutdowns correctly propagated to
both buffers.
A future evolution would consist in ensuring that BF_SHUT?_NOW may
be set at any time, and applies only when the buffer is empty. This
might help with error messages, but might complicate the processing
of data remaining in buffers.
Some useless buffer flag combinations have been removed.
Stat counters are still broken (eg: per-server total number of sessions).
Error messages should be delayed to the close instant and be produced by
protocol.
Many functions must now move to proper locations.
Now the global variable 'sessions' will be a dual-linked list of all
known sessions. The list element is set at the beginning of the session
so that it's easier to follow them all with gdb.
Two new functions are used instead : buffer_check_{shutr,shutw}.
It is indeed more adequate to check for new closures only when the
buffer reports them.
Several remaining unclosed connections were detected after a test,
even before this patch, so a bug remains. To reproduce, try the
following during 30 seconds :
inject30l4 -n 20000 -l -t 1000 -P 10 -o 4 -u 100 -s 100 -G 127.0.0.1:8000/
There were rare situations where it was not easy to detect that a failed
session attempt had occurred and needed some server cleanup. In particular,
client aborts sometimes lead to session leaks on the server side.
A new state "SI_ST_DIS" (disconnected) has been introduced for this. When
a session has been closed at a stream interface but the server cleanup has
not occurred, this state is entered instead of CLO. The cleanup is then
performed there and the state goes to CLO.
A new diagram has been added to show possible stream_interface state
transitions that can occur in a stream-sock. It makes debugging easier.
It is quite hard to track when the current session has already been counted
or discounted from the server's total number of established sessions. For
this reason, we introduce a new session flag, SN_CURR_SESS, which indicates
if the current session is one of those reported by the server or not. It
simplifies session accounting and makes it far more robust. It also makes
it possible to perform a last-minute cleanup during session_free().
Right now, with this fix and a few more buffer transitions fixes, no session
were found to remain after a test.
Tracking connection status changes was hard, and some code was
redundant. A new SI_ST_CER state was added to the stream interface
to indicate a past connection error, and an SI_FL_ERR flag was
added to report past I/O error. The stream_sock code does not set
the connection to SI_ST_CLO anymore in case of I/O error, it's
the upper layer which does it. This makes it possible to know
exactly when the file descriptors are allocated.
The new SI_ST_CER state permitted to split tcp_connection_status()
in two parts, one processing SI_ST_CON and the other one SI_ST_CER.
Synchronous connection errors now make use of this last state, hence
eliminating duplicate code.
Some ib<->ob copy paste errors were found and fixed, and all entities
setting SI_ST_CLO also shut the buffers down.
Some of these stream_interface specific functions and structures
have migrated to a new stream_interface.c file.
Some types of errors are still not detected by the buffers. For
instance, let's assume the following scenario in one single pass
of process_session: a connection sits in SI_ST_TAR state during
a retry. At TAR expiration, a new connection attempt is made, the
connection is obtained and srv->cur_sess is increased. Then the
buffer timeout is fires and everything is cleared, the new state
becomes SI_ST_CLO. The cleaning code checks that previous state
was either SI_ST_CON or SI_ST_EST to release the connection. But
that's wrong because last state is still SI_ST_TAR. So the
server's connection count does not get decreased.
This means that prev_state must not be used, and must be replaced
by some transition detection instead of level detection.
The following debugging line was useful to track state changes :
fprintf(stderr, "%s:%d: cs=%d ss=%d(%d) rqf=0x%08x rpf=0x%08x\n", __FUNCTION__, __LINE__,
s->si[0].state, s->si[1].state, s->si[1].err_type, s->req->flags, s-> rep->flags);
The connection setup code has been refactored in order to
make it run only on low level (stream interface). Several
complicated functions have been removed from backend.c,
and we now have sess_update_stream_int() to manage
an assigned connection, sess_prepare_conn_req() to assign a
server to a connection request, perform_http_redirect() to
redirect instead of connecting to server, and return_srv_error()
to return connection error status messages.
The stream_interface status changes are checked before adjusting
buffer flags, so that the buffers can be informed about this lower
level update.
A new connection is initiated by changing si->state from SI_ST_INI
to SI_ST_REQ.
The code seems to work but is awfully dirty. Some functions need
to be moved, and the layering is not yet quite clear.
A lot of dead old code has simply been removed.
It was not practical to have QUEUE and TAR timers in buffers, as they caused
triggering of the timeout flags. Move them to the stream interface where they
belong.
Now we have almost two distinct parts between tcp and http.
Only the connection establishment code still requires some
resynchronization, the rest does not.
Those entries were really needed for cleaner and better code. Using them
has permitted to automatically close a file descriptor during a shut write,
reducing by 20% the number of calls to process_session() and derived
functions.
Process_session() does not need to know the file descriptor anymore, though
it still remains very complicated due to the special case for the connect
mode.
As of now, a stream socket does not directly wake up the task
but it does contact the stream interface which itself knows the
task. This allows us to perform a few cleanups upon errors and
shutdowns, which reduces the number of calls to data_update()
from 8 per session to 2 per session, and make all the functions
called in the process_session() loop completely swappable.
Some improvements are required. We need to provide a shutw()
function on stream interfaces so that one side which closes
its read part on an empty buffer can propagate the close to
the remote side.
The owner of an fd was initially a task but this was sometimes
casted to a (struct listener *). We'll soon need more types,
so void* is more appropriate.
It's very frequent to require some information about the
reason why a task is running. Some flags have been added
so that a task now knows if it got woken up due to I/O
completion, timeout, etc...
The buffer flags became a big bazaar. Re-arrange them
so that their names are more explicit and so that they
are more easily readable in hex form. Some aggregates
have also been adjusted.
It was a waste to constantly update the file descriptor's status
and timeouts during a flags update. So stream_sock_process_data
has been slit in two parts :
stream_sock_data_update() => computes updated flags
stream_sock_data_finish() => computes timeouts
Only the first one is called during flag updates. The second one
is only called upon completion. The number of calls to fd_set/fd_clr
has now significantly dropped.
Also, it's useless to check for errors and timeouts in the
process_session() loop, it's enough to check for them at the
beginning.
The client side now relies on stream_sock_process_data(). One
part has not yet been re-implemented, it concerns the calls
to produce_content().
process_session() has been adjusted to correctly check for
changing bits in order not to call useless functions too many
times.
It already appears that stream_sock_process_data() should be
split so that the timeout computations are only performed at
the exit of process_session().
srv_state has been removed from HTTP state machines, and states
have been split in either TCP states or analyzers. For instance,
the TARPIT state has just become a simple analyzer.
New flags have been added to the struct buffer to compensate this.
The high-level stream processors sometimes need to force a disconnection
without touching a file-descriptor (eg: report an error). But if
they touched BF_SHUTW or BF_SHUTR, the file descriptor would not
be closed. Thus, the two SHUT?_NOW flags have been added so that
an application can request a forced close which the stream interface
will be forced to obey.
During this change, a new BF_HIJACK flag was added. It will
be used for data generation, eg during a stats dump. It
prevents the producer on a buffer from sending data into it.
BF_SHUTR_NOW /* the producer must shut down for reads ASAP */
BF_SHUTW_NOW /* the consumer must shut down for writes ASAP */
BF_HIJACK /* the producer is temporarily replaced */
BF_SHUTW_NOW has precedence over BF_HIJACK. BF_HIJACK has
precedence over BF_MAY_FORWARD (so that it does not need it).
New functions buffer_shutr_now(), buffer_shutw_now(), buffer_abort()
are provided to manipulate BF_SHUT* flags.
A new type "stream_interface" has been added to describe both
sides of a buffer. A stream interface has states and error
reporting. The session now has two stream interfaces (one per
side). Each buffer has stream_interface pointers to both
consumer and producer sides.
The server-side file descriptor has moved to its stream interface,
so that even the buffer has access to it.
process_srv() has been split into three parts :
- tcp_get_connection() obtains a connection to the server
- tcp_connection_failed() tests if a previously attempted
connection has succeeded or not.
- process_srv_data() only manages the data phase, and in
this sense should be roughly equivalent to process_cli.
Little code has been removed, and a lot of old code has been
left in comments for now.
It's a shame not to use buffer->wex for connection timeouts since by
definition it cannot be used till the connection is not established.
Using it instead of ->cex also makes the buffer processing more
symmetric.
It is not always convenient to run checks on req->l in functions to
check if a buffer is empty or full. Now the stream_sock functions
set flags BF_EMPTY and BF_FULL according to the buffer contents. Of
course, functions which touch the buffer contents adjust the flags
too.
BF_SHUTR_PENDING and BF_SHUTW_PENDING were poor ideas because
BF_SHUTR is the pending of BF_SHUTW_DONE and BF_SHUTW is the
pending of BF_SHUTR_DONE. Remove those two useless and confusing
"pending" versions and rename buffer_shut{r,w}_* functions.