The TCP analyser has moved to proto_tcp.c. Breaking the function
has required finer use of the return value and adding some tests
to process_session().
It was a bit awkward to have session.c call return_srv_error() for
HTTP error messages related to servers. The function has been adapted
to be passed a pointer to the faulty stream interface, and is now a
pointer in the session. It is possible that in the future, it will
become a callback in the stream interface itself.
The new function looks like the previous one except that it operates
at the stream interface level and assumes an already closed SI.
Also remove some old unused occurrences of srv_close_with_err().
proto_http.c was not suitable for session-related processing, it was
just convenient for the tranformation.
Some more splitting must occur: process_request/response in proto_http.c
must be split again per protocol, and the caller must run a list.
Some functions should be directly attached to the session or the buffer
(eg: perform_http_redirect, return_srv_error, http_sess_log).
All the processing has now completely been split in layers. As of
now, everything is still in process_session() which is not the right
place, but the code sequence works. Timeouts, retries, errors, all
work.
The shutdown sequence has been strictly applied: BF_SHUTR/BF_SHUTW
are only assigned by lower layers. Upper layers can only indicate
their wish to close using BF_SHUTR_NOW and BF_SHUTW_NOW.
When a shutdown is performed on a stream interface, the buffer flags
are updated accordingly and re-checked by upper layers. A lot of care
has been taken to ensure that aborts during intermediate connection
setups are correctly handled and shutdowns correctly propagated to
both buffers.
A future evolution would consist in ensuring that BF_SHUT?_NOW may
be set at any time, and applies only when the buffer is empty. This
might help with error messages, but might complicate the processing
of data remaining in buffers.
Some useless buffer flag combinations have been removed.
Stat counters are still broken (eg: per-server total number of sessions).
Error messages should be delayed to the close instant and be produced by
protocol.
Many functions must now move to proper locations.
Now the global variable 'sessions' will be a dual-linked list of all
known sessions. The list element is set at the beginning of the session
so that it's easier to follow them all with gdb.
Two new functions are used instead : buffer_check_{shutr,shutw}.
It is indeed more adequate to check for new closures only when the
buffer reports them.
Several remaining unclosed connections were detected after a test,
even before this patch, so a bug remains. To reproduce, try the
following during 30 seconds :
inject30l4 -n 20000 -l -t 1000 -P 10 -o 4 -u 100 -s 100 -G 127.0.0.1:8000/
Tracking connection status changes was hard, and some code was
redundant. A new SI_ST_CER state was added to the stream interface
to indicate a past connection error, and an SI_FL_ERR flag was
added to report past I/O error. The stream_sock code does not set
the connection to SI_ST_CLO anymore in case of I/O error, it's
the upper layer which does it. This makes it possible to know
exactly when the file descriptors are allocated.
The new SI_ST_CER state permitted to split tcp_connection_status()
in two parts, one processing SI_ST_CON and the other one SI_ST_CER.
Synchronous connection errors now make use of this last state, hence
eliminating duplicate code.
Some ib<->ob copy paste errors were found and fixed, and all entities
setting SI_ST_CLO also shut the buffers down.
Some of these stream_interface specific functions and structures
have migrated to a new stream_interface.c file.
Some types of errors are still not detected by the buffers. For
instance, let's assume the following scenario in one single pass
of process_session: a connection sits in SI_ST_TAR state during
a retry. At TAR expiration, a new connection attempt is made, the
connection is obtained and srv->cur_sess is increased. Then the
buffer timeout is fires and everything is cleared, the new state
becomes SI_ST_CLO. The cleaning code checks that previous state
was either SI_ST_CON or SI_ST_EST to release the connection. But
that's wrong because last state is still SI_ST_TAR. So the
server's connection count does not get decreased.
This means that prev_state must not be used, and must be replaced
by some transition detection instead of level detection.
The following debugging line was useful to track state changes :
fprintf(stderr, "%s:%d: cs=%d ss=%d(%d) rqf=0x%08x rpf=0x%08x\n", __FUNCTION__, __LINE__,
s->si[0].state, s->si[1].state, s->si[1].err_type, s->req->flags, s-> rep->flags);
The connection setup code has been refactored in order to
make it run only on low level (stream interface). Several
complicated functions have been removed from backend.c,
and we now have sess_update_stream_int() to manage
an assigned connection, sess_prepare_conn_req() to assign a
server to a connection request, perform_http_redirect() to
redirect instead of connecting to server, and return_srv_error()
to return connection error status messages.
The stream_interface status changes are checked before adjusting
buffer flags, so that the buffers can be informed about this lower
level update.
A new connection is initiated by changing si->state from SI_ST_INI
to SI_ST_REQ.
The code seems to work but is awfully dirty. Some functions need
to be moved, and the layering is not yet quite clear.
A lot of dead old code has simply been removed.
Those entries were really needed for cleaner and better code. Using them
has permitted to automatically close a file descriptor during a shut write,
reducing by 20% the number of calls to process_session() and derived
functions.
Process_session() does not need to know the file descriptor anymore, though
it still remains very complicated due to the special case for the connect
mode.
As of now, a stream socket does not directly wake up the task
but it does contact the stream interface which itself knows the
task. This allows us to perform a few cleanups upon errors and
shutdowns, which reduces the number of calls to data_update()
from 8 per session to 2 per session, and make all the functions
called in the process_session() loop completely swappable.
Some improvements are required. We need to provide a shutw()
function on stream interfaces so that one side which closes
its read part on an empty buffer can propagate the close to
the remote side.
It's very frequent to require some information about the
reason why a task is running. Some flags have been added
so that a task now knows if it got woken up due to I/O
completion, timeout, etc...
A test has shown that more than 16% of the calls to task_wakeup()
could be avoided because the task is already woken up. So make it
inline and move the test to the inline part.
The buffer flags became a big bazaar. Re-arrange them
so that their names are more explicit and so that they
are more easily readable in hex form. Some aggregates
have also been adjusted.
It was a waste to constantly update the file descriptor's status
and timeouts during a flags update. So stream_sock_process_data
has been slit in two parts :
stream_sock_data_update() => computes updated flags
stream_sock_data_finish() => computes timeouts
Only the first one is called during flag updates. The second one
is only called upon completion. The number of calls to fd_set/fd_clr
has now significantly dropped.
Also, it's useless to check for errors and timeouts in the
process_session() loop, it's enough to check for them at the
beginning.
srv_state has been removed from HTTP state machines, and states
have been split in either TCP states or analyzers. For instance,
the TARPIT state has just become a simple analyzer.
New flags have been added to the struct buffer to compensate this.
The high-level stream processors sometimes need to force a disconnection
without touching a file-descriptor (eg: report an error). But if
they touched BF_SHUTW or BF_SHUTR, the file descriptor would not
be closed. Thus, the two SHUT?_NOW flags have been added so that
an application can request a forced close which the stream interface
will be forced to obey.
During this change, a new BF_HIJACK flag was added. It will
be used for data generation, eg during a stats dump. It
prevents the producer on a buffer from sending data into it.
BF_SHUTR_NOW /* the producer must shut down for reads ASAP */
BF_SHUTW_NOW /* the consumer must shut down for writes ASAP */
BF_HIJACK /* the producer is temporarily replaced */
BF_SHUTW_NOW has precedence over BF_HIJACK. BF_HIJACK has
precedence over BF_MAY_FORWARD (so that it does not need it).
New functions buffer_shutr_now(), buffer_shutw_now(), buffer_abort()
are provided to manipulate BF_SHUT* flags.
A new type "stream_interface" has been added to describe both
sides of a buffer. A stream interface has states and error
reporting. The session now has two stream interfaces (one per
side). Each buffer has stream_interface pointers to both
consumer and producer sides.
The server-side file descriptor has moved to its stream interface,
so that even the buffer has access to it.
process_srv() has been split into three parts :
- tcp_get_connection() obtains a connection to the server
- tcp_connection_failed() tests if a previously attempted
connection has succeeded or not.
- process_srv_data() only manages the data phase, and in
this sense should be roughly equivalent to process_cli.
Little code has been removed, and a lot of old code has been
left in comments for now.
It is not always convenient to run checks on req->l in functions to
check if a buffer is empty or full. Now the stream_sock functions
set flags BF_EMPTY and BF_FULL according to the buffer contents. Of
course, functions which touch the buffer contents adjust the flags
too.
BF_SHUTR_PENDING and BF_SHUTW_PENDING were poor ideas because
BF_SHUTR is the pending of BF_SHUTW_DONE and BF_SHUTW is the
pending of BF_SHUTR_DONE. Remove those two useless and confusing
"pending" versions and rename buffer_shut{r,w}_* functions.
A new member has been added to the struct session. It keeps a trace
of what block of code performs a close or a shutdown on a socket, and
in what sequence. This is extremely convenient for post-mortem analysis
where flag combinations and states seem impossible. A new ABORT_NOW()
macro has also been added to make the code immediately segfault where
called.
The HTTP response code has been moved to a specific function
called "process_response" and the SV_STHEADERS state has been
removed and replaced with the flag AN_RTR_HTTP_HDR.
For the first time, HTTP and TCP are not merged anymore. All request
processing has moved to process_request while the TCP processing of
the frontend remains in process_cli. The code is a lot cleaner,
simpler, smaller (1%) and slightly faster (1% too).
Right now, the HTTP state machine cannot easily command the TCP
state machine, but it does not cause that many difficulties.
The response processing has not yet been extracted, and the unix-stream
state machines have to be broken down that way too.
The CL_STDATA, CL_STSHUTR and CL_STSHUTW states still exist and are
exactly the sames. They will have to be all merged into CL_STDATA
once the work has stabilized. It is also possible that this single
state will disappear in favor of just buffer flags.
When an ACL is referenced at a wrong place (eg: response during request, layer7
during layer4), try to indicate precisely the name and requirements of this ACL.
Only the first faulty ACL is returned. A small change consisting in iterating
that way may improve reports :
cap = ACL_USE_any_unexpected
while ((acl=cond_find_require(cond, cap))) {
warning()
cap &= ~acl->requires;
}
This will report the first ACL of each unsupported type. But doing so will
mangle the error reporting a lot, so we need to rework error reports first.
It should be stated as a rule that a C file should never
include types/xxx.h when proto/xxx.h exists, as it gives
less exposure to declaration conflicts (one of which was
caught and fixed here) and it complicates the file headers
for nothing.
Only types/global.h, types/capture.h and types/polling.h
have been found to be valid includes from C files.
This new function supports one major and one minor and makes an int of them.
It is very convenient to compare versions (eg: SSL) just as if they were plain
integers, as the comparison functions will still be based on integers.
Some people need to inspect contents of TCP requests before
deciding to forward a connection or not. A future extension
of this demand might consist in selecting a server farm
depending on the protocol detected in the request.
For this reason, a new state CL_STINSPECT has been added on
the client side. It is immediately entered upon accept() if
the statement "tcp-request inspect-delay <xxx>" is found in
the frontend configuration. Haproxy will then wait up to
this amount of time trying to find a matching ACL, and will
either accept or reject the connection depending on the
"tcp-request content <action> {if|unless}" rules, where
<action> is either "accept" or "reject".
Note that it only waits that long if no definitive verdict
can be found earlier. That generally implies calling a fetch()
function which does not have enough information to decode
some contents, or a match() function which only finds the
beginning of what it's looking for.
It is only at the ACL level that partial data may be processed
as such, because we need to distinguish between MISS and FAIL
*before* applying the term negation.
Thus it is enough to add "| ACL_PARTIAL" to the last argument
when calling acl_exec_cond() to indicate that we expect
ACL_PAT_MISS to be returned if some data is missing (for
fetch() or match()). This is the only case we may return
this value. For this reason, the ACL check in process_cli()
has become a lot simpler.
A new ACL "req_len" of type "int" has been added. Right now
it is already possible to drop requests which talk too early
(eg: for SMTP) or which don't talk at all (eg: HTTP/SSL).
Also, the acl fetch() functions have been extended in order
to permit reporting of missing data in case of fetch failure,
using the ACL_TEST_F_MAY_CHANGE flag.
The default behaviour is unchanged, and if no rule matches,
the request is accepted.
As a side effect, all layer 7 fetching functions have been
cleaned up so that they now check for the validity of the
layer 7 pointer before dereferencing it.
This is the first attempt at moving all internal parts from
using struct timeval to integer ticks. Those provides simpler
and faster code due to simplified operations, and this change
also saved about 64 bytes per session.
A new header file has been added : include/common/ticks.h.
It is possible that some functions should finally not be inlined
because they're used quite a lot (eg: tick_first, tick_add_ifset
and tick_is_expired). More measurements are required in order to
decide whether this is interesting or not.
Some function and variable names are still subject to change for
a better overall logics.
When queuing a timer, it's very likely that an expiration date is
equal to that of the previously queued timer, due to time rounding
to the millisecond. Optimizing for this case provides a noticeable
1% performance boost.
The run queue scheduler now considers task->nice to queue a task and
to pick a task out of the queue. This makes it possible to boost the
access to statistics (both via HTTP and UNIX socket). The UNIX socket
receives twice as much a boost as the HTTP socket because it is more
sensible.
We now insert tasks in a certain sequence in the run queue.
The sorting key currently is the arrival order. It will now
be possible to apply a "nice" value to any task so that it
goes forwards or backwards in the run queue.
The calls to wake_expired_tasks() and maintain_proxies()
have been moved to the main run_poll_loop(), because they
had nothing to do in process_runnable_tasks().
The task_wakeup() function is not inlined anymore, as it was
only used at one place.
The qlist member of the task structure has been removed now.
The run_queue list has been replaced for an integer indicating
the number of tasks in the run queue.
The ultree code has been removed in favor of a simpler and
cleaner ebtree implementation. The eternity queue does not
need to exist anymore, and the pool_tree64 has been removed.
The ebtree node is stored in the task itself. The qlist list
header is still used by the run-queue, but will be able to
disappear once the run-queue uses ebtree too.
The dequeuing logic was completely wrong. First, a task was assigned
to all servers to process the queue, but this task was never scheduled
and was only woken up on session free. Second, there was no reservation
of server entries when a task was assigned a server. This means that
as long as the task was not connected to the server, its presence was
not accounted for. This was causing trouble when detecting whether or
not a server had reached maxconn. Third, during a redispatch, a session
could lose its place at the server's and get blocked because another
session at the same moment would have stolen the entry. Fourth, the
redispatch option did not work when maxqueue was reached for a server,
and it was not possible to do so without indefinitely hanging a session.
The root cause of all those problems was the lack of pre-reservation of
connections at the server's, and the lack of tracking of servers during
a redispatch. Everything relied on combinations of flags which could
appear similarly in quite distinct situations.
This patch is a major rework but there was no other solution, as the
internal logic was deeply flawed. The resulting code is cleaner, more
understandable, uses less magics and is overall more robust.
As an added bonus, "option redispatch" now works when maxqueue has
been reached on a server.