Using the last rate counters, we now compute the queue, connect, response
and total times per server and per backend with a 95% accuracy over the last
1024 samples. The operation is cheap so we don't need to condition it.
As discussed with Dmitry Sivachenko, is a server farm has more than one
active server, uses a guaranteed non-determinist algorithm (round robin),
and a connection was initiated from a non-persistent connection, there's
no point insisting to reconnect to the same server after a connect failure,
better redispatch upon the very first retry instead of insisting on the same
server multiple times.
The retry delay is only useful when sticking to a same server. During
a redispatch, it's useless and counter-productive if we're sure to
switch to another server, which is almost guaranteed when there's
more than one server and the balancing algorithm is round robin, so
better not pass via the turn-around state in this case. It could be
done as well for leastconn, but there's a risk of always killing the
delay after the recovery of a server in a farm where it's almost
guaranteed to take most incoming traffic. So better only kill the
delay when using round robin.
As discussed with Dmitry Sivachenko, the default 1-second connect retry
delay can be large for situations where the connect timeout is much smaller,
because it means that an active connection reject will take more time to be
retried than a silent drop, and that does not make sense.
This patch changes this so that the retry delay is the minimum of 1 second
and the connect timeout. That way people running with sub-second connect
timeout will benefit from the shorter reconnect.
Servers used to have 3 flags to store a state, now they have 4 states
instead. This avoids lots of confusion for the 4 remaining undefined
states.
The encoding from the previous to the new states can be represented
this way :
SRV_STF_RUNNING
| SRV_STF_GOINGDOWN
| | SRV_STF_WARMINGUP
| | |
0 x x SRV_ST_STOPPED
1 0 0 SRV_ST_RUNNING
1 0 1 SRV_ST_STARTING
1 1 x SRV_ST_STOPPING
Note that the case where all bits were set used to exist and was randomly
dealt with. For example, the task was not stopped, the throttle value was
still updated and reported in the stats and in the http_server_state header.
It was the same if the server was stopped by the agent or for maintenance.
It's worth noting that the internal function names are still quite confusing.
Till now, the server's state and flags were all saved as a single bit
field. It causes some difficulties because we'd like to have an enum
for the state and separate flags.
This commit starts by splitting them in two distinct fields. The first
one is srv->state (with its counter-part srv->prev_state) which are now
enums, but which still contain bits (SRV_STF_*).
The flags now lie in their own field (srv->flags).
The function srv_is_usable() was updated to use the enum as input, since
it already used to deal only with the state.
Note that currently, the maintenance mode is still in the state for
simplicity, but it must move as well.
As more or less suspected, commit b1982e2 ("BUG/MEDIUM: http/session:
disable client-side expiration only after body") was hazardous. It
introduced a regression causing client side timeout to expire during
connection retries if it's lower than the time needed to cover the
amount of retries, so clients get a 408 when the connection to the
server fails to establish fast enough.
The reason is that the CF_READ_NOEXP flag is set after the MSG_DONE state
is reached, which protects the timeout from being re-armed, then during
the retries, process_session() clears the flag without calling the analyser
(since there's no activity for it), so the timeouts are rearmed.
Ideally, these one-shot flags should be per-analyser, and the analyser
which sets them would be responsible for clearing them, or they would
automatically be cleared when switching to another analyser. Unfortunately
this is not really possible currently.
What can be done however is to only clear them in the following situations :
- we're going to call analysers
- analysers have all been unsubscribed
This method seems reliable enough and approaches the ideal case well enough.
No backport is needed, this bug was introduced in 1.5-dev25.
Long-lived sessions are often subject to half-closed sessions resulting in
a lot of sessions appearing in FIN_WAIT state in the system tables, and no
way for haproxy to get rid of them. This typically happens because clients
suddenly disconnect without sending any packet (eg: FIN or RST was lost in
the path), and while the server detects this using an applicative heart
beat, haproxy does not close the connection.
This patch adds two new timeouts : "timeout client-fin" and
"timeout server-fin". The former allows one to override the client-facing
timeout when a FIN has been received or sent. The latter does the same for
server-facing connections, which is less useful.
John-Paul Bader reported a nasty segv which happens after a few hours
when SSL is enabled under a high load. Fortunately he could catch a
stack trace, systematically looking like this one :
(gdb) bt full
level = 6
conn = (struct connection *) 0x0
err_msg = <value optimized out>
s = (struct session *) 0x80337f800
conn = <value optimized out>
flags = 41997063
new_updt = <value optimized out>
old_updt = 1
e = <value optimized out>
status = 0
fd = 53999616
nbfd = 279
wait_time = <value optimized out>
updt_idx = <value optimized out>
en = <value optimized out>
eo = <value optimized out>
count = 78
sr = <value optimized out>
sw = <value optimized out>
rn = <value optimized out>
wn = <value optimized out>
The variable "flags" in conn_fd_handler() holds a copy of connection->flags
when entering the function. These flags indicate 41997063 = 0x0280d307 :
- {SOCK,DATA,CURR}_RD_ENA=1 => it's a handshake, waiting for reading
- {SOCK,DATA,CURR}_WR_ENA=0 => no need for writing
- CTRL_READY=1 => FD is still allocated
- XPRT_READY=1 => transport layer is initialized
- ADDR_FROM_SET=1, ADDR_TO_SET=0 => clearly it's a frontend connection
- INIT_DATA=1, WAKE_DATA=1 => processing a handshake (ssl I guess)
- {DATA,SOCK}_{RD,WR}_SH=0 => no shutdown
- ERROR=0, CONNECTED=0 => handshake not completed yet
- WAIT_L4_CONN=0 => normal
- WAIT_L6_CONN=1 => waiting for an L6 handshake to complete
- SSL_WAIT_HS=1 => the pending handshake is an SSL handshake
So this is a handshake is in progress. And the only way to reach line 88
is for the handshake to complete without error. So we know for sure that
ssl_sock_handshake() was called and completed the handshake then removed
the CO_FL_SSL_WAIT_HS flag from the connection. With these flags,
ssl_sock_handshake() does only call SSL_do_handshake() and retruns. So
that means that the problem is necessarily in data->init().
The fd is wrong as reported but is simply mis-decoded as it's the lower
half of the last function pointer.
What happens in practice is that there's an issue with the way we deal
with embryonic sessions during their conversion to regular sessions.
Since they have no stream interface at the beginning, the pointer to
the connection is temporarily stored into s->target. Then during their
conversion, the first stream interface is properly initialized and the
connection is attached to it, then s->target is set to NULL.
The problem is that if anything fails in session_complete(), the
session is left in this intermediate state where s->target is NULL,
and kill_mini_session() is called afterwards to perform the cleanup.
It needs the connection, that it finds in s->target which is NULL,
dereferences it and dies. The only reasons for dying here are a problem
on the TCP connection when doing the setsockopt(TCP_NODELAY) or a
memory allocation issue.
This patch implements a solution consisting in restoring s->target in
session_complete() on the error path. That way embryonic sessions that
were valid before calling it are still valid after.
The bug was introduced in 1.5-dev20 by commit f8a49ea ("MEDIUM: session:
attach incoming connection to target on embryonic sessions"). No backport
is needed.
Special thanks to John for his numerous tests and traces.
For a very long time, back in the v1.3 days, we used to rely on a trick
to avoid expiring the client side while transferring a payload to the
server. The problem was that if a client was able to quickly fill the
buffers, and these buffers took some time to reach the server, the
client should not expire while not sending anything.
In order to cover this situation, the client-side timeout was disabled
once the connection to the server was OK, since it implied that we would
at least expire on the server if required.
But there is a drawback to this : if a client stops uploading data before
the end, its timeout is not enforced and we only expire on the server's
timeout, so the logs report a 504.
Since 1.4, we have message body analysers which ensure that we know whether
all the expected data was received or not (HTTP_MSG_DATA or HTTP_MSG_DONE).
So we can fix this problem by disabling the client-side or server-side
timeout at the end of the transfer for the respective side instead of
having it unconditionally in session.c during all the transfer.
With this, the logs now report the correct side for the timeout. Note that
this patch is not enough, because another issue remains : the HTTP body
forwarders do not abort upon timeout, they simply rely on the generic
handling from session.c. So for now, the session is still aborted when
reaching the server timeout, but the culprit is properly reported. A
subsequent patch will address this specific point.
This bug was tagged MEDIUM because of the changes performed. The issue
it fixes is minor however. After some cooling down, it may be backported
to 1.4.
It was reported by and discussed with Rachel Chavez and Patrick Hemmer
on the mailing list.
On the mailing list, seri0528@naver.com reported an issue when
using balance url_param or balance uri. The request would sometimes
stall forever.
Cyril Bonté managed to reproduce it with the configuration below :
listen test :80
mode http
balance url_param q
hash-type consistent
server s demo.1wt.eu:80
and found it appeared with this commit : 80a92c0 ("BUG/MEDIUM: http:
don't start to forward request data before the connect").
The bug is subtle but real. The problem is that the HTTP request
forwarding analyzer refrains from starting to parse the request
body when some LB algorithms might need the body contents, in order
to preserve the data pointer and avoid moving things around during
analysis in case a redispatch is later needed. And in order to detect
that the connection establishes, it watches the response channel's
CF_READ_ATTACHED flag.
The problem is that a request analyzer is not subscribed to a response
channel, so it will only see changes when woken for other (generally
correlated) reasons, such as the fact that part of the request could
be sent. And since the CF_READ_ATTACHED flag is cleared once leaving
process_session(), it is important not to miss it. It simply happens
that sometimes the server starts to respond in a sequence that validates
the connection in the middle of process_session(), that it is detected
after the analysers, and that the newly assigned CF_READ_ATTACHED is
not used to detect that the request analysers need to be called again,
then the flag is lost.
The CF_WAKE_WRITE flag doesn't work either because it's cleared upon
entry into process_session(), ie if we spend more than one call not
connecting.
Thus we need a new flag to tell the connection initiator that we are
specifically interested in being notified about connection establishment.
This new flag is CF_WAKE_CONNECT. It is set by the requester, and is
cleared once the connection succeeds, where CF_WAKE_ONCE is set instead,
causing the request analysers to be scanned again.
For future versions, some better options will have to be considered :
- let all analysers subscribe to both request and response events ;
- let analysers subscribe to stream interface events (reduces number
of useless calls)
- change CF_WAKE_WRITE's semantics to persist across calls to
process_session(), but that is different from validating a
connection establishment (eg: no data sent, or no data to send)
The bug was introduced in 1.5-dev23, no backport is needed.
Since it became possible to use log-format expressions in use_backend,
having a mandatory condition becomes annoying because configurations
are full of "if TRUE". Let's relax the check to accept no condition
like many other keywords (eg: redirect).
Cyril Bonté reported that the "lastsess" field of a stats-only backend
was never updated. In fact the same is true for any applet and anything
not a server. Also, lastsess was not updated for a server reusing its
connection for a new request.
Since the goal of this field is to report recent activity, it's better
to ensure that all accesses are reported. The call has been moved to
the code validating the session establishment instead, since everything
passes there.
Since commit 4d4149c ("MEDIUM: counters: support passing the counter
number as a fetch argument"), the sample fetch sc_tracked(num) became
equivalent to sc[0-9]_tracked, by using the same smp_fetch_sc_tracked()
function.
This was theorically made possible after the series of changes starting
with commit a65536ca ("MINOR: counters: provide a generic function to
retrieve a stkctr for sc* and src."). Unfortunately, while all other
functions were changed to use the generic primitive smp_fetch_sc_stkctr(),
smp_fetch_sc_tracked() was forgotten and is not able to differentiate
between sc_tracked, src_tracked and sc[0-9]_tracked. The resulting mess is
that if sc_tracked is used, the counter number is assumed to be 47 because
that's what remains after subtracting "0" from char "_".
Fix this by simply relying on the generic function as should have been
done. The bug was introduced in 1.5-dev20. No backport is needed.
The function addr_to_stktable_key doesn't consider the expected
type of key. If the stick table key is based on IPv6 addresses
and the input is IPv4, the returned key is IPv4 adddress and his
length is 4 bytes, while is expected 16 bytes key.
This patch considers the expected key and try to convert IPv4 to
IPv6 and IPv6 to IPv4 according with the expected key.
This fixes the bug reported by Apollon Oikonomopoulos.
This bug was introduced somewhere in the 1.5-dev process.
Constructions such as sc0_get_gpc0(foo) allow to look up the same key as
the current key but in an alternate table. A check was missing to ensure
we already have a key, resulting in a crash if this lookup is performed
before the associated track-sc rule.
This bug was reported on the mailing list by Neil@iamafreeman and
narrowed down further by Lukas Tribus and Thierry Fournier.
This bug was introduced in 1.5-dev20 by commit "0f791d4 MEDIUM: counters:
support looking up a key in an alternate table".
We have a use case where we look up a customer ID in an HTTP header
and direct it to the corresponding server. This can easily be done
using ACLs and use_backend rules, but the configuration becomes
painful to maintain when the number of customers grows to a few
tens or even a several hundreds.
We realized it would be nice if we could make the use_backend
resolve its name at run time instead of config parsing time, and
use a similar expression as http-request add-header to decide on
the proper backend to use. This permits the use of prefixes or
even complex names in backend expressions. If no name matches,
then the default backend is used. Doing so allowed us to get rid
of all the use_backend rules.
Since there are some config checks on the use_backend rules to see
if the referenced backend exists, we want to keep them to detect
config errors in normal config. So this patch does not modify the
default behaviour and proceeds this way :
- if the backend name in the use_backend directive parses as a log
format rule, it's used as-is and is resolved at run time ;
- otherwise it's a static name which must be valid at config time.
There was the possibility of doing this with the use-server directive
instead of use_backend, but it seems like use_backend is more suited
to this task, as it can be used for other purposes. For example, it
becomes easy to serve a customer-specific proxy.pac file based on the
customer ID by abusing the errorfile primitive :
use_backend bk_cust_%[hdr(X-Cust-Id)] if { hdr(X-Cust-Id) -m found }
default_backend bk_err_404
backend bk_cust_1
errorfile 200 /etc/haproxy/static/proxy.pac.cust1
Signed-off-by: Bertrand Jacquin <bjacquin@exosec.fr>
For outgoing connections initiated from an applet, there might not be
any listener. It's the case with peers, which resort to a hack consisting
in making the session's listener point to the peer. This listener is only
used for statistics now so it's much easier to check for its presence now.
Currently there are two places where the compression context is released,
one in session_free() and another one in http_end_txn_clean_session().
Both of them call http_end_txn(), either directly or via http_reset_txn(),
and this function is made for this exact purpose. So let's centralize the
call there instead.
Summary:
Track and report last session time on the stats page for each server
in every backend, as well as the backend.
This attempts to address the requirement in the ROADMAP
- add a last activity date for each server (req/resp) that will be
displayed in the stats. It will be useful with soft stop.
The stats page reports this as time elapsed since last session. This
change does not adequately address the requirement for long running
session (websocket, RDP... etc).
On ARM, glibc does not implement accept4() and simply returns ENOSYS
which was not caught as a reason to fall back to accept(), resulting
in a spinning process since poll() would call again.
Let's change the error detection mechanism to save the broken status
of the syscall into a local variable that is used to fall back to the
legacy accept().
In addition to this, since the code was becoming a bit messy, the
accept4() was removed, so now the fallback code and the legacy code
are the same. This will also increase bug report accuracy if needed.
This is 1.5-specific, no backport is needed.
Till now, we had one flag per stick counter to indicate if it was
tracked in a backend or in a frontend. We just had to add another
flag per stick-counter to indicate if it relies on contents or just
connection. These flags are quite painful to maintain and tend to
easily conflict with other flags if their number is changed.
The correct solution consists in moving the flags to the stkctr struct
itself, but currently this struct is made of 2 pointers, so adding a
new entry there to store only two bits will cause at least 16 more bytes
to be eaten per counter due to alignment issues, and we definitely don't
want to waste tens to hundreds of bytes per session just for things that
most users don't use.
Since we only need to store two bits per counter, an intermediate
solution consists in replacing the entry pointer with a composite
value made of the original entry pointer and the two flags in the
2 unused lower bits. If later a need for other flags arises, we'll
have to store them in the struct.
A few inline functions have been added to abstract the retrieval
and assignment of the pointers and flags, resulting in very few
changes. That way there is no more dependence on the number of
stick-counters and their position in the session flags.
In 1.5-dev19, commit e25c917 ("MEDIUM: counters: add support for tracking
a third counter") introduced the third track counter. However, there was
a hard-coded test in the accept() error path to release only sc0 and sc1.
So it seems that if tracking sc2 at the connection level and deciding to
reject once the track-sc2 has been done, there could be some leaking of
stick-table entries which remain marked used forever, thus which can never
be purged nor expired. There's no memory leak though, it's just that
entries are unexpirable forever.
The simple solution consists in removing the test and always calling
the inline function which iterates over all entries.
Patrick Hemmer reported that using unique_id_format and logs did not
report the same unique ID counter since commit 9f09521 ("BUG/MEDIUM:
unique_id: HTTP request counter must be unique!"). This is because
the increment was done while producing the log message, so it was
performed twice.
A better solution consists in fetching a new value once per request
and saving it in the request or session context for all of this
request's life.
It happens that sessions already have a unique ID field which is used
for debugging and reporting errors, and which differs from the one
sent in logs and unique_id header.
So let's change this to reuse this field to have coherent IDs everywhere.
As of now, a session gets a new unique ID once it is instanciated. This
means that TCP sessions will also benefit from a unique ID that can be
logged. And this ID is renewed for each extra HTTP request received on
an existing session. Thus, all TCP sessions and HTTP requests will have
distinct IDs that will be stable along all their life, and coherent
between all places where they're used (logs, unique_id header,
"show sess", "show errors").
This feature is 1.5-specific, no backport to 1.4 is needed.
The code which enables tunnel mode or TCP transfers is rarely used
and at most once per session. Putting it in an unlikely() clause
reduces the length of the hot path of process_session() which is
already quite long, and also slightly reduces its overall size.
Some measurements show a steady gain of about 0.2% thanks to this.
si_connect() used to only return SI_ST_CON. But it already detect the
connection reuse and is the function which avoids calling connect().
So it already knows the connection is valid and reuse. Thus we make it
return SI_ST_EST when a connection is reused. This means that
connect_server() can return this state and sess_update_stream_int()
as well.
Thanks to this change, we don't need to leave process_session() in
SI_ST_CON state to immediately enter it again to switch to SI_ST_EST.
Implementing this removes one call to process_session() per request
in keep-alive mode. We're now at 2 calls per request, which is the
minimum (one for the request and another one for the response). The
number of calls to http_wait_for_response() has also dropped from 2
to one.
Tests indicate a performance gain of about 2.6% in request rate in
keep-alive mode. There should be no gain in http-server-close() since
we don't use this faster path.
At the moment it is possible in sess_prepare_conn_req() to switch to the
established state when the target is an applet. But sess_update_stream_int()
will soon also have the ability to set the established state via
connect_server() when a connection is reused, leading to a synchronous
connect.
So prepare the code to handle this SI_ST_ASS -> SI_ST_EST transition, which
really matches what's done in the lower layers.
Currently there are 3 places in the code where t_connect is set after
switching to state SI_ST_EST, and a fourth one will soon come. Since
all these places lead to an immediate call to sess_establish() to
complete the session establishment, better move that measurement
there.
As soon as we connect to the server, we want to limit the number of
recvfrom() on the response path because most of the time a single
call will retrieve enough information.
At the moment this is only done in the HTTP response parser, after
some reads have already failed, which is too late. We need to do
that at the earliest possible instant. It was already done for the
request side by frontend_accept() for the first request, and by
http_reset_txn() for the next requests.
Thanks to this change, there are no more failed recvfrom() calls in
keep-alive mode.
Since commit 6b66f3e ([MAJOR] implement autonomous inter-socket forwarding)
introduced in 1.3.16-rc1, we've been relying on a stupid mechanism to wake
up the task after a write, which was an exact copy-paste of the reader side.
The principle was that if we empty a buffer and there's no forwarding
scheduled or if the *producer* is not in a connected state, then we wake
the task up.
That does not make any sense. It happens to wake up too late sometimes (eg,
when the request analyser waits for some room in the buffer to start to
work), and leads to unneeded wakeups in client-side keep-alive, because
the task is woken up when the response is sent, while the analysers are
simply waiting for a new request.
In order to fix this, we introduce a new channel flag : CF_WAKE_WRITE. It
is designed so that an analyser can explicitly request being notified when
some data were written. It is used only when the HTTP request or response
analysers need to wait for more room in the buffers. It is automatically
cleared upon wake up.
The flag is also automatically set by the functions which try to write into
a buffer from an applet when they fail (bi_putblk() etc...).
That allows us to remove the stupid condition above and avoid some wakeups.
In http-server-close and in http-keep-alive modes, this reduces from 4 to 3
the average number of wakeups per request, and increases the overall
performance by about 1.5%.
In HTTP keep-alive mode, if we receive a 401, we still have a chance
of being able to send the visitor again to the same server over the
same connection. This is required by some broken protocols such as
NTLM, and anyway whenever there is an opportunity for sending the
challenge to the proper place, it's better to do it (at least it
helps with debugging).
The memset() was put here to corrupt memory for a debugging test,
it's not needed anymore and was unfortunately committed. It does
not harm anyway, it probably just slightly affects performance.
Since recent commit f79c817 (MAJOR: connection: add two new flags to
indicate readiness of control/transport) and the surrounding commits,
the session initialization has been slightly delayed and the control
layer of the connection is not yet initialized when processing the
rules.
We need to move that minimal initialization a bit above.
The bug was introduced with latest changes, no backport is needed.
It's becoming increasingly difficult to ignore unwanted function returns in
debug code with gcc. Now even when you try to work around it, it suggests a
way to write your code differently. For example :
src/frontend.c:187:65: warning: if statement has empty body [-Wempty-body]
if (write(1, trash.str, trash.len) < 0) /* shut gcc warning */;
^
src/frontend.c:187:65: note: put the semicolon on a separate line to silence this warning
1 warning generated.
This is totally unacceptable, this code already had to be written this way
to shut it up in earlier versions. And now it comments the form ? What's the
purpose of the C language if you can't write anymore the code that does what
you want ?
Emeric proposed to just keep a global variable to drain such useless results
so that gcc stops complaining all the time it believes people who write code
are monkeys. The solution is acceptable because the useless assignment is done
only in debug code so it will not impact performance. This patch implements
this, until gcc becomes even "smarter" to detect that we tried to cheat.
SSL and keep-alive will need to be able to fail on allocation errors,
and the stream interface did not allow to report such a cause. The flag
will then be "RC" as already documented.
Some applet users don't need to initialize their applet, they just want
to route the traffic there just as if it were a server. Since applets
are now connected to from session.c, let's simply ensure that when
connecting, the applet in si->end matches the target, and allocate
one there if it's not already done. In case of error, we force the
status code to resource and connection so that it's clear that it
happens because of a memory shortage.
The outgoing connection is now allocated dynamically upon the first attempt
to touch the connection's source or destination address. If this allocation
fails, we fail on SN_ERR_RESOURCE.
As we didn't use si->conn anymore, it was removed. The endpoints are released
upon session_free(), on the error path, and upon a new transaction. That way
we are able to carry the existing server's address across retries.
The stream interfaces are not initialized anymore before session_complete(),
so we could even think about allocating them dynamically as well, though
that would not provide much savings.
The session initialization now makes use of conn_new()/conn_free(). This
slightly simplifies the code and makes it more logical. The connection
initialization code is now shorter by about 120 bytes because it's done
at once, allowing the compiler to remove all redundant initializations.
The si_attach_applet() function now takes care of first detaching the
existing endpoint, and it is called from stream_int_register_handler(),
so we can safely remove the calls to si_release_endpoint() in the
application code around this call.
A call to si_detach() was made upon stream_int_unregister_handler() to
ensure we always free the allocated connection if one was allocated in
parallel to setting an applet (eg: detect HTTP proxy while proceeding
with stats maybe).
si_prepare_conn() is not appropriate in our case as it both initializes and
attaches the connection to the stream interface. Due to the asymmetry between
accept() and connect(), it causes some fields such as the control and transport
layers to be reinitialized.
Now that we can separately initialize these fields using conn_prepare(), let's
break this function to only attach the connection to the stream interface.
Also, by analogy, si_prepare_none() was renamed si_detach(), and
si_prepare_applet() was renamed si_attach_applet().
We don't want to assign the control nor transport layers anymore
at the same time as the data layer, because it prevents one from
keeping existing settings when reattaching a connection to an
existing stream interface.
Let's have conn_attach() replace conn_assign() for this purpose.
Thus, conn_prepare() + conn_attach() do exactly the same as the
previous conn_assign().
Now that we can assign conn->xprt regardless of the initialization state,
we can reintroduce conn_prepare() to set only the protocol, the transport
layer and initialize the transport layer's state.
The first function is used to (re)initialize a stream interface and
the second to force it into a known state. These are intended for
cleaning up the stream interface initialization code in session.c
and peers.c and avoiding future issues with missing initializations.
Currently the control and transport layers of a connection are supposed
to be initialized when their respective pointers are not NULL. This will
not work anymore when we plan to reuse connections, because there is an
asymmetry between the accept() side and the connect() side :
- on accept() side, the fd is set first, then the ctrl layer then the
transport layer ; upon error, they must be undone in the reverse order,
then the FD must be closed. The FD must not be deleted if the control
layer was not yet initialized ;
- on the connect() side, the fd is set last and there is no reliable way
to know if it has been initialized or not. In practice it's initialized
to -1 first but this is hackish and supposes that local FDs only will
be used forever. Also, there are even less solutions for keeping trace
of the transport layer's state.
Also it is possible to support delayed close() when something (eg: logs)
tracks some information requiring the transport and/or control layers,
making it even more difficult to clean them.
So the proposed solution is to add two flags to the connection :
- CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert)
and cleared after it's released (fd_delete).
- CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init)
and cleared after it's released (xprt->close).
The functions have been adapted to rely on this and not on the pointers
anymore. conn_xprt_close() was unused and dangerous : it did not close
the control layer (eg: the socket itself) but still marks the transport
layer as closed, preventing any future call to conn_full_close() from
finishing the job.
The problem comes from conn_full_close() in fact. It needs to close the
xprt and ctrl layers independantly. After that we're still having an issue :
we don't know based on ->ctrl alone whether the fd was registered or not.
For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We
now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what
remains to be done on the connection.
In order not to miss some flag assignments, we introduce conn_ctrl_init()
to initialize the control layer, register the fd using fd_insert() and set
the flag, and conn_ctrl_close() which unregisters the fd and removes the
flag, but only if the transport layer was closed.
Similarly, at the transport layer, conn_xprt_init() calls ->init and sets
the flag, while conn_xprt_close() checks the flag, calls ->close and clears
the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init
and the ->close functions are called only once each and in the correct order.
Note that conn_xprt_close() does nothing if the transport layer is still
tracked.
conn_full_close() now simply calls conn_xprt_close() then conn_full_close()
in turn, which do nothing if CO_FL_XPRT_TRACKED is set.
In order to handle the error path, we also provide conn_force_close() which
ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers
in turns. All relevant instances of fd_delete() have been replaced with
conn_force_close(). Now we always know what state the connection is in and
we can expect to split its initialization.
Everywhere conn_prepare() is used, the call to conn_init() has already
been done. We can now safely replace all instances of conn_prepare()
with conn_assign() which does not reset the transport layer, and remove
conn_prepare().
In order to reduce the dependency over stream-interfaces, we now
attach the incoming connection to the embryonic session's target
instead of the stream-interface's connection. This means we won't
need to initialize stream interfaces anymore after we implement
dynamic connection allocation. The session's target is reset to
NULL after the session has been converted to a complete session.
The connection will only remain there as a pre-allocated entity whose
goal is to be placed in ->end when establishing an outgoing connection.
All connection initialization can be made on this connection, but all
information retrieved should be applied to the end point only.
This change is huge because there were many users of si->conn. Now the
only users are those who initialize the new connection. The difficulty
appears in a few places such as backend.c, proto_http.c, peers.c where
si->conn is used to hold the connection's target address before assigning
the connection to the stream interface. This is why we have to keep
si->conn for now. A future improvement might consist in dynamically
allocating the connection when it is needed.