I was testing haproxy-1.5-dev22 on SmartOS (an illumos-based system)
and ran into a problem. There's a small window after non-blocking
connect() is called, but before the TCP connection is established,
where recv() may return ENOTCONN. On Linux, the behaviour here seems
to be always to return EAGAIN. The fix is relatively trivial, and
appears to make haproxy work reliably on current SmartOS (see patch
below). It's possible that other UNIX platforms exhibit this
behaviour as well.
Note: the equivalent was already done for send() in commit 0ea0cf6
("BUG: raw_sock: also consider ENOTCONN in addition to EAGAIN").
Both patches should be backported to 1.4.
This prevents us from passing other useful info and requires the
upper levels to know these flags. Let's use a new flags category
instead : CO_SFL_*. For now, only MSG_MORE has been remapped.
Due to a typo, the MSG_MORE flag used to replace MSG_NOSIGNAL and
MSG_DONTWAIT. Fortunately, sockets are always marked non-blocking,
so the loss of MSG_DONTWAIT is harmless, and the NOSIGNAL is covered
by the interception of the SIGPIPE. So no issue could have been
caused by this bug.
This is the reimplementation of the "done" action : when we experience
a short read, we're almost certain that we've exhausted the system's
buffers and that we'll meet an EAGAIN if we attempt to read again. If
the FD is not yet polled, the stream interface already takes care of
stopping the speculative read. When the FD is already being polled, we
have two options :
- either we're running from a level-triggered poller, in which case
we'd rather report that we've reached the end so that we don't
speculate over the poller and let it report next time data are
available ;
- or we're running from an edge-triggered poller in which case we
have no choice and have to see the EAGAIN to re-enable events.
At the moment we don't have any edge-triggered poller, so it's desirable
to avoid speculative I/O that we know will fail.
Note that this must not be ported to SSL since SSL hides the real
readiness of the file descriptor.
Thanks to this change, we observe no EAGAIN anymore during keep-alive
transfers, and failed recvfrom() are reduced by half in http-server-close
mode (the client-facing side is always being polled and the second recv
can be avoided). Doing so results in about 5% performance increase in
keep-alive mode. Similarly, we used to have up to about 1.6% of EAGAIN
on accept() (1/maxaccept), and these have completely disappeared under
high loads.
The recv/send callbacks must check for readiness themselves instead of
having their callers do it. This will strengthen the test and will also
ensure we never refrain from calling a handshake handler because a
direction is being polled while the other one is ready.
We simply remove these functions and replace their calls with the
appropriate ones :
- if we're in the data phase, we can simply report wait on the FD
- if we're in the socket phase, we may also have to signal the
desire to read/write on the socket because it might not be
active yet.
Steve Ruiz reported some reproducible crashes with HTTP health checks
on a certain page returning a huge length. The traces he provided
clearly showed that the recv() call was performed twice for a total
size exceeding the buffer's length.
Cyril Bonté tracked down the problem to be caused by the full buffer
size being passed to rcv_buf() in event_srv_chk_r() instead of passing
just the remaining amount of space. Indeed, this change happened during
the connection rework in 1.5-dev13 with the following commit :
f150317 MAJOR: checks: completely use the connection transport layer
But one of the problems is also that the comments at the top of the
rcv_buf() functions suggest that the caller only has to ensure the
requested size doesn't overflow the buffer's size.
Also, these functions already have to care about the buffer's size to
handle wrapping free space when there are pending data in the buffer.
So let's change the API instead to more closely match what could be
expected from these functions :
- the caller asks for the maximum amount of bytes it wants to read ;
This means that only the caller is responsible for enforcing the
reserve if it wants to (eg: checks don't).
- the rcv_buf() functions fix their computations to always consider
this size as a max, and always perform validity checks based on
the buffer's free space.
As a result, the code is simplified and reduced, and made more robust
for callers which now just have to care about whether they want the
buffer to be filled or not.
Since the bug was introduced in 1.5-dev13, no backport to stable versions
is needed.
Currently the control and transport layers of a connection are supposed
to be initialized when their respective pointers are not NULL. This will
not work anymore when we plan to reuse connections, because there is an
asymmetry between the accept() side and the connect() side :
- on accept() side, the fd is set first, then the ctrl layer then the
transport layer ; upon error, they must be undone in the reverse order,
then the FD must be closed. The FD must not be deleted if the control
layer was not yet initialized ;
- on the connect() side, the fd is set last and there is no reliable way
to know if it has been initialized or not. In practice it's initialized
to -1 first but this is hackish and supposes that local FDs only will
be used forever. Also, there are even less solutions for keeping trace
of the transport layer's state.
Also it is possible to support delayed close() when something (eg: logs)
tracks some information requiring the transport and/or control layers,
making it even more difficult to clean them.
So the proposed solution is to add two flags to the connection :
- CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert)
and cleared after it's released (fd_delete).
- CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init)
and cleared after it's released (xprt->close).
The functions have been adapted to rely on this and not on the pointers
anymore. conn_xprt_close() was unused and dangerous : it did not close
the control layer (eg: the socket itself) but still marks the transport
layer as closed, preventing any future call to conn_full_close() from
finishing the job.
The problem comes from conn_full_close() in fact. It needs to close the
xprt and ctrl layers independantly. After that we're still having an issue :
we don't know based on ->ctrl alone whether the fd was registered or not.
For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We
now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what
remains to be done on the connection.
In order not to miss some flag assignments, we introduce conn_ctrl_init()
to initialize the control layer, register the fd using fd_insert() and set
the flag, and conn_ctrl_close() which unregisters the fd and removes the
flag, but only if the transport layer was closed.
Similarly, at the transport layer, conn_xprt_init() calls ->init and sets
the flag, while conn_xprt_close() checks the flag, calls ->close and clears
the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init
and the ->close functions are called only once each and in the correct order.
Note that conn_xprt_close() does nothing if the transport layer is still
tracked.
conn_full_close() now simply calls conn_xprt_close() then conn_full_close()
in turn, which do nothing if CO_FL_XPRT_TRACKED is set.
In order to handle the error path, we also provide conn_force_close() which
ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers
in turns. All relevant instances of fd_delete() have been replaced with
conn_force_close(). Now we always know what state the connection is in and
we can expect to split its initialization.
At some places, we report an error by just detecting FD_POLL_ERR.
The problem is that the caller never knows if it must use errno or
call getsockopt(SO_ERROR). And since this last one clears the
pending error from the queue, it cannot be used inconditionally.
An elegant solution consists in clearing errno prior to inspecting
FD_POLL_ERR. The caller then knows that if it gets CO_FL_ERROR and
errno == 0, it must call getsockopt().
When we get a hard error from a syscall indicating the socket is dead,
it makes sense to set the CO_FL_SOCK_WR_SH and CO_FL_SOCK_RD_SH flags
to indicate that the socket may not be used anymore. It will ease the
error processing in health checks where the state of socket is very
important. We'll also be able to avoid some setsockopt(nolinger) after
an error.
For now, the rest of the code is not impacted because CO_FL_ERROR is
always tested prior to these flags.
Mark Janssen reported an issue in 1.5-dev19 which was introduced
in 1.5-dev12 by commit 96199b10. From time to time, randomly, the
CPU usage spikes to 100% for seconds to minutes.
A deep analysis of the traces provided shows that it happens when
waiting for the response to a second pipelined HTTP request, or
when trying to handle the received shutdown advertised by epoll()
after the last block of data. Each time, splice() was involved with
data pending in the pipe.
The cause of this was that such events could not be taken into account
by splice nor by recv and were left pending :
- the transfer of the last block of data, optionally with a shutdown
was not handled by splice() because of the validation that to_forward
is higher than MIN_SPLICE_FORWARD ;
- the next recv() call was inhibited because of the test on presence
of data in the pipe. This is also what prevented the recv() call
from handling a response to a pipelined request until the client
had ACKed the previous response.
No less than 4 different methods were experimented to fix this, and the
current one was finally chosen. The principle is that if an event is not
caught by splice(), then it MUST be caught by recv(). So we remove the
condition on the pipe's emptiness to perform an recv(), and in order to
prevent recv() from being used in the middle of a transfer, we mark
supposedly full pipes with CO_FL_WAIT_ROOM, which makes sense because
the reason for stopping a splice()-based receive is that the pipe is
supposed to be full.
The net effect is that we don't wake up and sleep in loops during these
transient states. This happened much more often than expected, sometimes
for a few cycles at end of transfers, but rarely long enough to be
noticed, unless a client timed out with data pending in the pipe. The
effect on CPU usage is visible even when transfering 1MB objects in
pipeline, where the CPU usage drops from 10 to 6% on a small machine at
medium bandwidth.
Some further improvements are needed :
- the last chunk of a splice() transfer is never done using splice due
to the test on to_forward. This is wrong and should be performed with
splice if the pipe has not yet been emptied ;
- si_chk_snd() should not be called when the write event is already being
polled, otherwise we're almost certain to get EAGAIN.
Many thanks to Mark for all the traces he cared to provide, they were
essential for understanding this issue which was not reproducible
without.
Only 1.5-dev is affected, no backport is needed.
Commit 96199b10 reintroduced the splice() mechanism in the new connection
system. However, it failed to account for the number of transferred bytes,
allowing more bytes than scheduled to be transferred to the client. This
can cause an issue with small-chunked responses, where each packet from
the server may contain several chunks, because a single splice() call may
succeed, then try to splice() a second time as the pipe is not full, thus
consuming the next chunk size.
This patch also reverts commit baf2a5 ("OPTIM: splice: detect shutdowns...")
because it introduced a related regression. The issue is that splice() may
return less data than available also if the pipe is full, so having EPOLLRDHUP
after splice() returns less than expected is not a sufficient indication that
the input is empty.
In both cases, the issue may be detected by the presence of "SD" termination
flags in the logs, and worked around by disabling splicing (using "-dS").
This problem was reported by Sander Klein, and no backport is needed.
Versions of splice between 2.6.25 and 2.6.27.12 were bogus and would return EAGAIN
on incoming shutdowns. On these versions, we have to call recv() after such a return
in order to find whether splice is OK or not. Since 2.6.27.13 we don't need to do
this anymore, saving one useless recv() call after each splice() returning EAGAIN,
and we can avoid this logic by defining ASSUME_SPLICE_WORKS.
Building with linux2628 automatically enables splice and the flag above since the
kernel is safe. People enabling splice for custom kernels will be able to disable
this logic by hand too.
Since last commit introducing EPOLLRDHUP, the splicing code is able to
detect an incoming shutdown without calling splice() == 0. This avoids
one useless syscall.
In raw_sock, we already check for FD_POLL_HUP after a short recv()
to avoid a useless syscall and detect the end of stream. However,
we fail to check for FD_POLL_ERR here, which causes major issues
as some errors might be delivered and ignored if they are delivered
at the same time as a HUP, and there is no data to send to detect
them on the other direction.
Since the connections flags do not have the CO_FL_ERROR flag, the
polling is not disabled on the socket and the pollers immediately
call the conn_fd_handler() again, resulting in CPU spikes for as
long as the timeouts allow them.
Note that this patch alone fixes the issue but a few patches will
follow to strengthen this fragile area.
Big thanks to Bryan Berry who reported the issue with significant
amounts of detailed traces that helped rule out many other initially
suspected causes and to finally reproduce the issue in the lab.
At least on a heavily patched 2.6.35.9, we can see splice() fail
with EBADF :
recv(6, "789.123456789.123456789.12345678"..., 1049, 0) = 1049
send(5, "HTTP/1.1 200\r\nContent-length: 10"..., 8030, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE) = 8030
gettimeofday({1352717854, 515601}, NULL) = 0
epoll_wait(0x3, 0x40221008, 0x7, 0) = 0
gettimeofday({1352717854, 515793}, NULL) = 0
pipe([7, 8]) = 0
splice(0x6, 0, 0x8, 0, 0xfe12c, 0x3) = -1 EBADF (Bad file descriptor)
close(6) = 0
This clearly is a kernel issue since all FDs are valid here, so let's
simply disable splice() on the connection when this happens so that
the session correctly recovers from that issue using recv().
A failed send() may return ENOTCONN when the connection is not yet established.
On Linux, we generally see EAGAIN but on OpenBSD we clearly have ENOTCONN, so
let's ensure we poll for write when we encounter this error.
Till now we used to perform the L4_CONN check in the data layer
(eg: stream interface) but that does not make sense, because some transport
layers will imply that the connection is opened (eg: SSL), and also because
the complexity to check for this is higher in the data layer than in the
transport layer. This is so much true that some read0 cases did not validate
the connection.
So as of now, the transport layer is responsible for clearing L4_CONN when
it detects an activity, and the data layer may safely rely on this flag. This
only impacts a minor change in raw_sock and stream_interface for now.
While working on the changes required to make the health checks use the
new connections, it started to become obvious that some naming was not
logical at all in the connections. Specifically, it is not logical to
call the "data layer" the layer which is in charge for all the handshake
and which does not yet provide a data layer once established until a
session has allocated all the required buffers.
In fact, it's more a transport layer, which makes much more sense. The
transport layer offers a medium on which data can transit, and it offers
the functions to move these data when the upper layer requests this. And
it is the upper layer which iterates over the transport layer's functions
to move data which should be called the data layer.
The use case where it's obvious is with embryonic sessions : an incoming
SSL connection is accepted. Only the connection is allocated, not the
buffers nor stream interface, etc... The connection handles the SSL
handshake by itself. Once this handshake is complete, we can't use the
data functions because the buffers and stream interface are not there
yet. Hence we have to first call a specific function to complete the
session initialization, after which we'll be able to use the data
functions. This clearly proves that SSL here is only a transport layer
and that the stream interface constitutes the data layer.
A similar change will be performed to rename app_cb => data, but the
two could not be in the same commit for obvious reasons.
When a connection setup is pending and we receive an error without a
POLL_IN flag, we're certain there will be nothing to read from it and
we can safely report an error without attempting a recv() call. This
will be significantly better for health checks which will avoid a useless
recv() on all failed checks.
Depending on the pollers used, a connection error may be notified
with POLLOUT|POLLERR|POLLHUP. POLLHUP by itself is enough for the
connection handler to call the read actor, which would only consider
this flag as a good indication of a hangup, without considering the
POLLERR flag.
In order to address this, we directly jump to the read0 label if
POLLERR was not set.
This will be important with health checks as we don't want to believe
a connection was properly established when it's not the case !
I/O handlers now all use __conn_{sock,data}_{stop,poll,want}_* instead
of returning dummy flags. The code has become slightly simpler because
some tricks such as the MIN_RET_FOR_READ_LOOP are not needed anymore,
and the data handlers which switch to a handshake handler do not need
to disable themselves anymore.
Some parts of the sock_ops structure were only used by the stream
interface and have been moved into si_ops. Some of them were callbacks
to the stream interface from the connection and have been moved into
app_cp as they're the application seen from the connection (later,
health-checks will need to use them). The rest has moved to data_ops.
Normally at this point the connection could live without knowing about
stream interfaces at all.
The splicing is now provided by the data-layer rcv_pipe/snd_pipe functions
which in turn are called by the stream interface's recv and send callbacks.
The presence of the rcv_pipe/snd_pipe functions is used to attest support
for splicing at the data layer. It looks like the stream-interface's
SI_FL_CAP_SPLICE flag does not make sense anymore as it's used as a proxy
for the pointers above.
It also appears that we call chk_snd() from the recv callback and then
try to call it again in update_conn(). It is very likely that this last
function will progressively slip into the recv/send callbacks in order
to avoid duplicate check code.
The code works right now with and without splicing. Only raw_sock provides
support for it and it is automatically selected when the various splice
options are set. However it looks like splice-auto doesn't enable it, which
possibly means that the streamer detection code does not work anymore, or
that it's only called at a time where it's too late to enable splicing (in
process_session).
Similar to what was done on the receive path, the data layer now provides
only an snd_buf() callback that is iterated over by the stream interface's
si_conn_send_loop() function.
The data layer now has no knowledge about channels nor stream interfaces.
The splice() code still need to be ported as it currently is disabled.
The recv function is now generic and is usable to iterate any connection-to-buf
reading function from a stream interface. So let's move it to stream-interface.
This is the start of the stream connection iterator which calls the
data-layer reader. This still looks a bit tricky but is OK. Splicing
is not handled at all at the moment.
The "raw_sock" prefix will be more convenient for naming functions as
it will be prefixed with the data layer and suffixed with the data
direction. So let's rename the files now to avoid any further confusion.
The #include directive was also removed from a number of files which do
not need it anymore.