haproxy

mirror of http://git.haproxy.org/git/haproxy.git/ synced 2025-03-06 03:18:43 +00:00

Author	SHA1	Message	Date
Joshua M. Clulow	0724903143	BUG/MINOR: raw_sock: also consider ENOTCONN in addition to EAGAIN for recv() I was testing haproxy-1.5-dev22 on SmartOS (an illumos-based system) and ran into a problem. There's a small window after non-blocking connect() is called, but before the TCP connection is established, where recv() may return ENOTCONN. On Linux, the behaviour here seems to be always to return EAGAIN. The fix is relatively trivial, and appears to make haproxy work reliably on current SmartOS (see patch below). It's possible that other UNIX platforms exhibit this behaviour as well. Note: the equivalent was already done for send() in commit `0ea0cf6` ("BUG: raw_sock: also consider ENOTCONN in addition to EAGAIN"). Both patches should be backported to 1.4.	2014-03-04 07:27:18 +01:00
Willy Tarreau	1049b1f551	MEDIUM: connection: don't use real send() flags in snd_buf() This prevents us from passing other useful info and requires the upper levels to know these flags. Let's use a new flags category instead : CO_SFL_*. For now, only MSG_MORE has been remapped.	2014-02-06 11:37:29 +01:00
Willy Tarreau	7e4086dc18	BUG/MINOR: raw_sock: correctly set the MSG_MORE flag Due to a typo, the MSG_MORE flag used to replace MSG_NOSIGNAL and MSG_DONTWAIT. Fortunately, sockets are always marked non-blocking, so the loss of MSG_DONTWAIT is harmless, and the NOSIGNAL is covered by the interception of the SIGPIPE. So no issue could have been caused by this bug.	2014-02-02 09:38:06 +01:00
Willy Tarreau	6c11bd2f89	OPTIM: raw-sock: don't speculate after a short read if polling is enabled This is the reimplementation of the "done" action : when we experience a short read, we're almost certain that we've exhausted the system's buffers and that we'll meet an EAGAIN if we attempt to read again. If the FD is not yet polled, the stream interface already takes care of stopping the speculative read. When the FD is already being polled, we have two options : - either we're running from a level-triggered poller, in which case we'd rather report that we've reached the end so that we don't speculate over the poller and let it report next time data are available ; - or we're running from an edge-triggered poller in which case we have no choice and have to see the EAGAIN to re-enable events. At the moment we don't have any edge-triggered poller, so it's desirable to avoid speculative I/O that we know will fail. Note that this must not be ported to SSL since SSL hides the real readiness of the file descriptor. Thanks to this change, we observe no EAGAIN anymore during keep-alive transfers, and failed recvfrom() are reduced by half in http-server-close mode (the client-facing side is always being polled and the second recv can be avoided). Doing so results in about 5% performance increase in keep-alive mode. Similarly, we used to have up to about 1.6% of EAGAIN on accept() (1/maxaccept), and these have completely disappeared under high loads.	2014-01-26 00:42:32 +01:00
Willy Tarreau	fd803bb4d7	MEDIUM: connection: add check for readiness in I/O handlers The recv/send callbacks must check for readiness themselves instead of having their callers do it. This will strengthen the test and will also ensure we never refrain from calling a handshake handler because a direction is being polled while the other one is ready.	2014-01-26 00:42:30 +01:00
Willy Tarreau	e1f50c4b02	MEDIUM: connection: remove conn_{data,sock}_poll_{recv,send} We simply remove these functions and replace their calls with the appropriate ones : - if we're in the data phase, we can simply report wait on the FD - if we're in the socket phase, we may also have to signal the desire to read/write on the socket because it might not be active yet.	2014-01-26 00:42:30 +01:00
Willy Tarreau	abf08d9365	BUG/MAJOR: connection: fix mismatch between rcv_buf's API and usage Steve Ruiz reported some reproducible crashes with HTTP health checks on a certain page returning a huge length. The traces he provided clearly showed that the recv() call was performed twice for a total size exceeding the buffer's length. Cyril Bont� tracked down the problem to be caused by the full buffer size being passed to rcv_buf() in event_srv_chk_r() instead of passing just the remaining amount of space. Indeed, this change happened during the connection rework in 1.5-dev13 with the following commit : `f150317` MAJOR: checks: completely use the connection transport layer But one of the problems is also that the comments at the top of the rcv_buf() functions suggest that the caller only has to ensure the requested size doesn't overflow the buffer's size. Also, these functions already have to care about the buffer's size to handle wrapping free space when there are pending data in the buffer. So let's change the API instead to more closely match what could be expected from these functions : - the caller asks for the maximum amount of bytes it wants to read ; This means that only the caller is responsible for enforcing the reserve if it wants to (eg: checks don't). - the rcv_buf() functions fix their computations to always consider this size as a max, and always perform validity checks based on the buffer's free space. As a result, the code is simplified and reduced, and made more robust for callers which now just have to care about whether they want the buffer to be filled or not. Since the bug was introduced in 1.5-dev13, no backport to stable versions is needed.	2014-01-15 01:09:48 +01:00
Willy Tarreau	f79c8171b2	MAJOR: connection: add two new flags to indicate readiness of control/transport Currently the control and transport layers of a connection are supposed to be initialized when their respective pointers are not NULL. This will not work anymore when we plan to reuse connections, because there is an asymmetry between the accept() side and the connect() side : - on accept() side, the fd is set first, then the ctrl layer then the transport layer ; upon error, they must be undone in the reverse order, then the FD must be closed. The FD must not be deleted if the control layer was not yet initialized ; - on the connect() side, the fd is set last and there is no reliable way to know if it has been initialized or not. In practice it's initialized to -1 first but this is hackish and supposes that local FDs only will be used forever. Also, there are even less solutions for keeping trace of the transport layer's state. Also it is possible to support delayed close() when something (eg: logs) tracks some information requiring the transport and/or control layers, making it even more difficult to clean them. So the proposed solution is to add two flags to the connection : - CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert) and cleared after it's released (fd_delete). - CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init) and cleared after it's released (xprt->close). The functions have been adapted to rely on this and not on the pointers anymore. conn_xprt_close() was unused and dangerous : it did not close the control layer (eg: the socket itself) but still marks the transport layer as closed, preventing any future call to conn_full_close() from finishing the job. The problem comes from conn_full_close() in fact. It needs to close the xprt and ctrl layers independantly. After that we're still having an issue : we don't know based on ->ctrl alone whether the fd was registered or not. For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what remains to be done on the connection. In order not to miss some flag assignments, we introduce conn_ctrl_init() to initialize the control layer, register the fd using fd_insert() and set the flag, and conn_ctrl_close() which unregisters the fd and removes the flag, but only if the transport layer was closed. Similarly, at the transport layer, conn_xprt_init() calls ->init and sets the flag, while conn_xprt_close() checks the flag, calls ->close and clears the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init and the ->close functions are called only once each and in the correct order. Note that conn_xprt_close() does nothing if the transport layer is still tracked. conn_full_close() now simply calls conn_xprt_close() then conn_full_close() in turn, which do nothing if CO_FL_XPRT_TRACKED is set. In order to handle the error path, we also provide conn_force_close() which ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers in turns. All relevant instances of fd_delete() have been replaced with conn_force_close(). Now we always know what state the connection is in and we can expect to split its initialization.	2013-12-09 15:40:23 +01:00
Willy Tarreau	ce3eda7c6a	MINOR: connection: clear errno prior to checking for errors At some places, we report an error by just detecting FD_POLL_ERR. The problem is that the caller never knows if it must use errno or call getsockopt(SO_ERROR). And since this last one clears the pending error from the queue, it cannot be used inconditionally. An elegant solution consists in clearing errno prior to inspecting FD_POLL_ERR. The caller then knows that if it gets CO_FL_ERROR and errno == 0, it must call getsockopt().	2013-12-05 02:23:48 +01:00
Willy Tarreau	26f4a04744	MEDIUM: connection: set the socket shutdown flags on socket errors When we get a hard error from a syscall indicating the socket is dead, it makes sense to set the CO_FL_SOCK_WR_SH and CO_FL_SOCK_RD_SH flags to indicate that the socket may not be used anymore. It will ease the error processing in health checks where the state of socket is very important. We'll also be able to avoid some setsockopt(nolinger) after an error. For now, the rest of the code is not impacted because CO_FL_ERROR is always tested prior to these flags.	2013-12-04 23:50:36 +01:00
Willy Tarreau	61d39a0e2a	BUG/MEDIUM: splicing: fix abnormal CPU usage with splicing Mark Janssen reported an issue in 1.5-dev19 which was introduced in 1.5-dev12 by commit `96199b10`. From time to time, randomly, the CPU usage spikes to 100% for seconds to minutes. A deep analysis of the traces provided shows that it happens when waiting for the response to a second pipelined HTTP request, or when trying to handle the received shutdown advertised by epoll() after the last block of data. Each time, splice() was involved with data pending in the pipe. The cause of this was that such events could not be taken into account by splice nor by recv and were left pending : - the transfer of the last block of data, optionally with a shutdown was not handled by splice() because of the validation that to_forward is higher than MIN_SPLICE_FORWARD ; - the next recv() call was inhibited because of the test on presence of data in the pipe. This is also what prevented the recv() call from handling a response to a pipelined request until the client had ACKed the previous response. No less than 4 different methods were experimented to fix this, and the current one was finally chosen. The principle is that if an event is not caught by splice(), then it MUST be caught by recv(). So we remove the condition on the pipe's emptiness to perform an recv(), and in order to prevent recv() from being used in the middle of a transfer, we mark supposedly full pipes with CO_FL_WAIT_ROOM, which makes sense because the reason for stopping a splice()-based receive is that the pipe is supposed to be full. The net effect is that we don't wake up and sleep in loops during these transient states. This happened much more often than expected, sometimes for a few cycles at end of transfers, but rarely long enough to be noticed, unless a client timed out with data pending in the pipe. The effect on CPU usage is visible even when transfering 1MB objects in pipeline, where the CPU usage drops from 10 to 6% on a small machine at medium bandwidth. Some further improvements are needed : - the last chunk of a splice() transfer is never done using splice due to the test on to_forward. This is wrong and should be performed with splice if the pipe has not yet been emptied ; - si_chk_snd() should not be called when the write event is already being polled, otherwise we're almost certain to get EAGAIN. Many thanks to Mark for all the traces he cared to provide, they were essential for understanding this issue which was not reproducible without. Only 1.5-dev is affected, no backport is needed.	2013-07-22 09:31:55 +02:00
Willy Tarreau	4fc90efed0	BUG/MEDIUM: splicing is broken since 1.5-dev12 Commit `96199b10` reintroduced the splice() mechanism in the new connection system. However, it failed to account for the number of transferred bytes, allowing more bytes than scheduled to be transferred to the client. This can cause an issue with small-chunked responses, where each packet from the server may contain several chunks, because a single splice() call may succeed, then try to splice() a second time as the pipe is not full, thus consuming the next chunk size. This patch also reverts commit baf2a5 ("OPTIM: splice: detect shutdowns...") because it introduced a related regression. The issue is that splice() may return less data than available also if the pipe is full, so having EPOLLRDHUP after splice() returns less than expected is not a sufficient indication that the input is empty. In both cases, the issue may be detected by the presence of "SD" termination flags in the logs, and worked around by disabling splicing (using "-dS"). This problem was reported by Sander Klein, and no backport is needed.	2013-04-06 11:46:27 +02:00
Willy Tarreau	b6daedd46c	OPTIM: splice: assume by default that splice is working correctly Versions of splice between 2.6.25 and 2.6.27.12 were bogus and would return EAGAIN on incoming shutdowns. On these versions, we have to call recv() after such a return in order to find whether splice is OK or not. Since 2.6.27.13 we don't need to do this anymore, saving one useless recv() call after each splice() returning EAGAIN, and we can avoid this logic by defining ASSUME_SPLICE_WORKS. Building with linux2628 automatically enables splice and the flag above since the kernel is safe. People enabling splice for custom kernels will be able to disable this logic by hand too.	2013-01-07 16:57:09 +01:00
Willy Tarreau	baf2a500a1	OPTIM: splice: detect shutdowns and avoid splice() == 0 Since last commit introducing EPOLLRDHUP, the splicing code is able to detect an incoming shutdown without calling splice() == 0. This avoids one useless syscall.	2013-01-07 16:39:51 +01:00
Willy Tarreau	5fb3803f4b	CLEANUP: buffer: use buffer_empty() instead of buffer_len()==0 A few places still made use of buffer_len()==0 to detect an empty buffer. Use the cleaner and more efficient buffer_empty() instead.	2012-12-17 01:14:49 +01:00
Willy Tarreau	debdc4b657	BUG/MAJOR: raw_sock: must check error code on hangup In raw_sock, we already check for FD_POLL_HUP after a short recv() to avoid a useless syscall and detect the end of stream. However, we fail to check for FD_POLL_ERR here, which causes major issues as some errors might be delivered and ignored if they are delivered at the same time as a HUP, and there is no data to send to detect them on the other direction. Since the connections flags do not have the CO_FL_ERROR flag, the polling is not disabled on the socket and the pollers immediately call the conn_fd_handler() again, resulting in CPU spikes for as long as the timeouts allow them. Note that this patch alone fixes the issue but a few patches will follow to strengthen this fragile area. Big thanks to Bryan Berry who reported the issue with significant amounts of detailed traces that helped rule out many other initially suspected causes and to finally reproduce the issue in the lab.	2012-12-07 00:01:33 +01:00
Willy Tarreau	45b8893966	MINOR: splice: disable it when the system returns EBADF At least on a heavily patched 2.6.35.9, we can see splice() fail with EBADF : recv(6, "789.123456789.123456789.12345678"..., 1049, 0) = 1049 send(5, "HTTP/1.1 200\r\nContent-length: 10"..., 8030, MSG_DONTWAIT\|MSG_NOSIGNAL\|MSG_MORE) = 8030 gettimeofday({1352717854, 515601}, NULL) = 0 epoll_wait(0x3, 0x40221008, 0x7, 0) = 0 gettimeofday({1352717854, 515793}, NULL) = 0 pipe([7, 8]) = 0 splice(0x6, 0, 0x8, 0, 0xfe12c, 0x3) = -1 EBADF (Bad file descriptor) close(6) = 0 This clearly is a kernel issue since all FDs are valid here, so let's simply disable splice() on the connection when this happens so that the session correctly recovers from that issue using recv().	2012-11-12 12:02:20 +01:00
Willy Tarreau	0ea0cf606e	BUG: raw_sock: also consider ENOTCONN in addition to EAGAIN A failed send() may return ENOTCONN when the connection is not yet established. On Linux, we generally see EAGAIN but on OpenBSD we clearly have ENOTCONN, so let's ensure we poll for write when we encounter this error.	2012-11-11 20:53:28 +01:00
Willy Tarreau	665e6ee7aa	MEDIUM: connection: it's not the data layer's role to validate the connection Till now we used to perform the L4_CONN check in the data layer (eg: stream interface) but that does not make sense, because some transport layers will imply that the connection is opened (eg: SSL), and also because the complexity to check for this is higher in the data layer than in the transport layer. This is so much true that some read0 cases did not validate the connection. So as of now, the transport layer is responsible for clearing L4_CONN when it detects an activity, and the data layer may safely rely on this flag. This only impacts a minor change in raw_sock and stream_interface for now.	2012-10-04 22:26:11 +02:00
Willy Tarreau	f7bc57ca6e	REORG: connection: rename the data layer the "transport layer" While working on the changes required to make the health checks use the new connections, it started to become obvious that some naming was not logical at all in the connections. Specifically, it is not logical to call the "data layer" the layer which is in charge for all the handshake and which does not yet provide a data layer once established until a session has allocated all the required buffers. In fact, it's more a transport layer, which makes much more sense. The transport layer offers a medium on which data can transit, and it offers the functions to move these data when the upper layer requests this. And it is the upper layer which iterates over the transport layer's functions to move data which should be called the data layer. The use case where it's obvious is with embryonic sessions : an incoming SSL connection is accepted. Only the connection is allocated, not the buffers nor stream interface, etc... The connection handles the SSL handshake by itself. Once this handshake is complete, we can't use the data functions because the buffers and stream interface are not there yet. Hence we have to first call a specific function to complete the session initialization, after which we'll be able to use the data functions. This clearly proves that SSL here is only a transport layer and that the stream interface constitutes the data layer. A similar change will be performed to rename app_cb => data, but the two could not be in the same commit for obvious reasons.	2012-10-04 22:26:09 +02:00
Willy Tarreau	6f5d141149	MEDIUM: raw_sock: improve connection error reporting When a connection setup is pending and we receive an error without a POLL_IN flag, we're certain there will be nothing to read from it and we can safely report an error without attempting a recv() call. This will be significantly better for health checks which will avoid a useless recv() on all failed checks.	2012-10-04 22:26:09 +02:00
Willy Tarreau	c0e98868fe	MINOR: raw_sock: always report asynchronous connection errors Depending on the pollers used, a connection error may be notified with POLLOUT\|POLLERR\|POLLHUP. POLLHUP by itself is enough for the connection handler to call the read actor, which would only consider this flag as a good indication of a hangup, without considering the POLLERR flag. In order to address this, we directly jump to the read0 label if POLLERR was not set. This will be important with health checks as we don't want to believe a connection was properly established when it's not the case !	2012-10-04 22:26:09 +02:00
Willy Tarreau	d1d5454180	REORG: split "protocols" files into protocol and listener It was becoming confusing to have protocols and listeners in the same files, split them.	2012-09-15 22:29:32 +02:00
Willy Tarreau	56a77e5933	MEDIUM: connection: complete the polling cleanups I/O handlers now all use __conn_{sock,data}_{stop,poll,want}_* instead of returning dummy flags. The code has become slightly simpler because some tricks such as the MIN_RET_FOR_READ_LOOP are not needed anymore, and the data handlers which switch to a handshake handler do not need to disable themselves anymore.	2012-09-03 20:47:35 +02:00
Willy Tarreau	c7e4238df0	REORG: buffers: split buffers into chunk,buffer,channel Many parts of the channel definition still make use of the "buffer" word.	2012-09-03 20:47:32 +02:00
Willy Tarreau	c578891112	CLEANUP: connection: split sock_ops into data_ops, app_cp and si_ops Some parts of the sock_ops structure were only used by the stream interface and have been moved into si_ops. Some of them were callbacks to the stream interface from the connection and have been moved into app_cp as they're the application seen from the connection (later, health-checks will need to use them). The rest has moved to data_ops. Normally at this point the connection could live without knowing about stream interfaces at all.	2012-09-03 20:47:31 +02:00
Willy Tarreau	96199b1016	MAJOR: stream-interface: restore splicing mechanism The splicing is now provided by the data-layer rcv_pipe/snd_pipe functions which in turn are called by the stream interface's recv and send callbacks. The presence of the rcv_pipe/snd_pipe functions is used to attest support for splicing at the data layer. It looks like the stream-interface's SI_FL_CAP_SPLICE flag does not make sense anymore as it's used as a proxy for the pointers above. It also appears that we call chk_snd() from the recv callback and then try to call it again in update_conn(). It is very likely that this last function will progressively slip into the recv/send callbacks in order to avoid duplicate check code. The code works right now with and without splicing. Only raw_sock provides support for it and it is automatically selected when the various splice options are set. However it looks like splice-auto doesn't enable it, which possibly means that the streamer detection code does not work anymore, or that it's only called at a time where it's too late to enable splicing (in process_session).	2012-09-03 20:47:31 +02:00
Willy Tarreau	5368d80ede	MAJOR: connection: split the send call into connection and stream interface Similar to what was done on the receive path, the data layer now provides only an snd_buf() callback that is iterated over by the stream interface's si_conn_send_loop() function. The data layer now has no knowledge about channels nor stream interfaces. The splice() code still need to be ported as it currently is disabled.	2012-09-03 20:47:31 +02:00
Willy Tarreau	ce323dea14	REORG: stream-interface: move sock_raw_read() to si_conn_recv_cb() The recv function is now generic and is usable to iterate any connection-to-buf reading function from a stream interface. So let's move it to stream-interface.	2012-09-03 20:47:30 +02:00
Willy Tarreau	1fe6bc335a	MINOR: stream-interface: add an rcv_buf callback to sock_ops This one is to be used by the read I/O handlers.	2012-09-03 20:47:30 +02:00
Willy Tarreau	af978c4170	MAJOR: raw_sock: temporarily disable splicing It's too hard to convert splicing to connection+buf for now, so let's disable it in order to make progress.	2012-09-03 20:47:30 +02:00
Willy Tarreau	2ba4465086	MAJOR: raw_sock: extract raw_sock_to_buf() from raw_sock_read() This is the start of the stream connection iterator which calls the data-layer reader. This still looks a bit tricky but is OK. Splicing is not handled at all at the moment.	2012-09-03 20:47:30 +02:00
Willy Tarreau	75bf2c925f	REORG: sock_raw: rename the files raw_sock* The "raw_sock" prefix will be more convenient for naming functions as it will be prefixed with the data layer and suffixed with the data direction. So let's rename the files now to avoid any further confusion. The #include directive was also removed from a number of files which do not need it anymore.	2012-09-02 21:54:56 +02:00

33 Commits