haproxy public development tree
Go to file
Willy Tarreau 0630038e77 BUG/MEDIUM: ssl: check a connection's status before computing a handshake
As spotted in issue #822, we're having a problem with error detection in
the SSL layer. The problem is that on an overwhelmed machine, accepted
connections can start to pile up, each of them requiring a slow handshake,
and during all this time if the client aborts, the handshake will still be
calculated.

The error controls are properly placed, it's just that the SSL layer
reads records exactly of the advertised size, without having the ability
to encounter a pending connection error. As such if injecting many TLS
connections to a listener with a huge backlog, it's fairly possible to
meet this situation:

  12:50:48.236056 accept4(8, {sa_family=AF_INET, sin_port=htons(62794), sin_addr=inet_addr("127.0.0.1")}, [128->16], SOCK_NONBLOCK) = 1109
  12:50:48.236071 setsockopt(1109, SOL_TCP, TCP_NODELAY, [1], 4) = 0
  (process other connections' handshakes)

  12:50:48.257270 getsockopt(1109, SOL_SOCKET, SO_ERROR, [ECONNRESET], [4]) = 0
  (proof that error was detectable there but this code was added for the PoC)

  12:50:48.257297 recvfrom(1109, "\26\3\1\2\0", 5, 0, NULL, NULL) = 5
  12:50:48.257310 recvfrom(1109, "\1\0\1\3"..., 512, 0, NULL, NULL) = 512

  (handshake calculation taking 700us)

  12:50:48.258004 sendto(1109, "\26\3\3\0z"..., 1421, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)
  12:50:48.258036 close(1109)             = 0

The situation was amplified by the multi-queue accept code, as it resulted
in many incoming connections to be accepted long before they could be
handled. Prior to this they would have been accepted and the handshake
immediately started, which would have resulted in most of the connections
waiting in the the system's accept queue, and dying there when the client
aborted, thus the error would have been detected before even trying to
pass them to the handshake code.

As a result, with a listener running on a very large backlog, it's possible
to quickly accept tens of thousands of connections and waste time slowly
running their handshakes while they get replaced by other ones.

This patch adds an SO_ERROR check on the connection's FD before starting
the handshake. This is not pretty as it requires to access the FD, but it
does the job.

Some improvements should be made over the long term so that the transport
layers can report extra information with their ->rcv_buf() call, or at the
very least, implement a ->get_conn_status() function to report various
flags such as shutr, shutw, error at various stages, allowing an upper
layer to inquire for the relevance of engaging into a long operation if
it's known the connection is not usable anymore. An even simpler step
could probably consist in implementing this in the control layer.

This patch is simple enough to be backported as far as 2.0.

Many thanks to @ngaugler for his numerous tests with detailed feedback.
2021-02-02 15:55:53 +01:00
.github CI: Fix the coverity builds 2021-01-28 20:40:14 +01:00
contrib BUG/MINOR: contrib/prometheus-exporter: Restart labels dump at the right pos 2021-02-01 15:21:55 +01:00
doc MINOR: activity: add a new "show tasks" command to list currently active tasks 2021-01-29 12:12:28 +01:00
examples
include MINOR: stats: improve pending connections description 2021-02-01 15:16:33 +01:00
reg-tests BUG/MEDIUM: ssl/cli: abort ssl cert is freeing the old store 2021-02-01 17:58:21 +01:00
scripts BUG/MINOR: reg-tests: fix service dependency script 2021-01-11 14:16:06 +01:00
src BUG/MEDIUM: ssl: check a connection's status before computing a handshake 2021-02-02 15:55:53 +01:00
tests MEDIUM: config: remove the deprecated and dangerous global "debug" directive 2020-10-09 19:18:45 +02:00
.cirrus.yml CI: cirrus: drop CentOS 6 builds 2020-12-16 09:21:51 +01:00
.gitattributes
.gitignore CLEANUP: Update .gitignore 2020-09-12 13:11:24 +02:00
.travis.yml CI: travis-ci: drop coverity scan builds 2020-12-22 19:39:23 +01:00
BRANCHES DOC: fix some spelling issues over multiple files 2021-01-08 14:53:47 +01:00
CHANGELOG [RELEASE] Released version 2.4-dev6 2021-01-22 16:19:46 +01:00
CONTRIBUTING DOC: fix some spelling issues over multiple files 2021-01-08 14:53:47 +01:00
INSTALL DOC: fix some spelling issues over multiple files 2021-01-08 14:53:47 +01:00
LICENSE
MAINTAINERS DOC: Add maintainers for the Prometheus exporter 2021-01-08 15:14:15 +01:00
Makefile MINOR: build: discard echoing in help target 2021-01-18 08:58:33 +01:00
README
ROADMAP
SUBVERS
VERDATE [RELEASE] Released version 2.4-dev6 2021-01-22 16:19:46 +01:00
VERSION [RELEASE] Released version 2.4-dev6 2021-01-22 16:19:46 +01:00

The HAProxy documentation has been split into a number of different files for
ease of use.

Please refer to the following files depending on what you're looking for :

  - INSTALL for instructions on how to build and install HAProxy
  - BRANCHES to understand the project's life cycle and what version to use
  - LICENSE for the project's license
  - CONTRIBUTING for the process to follow to submit contributions

The more detailed documentation is located into the doc/ directory :

  - doc/intro.txt for a quick introduction on HAProxy
  - doc/configuration.txt for the configuration's reference manual
  - doc/lua.txt for the Lua's reference manual
  - doc/SPOE.txt for how to use the SPOE engine
  - doc/network-namespaces.txt for how to use network namespaces under Linux
  - doc/management.txt for the management guide
  - doc/regression-testing.txt for how to use the regression testing suite
  - doc/peers.txt for the peers protocol reference
  - doc/coding-style.txt for how to adopt HAProxy's coding style
  - doc/internals for developer-specific documentation (not all up to date)