From 2212e6a9e22ff7e0ebf7acef552e7e61394b9c25 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Tue, 13 Oct 2015 14:40:55 +0200 Subject: [PATCH] DOC: add the "management" documentation This doc explains how to start/stop haproxy, what signals are used and a few debugging tricks. It's far from being complete but should already help a number of users. The stats part will be taken from the config doc. --- doc/management.txt | 1196 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1196 insertions(+) create mode 100644 doc/management.txt diff --git a/doc/management.txt b/doc/management.txt new file mode 100644 index 0000000000..93f2270b3e --- /dev/null +++ b/doc/management.txt @@ -0,0 +1,1196 @@ + ------------------------ + HAProxy Management Guide + ------------------------ + version 1.6 + + +This document describes how to start, stop, manage, and troubleshoot HAProxy, +as well as some known limitations and traps to avoid. It does not describe how +to configure it (for this please read configuration.txt). + +Note to documentation contributors : + This document is formatted with 80 columns per line, with even number of + spaces for indentation and without tabs. Please follow these rules strictly + so that it remains easily printable everywhere. If you add sections, please + update the summary below for easier searching. + + +Summary +------- + +1. Prerequisites +2. Quick reminder about HAProxy's architecture +3. Starting HAProxy +4. Stopping and restarting HAProxy +5. File-descriptor limitations +6. Memory management +7. CPU usage +8. Logging +9. Statistics and monitoring +10. Tricks for easier configuration management +11. Well-known traps to avoid +12. Debugging and performance issues +13. Security considerations + + +1. Prerequisites +---------------- + +In this document it is assumed that the reader has sufficient administration +skills on a UNIX-like operating system, uses the shell on a daily basis and is +familiar with troubleshooting utilities such as strace and tcpdump. + + +2. Quick reminder about HAProxy's architecture +---------------------------------------------- + +HAProxy is a single-threaded, event-driven, non-blocking daemon. This means is +uses event multiplexing to schedule all of its activities instead of relying on +the system to schedule between multiple activities. Most of the time it runs as +a single process, so the output of "ps aux" on a system will report only one +"haproxy" process, unless a soft reload is in progress and an older process is +finishing its job in parallel to the new one. It is thus always easy to trace +its activity using the strace utility. + +HAProxy is designed to isolate itself into a chroot jail during startup, where +it cannot perform any file-system access at all. This is also true for the +libraries it depends on (eg: libc, libssl, etc). The immediate effect is that +a running process will not be able to reload a configuration file to apply +changes, instead a new process will be started using the updated configuration +file. Some other less obvious effects are that some timezone files or resolver +files the libc might attempt to access at run time will not be found, though +this should generally not happen as they're not needed after startup. A nice +consequence of this principle is that the HAProxy process is totally stateless, +and no cleanup is needed after it's killed, so any killing method that works +will do the right thing. + +HAProxy doesn't write log files, but it relies on the standard syslog protocol +to send logs to a remote server (which is often located on the same system). + +HAProxy uses its internal clock to enforce timeouts, that is derived from the +system's time but where unexpected drift is corrected. This is done by limiting +the time spent waiting in poll() for an event, and measuring the time it really +took. In practice it never waits more than one second. This explains why, when +running strace over a completely idle process, periodic calls to poll() (or any +of its variants) surrounded by two gettimeofday() calls are noticed. They are +normal, completely harmless and so cheap that the load they imply is totally +undetectable at the system scale, so there's nothing abnormal there. Example : + + 16:35:40.002320 gettimeofday({1442759740, 2605}, NULL) = 0 + 16:35:40.002942 epoll_wait(0, {}, 200, 1000) = 0 + 16:35:41.007542 gettimeofday({1442759741, 7641}, NULL) = 0 + 16:35:41.007998 gettimeofday({1442759741, 8114}, NULL) = 0 + 16:35:41.008391 epoll_wait(0, {}, 200, 1000) = 0 + 16:35:42.011313 gettimeofday({1442759742, 11411}, NULL) = 0 + +HAProxy is a TCP proxy, not a router. It deals with established connections that +have been validated by the kernel, and not with packets of any form nor with +sockets in other states (eg: no SYN_RECV nor TIME_WAIT), though their existence +may prevent it from binding a port. It relies on the system to accept incoming +connections and to initiate outgoing connections. An immediate effect of this is +that there is no relation between packets observed on the two sides of a +forwarded connection, which can be of different size, numbers and even family. +Since a connection may only be accepted from a socket in LISTEN state, all the +sockets it is listening to are necessarily visible using the "netstat" utility +to show listening sockets. Example : + + # netstat -ltnp + Active Internet connections (only servers) + Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name + tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1629/sshd + tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 2847/haproxy + tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 2847/haproxy + + +3. Starting HAProxy +------------------- + +HAProxy is started by invoking the "haproxy" program with a number of arguments +passed on the command line. The actual syntax is : + + $ haproxy []* + +where []* is any number of options. An option always starts with '-' +followed by one of more letters, and possibly followed by one or multiple extra +arguments. Without any option, HAProxy displays the help page with a reminder +about supported options. Available options may vary slightly based on the +operating system. A fair number of these options overlap with an equivalent one +if the "global" section. In this case, the command line always has precedence +over the configuration file, so that the command line can be used to quickly +enforce some settings without touching the configuration files. The current +list of options is : + + -- * : all the arguments following "--" are paths to configuration + file to be loaded and processed in the declaration order. It is mostly + useful when relying on the shell to load many files that are numerically + ordered. See also "-f". The difference between "--" and "-f" is that one + "-f" must be placed before each file name, while a single "--" is needed + before all file names. Both options can be used together, the command line + ordering still applies. When more than one file is specified, each file + must start on a section boundary, so the first keyword of each file must be + one of "global", "defaults", "peers", "listen", "frontend", "backend", and + so on. A file cannot contain just a server list for example. + + -f : adds to the list of configuration files to be + loaded. Configuration files are loaded and processed in their declaration + order. This option may be specified multiple times to load multiple files. + See also "--". The difference between "--" and "-f" is that one "-f" must + be placed before each file name, while a single "--" is needed before all + file names. Both options can be used together, the command line ordering + still applies. When more than one file is specified, each file must start + on a section boundary, so the first keyword of each file must be one of + "global", "defaults", "peers", "listen", "frontend", "backend", and so + on. A file cannot contain just a server list for example. + + -C : changes to directory before loading configuration + files. This is useful when using relative paths. Warning when using + wildcards after "--" which are in fact replaced by the shell before + starting haproxy. + + -D : start as a daemon. The process detaches from the current terminal after + forking, and errors are not reported anymore in the terminal. It is + equivalent to the "daemon" keyword in the "global" section of the + configuration. It is recommended to always force it in any init script so + that a faulty configuration doesn't prevent the system from booting. + + -Ds : work in systemd mode. Only used by the systemd wrapper. + + -L : change the local peer name to , which defaults to the local + hostname. This is used only with peers replication. + + -N : sets the default per-proxy maxconn to instead of the + builtin default value (usually 2000). Only useful for debugging. + + -V : enable verbose mode (disables quiet mode). Reverts the effect of "-q" or + "quiet". + + -c : only performs a check of the configuration files and exits before trying + to bind. The exit status is zero if everything is OK, or non-zero if an + error is encountered. + + -d : enable debug mode. This disables daemon mode, forces the process to stay + in foreground and to show incoming and outgoing events. It is equivalent to + the "global" section's "debug" keyword. It must never be used in an init + script. + + -dG : disable use of getaddrinfo() to resolve host names into addresses. It + can be used when suspecting that getaddrinfo() doesn't work as expected. + This option was made available because many bogus implementations of + getaddrinfo() exist on various systems and cause anomalies that are + difficult to troubleshoot. + + -dM[] : forces memory poisonning, which means that each and every + memory region allocated with malloc() or pool_alloc2() will be filled with + before being passed to the caller. When is not specified, it + defaults to 0x50 ('P'). While this slightly slows down operations, it is + useful to reliably trigger issues resulting from missing initializations in + the code that cause random crashes. Note that -dM0 has the effect of + turning any malloc() into a calloc(). In any case if a bug appears or + disappears when using this option it means there is a bug in haproxy, so + please report it. + + -dS : disable use of the splice() system call. It is equivalent to the + "global" section's "nosplice" keyword. This may be used when splice() is + suspected to behave improperly or to cause performance issues, or when + using strace to see the forwarded data (which do not appear when using + splice()). + + -dV : disable SSL verify on the server side. It is equivalent to having + "ssl-server-verify none" in the "global" section. This is useful when + trying to reproduce production issues out of the production + environment. Never use this in an init script as it degrades SSL security + to the servers. + + -db : disable background mode and multi-process mode. The process remains in + foreground. It is mainly used during development or during small tests, as + Ctrl-C is enough to stop the process. Never use it in an init script. + + -de : disable the use of the "epoll" poller. It is equivalent to the "global" + section's keyword "noepoll". It is mostly useful when suspecting a bug + related to this poller. On systems supporting epoll, the fallback will + generally be the "poll" poller. + + -dk : disable the use of the "kqueue" poller. It is equivalent to the + "global" section's keyword "nokqueue". It is mostly useful when suspecting + a bug related to this poller. On systems supporting kqueue, the fallback + will generally be the "poll" poller. + + -dp : disable the use of the "poll" poller. It is equivalent to the "global" + section's keyword "nopoll". It is mostly useful when suspecting a bug + related to this poller. On systems supporting poll, the fallback will + generally be the "select" poller, which cannot be disabled and is limited + to 1024 file descriptors. + + -m : limit the total allocatable memory to megabytes per + process. This may cause some connection refusals or some slowdowns + depending on the amount of memory needed for normal operations. This is + mostly used to force the process to work in a constrained resource usage + scenario. + + -n : limits the per-process connection limit to . This is + equivalent to the global section's keyword "maxconn". It has precedence + over this keyword. This may be used to quickly force lower limits to avoid + a service outage on systems where resource limits are too low. + + -p : write all processes' pids into during startup. This is + equivalent to the "global" section's keyword "pidfile". The file is opened + before entering the chroot jail, and after doing the chdir() implied by + "-C". Each pid appears on its own line. + + -q : set "quiet" mode. This disables some messages during the configuration + parsing and during startup. It can be used in combination with "-c" to + just check if a configuration file is valid or not. + + -sf * : send the "finish" signal (SIGUSR1) to older processes after boot + completion to ask them to finish what they are doing and to leave. + is a list of pids to signal (one per argument). The list ends on any + option starting with a "-". It is not a problem if the list of pids is + empty, so that it can be built on the fly based on the result of a command + like "pidof" or "pgrep". + + -st * : send the "terminate" signal (SIGTERM) to older processes after + boot completion to terminate them immediately without finishing what they + were doing. is a list of pids to signal (one per argument). The list + is ends on any option starting with a "-". It is not a problem if the list + of pids is empty, so that it can be built on the fly based on the result of + a command like "pidof" or "pgrep". + + -v : report the version and build date. + + -vv : display the version, build options, libraries versions and usable + pollers. This output is systematically requested when filing a bug report. + +A safe way to start HAProxy from an init file consists in forcing the deamon +mode, storing existing pids to a pid file and using this pid file to notify +older processes to finish before leaving : + + haproxy -f /etc/haproxy.cfg \ + -D -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid) + +When the configuration is split into a few specific files (eg: tcp vs http), +it is recommended to use the "-f" option : + + haproxy -f /etc/haproxy/global.cfg -f /etc/haproxy/stats.cfg \ + -f /etc/haproxy/default-tcp.cfg -f /etc/haproxy/tcp.cfg \ + -f /etc/haproxy/default-http.cfg -f /etc/haproxy/http.cfg \ + -D -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid) + +When an unknown number of files is expected, such as customer-specific files, +it is recommended to assign them a name starting with a fixed-size sequence +number and to use "--" to load them, possibly after loading some defaults : + + haproxy -f /etc/haproxy/global.cfg -f /etc/haproxy/stats.cfg \ + -f /etc/haproxy/default-tcp.cfg -f /etc/haproxy/tcp.cfg \ + -f /etc/haproxy/default-http.cfg -f /etc/haproxy/http.cfg \ + -D -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid) \ + -f /etc/haproxy/default-customers.cfg -- /etc/haproxy/customers/* + +Sometimes a failure to start may happen for whatever reason. Then it is +important to verify if the version of HAProxy you are invoking is the expected +version and if it supports the features you are expecting (eg: SSL, PCRE, +compression, Lua, etc). This can be verified using "haproxy -vv". Some +important information such as certain build options, the target system and +the versions of the libraries being used are reported there. It is also what +you will systematically be asked for when posting a bug report : + + $ haproxy -vv + HA-Proxy version 1.6-dev7-a088d3-4 2015/10/08 + Copyright 2000-2015 Willy Tarreau + + Build options : + TARGET = linux2628 + CPU = generic + CC = gcc + CFLAGS = -pg -O0 -g -fno-strict-aliasing -Wdeclaration-after-statement \ + -DBUFSIZE=8030 -DMAXREWRITE=1030 -DSO_MARK=36 -DTCP_REPAIR=19 + OPTIONS = USE_ZLIB=1 USE_DLMALLOC=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 + + Default settings : + maxconn = 2000, bufsize = 8030, maxrewrite = 1030, maxpollevents = 200 + + Encrypted password support via crypt(3): yes + Built with zlib version : 1.2.6 + Compression algorithms supported : identity("identity"), deflate("deflate"), \ + raw-deflate("deflate"), gzip("gzip") + Built with OpenSSL version : OpenSSL 1.0.1o 12 Jun 2015 + Running on OpenSSL version : OpenSSL 1.0.1o 12 Jun 2015 + OpenSSL library supports TLS extensions : yes + OpenSSL library supports SNI : yes + OpenSSL library supports prefer-server-ciphers : yes + Built with PCRE version : 8.12 2011-01-15 + PCRE library supports JIT : no (USE_PCRE_JIT not set) + Built with Lua version : Lua 5.3.1 + Built with transparent proxy support using: IP_TRANSPARENT IP_FREEBIND + + Available polling systems : + epoll : pref=300, test result OK + poll : pref=200, test result OK + select : pref=150, test result OK + Total: 3 (3 usable), will use epoll. + +The relevant information that many non-developer users can verify here are : + - the version : 1.6-dev7-a088d3-4 above means the code is currently at commit + ID "a088d3" which is the 4th one after after official version "1.6-dev7". + Version 1.6-dev7 would show as "1.6-dev7-8c1ad7". What matters here is in + fact "1.6-dev7". This is the 7th development version of what will become + version 1.6 in the future. A development version not suitable for use in + production (unless you know exactly what you are doing). A stable version + will show as a 3-numbers version, such as "1.5.14-16f863", indicating the + 14th level of fix on top of version 1.5. This is a production-ready version. + + - the release date : 2015/10/08. It is represented in the universal + year/month/day format. Here this means August 8th, 2015. Given that stable + releases are issued every few months (1-2 months at the beginning, sometimes + 6 months once the product becomes very stable), if you're seeing an old date + here, it means you're probably affected by a number of bugs or security + issues that have since been fixed and that it might be worth checking on the + official site. + + - build options : they are relevant to people who build their packages + themselves, they can explain why things are not behaving as expected. For + example the development version above was built for Linux 2.6.28 or later, + targetting a generic CPU (no CPU-specific optimizations), and lacks any + code optimization (-O0) so it will perform poorly in terms of performance. + + - libraries versions : zlib version is reported as found in the library + itself. In general zlib is considered a very stable product and upgrades + are almost never needed. OpenSSL reports two versions, the version used at + build time and the one being used, as found on the system. These ones may + differ by the last letter but never by the numbers. The build date is also + reported because most OpenSSL bugs are security issues and need to be taken + seriously, so this library absolutely needs to be kept up to date. Seeing a + 4-months old version here is highly suspicious and indeed an update was + missed. PCRE provides very fast regular expressions and is highly + recommended. Certain of its extensions such as JIT are not present in all + versions and still young so some people prefer not to build with them, + which is why the biuld status is reported as well. Regarding the Lua + scripting language, HAProxy expects version 5.3 which is very young since + it was released a little time before HAProxy 1.6. It is important to check + on the Lua web site if some fixes are proposed for this branch. + + - Available polling systems will affect the process's scalability when + dealing with more than about one thousand of concurrent connections. These + ones are only available when the correct system was indicated in the TARGET + variable during the build. The "epoll" mechanism is highly recommended on + Linux, and the kqueue mechanism is highly recommended on BSD. Lacking them + will result in poll() or even select() being used, causing a high CPU usage + when dealing with a lot of connections. + + +4. Stopping and restarting HAProxy +---------------------------------- + +HAProxy supports a graceful and a hard stop. The hard stop is simple, when the +SIGTERM signal is sent to the haproxy process, it immediately quits and all +established connections are closed. The graceful stop is triggered when the +SIGUSR1 signal is sent to the haproxy process. It consists in only unbinding +from listening ports, but continue to process existing connections until they +close. Once the last connection is closed, the process leaves. + +The hard stop method is used for the "stop" or "restart" actions of the service +management script. The graceful stop is used for the "reload" action which +tries to seamlessly reload a new configuration in a new process. + +Both of these signals may be sent by the new haproxy process itself during a +reload or restart, so that they are sent at the latest possible moment and only +if absolutely required. This is what is performed by the "-st" (hard) and "-sf" +(graceful) options respectively. + +To understand better how these signals are used, it is important to understand +the whole restart mechanism. + +First, an existing haproxy process is running. The administrator uses a system +specific command such as "/etc/init.d/haproxy reload" to indicate he wants to +take the new configuration file into effect. What happens then is the following. +First, the service script (/etc/init.d/haproxy or equivalent) will verify that +the configuration file parses correctly using "haproxy -c". After that it will +try to start haproxy with this configuration file, using "-st" or "-sf". + +Then HAProxy tries to bind to all listening ports. If some fatal errors happen +(eg: address not present on the system, permission denied), the process quits +with an error. If a socket binding fails because a port is already in use, then +the process will first send a SIGTTOU signal to all the pids specified in the +"-st" or "-sf" pid list. This is what is called the "pause" signal. It instructs +all existing haproxy processes to temporarily stop listening to their ports so +that the new process can try to bind again. During this time, the old process +continues to process existing connections. If the binding still fails (because +for example a port is shared with another daemon), then the new process sends a +SIGTTIN signal to the old processes to instruct them to resume operations just +as if nothing happened. The old processes will then restart listening to the +ports and continue to accept connections. Not that this mechanism is system +dependant and some operating systems may not support it in multi-process mode. + +If the new process manages to bind correctly to all ports, then it sends either +the SIGTERM (hard stop in case of "-st") or the SIGUSR1 (graceful stop in case +of "-sf") to all processes to notify them that it is now in charge of operations +and that the old processes will have to leave, either immediately or once they +have finished their job. + +It is important to note that during this timeframe, there are two small windows +of a few milliseconds each where it is possible that a few connection failures +will be noticed during high loads. Typically observed failure rates are around +1 failure during a reload operation every 10000 new connections per second, +which means that a heavily loaded site running at 30000 new connections per +second may see about 3 failed connection upon every reload. The two situations +where this happens are : + + - if the new process fails to bind due to the presence of the old process, + it will first have to go through the SIGTTOU+SIGTTIN sequence, which + typically lasts about one millisecond for a few tens of frontends, and + during which some ports will not be bound to the old process and not yet + bound to the new one. HAProxy works around this on systems that support the + SO_REUSEPORT socket options, as it allows the new process to bind without + first asking the old one to unbind. Most BSD systems have been supporting + this almost forever. Linux has been supporting this in version 2.0 and + dropped it around 2.2, but some patches were floating around by then. It + was reintroduced in kernel 3.9, so if you are observing a connection + failure rate above the one mentionned above, please ensure that your kernel + is 3.9 or newer, or that relevant patches were backported to your kernel + (less likely). + + - when the old processes close the listening ports, the kernel may not always + redistribute any pending connection that was remaining in the socket's + backlog. Under high loads, a SYN packet may happen just before the socket + is closed, and will lead to an RST packet being sent to the client. In some + critical environments where even one drop is not acceptable, these ones are + sometimes dealt with using firewall rules to block SYN packets during the + reload, forcing the client to retransmit. This is totally system-dependent, + as some systems might be able to visit other listening queues and avoid + this RST. A second case concerns the ACK from the client on a local socket + that was in SYN_RECV state just before the close. This ACK will lead to an + RST packet while the haproxy process is still not aware of it. This one is + harder to get rid of, though the firewall filtering rules mentionned above + will work well if applied one second or so before restarting the process. + +For the vast majority of users, such drops will never ever happen since they +don't have enough load to trigger the race conditions. And for most high traffic +users, the failure rate is still fairly within the noise margin provided that at +least SO_REUSEPORT is properly supported on their systems. + + +5. File-descriptor limitations +------------------------------ + +In order to ensure that all incoming connections will successfully be served, +HAProxy computes at load time the total number of file descriptors that will be +needed during the process's life. A regular Unix process is generally granted +1024 file descriptors by default, and a privileged process can raise this limit +itself. This is one reason for starting HAProxy as root and letting it adjust +the limit. The default limit of 1024 file descriptors roughly allow about 500 +concurrent connections to be processed. The computation is based on the global +maxconn parameter which limits the total number of connections per process, the +number of listeners, the number of servers which have a health check enabled, +the agent checks, the peers, the loggers and possibly a few other technical +requirements. A simple rough estimate of this number consists in simply +doubling the maxconn value and adding a few tens to get the approximate number +of file descriptors needed. + +Originally HAProxy did not know how to compute this value, and it was necessary +to pass the value using the "ulimit-n" setting in the global section. This +explains why even today a lot of configurations are seen with this setting +present. Unfortunately it was often miscalculated resulting in connection +failures when approaching maxconn instead of throttling incoming connection +while waiting for the needed resources. For this reason it is important to +remove any vestigal "ulimit-n" setting that can remain from very old versions. + +Raising the number of file descriptors to accept even moderate loads is +mandatory but comes with some OS-specific adjustments. First, the select() +polling system is limited to 1024 file descriptors. In fact on Linux it used +to be capable of handling more but since certain OS ship with excessively +restrictive SELinux policies forbidding the use of select() with more than +1024 file descriptors, HAProxy now refuses to start in this case in order to +avoid any issue at run time. On all supported operating systems, poll() is +available and will not suffer from this limitation. It is automatically picked +so there is nothing ot do to get a working configuration. But poll's becomes +very slow when the number of file descriptors increases. While HAProxy does its +best to limit this performance impact (eg: via the use of the internal file +descriptor cache and batched processing), a good rule of thumb is that using +poll() with more than a thousand concurrent connections will use a lot of CPU. + +For Linux systems base on kernels 2.6 and above, the epoll() system call will +be used. It's a much more scalable mechanism relying on callbacks in the kernel +that guarantee a constant wake up time regardless of the number of registered +monitored file descriptors. It is automatically used where detected, provided +that HAProxy had been built for one of the Linux flavors. Its presence and +support can be verified using "haproxy -vv". + +For BSD systems which support it, kqueue() is available as an alternative. It +is much faster than poll() and even slightly faster than epoll() thanks to its +batched handling of changes. At least FreeBSD and OpenBSD support it. Just like +with Linux's epoll(), its support and availability are reported in the output +of "haproxy -vv". + +Having a good poller is one thing, but it is mandatory that the process can +reach the limits. When HAProxy starts, it immediately sets the new process's +file descriptor limits and verifies if it succeeds. In case of failure, it +reports it before forking so that the administrator can see the problem. As +long as the process is started by as root, there should be no reason for this +setting to fail. However, it can fail if the process is started by an +unprivileged user. If there is a compelling reason for *not* starting haproxy +as root (eg: started by end users, or by a per-application account), then the +file descriptor limit can be raised by the system administrator for this +specific user. The effectiveness of the setting can be verified by issuing +"ulimit -n" from the user's command line. It should reflect the new limit. + +Warning: when an unprivileged user's limits are changed in this user's account, +it is fairly common that these values are only considered when the user logs in +and not at all in some scripts run at system boot time nor in crontabs. This is +totally dependent on the operating system, keep in mind to check "ulimit -n" +before starting haproxy when running this way. The general advice is never to +start haproxy as an unprivileged user for production purposes. Another good +reason is that it prevents haproxy from enabling some security protections. + +Once it is certain that the system will allow the haproxy process to use the +requested number of file descriptors, two new system-specific limits may be +encountered. The first one is the system-wide file descriptor limit, which is +the total number of file descriptors opened on the system, covering all +processes. When this limit is reached, accept() or socket() will typically +return ENFILE. The second one is the per-process hard limit on the number of +file descriptors, it prevents setrlimit() from being set higher. Both are very +dependent on the operating system. On Linux, the system limit is set at boot +based on the amount of memory. It can be changed with the "fs.file-max" sysctl. +And the per-process hard limit is set to 1048576 by default, but it can be +changed using the "fs.nr_open" sysctl. + +File descriptor limitations may be observed on a running process when they are +set too low. The strace utility will report that accept() and socket() return +"-1 EMFILE" when the process's limits have been reached. In this case, simply +raising the "ulimit-n" value (or removing it) will solve the problem. If these +system calls return "-1 ENFILE" then it means that the kernel's limits have +been reached and that something must be done on a system-wide parameter. These +trouble must absolutely be addressed, as they result in high CPU usage (when +accept() fails) and failed connections that are generally visible to the user. +One solution also consists in lowering the global maxconn value to enforce +serialization, and possibly to disable HTTP keep-alive to force connections +to be released and reused faster. + + +6. Memory management +-------------------- + +HAProxy uses a simple and fast pool-based memory management. Since it relies on +a small number of different object types, it's much more efficient to pick new +objects from a pool which already contains objects of the appropriate size than +to call malloc() for each different size. The pools are organized as a stack or +LIFO, so that newly allocated objects are taken from recently released objects +still hot in the CPU caches. Pools of similar sizes are merged together, in +order to limit memory fragmentation. + +By default, since the focus is set on performance, each released object is put +back into the pool it came from, and allocated objects are never freed since +they are expected to be reused very soon. + +On the CLI, it is possible to check how memory is being used in pools thanks to +the "show pools" command : + + > show pools + Dumping pools usage. Use SIGQUIT to flush them. + - Pool pipe (32 bytes) : 5 allocated (160 bytes), 5 used, 3 users [SHARED] + - Pool hlua_com (48 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED] + - Pool vars (64 bytes) : 0 allocated (0 bytes), 0 used, 2 users [SHARED] + - Pool task (112 bytes) : 5 allocated (560 bytes), 5 used, 1 users [SHARED] + - Pool session (128 bytes) : 1 allocated (128 bytes), 1 used, 2 users [SHARED] + - Pool http_txn (272 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED] + - Pool connection (352 bytes) : 2 allocated (704 bytes), 2 used, 1 users [SHARED] + - Pool hdr_idx (416 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED] + - Pool stream (864 bytes) : 1 allocated (864 bytes), 1 used, 1 users [SHARED] + - Pool requri (1024 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED] + - Pool buffer (8064 bytes) : 3 allocated (24192 bytes), 2 used, 1 users [SHARED] + Total: 11 pools, 26608 bytes allocated, 18544 used. + +The pool name is only indicative, it's the name of the first object type using +this pool. The size in parenthesis is the object size for objects in this pool. +Object sizes are always rounded up to the closest multiple of 16 bytes. The +number of objects currently allocated and the equivalent number of bytes is +reported so that it is easy to know which pool is responsible for the highest +memory usage. The number of objects currently in use is reported as well in the +"used" field. The difference between "allocated" and "used" corresponds to the +objects that have been freed and are available for immediate use. + +It is possible to limit the amount of memory allocated per process using the +"-m" command line option, followed by a number of megabytes. It covers all of +the process's addressable space, so that includes memory used by some libraries +as well as the stack, but it is a reliable limit when building a resource +constrained system. It works the same way as "ulimit -v" on systems which have +it, or "ulimit -d" for the other ones. + +If a memory allocation fails due to the memory limit being reached or because +the system doesn't have any enough memory, then haproxy will first start to +free all available objects from all pools before attempting to allocate memory +again. This mechanism of releasing unused memory can be triggered by sending +the signal SIGQUIT to the haproxy process. When doing so, the pools state prior +to the flush will also be reported to stderr when the process runs in +foreground. + +During a reload operation, the process switched to the graceful stop state also +automatically performs some flushes after releasing any connection so that all +possible memory is released to save it for the new process. + + +7. CPU usage +------------ + +HAProxy normally spends most of its time in the system and a smaller part in +userland. A finely tuned 3.5 GHz CPU can sustain a rate about 80000 end-to-end +connection setups and closes per second at 100% CPU on a single core. When one +core is saturated, typical figures are : + - 95% system, 5% user for long TCP connections or large HTTP objects + - 85% system and 15% user for short TCP connections or small HTTP objects in + close mode + - 70% system and 30% user for small HTTP objects in keep-alive mode + +The amount of rules processing and regular expressions will increase the user +land part. The presence of firewall rules, connection tracking, complex routing +tables in the system will instead increase the system part. + +On most systems, the CPU time observed during network transfers can be cut in 4 +parts : + - the interrupt part, which concerns all the processing performed upon I/O + receipt, before the target process is even known. Typically Rx packets are + accounted for in interrupt. On some systems such as Linux where interrupt + processing may be deferred to a dedicated thread, it can appear as softirq, + and the thread is called ksoftirqd/0 (for CPU 0). The CPU taking care of + this load is generally defined by the hardware settings, though in the case + of softirq it is often possible to remap the processing to another CPU. + This interrupt part will often be perceived as parasitic since it's not + associated with any process, but it actually is some processing being done + to prepare the work for the process. + + - the system part, which concerns all the processing done using kernel code + called from userland. System calls are accounted as system for example. All + synchronously delivered Tx packets will be accounted for as system time. If + some packets have to be deferred due to queues filling up, they may then be + processed in interrupt context later (eg: upon receipt of an ACK opening a + TCP window). + + - the user part, which exclusively runs application code in userland. HAProxy + runs exclusively in this part, though it makes heavy use of system calls. + Rules processing, regular expressions, compression, encryption all add to + the user portion of CPU consumption. + + - the idle part, which is what the CPU does when there is nothing to do. For + example HAProxy waits for an incoming connection, or waits for some data to + leave, meaning the system is waiting for an ACK from the client to push + these data. + +In practice regarding HAProxy's activity, it is in general reasonably accurate +(but totally inexact) to consider that interrupt/softirq are caused by Rx +processing in kernel drivers, that user-land is caused by layer 7 processing +in HAProxy, and that system time is caused by network processing on the Tx +path. + +Since HAProxy runs around an event loop, it waits for new events using poll() +(or any alternative) and processes all these events as fast as possible before +going back to poll() waiting for new events. It measures the time spent waiting +in poll() compared to the time spent doing processing events. The ratio of +polling time vs total time is called the "idle" time, it's the amount of time +spent waiting for something to happen. This ratio is reported in the stats page +on the "idle" line, or "Idle_pct" on the CLI. When it's close to 100%, it means +the load is extremely low. When it's close to 0%, it means that there is +constantly some activity. While it cannot be very accurate on an overloaded +system due to other processes possibly preempting the CPU from the haproxy +process, it still provides a good estimate about how HAProxy considers it is +working : if the load is low and the idle ratio is low as well, it may indicate +that HAProxy has a lot of work to do, possibly due to very expensive rules that +have to be processed. Conversely, if HAProxy indicates the idle is close to +100% while things are slow, it means that it cannot do anything to speed things +up because it is already waiting for incoming data to process. In the example +below, haproxy is completely idle : + + $ echo "show info" | socat - /var/run/haproxy.sock | grep ^Idle + Idle_pct: 100 + +When the idle ratio starts to become very low, it is important to tune the +system and place processes and interrupts correctly to save the most possible +CPU resources for all tasks. If a firewall is present, it may be worth trying +to disable it or to tune it to ensure it is not responsible for a large part +of the performance limitation. It's worth noting that unloading a stateful +firewall generally reduces both the amount of interrupt/softirq and of system +usage since such firewalls act both on the Rx and the Tx paths. On Linux, +unloading the nf_conntrack and ip_conntrack modules will show whether there is +anything to gain. If so, then the module runs with default settings and you'll +have to figure how to tune it for better performance. In general this consists +in considerably increasing the hash table size. On FreeBSD, "pfctl -d" will +disable the "pf" firewall and its stateful engine at the same time. + +If it is observed that a lot of time is spent in interrupt/softirq, it is +important to ensure that they don't run on the same CPU. Most systems tend to +pin the tasks on the CPU where they receive the network traffic because for +certain workloads it improves things. But with heavily network-bound workloads +it is the opposite as the haproxy process will have to fight against its kernel +counterpart. Pinning haproxy to one CPU core and the interrupts to another one, +all sharing the same L3 cache tends to sensibly increase network performance +because in practice the amount of work for haproxy and the network stack are +quite close, so they can almost fill an entire CPU each. On Linux this is done +using taskset (for haproxy) or using cpu-map (from the haproxy config), and the +interrupts are assigned under /proc/irq. Many network interfaces support +multiple queues and multiple interrupts. In general it helps to spread them +across a small number of CPU cores provided they all share the same L3 cache. +Please always stop irq_balance which always does the worst possible thing on +such workloads. + +For CPU-bound workloads consisting in a lot of SSL traffic or a lot of +compression, it may be worth using multiple processes dedicated to certain +tasks, though there is no universal rule here and experimentation will have to +be performed. + +In order to increase the CPU capacity, it is possible to make HAProxy run as +several processes, using the "nbproc" directive in the global section. There +are some limitations though : + - health checks are run per process, so the target servers will get as many + checks as there are running processes ; + - maxconn values and queues are per-process so the correct value must be set + to avoid overloading the servers ; + - outgoing connections should avoid using port ranges to avoid conflicts + - stick-tables are per process and are not shared between processes ; + - each peers section may only run on a single process at a time ; + - the CLI operations will only act on a single process at a time. + +With this in mind, it appears that the easiest setup often consists in having +one first layer running on multiple processes and in charge for the heavy +processing, passing the traffic to a second layer running in a single process. +This mechanism is suited to SSL and compression which are the two CPU-heavy +features. Instances can easily be chained over UNIX sockets (which are cheaper +than TCP sockets and which do not waste ports), adn the proxy protocol which is +useful to pass client information to the next stage. When doing so, it is +generally a good idea to bind all the single-process tasks to process number 1 +and extra tasks to next processes, as this will make it easier to generate +similar configurations for different machines. + +On Linux versions 3.9 and above, running HAProxy in multi-process mode is much +more efficient when each process uses a distinct listening socket on the same +IP:port ; this will make the kernel evenly distribute the load across all +processes instead of waking them all up. Please check the "process" option of +the "bind" keyword lines in the configuration manual for more information. + + +8. Logging +---------- + +For logging, HAProxy always relies on a syslog server since it does not perform +any file-system access. The standard way of using it is to send logs over UDP +to the log server (by default on port 514). Very commonly this is configured to +127.0.0.1 where the local syslog daemon is running, but it's also used over the +network to log to a central server. The central server provides additional +benefits especially in active-active scenarios where it is desirable to keep +the logs merged in arrival order. HAProxy may also make use of a UNIX socket to +send its logs to the local syslog daemon, but it is not recommended at all, +because if the syslog server is restarted while haproxy runs, the socket will +be replaced and new logs will be lost. Since HAProxy will be isolated inside a +chroot jail, it will not have the ability to reconnect to the new socket. It +has also been observed in field that the log buffers in use on UNIX sockets are +very small and lead to lost messages even at very light loads. But this can be +fine for testing however. + +It is recommended to add the following directive to the "global" section to +make HAProxy log to the local daemon using facility "local0" : + + log 127.0.0.1:514 local0 + +and then to add the following one to each "defaults" section or to each frontend +and backend section : + + log global + +This way, all logs will be centralized through the global definition of where +the log server is. + +Some syslog daemons do not listen to UDP traffic by default, so depending on +the daemon being used, the syntax to enable this will vary : + + - on sysklogd, you need to pass argument "-r" on the daemon's command line + so that it listens to a UDP socket for "remote" logs ; note that there is + no way to limit it to address 127.0.0.1 so it will also receive logs from + remote systems ; + + - on rsyslogd, the following lines must be added to the configuration file : + + $ModLoad imudp + $UDPServerAddress * + $UDPServerRun 514 + + - on syslog-ng, a new source can be created the following way, it then needs + to be added as a valid source in one of the "log" directives : + + source s_udp { + udp(ip(127.0.0.1) port(514)); + }; + +Please consult your syslog daemon's manual for more information. If no logs are +seen in the system's log files, please consider the following tests : + + - restart haproxy. Each frontend and backend logs one line indicating it's + starting. If these logs are received, it means logs are working. + + - run "strace -tt -s100 -etrace=sendmsg -p " and perform some + activity that you expect to be logged. You should see the log messages + being sent using sendmsg() there. If they don't appear, restart using + strace on top of haproxy. If you still see no logs, it definitely means + that something is wrong in your configuration. + + - run tcpdump to watch for port 514, for example on the loopback interface if + the traffic is being sent locally : "tcpdump -As0 -ni lo port 514". If the + packets are seen there, it's the proof they're sent then the syslogd daemon + needs to be troubleshooted. + +While traffic logs are sent from the frontends (where the incoming connections +are accepted), backends also need to be able to send logs in order to report a +server state change consecutive to a health check. Please consult HAProxy's +configuration manual for more information regarding all possible log settings. + +It is convenient to chose a facility that is not used by other deamons. HAProxy +examples often suggest "local0" for traffic logs and "local1" for admin logs +because they're never seen in field. A single facility would be enough as well. +Having separate logs is convenient for log analysis, but it's also important to +remember that logs may sometimes convey confidential information, and as such +they must not be mixed with other logs that may accidently be handed out to +unauthorized people. + +For in-field troubleshooting without impacting the server's capacity too much, +it is recommended to make use of the "halog" utility provided with HAProxy. +This is sort of a grep-like utility designed to process HAProxy log files at +a very fast data rate. Typical figures range between 1 and 2 GB of logs per +second. It is capable of extracting only certain logs (eg: search for some +classes of HTTP status codes, connection termination status, search by response +time ranges, look for errors only), count lines, limit the output to a number +of lines, and perform some more advanced statistics such as sorting servers +by response time or error counts, sorting URLs by time or count, sorting client +addresses by access count, and so on. It is pretty convenient to quickly spot +anomalies such as a bot looping on the site, and block them. + + +9. Statistics and monitoring +---------------------------- + + +10. Tricks for easier configuration management +---------------------------------------------- + +It is very common that two HAProxy nodes constituting a cluster share exactly +the same configuration modulo a few addresses. Instead of having to maintain a +duplicate configuration for each node, which will inevitably diverge, it is +possible to include environment variables in the configuration. Thus multiple +configuration may share the exact same file with only a few different system +wide environment variables. This started in version 1.5 where only addresses +were allowed to include environment variables, and 1.6 goes further by +supporting environment variables everywhere. The syntax is the same as in the +UNIX shell, a variable starts with a dollar sign ('$'), followed by an opening +curly brace ('{'), then the variable name followed by the closing brace ('}'). +Except for addresses, environment variables are only interpreted in arguments +surrounded with double quotes (this was necessary not to break existing setups +using regular expressions involving the dollar symbol). + +Environment variables also make it convenient to write configurations which are +expected to work on various sites where only the address changes. It can also +permit to remove passwords from some configs. Example below where the the file +"site1.env" file is sourced by the init script upon startup : + + $ cat site1.env + LISTEN=192.168.1.1 + CACHE_PFX=192.168.11 + SERVER_PFX=192.168.22 + LOGGER=192.168.33.1 + STATSLP=admin:pa$$w0rd + ABUSERS=/etc/haproxy/abuse.lst + TIMEOUT=10s + + $ cat haproxy.cfg + global + log "${LOGGER}:514" local0 + + defaults + mode http + timeout client "${TIMEOUT}" + timeout server "${TIMEOUT}" + timeout connect 5s + + frontend public + bind "${LISTEN}:80" + http-request reject if { src -f "${ABUSERS}" } + stats uri /stats + stats auth "${STATSLP}" + use_backend cache if { path_end .jpg .css .ico } + default_backend server + + backend cache + server cache1 "${CACHE_PFX}.1:18080" check + server cache2 "${CACHE_PFX}.2:18080" check + + backend server + server cache1 "${SERVER_PFX}.1:8080" check + server cache2 "${SERVER_PFX}.2:8080" check + + +11. Well-known traps to avoid +----------------------------- + +Once in a while, someone reports that after a system reboot, the haproxy +service wasn't started, and that once they start it by hand it works. Most +often, these people are running a clustered IP address mechanism such as +keepalived, to assign the service IP address to the master node only, and while +it used to work when they used to bind haproxy to address 0.0.0.0, it stopped +working after they bound it to the virtual IP address. What happens here is +that when the service starts, the virtual IP address is not yet owned by the +local node, so when HAProxy wants to bind to it, the system rejects this +because it is not a local IP address. The fix doesn't consist in delaying the +haproxy service startup (since it wouldn't stand a restart), but instead to +properly configure the system to allow binding to non-local addresses. This is +easily done on Linux by setting the net.ipv4.ip_nonlocal_bind sysctl to 1. This +is also needed in order to transparently intercept the IP traffic that passes +through HAProxy for a specific target address. + +Multi-process configurations involving source port ranges may apparently seem +to work but they will cause some random failures under high loads because more +than one process may try to use the same source port to connect to the same +server, which is not possible. The system will report an error and a retry will +happen, picking another port. A high value in the "retries" parameter may hide +the effect to a certain extent but this also comes with increased CPU usage and +processing time. Logs will also report a certain number of retries. For this +reason, port ranges should be avoided in multi-process configurations. + +Since HAProxy uses SO_REUSEPORT and supports having multiple independant +processes bound to the same IP:port, during troubleshooting it can happen that +an old process was not stopped before a new one was started. This provides +absurd test results which tend to indicate that any change to the configuration +is ignored. The reason is that in fact even the new process is restarted with a +new configuration, the old one also gets some incoming connections and +processes them, returning unexpected results. When in doubt, just stop the new +process and try again. If it still works, it very likely means that an old +process remains alive and has to be stopped. Linux's "netstat -lntp" is of good +help here. + +When adding entries to an ACL from the command line (eg: when blacklisting a +source address), it is important to keep in mind that these entries are not +synchronized to the file and that if someone reloads the configuration, these +updates will be lost. While this is often the desired effect (for blacklisting) +it may not necessarily match expectations when the change was made as a fix for +a problem. See the "add acl" action of the CLI interface. + + +12. Debugging and performance issues +------------------------------------ + +When HAProxy is started with the "-d" option, it will stay in the foreground +and will print one line per event, such as an incoming connection, the end of a +connection, and for each request or response header line seen. This debug +output is emitted before the contents are processed, so they don't consider the +local modifications. The main use is to show the request and response without +having to run a network sniffer. The output is less readable when multiple +connections are handled in parallel, though the "debug2ansi" and "debug2html" +scripts found in the examples/ directory definitely help here by coloring the +output. + +If a request or response is rejected because HAProxy finds it is malformed, the +best thing to do is to connect to the CLI and issue "show errors", which will +report the last captured faulty request and response for each frontend and +backend, with all the necessary information to indicate precisely the first +character of the input stream that was rejected. This is sometimes needed to +prove to customers or to developers that a bug is present in their code. In +this case it is often possible to relax the checks (but still keep the +captures) using "option accept-invalid-http-request" or its equivalent for +responses coming from the server "option accept-invalid-http-response". Please +see the configuration manual for more details. + +Example : + + > show errors + Total events captured on [13/Oct/2015:13:43:47.169] : 1 + + [13/Oct/2015:13:43:40.918] frontend HAProxyLocalStats (#2): invalid request + backend (#-1), server (#-1), event #0 + src 127.0.0.1:51981, session #0, session flags 0x00000080 + HTTP msg state 26, msg flags 0x00000000, tx flags 0x00000000 + HTTP chunk len 0 bytes, HTTP body len 0 bytes + buffer flags 0x00808002, out 0 bytes, total 31 bytes + pending 31 bytes, wrapping at 8040, error at position 13: + + 00000 GET /invalid request HTTP/1.1\r\n + + +The output of "show info" on the CLI provides a number of useful information +regarding the maximum connection rate ever reached, maximum SSL key rate ever +reached, and in general all information which can help to explain temporary +issues regarding CPU or memory usage. Example : + + > show info + Name: HAProxy + Version: 1.6-dev7-e32d18-17 + Release_date: 2015/10/12 + Nbproc: 1 + Process_num: 1 + Pid: 7949 + Uptime: 0d 0h02m39s + Uptime_sec: 159 + Memmax_MB: 0 + Ulimit-n: 120032 + Maxsock: 120032 + Maxconn: 60000 + Hard_maxconn: 60000 + CurrConns: 0 + CumConns: 3 + CumReq: 3 + MaxSslConns: 0 + CurrSslConns: 0 + CumSslConns: 0 + Maxpipes: 0 + PipesUsed: 0 + PipesFree: 0 + ConnRate: 0 + ConnRateLimit: 0 + MaxConnRate: 1 + SessRate: 0 + SessRateLimit: 0 + MaxSessRate: 1 + SslRate: 0 + SslRateLimit: 0 + MaxSslRate: 0 + SslFrontendKeyRate: 0 + SslFrontendMaxKeyRate: 0 + SslFrontendSessionReuse_pct: 0 + SslBackendKeyRate: 0 + SslBackendMaxKeyRate: 0 + SslCacheLookups: 0 + SslCacheMisses: 0 + CompressBpsIn: 0 + CompressBpsOut: 0 + CompressBpsRateLim: 0 + ZlibMemUsage: 0 + MaxZlibMemUsage: 0 + Tasks: 5 + Run_queue: 1 + Idle_pct: 100 + node: wtap + description: + +When an issue seems to randomly appear on a new version of HAProxy (eg: every +second request is aborted, occasional crash, etc), it is worth trying to enable +memory poisonning so that each call to malloc() is immediately followed by the +filling of the memory area with a configurable byte. By default this byte is +0x50 (ASCII for 'P'), but any other byte can be used, including zero (which +will have the same effect as a calloc() and which may make issues disappear). +Memory poisonning is enabled on the command line using the "-dM" option. It +slightly hurts performance and is not recommended for use in production. If +an issue happens all the time with it or never happens when poisoonning uses +byte zero, it clearly means you've found a bug and you definitely need to +report it. Otherwise if there's no clear change, the problem it is not related. + +When debugging some latency issues, it is important to use both strace and +tcpdump on the local machine, and another tcpdump on the remote system. The +reason for this is that there are delays everywhere in the processing chain and +it is important to know which one is causing latency to know where to act. In +practice, the local tcpdump will indicate when the input data come in. Strace +will indicate when haproxy receives these data (using recv/recvfrom). Warning, +openssl uses read()/write() syscalls instead of recv()/send(). Strace will also +show when haproxy sends the data, and tcpdump will show when the system sends +these data to the interface. Then the external tcpdump will show when the data +sent are really received (since the local one only shows when the packets are +queued). The benefit of sniffing on the local system is that strace and tcpdump +will use the same reference clock. Strace should be used with "-tts200" to get +complete timestamps and report large enough chunks of data to read them. +Tcpdump should be used with "-nvvttSs0" to report full packets, real sequence +numbers and complete timestamps. + +In practice, received data are almost always immediately received by haproxy +(unless the machine has a saturated CPU or these data are invalid and not +delivered). If these data are received but not sent, it generally is because +the output buffer is saturated (ie: recipient doesn't consume the data fast +enough). This can be confirmed by seeing that the polling doesn't notify of +the ability to write on the output file descriptor for some time (it's often +easier to spot in the strace output when the data finally leave and then roll +back to see when the write event was notified). It generally matches an ACK +received from the recipient, and detected by tcpdump. Once the data are sent, +they may spend some time in the system doing nothing. Here again, the TCP +congestion window may be limited and not allow these data to leave, waiting for +an ACK to open the window. If the traffic is idle and the data take 40 ms or +200 ms to leave, it's a different issue (which is not an issue), it's the fact +that the Nagle algorithm prevents empty packets from leaving immediately, in +hope that they will be merged with subsequent data. HAProxy automatically +disables Nagle in pure TCP mode and in tunnels. However it definitely remains +enabled when forwarding an HTTP body (and this contributes to the performance +improvement there by reducing the number of packets). Some HTTP non-compliant +applications may be sensitive to the latency when delivering incomplete HTTP +response messages. In this case you will have to enable "option http-no-delay" +to disable Nagle in order to work around their design, keeping in mind that any +other proxy in the chain may similarly be impacted. If tcpdump reports that data +leave immediately but the other end doesn't see them quickly, it can mean there +is a congestionned WAN link, a congestionned LAN with flow control enabled and +preventing the data from leaving, or more commonly that HAProxy is in fact +running in a virtual machine and that for whatever reason the hypervisor has +decided that the data didn't need to be sent immediately. In virtualized +environments, latency issues are almost always caused by the virtualization +layer, so in order to save time, it's worth first comparing tcpdump in the VM +and on the external components. Any difference has to be credited to the +hypervisor and its accompanying drivers. + +When some TCP SACK segments are seen in tcpdump traces (using -vv), it always +means that the side sending them has got the proof of a lost packet. While not +seeing them doesn't mean there are no losses, seeing them definitely means the +network is lossy. Losses are normal on a network, but at a rate where SACKs are +not noticeable at the naked eye. If they appear a lot in the traces, it is +worth investigating exactly what happens and where the packets are lost. HTTP +doesn't cope well with TCP losses, which introduce huge latencies. + +The "netstat -i" command will report statistics per interface. An interface +where the Rx-Ovr counter grows indicates that the system doesn't have enough +resources to receive all incoming packets and that they're lost before being +processed by the network driver. Rx-Drp indicates that some received packets +were lost in the network stack because the application doesn't process them +fast enough. This can happen during some attacks as well. Tx-Drp means that +the output queues were full and packets had to be dropped. When using TCP it +should be very rare, but will possibly indicte a saturated outgoing link. + + +13. Security considerations +--------------------------- + +HAProxy is designed to run with very limited privileges. The standard way to +use it is to isolate it into a chroot jail and to drop its privileges to a +non-root user without any permissions inside this jail so that if any future +vulnerability were to be discovered, its compromise would not affect the rest +of the system. + +In order to perfom a chroot, it first needs to be started as a root user. It is +pointless to build hand-made chroots to start the process there, these ones are +painful to build, are never properly maintained and always contain way more +bugs than the main file-system. And in case of compromise, the intruder can use +the purposely built file-system. Unfortunately many administrators confuse +"start as root" and "run as root", resulting in the uid change to be done prior +to starting haproxy, and reducing the effective security restrictions. + +HAProxy will need to be started as root in order to : + - adjust the file descriptor limits + - bind to privileged port numbers + - bind to a specific network interface + - transparently listen to a foreign address + - isolate itself inside the chroot jail + - drop to another non-privileged UID + +HAProxy may require to be run as root in order to : + - bind to an interface for outgoing connections + - bind to privileged source ports for outgoing connections + - transparently bind to a foreing address for outgoing connections + +Most users will never need the "run as root" case. But the "start as root" +covers most usages. + +A safe configuration will have : + + - a chroot statement pointing to an empty location without any access + permissions. This can be prepared this way on the UNIX command line : + + # mkdir /var/empty && chmod 0 /var/empty || echo "Failed" + + and referenced like this in the HAProxy configuration's global section : + + chroot /var/empty + + - both a uid/user and gid/group statements in the global section : + + user haproxy + group haproxy + + - a stats socket whose mode, uid and gid are set to match the user and/or + group allowed to access the CLI so that nobody may access it : + + stats socket /var/run/haproxy.stat uid hatop gid hatop mode 600 +