We now insert tasks in a certain sequence in the run queue.
The sorting key currently is the arrival order. It will now
be possible to apply a "nice" value to any task so that it
goes forwards or backwards in the run queue.
The calls to wake_expired_tasks() and maintain_proxies()
have been moved to the main run_poll_loop(), because they
had nothing to do in process_runnable_tasks().
The task_wakeup() function is not inlined anymore, as it was
only used at one place.
The qlist member of the task structure has been removed now.
The run_queue list has been replaced for an integer indicating
the number of tasks in the run queue.
The following config makes haproxy segfault on exit :
defaults
mode http
balance roundrobin
listen no-stats
bind :8001
listen stats
bind :8002
stats uri /stats
The simple fix is to ensure that p->uri_auth is not NULL
before dereferencing it.
The ultree code has been removed in favor of a simpler and
cleaner ebtree implementation. The eternity queue does not
need to exist anymore, and the pool_tree64 has been removed.
The ebtree node is stored in the task itself. The qlist list
header is still used by the run-queue, but will be able to
disappear once the run-queue uses ebtree too.
The first implementation of the monotonic clock did not verify
forward jumps. The consequence is that a fast changing time may
expire a lot of tasks. While it does seem minor, in fact it is
problematic because most machines which boot with a wrong date
are in the past and suddenly see their time jump by several
years in the future.
The solution is to check if we spent more apparent time in
a poller than allowed (with a margin applied). The margin
is currently set to 1000 ms. It should be large enough for
any poll() to complete.
Tests with randomly jumping clock show that the result is quite
accurate (error less than 1 second at every change of more than
one second).
If the system date is set backwards while haproxy is running,
some scheduled events are delayed by the amount of time the
clock went backwards. This is particularly problematic on
systems where the date is set at boot, because it seldom
happens that health-checks do not get sent for a few hours.
Before switching to use clock_gettime() on systems which
provide it, we can at least ensure that the clock is not
going backwards and maintain two clocks : the "date" which
represents what the user wants to see (mostly for logs),
and an internal date stored in "now", used for scheduled
events.
The dequeuing logic was completely wrong. First, a task was assigned
to all servers to process the queue, but this task was never scheduled
and was only woken up on session free. Second, there was no reservation
of server entries when a task was assigned a server. This means that
as long as the task was not connected to the server, its presence was
not accounted for. This was causing trouble when detecting whether or
not a server had reached maxconn. Third, during a redispatch, a session
could lose its place at the server's and get blocked because another
session at the same moment would have stolen the entry. Fourth, the
redispatch option did not work when maxqueue was reached for a server,
and it was not possible to do so without indefinitely hanging a session.
The root cause of all those problems was the lack of pre-reservation of
connections at the server's, and the lack of tracking of servers during
a redispatch. Everything relied on combinations of flags which could
appear similarly in quite distinct situations.
This patch is a major rework but there was no other solution, as the
internal logic was deeply flawed. The resulting code is cleaner, more
understandable, uses less magics and is overall more robust.
As an added bonus, "option redispatch" now works when maxqueue has
been reached on a server.
A new "redirect" keyword adds the ability to send an HTTP 301/302/303
redirection to either an absolute location or to a prefix followed by
the original URI. The redirection is conditionned by ACL rules, so it
becomes very easy to move parts of a site to another site using this.
This work was almost entirely done at Exceliance by Emeric Brun.
A test-case has been added in the tests/ directory.
- free oldpids
- call free(exp->preg), not only regfree(exp->preg): req_exp, rsp_exp
- build a list of unique uri_auths and eventually free it
- prune_acl_cond/free for switching_rules
- add a callback pointer to free ptr from acl_pattern (used for regexs) and execute it
==1180== malloc/free: in use at exit: 0 bytes in 0 blocks.
==1180== malloc/free: 5,599 allocs, 5,599 frees, 4,220,556 bytes allocated.
==1180== All heap blocks were freed -- no leaks are possible.
New functions implemented:
- deinit_pollers: called at the end of deinit())
- prune_acl: called via list_for_each_entry_safe
Add missing pool_destroy2 calls:
- p->hdr_idx_pool
- pool2_tree64
Implement all task stopping:
- health-check: needs new "struct task" in the struct server
- queue processing: queue_mgt
- appsess_refresh: appsession_refresh
before (idle system):
==6079== LEAK SUMMARY:
==6079== definitely lost: 1,112 bytes in 75 blocks.
==6079== indirectly lost: 53,356 bytes in 2,090 blocks.
==6079== possibly lost: 52 bytes in 1 blocks.
==6079== still reachable: 150,996 bytes in 504 blocks.
==6079== suppressed: 0 bytes in 0 blocks.
after (idle system):
==6945== LEAK SUMMARY:
==6945== definitely lost: 7,644 bytes in 137 blocks.
==6945== indirectly lost: 9,913 bytes in 587 blocks.
==6945== possibly lost: 0 bytes in 0 blocks.
==6945== still reachable: 0 bytes in 0 blocks.
==6945== suppressed: 0 bytes in 0 blocks.
before (running system for ~2m):
==9343== LEAK SUMMARY:
==9343== definitely lost: 1,112 bytes in 75 blocks.
==9343== indirectly lost: 54,199 bytes in 2,122 blocks.
==9343== possibly lost: 52 bytes in 1 blocks.
==9343== still reachable: 151,128 bytes in 509 blocks.
==9343== suppressed: 0 bytes in 0 blocks.
after (running system for ~2m):
==11616== LEAK SUMMARY:
==11616== definitely lost: 7,644 bytes in 137 blocks.
==11616== indirectly lost: 9,981 bytes in 591 blocks.
==11616== possibly lost: 0 bytes in 0 blocks.
==11616== still reachable: 4 bytes in 1 blocks.
==11616== suppressed: 0 bytes in 0 blocks.
Still not perfect but significant improvement.
Released version 1.3.15 with the following main changes :
- [BUILD] Added support for 'make install'
- [BUILD] Added 'install-man' make target for installing the man page
- [BUILD] Added 'install-bin' make target
- [BUILD] Added 'install-doc' make target
- [BUILD] Removed "/" after '$(DESTDIR)' in install targets
- [BUILD] Changed 'install' target to install the binaries first
- [BUILD] Replace hardcoded 'LD = gcc' with 'LD = $(CC)'
- [MEDIUM]: Inversion for options
- [MEDIUM]: Count retries and redispatches also for servers, fix redistribute_pending, extend logs, %d->%u cleanup
- [BUG]: Restore clearing t->logs.bytes
- [MEDIUM]: rework checks handling
- [DOC] Update a "contrib" file with a hint about a scheme used for formathing subjects
- [MEDIUM] Implement "track [<backend>/]<server>"
- [MINOR] Implement persistent id for proxies and servers
- [BUG] Don't increment server connections too much + fix retries
- [MEDIUM]: Prevent redispatcher from selecting the same server, version #3
- [MAJOR] proto_uxst rework -> SNMP support
- [BUG] appsession lookup in URL does not work
- [BUG] transparent proxy address was ignored in backend
- [BUG] hot reconfiguration failed because of a wrong error check
- [DOC] big update to the configuration manual
- [DOC] large update to the configuration manual
- [DOC] document more options
- [BUILD] major rework of the GNU Makefile
- [STATS] add support for "show info" on the unix socket
- [DOC] document options forwardfor to logasap
- [MINOR] add support for the "backlog" parameter
- [OPTIM] introduce global parameter "tune.maxaccept"
- [MEDIUM] introduce "timeout http-request" in frontends
- [MINOR] tarpit timeout is also allowed in backends
- [BUG] increment server connections for each connect()
- [MEDIUM] add a turn-around state of one second after a connection failure
- [BUG] fix typo in redispatched connection
- [DOC] document options nolinger to ssl-hello-chk
- [DOC] added documentation for "option tcplog" to "use_backend"
- [BUG] connect_server: server might not exist when sending error report
- [MEDIUM] support fully transparent proxy on Linux (USE_LINUX_TPROXY)
- [MEDIUM] add non-local bind to connect() on Linux
- [MINOR] add transparent proxy support for balabit's Tproxy v4
- [BUG] use backend's source and not server's source with tproxy
- [BUG] fix overlapping server flags
- [MEDIUM] fix server health checks source address selection
- [BUG] build failed on CONFIG_HAP_LINUX_TPROXY without CONFIG_HAP_CTTPROXY
- [DOC] added "server", "source" and "stats" keywords
- [DOC] all server parameters have been documented
- [DOC] document all req* and rsp* keywords.
- [DOC] added documentation about HTTP header manipulations
- [BUG] log response byte count, not request
- [BUILD] code did not build in full debug mode
- [BUG] fix truncated responses with sepoll
- [MINOR] use s->frt_addr as the server's address in transparent proxy
- [MINOR] fix configuration hint about timeouts
- [DOC] minor cleanup of the doc and notice to contributors
- [MINOR] report correct section type for unknown keywords.
- [BUILD] update MacOS Makefile to build on newer versions
- [DOC] fix erroneous "useallbackups" option in the doc
- [DOC] applied small fixes from early readers
- [MINOR] add configuration support for "redir" server keyword
- [MEDIUM] completely implement the server redirection method
- [TESTS] add a test case for the server redirection mechanism
- [DOC] add a configuration entry for "server ... redir <prefix>"
- [BUILD] backend.c and checks.c did not build without tproxy !
- Revert "[BUILD] backend.c and checks.c did not build without tproxy !"
- [BUILD] backend.c and checks.c did not build without tproxy !
- [OPTIM] used unsigned ints for HTTP state and message offsets
- [OPTIM] GCC4's builtin_expect() is suboptimal
- [BUG] failed conns were sometimes incremented in the frontend!
- [BUG] timeout.check was not pre-set to eternity
- [TESTS] add test-pollers.cfg to easily report pollers in use
- [BUG] do not apply timeout.connect in checks if unset
- [BUILD] ensure that makefile understands USE_DLMALLOC=1
- [MINOR] silent gcc for a wrong warning
- [CLEANUP] update .gitignore to ignore more temporary files
- [CLEANUP] report dlmalloc's source path only if explictly specified
- [BUG] str2sun could leak a small buffer in case of error during parsing
- [BUG] option allbackups was not working anymore in roundrobin mode
- [MAJOR] implementation of the "leastconn" load balancing algorithm
- [BUILD] ensure that users don't build without setting the target anymore.
- [DOC] document the leastconn LB algo
- [MEDIUM] fix stats socket limitation to 16 kB
- [DOC] fix unescaped space in httpchk example.
- [BUG] fix double-decrement of server connections
- [TESTS] add a test case for port mapping
- [TESTS] add a benchmark for integer hashing
- [TESTS] add new methods in ip-hash test file
- [MAJOR] implement parameter hashing for POST requests
This new parameter makes it possible to override the default
number of consecutive incoming connections which can be
accepted on a socket. By default it is not limited on single
process mode, and limited to 8 in multi-process mode.
The build process was getting annoying under some conditions,
especially on platforms which are used to set CFLAGS, as well
as those which set a lot of complex defines. The new Makefile
takes care of this situation by not mixing TARGET, CPU and user
values, and by making privileging the pre-setting of common
variables with the ability to override them.
Now CFLAGS and LDFLAGS are set by default and may be overridden
without the risk of breaking useful defines. Options are better
dealt with, and as a bonus, it was possible to merge the FreeBSD
and OpenBSD targets into the common GNU Makefile.
The report of build options by "haproxy -vv" has been slightly
adapted to the new mode. Options implied by architecture are not
reported, only user-specified options are. It is also possible to
add options which will not be reported in order not to mangle the
output when specifying dirty informations such as URLs...
The Makefile was copiously documented and it should be easier to
build for any target now. Backwards compatibility with older
build processes was kept, and warnings are emitted for deprecated
build options.
The error check in return of start_proxies checked for exact ERR_RETRYABLE
but did not consider the return as a bit field. The function returned both
ERR_RETRYABLE and ERR_ALERT, hence the problem.
Sometimes it is useful to find out how a given binary version was
built. The build compiler and options are now provided for this,
and it's possible to get them with the -vv option.
Under certain circumstances, it is very useful to be able to fail some
monitor requests. One specific case is when the number of servers in
the backend falls below a certain level. The new "monitor fail" construct
followed by either "if"/"unless" <condition> makes it possible to specify
ACL-based conditions which will make the monitor return 503 instead of
200. Any number of conditions can be passed. Another use may be to limit
the requests to local networks only.
It is very convenient for SNMP monitoring to have unique process ID,
proxy ID and server ID. Those have been added to the CSV outputs.
The numbers start at 1. 0 is reserved. For servers, 0 means that the
reported name is not a server name but half a proxy (FRONTEND/BACKEND).
A remaining hidden "-" in the CSV output has been eliminated too.
Some applications do not have a strict persistence requirement, yet
it is still desirable for performance considerations, due to local
caches on the servers. For some reasons, there are some applications
which cannot rely on cookies, and for which the last resort is to use
a parameter passed in the URL.
The new 'url_param' balance method is there to solve this issue. It
accepts a parameter name which is looked up from the URL and which
is then hashed to select a server. If the parameter is not found,
then the round robin algorithm is used in order to provide a normal
load balancing across the servers for the first requests. It would
have been possible to use a source IP hash instead, but since such
applications are generally buried behind multiple levels of
reverse-proxies, it would not provide a good balance.
The doc has been updated, and two regression testing configurations
have been added.
localtime() was called with pointers to tv_sec, which is time_t on
some platforms and long on others. A problem was encountered on
Sparc64 under OpenBSD where tv_sec is long (64 bits) and time_t is
32 bits. Since this architecture is big-endian, it exhibited the
bug because localtime() always worked with the high part of the
value which is always zero. This problem was identified and debugged
by Thierry Fournier.
The correct solution is to pass the date by value and not by pointer,
through an intermediate function. The use of localtime_r() instead of
localtime() also made it possible to get rid of the first call to
localtime() since it does not need to allocate memory anymore.
Removed old unused MODE_LOG and MODE_STATS, and replaced the "stats"
keyword in the global section. The new "stats" keyword in the global
section is used to create a UNIX socket on which the statistics will
be accessed. The client must issue a "show stat\n" command in order
to get a CSV-formated output similar to the output on the HTTP socket
in CSV mode.
A new generic protocol mechanism has been added. It provides
an easy method to implement new protocols with different
listeners (eg: unix sockets).
The listeners are automatically started at the right moment
and enabled after the possible fork().
This must come from a copy-paste typo: in the unlikely event that
fork() would fail, the parent process would only exit(1) if there
were old pids. That's non-sense.
When one server appears at the same position in multiple backends, it
receives all the checks from all the backends exactly at the same time
because the health-checks are only spread within a backend but not
globally.
Attached patch implements per-server start delay in a different way.
Checks are now spread globally - not locally to one backend. It also makes
them start faster - IMHO there is no need to add a 'server->inter' when
calculating first execution. Calculation were moved from cfgparse.c to
checks.c. There is a new function start_checks() and now it is not called
when haproxy is started in MODE_CHECK.
With this patch it is also possible to set a global 'spread-checks'
parameter. It takes a percentage value (1..50, probably something near
5..10 is a good idea) so haproxy adds or removes that many percent to the
original interval after each check. My test shows that with 18 backends,
54 servers total and 10000ms/5% it takes about 45m to mix them completely.
I decided to use rand/srand pseudo-random number generator. I am aware it
is not recommend for a good randomness but a) we do not need a good random
generator here b) it is probably the most portable one.
The following patch will give the ability to tweak socket linger mode.
You can use this option with "option nolinger" inside fronted or backend
configuration declaration.
This will help in environments where lots of FIN_WAIT sockets are
encountered.
This patch fixes a nasty bug raported by both glibc and valgrind, which
leads into a problem that haproxy does not exit when a new instace
starts ap (-sf/-st).
==9299== Invalid free() / delete / delete[]
==9299== at 0x401D095: free (in
/usr/lib/valgrind/x86-linux/vgpreload_memcheck.so)
==9299== by 0x804A377: deinit (haproxy.c:721)
==9299== by 0x804A883: main (haproxy.c:1014)
==9299== Address 0x41859E0 is 0 bytes inside a block of size 21 free'd
==9299== at 0x401D095: free (in
/usr/lib/valgrind/x86-linux/vgpreload_memcheck.so)
==9299== by 0x804A84B: main (haproxy.c:985)
==9299==
6542 open("/dev/tty", O_RDWR|O_NONBLOCK|O_NOCTTY) = -1 ENOENT (No such file
or directory)
6542 writev(2, [{"*** glibc detected *** ", 23}, {"corrupted double-linked
list", 28}, {": 0x", 4}, {"6ff91878", 8}, {" ***\n", 5}], 5) = -1 EBADF (Bad
file descriptor)
I found this bug trying to find why, after one week with many restarts, I
finished with >100 haproxy process running. ;)
The deinit() function is specialized in memory area freeing.
There were a ton of information that were not released at the
exit time, which made valgrind complain. Now, most of the entries
are freed. However, it seems like regfree() does not completely
free a regex (12 bytes lost per regex).
By default, epoll/kqueue used to return as many events as possible.
This could sometimes cause huge latencies (latencies of up to 400 ms
have been observed with many thousands of fds at once). Limiting the
number of events returned also reduces the latency by avoiding too
many blind processing. The value is set to 200 by default and can be
changed in the global section using the tune.maxpollevents parameter.
When we're interrupted by another instance, it is very likely
that the other one will need some memory. Now we know how to
free what is not used, so let's do it.
Also only free non-null pointers. Previously, pool_destroy()
did implicitly check for this case which was incidentely
needed.
Also during this process, a bug was found in appsession_refresh().
It would not automatically requeue the task in the queue, so the
old sessions would not vanish.
The timeout functions were difficult to manipulate because they were
rounding results to the millisecond. Thus, it was difficult to compare
and to check what expired and what did not. Also, the comparison
functions were heavy with multiplies and divides by 1000. Now, all
timeouts are stored in timevals, reducing the number of operations
for updates and leading to cleaner and more efficient code.
The rbtree-based wait queue consumes a lot of CPU. Use the ul2tree
instead. Lots of cleanups and code reorganizations made it possible
to reduce the task struct and simplify the code a bit.
The principle behind speculative I/O is to speculatively try to
perform I/O before registering the events in the system. This
considerably reduces the number of calls to epoll_ctl() and
sometimes even epoll_wait(), and manages to increase overall
performance by about 10%.
The new poller has been called "sepoll". It is used by default
on Linux when it works. A corresponding option "nosepoll" and
the command line argument "-ds" allow to disable it.
Gcc provides __attribute__((constructor)) which is very convenient
to execute functions at startup right before main(). All the pollers
have been converted to have their register() function declared like
this, so that it is not necessary anymore to call them from a centralized
file.
Some pollers such as kqueue lose their FD across fork(), meaning that
the registered file descriptors are lost too. Now when the proxies are
started by start_proxies(), the file descriptors are not registered yet,
leaving enough time for the fork() to take place and to get a new pollfd.
It will be the first call to maintain_proxies that will register them.
select, poll and epoll now have their dedicated functions and have
been split into distinct files. Several FD manipulation primitives
have been provided with each poller.
The rest of the code needs to be cleaned to remove traces of
StaticReadEvent/StaticWriteEvent. A trick involving a macro has
temporarily been used right now. Some work needs to be done to
factorize tests and sets everywhere.
logs are handled better with dedicated functions. The HTTP implementation
moved to proto_http.c. It has been cleaned up a bit. Now a frontend with
option httplog and no log will not call the function anymore.
Previously, use of the "usesrc" keyword could silently fail if
either the module was not loaded, or the user did not have enough
permissions. Now the errors are better diagnosed and more appropriate
advices are given.
- stats now support the HEAD method too
- extracted http request from the session
- huge rework of the HTTP parser which is now a 28-state FSM.
- linux-style likely/unlikely macros for optimization hints
- do not create a server socket when there's no server
This patch from Sin Yu makes use of an rbtree for the wait queue,
which will solve the slowdown problem encountered when timeouts
are heterogenous in the configuration. The next step will be to
turn maintain_proxies() into a per-proxy task so that we won't
have to scan them all after each poll() loop.
The tcp-splicing code has been merged, and a doc has been written.
A configuration example has been derived from the previous content
switching sample.
It is now possible to define an errorloc in the backend as well as
in the frontend. The backend's will be used first, and if undefined,
then the frontend's will be used instead. If none is used, then the
original error messages will be used.
The nbconn attribute in the proxies was not relevant anymore because
a frontend A may use backend B and both of them must account for their
respective connections. For this reason, there now are two separate
counters for frontend and backend connections.
The stats page has been updated to reflect the backend, but a separate
line entry for the frontend with error counts would be good.
Note that as of now, beconn may be higher than maxconn, because maxconn
applies to the frontend, while beconn may be increased due to sessions
passed from another frontend.
The files are now stored under :
- include/haproxy for the generic includes
- include/types.h for the structures needed within prototypes
- include/proto.h for function prototypes and inline functions
- src/*.c for the C files
Most include files are now covered by LGPL. A last move still needs
to be done to put inline functions under GPL and not LGPL.
Version has been set to 1.3.0 in the code but some control still
needs to be done before releasing.