Jarno Huuskonen reported that ip6range doesn't build anymore on
Centos 7 (and possibly other distros) due to "in6_u" not being known.
Using s6_addr32 instead of in6_u.u6_addr32 apparently works fine, and
it's also what the Lua code uses so it should be OK.
This patch may be backported to 1.6.
This one jumps back to the oldest post-fork and post-accept action,
so it allows to recv(), pause() and send() in loops after a fork()
and an accept() for example. This is handy for bugs that reproduce
once in a while or to keep idle connections working.
By passing "S:<string>" instead of S<size> it's possible to send
a pre-defined string, which is convenient to write HTTP requests or
responses.
Example : produce two responses, one in keep-alive, one not for ab :
./tcploop 8001 L W N2 A R S:"HTTP/1.0 200 OK\r\nConnection: keep-alive\r\nContent-length: 50\r\n\r\n0123456789.123456789.123456789.123456789.123456789" R S:"HTTP/1.0 200 OK\r\nContent-length: 50\r\n\r\n0123456789.123456789.123456789.123456789.123456789"
With 20 such keep-alive responses and 10 parallel processes, ab achieves
350kreq/s, so it should be possible to get precise timings.
This is helpful to show what state we're dealing with. The pid is
written, optionally followed by the time in 3 different formats
(relative/absolute) depending on the command line option (-t, -tt, -ttt).
Fork is a very convenient way to deal with independant yet properly
timed connections. It's particularly useful here for accept(), and
ensures that any accepted FD will automatically be released. The
principle is that when we hit a fork command, the parent restarts
evaluating the actions from the beginning and the child continues
to evaluate the next actions. Listen and connect are skipped if the
connection is already established. Fork() is amazingly cheap on
Linux, 21k forked connections per second are handled on a single
core, and 38k on two cores.
For now it's not possible to have two different code paths so in order
to have both a listener and a connector, two distinct commands are
still needed.
netcat, nc6 and socat are only partially convenient as reproducers for
state machine bugs, but when it comes to adding delays, forcing resets,
waiting for data to be acked, they become useless.
The purpose of this utility is to be able to easily script some TCP
operations such as connect, accept, send, receive, shutdown and of
course pauses.
A new "option spop-check" statement has been added to enable server health
checks based on SPOP HELLO handshake. SPOP is the protocol used by SPOE filters
to talk to servers.
This is a very simple service that implement a "random" ip reputation
service. It will return random scores for all checked IP addresses. It only
shows you how to implement a ip reputation service or such kind of services
using the SPOE.
Users can set the location of haproxy.cfg and pidfile files by providing
a systemd overwrite file /etc/systemd/system/haproxy.service.d/overwrite.conf
with the following content:
[Service]
Environment=CONFIG=/etc/foobar/haproxy.cfg
The goal is to have a collection of quick-n-dirty utilities that make
debugging easier and that can easily be modified when needed. The first
utility in this series is called "flags". For a given numeric argument,
it reports the various known combinations of flags for channels, streams
and so on. This way it's easy to copy-paste values from the CLI or from
gdb and immediately know what state a stream-interface or connection is
in.
By default systemd will send SIGTERM to all processes in the service's
control group. In our case, this includes the wrapper, the master
process and all worker processes.
Since commit c54bdd2a the wrapper actually catches SIGTERM and survives
to see the master process getting killed by systemd and regard this as
an error, placing the unit in a failed state during "systemctl stop".
Since the wrapper now handles SIGTERM by itself, we switch the kill mode
to 'mixed', which means that systemd will deliver the initial SIGTERM to
the wrapper only, and if the actual haproxy processes don't exit after a
given amount of time (default: 90s), a SIGKILL is sent to all remaining
processes in the control group. See systemd.kill(5) for more
information.
This should also be backported to 1.5.
Dmitry Sivachenko reported that "uint" doesn't build on FreeBSD 10.
On Linux it's defined in sys/types.h and indicated as "old". Just
get rid of the very few occurrences.
The last commit provides time-based filtering. Unfortunately, it wastes
90% of the time calling the expensive time()/localtime()/mktime()
functions.
This patch does 3 things :
- call time()/localtime() only once to initialize the correct
struct timeinfo ;
- call mktime() only when the time has changed regardless of
the current second.
- manually add the current second to the cached result.
Doing just this is enough to multiply the parsing speed by 8.
I wanted to make a graph with average answer time in nagios that takes only
the last 5 mn of the log. Filtering the log before using halog was too
slow, so I added that filter to halog.
The patch attached to this mail is a proposal to add a new option : -time
[min][:max]
The values are min timestamp and/or max timestamp of the lines to be used
for stats. The date and time of the log lines between '[' and ']' are
converted to timestamp and compared to these values.
Here is an exemple of usage :
cat /var/log/haproxy.log | ./halog -srv -H -q -time $(date --date '-5 min' +%s)
This is the same as -uc except that instead of counting URLs, it
counts source addresses. The reported times are request times and
not response times.
The code becomes heavily ugly, the url struct is being abused to
store an address, and there are no more bit fields available. The
code needs a major revamp.
The iprange tool is handy for transforming network range formats, but
it's common to need a tool for running quick checks on the output.
The tool now supports a list of addresses on the command line, and it
will only output those which match. It's absolutely inefficient but is
handy for debugging.
Commit 667c905f introduced parameter -m to halog which limits the size
of the output. Unfortunately it is completely broken in that it doesn't
check that the limit was previously set or not, and also prevents a
simple counting operation from returning anything if a limit is not set.
Note that the -gt and -pct outputs behave differently in face of this
limit, since they count the valid output lines BEFORE actually producing
the data, so the limit really applies to valid input lines.
Sometimes it's useful to limit the output to a number of lines, for
example when output is already sorted (eg: 10 slowest URLs, ...). Now
we can use -m for this.
There was a lines_out++ left from earlier code, causing each input
line to be counted as an output line.
This fix also affects 1.4 and should be backported.
Using posix_fadvise() it is possible to tell the system that we're
going to read a whole file at once. The kernel then doubles the
read-ahead size for this file. On Linux with an SSD, this has improved
cold-cache performance by around 20%. Hot-cache is not affected at all.
glibc-2.11 on x86_64 provides a machine-specific memchr() which is faster
than the generic C implementation by around 40%, so let's make it possible
to use it instead of the hand-coded version.
This version implements both 32 and 64 bit versions at once, it
avoids the need to have two separate output files. It also improves
efficiency on i386 platforms by adding a little bit of assembly where
gcc isn't efficient.
This feature relies on GCC's ability to call helpers at function entry/exit
points. We define these helpers to quickly dump the minimum info into a trace
file that can be converted to a human readable format using a script in the
contrib/trace directory. This has only been implemented in the GNU makefile
for now on as it is unsure whether it's supported on all OSes.
The feature is enabled by building with "TRACE=1". The performance impact is
huge, so this feature should only be used when debugging. To limit the loss
of performance, fprintf() has been disabled and the output is hand-crafted
and emitted using fwrite(), resulting in doubling the performance. Using the
TSC instead of gettimeofday() also doubles the performance. Around 1200 conns/s
may be achieved on a Pentium-M 1.7 GHz which leads to around 50 MB/s of traces.
The entry and exits of all functions will be dumped into a file designated
by the HAPROXY_TRACE environment variable, or by default "trace.out". If the
trace file name is empty or "/dev/null", then traces are disabled. If
opening the trace file fails, then stderr is used. If HAPROXY_TRACE_FAST is
used, then the time is taken from the global <now> variable. Last, if
HAPROXY_TRACE_TSC is used, then the machine's TSC is used instead of the
real time (almost twice as fast).
The output format is :
<sec.usec> <level> <caller_ptr> <dir> <callee_ptr>
or :
<tsc> <level> <caller_ptr> <dir> <callee_ptr>
where <dir> is '>' when entering a function and '<' when leaving.
The awk script in contrib/trace provides a nicer indented output :
6f74989e6f8 ->->-> run_poll_loop > signal_process_queue [src/haproxy.c:1097:0x804bd69] > [include/proto/signal.h:32:0x8049cd0]
6f74989eb00 run_poll_loop < signal_process_queue [src/haproxy.c:1097:0x804bd69] < [include/proto/signal.h:32:0x8049cd0]
6f74989ef44 ->->-> run_poll_loop > wake_expired_tasks [src/haproxy.c:1100:0x804bd72] > [src/task.c:123:0x8055060]
6f74989f3a6 ->->->-> wake_expired_tasks > eb32_lookup_ge [src/task.c:128:0x8055091] > [ebtree/eb32tree.c:138:0x80a8c70]
6f74989f7e9 wake_expired_tasks < eb32_lookup_ge [src/task.c:128:0x8055091] < [ebtree/eb32tree.c:138:0x80a8c70]
6f74989fc0d ->->->-> wake_expired_tasks > eb32_first [src/task.c:134:0x80550d5] > [ebtree/eb32tree.h:55:0x8054ad0]
6f7498a003d ->->->->-> eb32_first > eb_first [ebtree/eb32tree.h:56:0x8054af1] > [ebtree/ebtree.h:520:0x8054a10]
6f7498a0436 ->->->->->-> eb_first > eb_walk_down [ebtree/ebtree.h:521:0x8054a33] > [ebtree/ebtree.h:442:0x80549a0]
6f7498a0843 ->->->->->->-> eb_walk_down > eb_gettag [ebtree/ebtree.h:445:0x80549d6] > [ebtree/ebtree.h:418:0x80548e0]
6f7498a0c2b eb_walk_down < eb_gettag [ebtree/ebtree.h:445:0x80549d6] < [ebtree/ebtree.h:418:0x80548e0]
6f7498a1042 ->->->->->->-> eb_walk_down > eb_untag [ebtree/ebtree.h:447:0x80549e2] > [ebtree/ebtree.h:412:0x80548a0]
6f7498a1498 eb_walk_down < eb_untag [ebtree/ebtree.h:447:0x80549e2] < [ebtree/ebtree.h:412:0x80548a0]
6f7498a18c6 ->->->->->->-> eb_walk_down > eb_root_to_node [ebtree/ebtree.h:448:0x80549e7] > [ebtree/ebtree.h:432:0x8054960]
6f7498a1cd4 eb_walk_down < eb_root_to_node [ebtree/ebtree.h:448:0x80549e7] < [ebtree/ebtree.h:432:0x8054960]
6f7498a20c4 eb_first < eb_walk_down [ebtree/ebtree.h:521:0x8054a33] < [ebtree/ebtree.h:442:0x80549a0]
6f7498a24b4 eb32_first < eb_first [ebtree/eb32tree.h:56:0x8054af1] < [ebtree/ebtree.h:520:0x8054a10]
6f7498a289c wake_expired_tasks < eb32_first [src/task.c:134:0x80550d5] < [ebtree/eb32tree.h:55:0x8054ad0]
6f7498a2c8c run_poll_loop < wake_expired_tasks [src/haproxy.c:1100:0x804bd72] < [src/task.c:123:0x8055060]
6f7498a3095 ->->-> run_poll_loop > process_runnable_tasks [src/haproxy.c:1103:0x804bd7a] > [src/task.c:190:0x8055150]
A nice improvement would possibly consist in trying to get the function's
arguments in the stack and to dump a few more infor for some well-known
functions (eg: the session's status for process_session).
This tool has remained uncommitted in my development tree for almost a year.
Just minor polish and commit.
It can be used to convert some geolocation IP lists to ACLs.
Using "halog -c" is still something quite common to perform on logs,
but unfortunately since the recent added controls, it was sensibly
slowed down due to the parsing of the accept date field.
Now we use a specific loop for the case where nothing is needed from
the input, and this sped up the line counting by 2.5x. A 2.4 GHz Xeon
now counts lines at a rate of 2 GB of logs per second.
Gcc tries to be a bit too smart in these small loops and the result is
that on i386 we waste a lot of time there. By recoding these loops in
assembly, we save up to 23% total processing time on i386! The savings
on x86_64 are much lower, probably because there are more registers and
gcc has to do less tricks. However, those savings vary a lot between gcc
versions and even cause harm on some of them (eg: 4.4) because gcc does
not know how to optimize the code once inlined.
However, by recoding field_start() in C to try to match the assembly
code as much as possible, we can significantly reduce its execution
time without risking the negative impacts. Thus, the assembly version
is less interesting there but still worth being used on some compilers.
By adding a "landing area" at the end of the buffer, it becomes safe to
parse more bytes at once. On 32-bit this makes fgets run about 4% faster
but it does not save anything on 64-bit.
A bug in the algorithm used to find an LF in multiple bytes at once
made byte 0x80 trigger detection of byte 0x00, thus 0x8A matches byte
0x0A. In practice, this issue never happens since byte 0x8A won't be
displayed in logs (or it will be encoded). This could still possibly
happen in mixed logs.
Some syslog servers escape quotes, which make the resulting logs unusable
for URL processing since the parser looks for the first field beginning
with a quote. It now supports also fields starting with backslash and
quote in order to address this. No performance impact was measured.
The code was merged with the error code checking which is very similar and
which shares the same information. The new test adds about 1% slowdown to
error checking but makes it more reliable when facing wrongly formated
status codes.
It is now possible to filter by termination code with -tcn <termcode>, to be
able to track one kind of errors, for example after counting it with -tc.
Use -TCN <termcode> gives you the opposite.
There were too many filters, we were losing time in all the "if" statements.
By moving all the filters to independant functions, we made the code cleaner
and slightly faster (3%).
One minor bug was found, the -tc and -st options did not report the number
of output lines, but always zero.
Almost all filters first check the line format, which takes a lot of code
and requires parsing back and forth. By centralizing this test, we can
save about 15-20 more percent of performance for all filters.
Also, the test was wrong, it was checking that the source IP address was
starting with a digit, which is not always true with local IPv6 addresses.
Instead, we now check that the next field (accept field) starts with an
opening bracket and is followed by a digit between 0 and 3 (day of the
month). Doing this has contributed a 2% speedup because all other field
calculations were relative to a closer field.