We now measure the work and idle times in order to report the idle
time in the stats. It's expected that we'll be able to use it at
other places later.
Several algorithms will need to know the millisecond value within
the current second. Instead of doing a divide every time it is needed,
it's better to compute it when it changes, which is when now and now_ms
are recomputed.
curr_sec_ms_scaled is the same multiplied by 2^32/1000, which will be
useful to compute some ratios based on the position within last second.
This new time value will be used to compute timeouts and wait queue
positions. The operation is made once for all when time is retrieved.
A future improvement might consist in having it in ticks of 1/1024
second and to convert all timeouts into ticks.
The first implementation of the monotonic clock did not verify
forward jumps. The consequence is that a fast changing time may
expire a lot of tasks. While it does seem minor, in fact it is
problematic because most machines which boot with a wrong date
are in the past and suddenly see their time jump by several
years in the future.
The solution is to check if we spent more apparent time in
a poller than allowed (with a margin applied). The margin
is currently set to 1000 ms. It should be large enough for
any poll() to complete.
Tests with randomly jumping clock show that the result is quite
accurate (error less than 1 second at every change of more than
one second).
If the system date is set backwards while haproxy is running,
some scheduled events are delayed by the amount of time the
clock went backwards. This is particularly problematic on
systems where the date is set at boot, because it seldom
happens that health-checks do not get sent for a few hours.
Before switching to use clock_gettime() on systems which
provide it, we can at least ensure that the clock is not
going backwards and maintain two clocks : the "date" which
represents what the user wants to see (mostly for logs),
and an internal date stored in "now", used for scheduled
events.
AIX does not know about MSG_DONTWAIT. Fortunately, nearly all sockets
are already set to O_NONBLOCK, so it's not even required to change the
code. It was only necessary to add this fcntl to the log socket which
lacked it. The MSG_DONTWAIT value has been defined to zero when unset
in order to make the code cleaner and more portable.
Also, on AIX, "hz" is defined, which causes a problem with one function
parameter in time.c. It's enough to rename the parameter there. Last,
fix a missing #include <string.h> in proxy.c.
Hello,
This patch implements new statistics for SLA calculation by adding new
field 'Dwntime' with total down time since restart (both HTTP/CSV) and
extending status field (HTTP) or inserting a new one (CSV) with time
showing how long each server/backend is in a current state. Additionaly,
down transations are also calculated and displayed for backends, so it is
possible to know how many times selected backend was down, generating "No
server is available to handle this request." error.
New information are presentetd in two different ways:
- for HTTP: a "human redable form", one of "100000d 23h", "23h 59m" or
"59m 59s"
- for CSV: seconds
I believe that seconds resolution is enough.
As there are more columns in the status page I decided to shrink some
names to make more space:
- Weight -> Wght
- Check -> Chk
- Down -> Dwn
Making described changes I also made some improvements and fixed some
small bugs:
- don't increment s->health above 's->rise + s->fall - 1'. Previously it
was incremented an then (re)set to 's->rise + s->fall - 1'.
- do not set server down if it is down already
- do not set server up if it is up already
- fix colspan in multiple places (mostly introduced by my previous patch)
- add missing "status" header to CSV
- fix order of retries/redispatches in server (CSV)
- s/Tthen/Then/
- s/server/backend/ in DATA_ST_PX_BE (dumpstats.c)
Changes from previous version:
- deal with negative time intervales
- don't relay on s->state (SRV_RUNNING)
- little reworked human_time + compacted format (no spaces). If needed it
can be used in the future for other purposes by optionally making "cnt"
as an argument
- leave set_server_down mostly unchanged
- only little reworked "process_chk: 9"
- additional fields in CSV are appended to the rigth
- fix "SEC" macro
- named arguments (human_time, be_downtime, srv_downtime)
Hope it is OK. If there are only cosmetic changes needed please fill free
to correct it, however if there are some bigger changes required I would
like to discuss it first or at last to know what exactly was changed
especially since I already put this patch into my production server. :)
Thank you,
Best regards,
Krzysztof Oledzki
tv_cmp2_ms handles multiple combinations of tv1 and tv2, but only
one form is used: (tv1 <= tv2). So it is overkill to use it everywhere.
A new function designed to do exactly this has been written for that
purpose: tv_cmp2_le. Also, removed old unused tv_* functions.
The fact that TV_ETERNITY was 0 was very awkward because it
required that comparison functions handled the special case.
Now it is ~0 and all comparisons are performed on unsigned
values, so that it is naturally greater than any other value.
A performance gain of about 2-5% has been noticed.
The new rbtree-based scheduler makes heavy use of tv_cmp2(), and
this function becomes a huge CPU eater. Refine it a little bit in
order to slightly reduce CPU usage.
Some of the tv_* functions are called very often. Passing their
arguments as registers is quite faster. This can be disabled
by setting CONFIG_HAP_DISABLE_REGPARM.
As suggested by Markus Elfring, a few "const char *" have replaced
some "char *" declarations where a function is not expected to
modify a value. It does not change the code but it helps detecting
coding errors.