We often need ISO time + microseconds in traces and ring buffers, thus
function does this by calling gettimeofday() and keeping a cached value
of the part representing the tv_sec value, and only rewrites the microsecond
part. The cache is per-thread so it's lockless and safe to use as-is.
Some tests already show that it's easy to see 3-4 events in a single
microsecond, thus it's likely that the nanosecond version will have to
be implemented as well. But certain comments on the net suggest that
some parsers are having trouble beyond microsecond, thus for now let's
stick to the microsecond only.
All threads call tv_update_date(-1) at boot to set their own local time
offset. While doing so they also overwrite global_now, which is not that
much of a problem except that it's not done using an atomic write and
that it will be overwritten by every there in parallel. We only need the
first thread to set it anyway, so let's simply set it if not set and do
it using a CAS. This should fix GH issue #111.
This may be backported to 1.9.
I found on an (old) AIX 5.1 machine that stdint.h didn't exist while
inttypes.h which is expected to include it does exist and provides the
desired functionalities.
As explained here, stdint being just a subset of inttypes for use in
freestanding environments, it's probably always OK to switch to inttypes
instead:
https://pubs.opengroup.org/onlinepubs/009696799/basedefs/stdint.h.html
Also it's even clearer here in the autoconf doc :
https://www.gnu.org/software/autoconf/manual/autoconf-2.61/html_node/Header-Portability.html
"The C99 standard says that inttypes.h includes stdint.h, so there's
no need to include stdint.h separately in a standard environment.
Some implementations have inttypes.h but not stdint.h (e.g., Solaris
7), but we don't know of any implementation that has stdint.h but not
inttypes.h"
At the moment the situation with activity measurement is quite tricky
because the struct activity is defined in global.h and declared in
haproxy.c, with operations made in time.h and relying on freq_ctr
which are defined in freq_ctr.h which itself includes time.h. It's
barely possible to touch any of these files without breaking all the
circular dependency.
Let's move all this stuff to activity.{c,h} and be done with it. The
measurement of active and stolen time is now done in a dedicated
function called just after tv_before_poll() instead of mixing the two,
which used to be a lazy (but convenient) decision.
No code was changed, stuff was just moved around.
The purpose is to detect if threads or processes are competing for the
same CPU. This can happen when threads are incorrectly bound, or after a
reload if the previous process still has an important activity. With
threads this situation is problematic because a preempted thread holding
a lock will block other ones waiting for this lock to be released.
A first attempt consisted in measuring the cumulated lost time more
precisely but the system's scheduler is smart enough to try to limit the
thread preemption rate by mostly context switching during poll()'s blank
periods, so most of the time lost is not seen. In essence this is good
because it means a thread is not preempted with a lock held, and even
regarding the rendez-vous point it cannot prevent the other ones from
making progress. But still it happens tens to hundreds of times per
second that a thread might be preempted, so it's still possible to detect
that the situation is happening, thus it's interesting to measure and
report its frequency.
Each time we enter the poller, we check the CPU time spent working and
see if we've lost time doing something else. To limit false positives,
we're only interested in losses of 500 microseconds or more (i.e. half
a clock tick on a 1 kHz system). If so, it indicates that some time was
stolen by another thread or process. Note that we purposely store some
sub-millisecond counters so that under heavy traffic with a 1 kHz clock,
it's still possible to measure something without being subject to the
risk of rounding errors (i.e. if exactly 1 ms is stolen it's possible
that the time difference could often be slightly lower).
This counter of lost CPU time slots time is reported in "show activity"
in numbers of milliseconds of CPU lost per second, per 15s, and total
over the process' life. By definition, the per-second counter cannot
report values larger than 1000 per thread per second and the 15s one
will be limited to 15000/s in the worst case, but it's possible that
peak values exceed such thresholds after long pauses.
In the time offset calculation loop, we ensure we only commit the new
date once it's futher in the future than the current one. However there
is a small issue here on 32-bit platforms : if global_now is written in
two cycles by another thread, starting with the tv_sec part, and the
current thread reads it in the middle of a change, it may compute a
wrong "adjusted" value on the first round, with the new (larger) tv_sec
and the old (large) tv_usec. This will be detected as the CAS will fail,
and another attempt will be made, but this time possibly with too large
an adusted value, pushing the date further than needed (at worst almost
one second).
This patch addresses this by using a temporary adjusted time in the loop
that always restarts from the last known one, and by assigning the result
to the final value only once the CAS succeeds.
The impact is very limited, it may cause the time to advance in small
jumps on 32 bit platforms and in the worst case some timeouts might
expire 1 second too early.
This fix should be backported to 1.8.
During high loads it becomes visible that the time drifts between threads,
sometimes showing tens of seconds after several minutes. The root cause is
the per-thread correction which is performed based on a local offset and
local time. But we can't use a unique global time either as we need the
thread-local time to be stable between two poll() calls.
This commit takes a stab at this problem by proceeding this way :
- a global "global_now" date is monotonous and common between all threads.
- each thread has its own local <now> which is resynced with <global_now>
on each invocation of tv_update_date()
- each thread detects its own drift based on its poll() timeout and its
local <now>, and recalculates its adjusted local time
- each thread then ensures its new local time is no older than the current
global time, otherwise it readjusts its local time to match this one
- finally threads do atomically update the global time to match its own
local one
This guarantees a monotonous global time and a monotonous+stable local time.
It is still possible by definition for two threads to report a minor time
variation on subsequent events but that variation will only be caused by
the moment they watched the time and are very small. When a common global
time is needed between all threads, global_now could be used as a reference
(with care). The wallclock time used in logs is still <date> anyway.
With threads, it became mandatory to implement a thread-local time with
its own correction. However, it was noticed that during high thread
contention, the time correction could occasionally be wrong, reporting
huge negative or positive timers in logs. This was caused by the
conversion between struct timeval and a single 64-bit offset, due to
an erroneous shift and due to a loss of sign during the conversion.
Given that time_t is not always signed, and that timeval is not really
needed here, better avoid playing dangerous games with these operations
and use a single 64-bit offset representing a signed 32-bit offset, for
the seconds part and an unsigned offset for the microsecond part.
It still supports atomic updates and doesn't cause issues anymore.
In function tv_update_date, we keep an offset reprenting the time deviation to
adjust the system time. At every call, we check if this offset must be updated
or not. Of course, It must be shared by all threads. It was store in a
timeval. But it cannot be atomically updated. So now, instead, we store it in a
64-bits integer. And in tv_update_date, we convert this integer in a
timeval. Once updated, it is converted back in an integer to be atomically
stored.
To store a tv_offset into an integer, we use 32 bits from tv_sec and 32 bits
tv_usec to avoid shift operations.
When a frontend is rate-limited to 1000 connections per second, the
effective rate measured from the client is 999/s, and connections
experience an average response time of 99.5 ms with a standard
deviation of 2 ms.
The reason for this inaccuracy is that when computing frequency
counters, we use one part of the previous value proportional to the
number of milliseconds remaining in the current second. But even the
last millisecond still uses a part of the past value, which is wrong :
since we have a 1ms resolution, the last millisecond must be dedicated
only to filling the current second.
So we slightly adjust the algorithm to use 999/1000 of the past value
during the first millisecond, and 0/1000 of the past value during the
last millisecond. We also slightly improve the computation by computing
the remaining time instead of the current time in tv_update_date(), so
that we don't have to negate the value in each frequency counter.
Now with the fix, the connection rate measured by both the client and
haproxy is a steady 1000/s, the average response time measured is 99.2ms
and more importantly, the standard deviation has been divided by 3 to
0.6 millisecond.
This fix should also be backported to 1.4 which has the same issue.
We now measure the work and idle times in order to report the idle
time in the stats. It's expected that we'll be able to use it at
other places later.
Several algorithms will need to know the millisecond value within
the current second. Instead of doing a divide every time it is needed,
it's better to compute it when it changes, which is when now and now_ms
are recomputed.
curr_sec_ms_scaled is the same multiplied by 2^32/1000, which will be
useful to compute some ratios based on the position within last second.
This new time value will be used to compute timeouts and wait queue
positions. The operation is made once for all when time is retrieved.
A future improvement might consist in having it in ticks of 1/1024
second and to convert all timeouts into ticks.
The first implementation of the monotonic clock did not verify
forward jumps. The consequence is that a fast changing time may
expire a lot of tasks. While it does seem minor, in fact it is
problematic because most machines which boot with a wrong date
are in the past and suddenly see their time jump by several
years in the future.
The solution is to check if we spent more apparent time in
a poller than allowed (with a margin applied). The margin
is currently set to 1000 ms. It should be large enough for
any poll() to complete.
Tests with randomly jumping clock show that the result is quite
accurate (error less than 1 second at every change of more than
one second).
If the system date is set backwards while haproxy is running,
some scheduled events are delayed by the amount of time the
clock went backwards. This is particularly problematic on
systems where the date is set at boot, because it seldom
happens that health-checks do not get sent for a few hours.
Before switching to use clock_gettime() on systems which
provide it, we can at least ensure that the clock is not
going backwards and maintain two clocks : the "date" which
represents what the user wants to see (mostly for logs),
and an internal date stored in "now", used for scheduled
events.
AIX does not know about MSG_DONTWAIT. Fortunately, nearly all sockets
are already set to O_NONBLOCK, so it's not even required to change the
code. It was only necessary to add this fcntl to the log socket which
lacked it. The MSG_DONTWAIT value has been defined to zero when unset
in order to make the code cleaner and more portable.
Also, on AIX, "hz" is defined, which causes a problem with one function
parameter in time.c. It's enough to rename the parameter there. Last,
fix a missing #include <string.h> in proxy.c.
Hello,
This patch implements new statistics for SLA calculation by adding new
field 'Dwntime' with total down time since restart (both HTTP/CSV) and
extending status field (HTTP) or inserting a new one (CSV) with time
showing how long each server/backend is in a current state. Additionaly,
down transations are also calculated and displayed for backends, so it is
possible to know how many times selected backend was down, generating "No
server is available to handle this request." error.
New information are presentetd in two different ways:
- for HTTP: a "human redable form", one of "100000d 23h", "23h 59m" or
"59m 59s"
- for CSV: seconds
I believe that seconds resolution is enough.
As there are more columns in the status page I decided to shrink some
names to make more space:
- Weight -> Wght
- Check -> Chk
- Down -> Dwn
Making described changes I also made some improvements and fixed some
small bugs:
- don't increment s->health above 's->rise + s->fall - 1'. Previously it
was incremented an then (re)set to 's->rise + s->fall - 1'.
- do not set server down if it is down already
- do not set server up if it is up already
- fix colspan in multiple places (mostly introduced by my previous patch)
- add missing "status" header to CSV
- fix order of retries/redispatches in server (CSV)
- s/Tthen/Then/
- s/server/backend/ in DATA_ST_PX_BE (dumpstats.c)
Changes from previous version:
- deal with negative time intervales
- don't relay on s->state (SRV_RUNNING)
- little reworked human_time + compacted format (no spaces). If needed it
can be used in the future for other purposes by optionally making "cnt"
as an argument
- leave set_server_down mostly unchanged
- only little reworked "process_chk: 9"
- additional fields in CSV are appended to the rigth
- fix "SEC" macro
- named arguments (human_time, be_downtime, srv_downtime)
Hope it is OK. If there are only cosmetic changes needed please fill free
to correct it, however if there are some bigger changes required I would
like to discuss it first or at last to know what exactly was changed
especially since I already put this patch into my production server. :)
Thank you,
Best regards,
Krzysztof Oledzki
The timeout functions were difficult to manipulate because they were
rounding results to the millisecond. Thus, it was difficult to compare
and to check what expired and what did not. Also, the comparison
functions were heavy with multiplies and divides by 1000. Now, all
timeouts are stored in timevals, reducing the number of operations
for updates and leading to cleaner and more efficient code.
tv_cmp2_ms handles multiple combinations of tv1 and tv2, but only
one form is used: (tv1 <= tv2). So it is overkill to use it everywhere.
A new function designed to do exactly this has been written for that
purpose: tv_cmp2_le. Also, removed old unused tv_* functions.
The fact that TV_ETERNITY was 0 was very awkward because it
required that comparison functions handled the special case.
Now it is ~0 and all comparisons are performed on unsigned
values, so that it is naturally greater than any other value.
A performance gain of about 2-5% has been noticed.
The new rbtree-based scheduler makes heavy use of tv_cmp2(), and
this function becomes a huge CPU eater. Refine it a little bit in
order to slightly reduce CPU usage.
Some of the tv_* functions are called very often. Passing their
arguments as registers is quite faster. This can be disabled
by setting CONFIG_HAP_DISABLE_REGPARM.
As suggested by Markus Elfring, a few "const char *" have replaced
some "char *" declarations where a function is not expected to
modify a value. It does not change the code but it helps detecting
coding errors.
The files are now stored under :
- include/haproxy for the generic includes
- include/types.h for the structures needed within prototypes
- include/proto.h for function prototypes and inline functions
- src/*.c for the C files
Most include files are now covered by LGPL. A last move still needs
to be done to put inline functions under GPL and not LGPL.
Version has been set to 1.3.0 in the code but some control still
needs to be done before releasing.