When DEBUG_FD is set at build time, we'll keep a counter of per-FD events
in the fdtab. This counter is reported in "show fd" even for closed FDs if
not zero. The purpose is to help spot situations where an apparently closed
FD continues to be reported in loops, or where some events are dismissed.
We have poll_drop, poll_dead and poll_skip which are confusingly named
like their poll_io and poll_exp counterparts except that they are not
per poll() call but per-fd. This patch renames them to poll_drop_fd(),
poll_dead_fd() and poll_skip_fd() for this reason.
The "show activity" output mentions a number of indicators to explain
wake up reasons but doesn't have the number of times poll() sees some
I/O. And given that multiple events can happen simultaneously, it's
not always possible to deduce this metric by subtracting.
This patch adds a new "poll_io" counter that allows one to see how
often poll() returns with at least one active FD. This should help
detect stuck events and measure various ratios of poll sub-metrics.
This patch fixes all the leftovers from the include cleanup campaign. There
were not that many (~400 entries in ~150 files) but it was definitely worth
doing it as it revealed a few duplicates.
Since these are used as type attributes or conditional clauses, they
are used about everywhere and should not require a dependency on
thread.h. Moving them to compiler.h along with other similar statements
like ALIGN() etc looks more logical; this way they become part of the
base API. This allowed to remove thread-t.h from ~12 files, one was
found to only require thread-t and not thread and dict.c was found to
require thread.h.
global.h was one of the messiest files, it has accumulated tons of
implicit dependencies and declares many globals that make almost all
other file include it. It managed to silence a dependency loop between
server.h and proxy.h by being well placed to pre-define the required
structs, forcing struct proxy and struct server to be forward-declared
in a significant number of files.
It was split in to, one which is the global struct definition and the
few macros and flags, and the rest containing the functions prototypes.
The UNIX_MAX_PATH definition was moved to compat.h.
A few includes were missing in each file. A definition of
struct polled_mask was moved to fd-t.h. The MAX_POLLERS macro was
moved to defaults.h
Stdio used to be silently inherited from whatever path but it's needed
for list_pollers() which takes a FILE* and which can thus not be
forward-declared.
This moves types/activity.h to haproxy/activity-t.h and
proto/activity.h to haproxy/activity.h.
The macros defining the bit field values for the profiling variable
were moved to the type file to be more future-proof.
This one is included almost everywhere and used to rely on a few other
.h that are not needed (unistd, stdlib, standard.h). It could possibly
make sense to split it into multiple parts to distinguish operations
performed on timers and the internal time accounting, but at this point
it does not appear much important.
This splits the hathreads.h file into types+macros and functions. Given
that most users of this file used to include it only to get the definition
of THREAD_LOCAL and MAXTHREADS, the bare minimum was placed into thread-t.h
(i.e. types and macros).
All the thread management was left to haproxy/thread.h. It's worth noting
the drop of the trailing "s" in the name, to remove the permanent confusion
that arises between this one and the system implementation (no "s") and the
makefile's option (no "s").
For consistency, src/hathreads.c was also renamed thread.c.
A number of files were updated to only include thread-t which is the one
they really needed.
Some future improvements are possible like replacing empty inlined
functions with macros for the thread-less case, as building at -O0 disables
inlining and causes these ones to be emitted. But this really is cosmetic.
All files that were including one of the following include files have
been updated to only include haproxy/api.h or haproxy/api-t.h once instead:
- common/config.h
- common/compat.h
- common/compiler.h
- common/defaults.h
- common/initcall.h
- common/tools.h
The choice is simple: if the file only requires type definitions, it includes
api-t.h, otherwise it includes the full api.h.
In addition, in these files, explicit includes for inttypes.h and limits.h
were dropped since these are now covered by api.h and api-t.h.
No other change was performed, given that this patch is large and
affects 201 files. At least one (tools.h) was already freestanding and
didn't get the new one added.
This used to be a minor optimization on ix86 where registers are scarce
and the calling convention not very efficient, but this platform is not
relevant enough anymore to warrant all this dirt in the code for the sake
of saving 1 or 2% of performance. Modern platforms don't use this at all
since their calling convention already defaults to using several registers
so better get rid of this once for all.
In practice it's all pollers except select(). It turns out that we're
keeping some legacy code only for select and enforcing it on all
pollers, let's offer the pollers the ability to declare that they
do not need that.
When we have a EVFILT_READ event, an optimization was made, and the FD was
not reported as ready to receive if there were no data available. That way,
if the socket was closed by our peer (the EV8EOF flag was set), and there were
no remaining data to read, we would just close(), and avoid doing a recv().
However, it may be fine for TCP socket, but it is not for UDP.
If we send data via UDP, and we receive an error, the only way to detect it
is to attempt a recv(). However, in this case, kevent() will report a read
event, but with no data, so we'd just ignore that read event, nothing would be
done about it, and the poller would be woken up by it over and over.
To fix this, report read events if either we have data, or the EV_EOF flag
is not set.
This should be backported to 2.1, 2.0, 1.9 and 1.8.
As mentioned in previous commit, these flags do not map well to
modern poller capabilities. Let's use the FD_EV_*_{R,W} flags instead.
This first patch only performs a 1-to-1 mapping making sure that the
previously reported flags are still reported identically while using
the closest possible semantics in the pollers.
It's worth noting that kqueue will now support improvements such as
returning distinctions between shut and errors on each direction,
though this is not exploited for now.
Since commit 7ac0e35f2 in 1.9-dev1 ("MAJOR: fd: compute the new fd polling
state out of the fd lock") we've started to update the FD POLLED bit a
bit more aggressively. Lately with the removal of the FD cache, this bit
is always equal to the ACTIVE bit. There's no point continuing to watch
it and update it anymore, all it does is create confusion and complicate
the code. One interesting side effect is that it now becomes visible that
all fd_*_{send,recv}() operations systematically call updt_fd_polling(),
except fd_cant_recv()/fd_cant_send() which never saw it change.
In the poller code, instead of just remembering if we're currently polling
a fd or not, remember if we're polling it for writing and/or for reading, that
way, we can avoid to modify the polling if it's already polled as needed.
Now that the architecture was changed so that attempts to receive/send data
always come from the upper layers, instead of them only trying to do so when
the lower layer let them know they could try, we can finally get rid of the
fd cache. We don't really need it anymore, and removing it gives us a small
performance boost.
We have been abusing the do_poll()'s timeout for a while, making it zero
whenever there is some known activity. The problem this poses is that it
complicates activity diagnostic by incrementing the poll_exp field for
each known activity. It also requires extra computations that could be
avoided.
This change passes a "wake" argument to say that the poller must not
sleep. This simplifies the operations and allows one to differenciate
expirations from activity.
In some situations, especially when dealing with low latency on processors
supporting a variable frequency or when running inside virtual machines,
each time the process waits for an I/O using the poller, the processor
goes back to sleep or is offered to another VM for a long time, and it
causes excessively high latencies.
A solution to this provided by this patch is to enable busy polling using
a global option. When busy polling is enabled, the pollers never sleep and
loop over themselves waiting for an I/O event to happen or for a timeout
to occur. On multi-processor machines it can significantly overheat the
processor but it usually results in much lower latencies.
A typical test consisting in injecting traffic over a single connection at
a time over the loopback shows a bump from 4640 to 8540 connections per
second on forwarded connections, indicating a latency reduction of 98
microseconds for each connection, and a bump from 12500 to 21250 for
locally terminated connections (redirects), indicating a reduction of
33 microseconds.
It is only usable with epoll and kqueue because select() and poll()'s
API is not convenient for such usages, and the level of performance they
are used in doesn't benefit from this anyway.
The option, which obviously remains disabled by default, can be turned
on using "busy-polling" in the global section, and turned off later
using "no busy-polling". Its status is reported in "show info" to help
troubleshooting suspicious CPU spikes.
At the moment the situation with activity measurement is quite tricky
because the struct activity is defined in global.h and declared in
haproxy.c, with operations made in time.h and relying on freq_ctr
which are defined in freq_ctr.h which itself includes time.h. It's
barely possible to touch any of these files without breaking all the
circular dependency.
Let's move all this stuff to activity.{c,h} and be done with it. The
measurement of active and stolen time is now done in a dedicated
function called just after tv_before_poll() instead of mixing the two,
which used to be a lazy (but convenient) decision.
No code was changed, stuff was just moved around.
By placing this code into time.h (tv_entering_poll() and tv_leaving_poll())
we can remove the logic from the pollers and prepare for extending this to
offer more accurate time measurements.
The 4 pollers all contain the same code used to compute the poll timeout.
This is pointless, let's centralize this into fd.h. This also gets rid of
the useless SCHEDULER_RESOLUTION macro which used to work arond a very old
linux 2.2 bug causing select() to wake up slightly before the timeout.
In _update_fd(), if the fd wasn't polled, and we don't want it to be polled,
we just returned 0, however, we should return changes instead, or all previous
changes will be lost.
This should be backported to 1.8.
The current synchronization point enforces certain restrictions which
are hard to workaround in certain areas of the code. The fact that the
critical code can only be called from the sync point itself is a problem
for some callback-driven parts. The "show fd" command for example is
fragile regarding this.
Also it is expensive in terms of CPU usage because it wakes every other
thread just to be sure all of them join to the rendez-vous point. It's a
problem because the sleeping threads would not need to be woken up just
to know they're doing nothing.
Here we implement a different approach. We keep track of harmless threads,
which are defined as those either doing nothing, or doing harmless things.
The rendez-vous is used "for others" as a way for a thread to isolate itself.
A thread then requests to be alone using thread_isolate() when approaching
the dangerous area, and then waits until all other threads are either doing
the same or are doing something harmless (typically polling). The function
only returns once the thread is guaranteed to be alone, and the critical
section is terminated using thread_release().
When composing the event list for subscribe to kqueue events, the index
where the new event is added must be after the previous events, as such
the changes counter should continue counting.
This caused haproxy to accept connections but not try read and process
the incoming data.
This patch is for 1.9 only
The polled_mask is only used in the pollers, and removing it from the
struct fdtab makes it fit in one 64B cacheline again, on a 64bits machine,
so make it a separate array.
With the old model, any fd shared by multiple threads, such as listeners
or dns sockets, would only be updated on one threads, so that could lead
to missed event, or spurious wakeups.
To avoid this, add a global list for fd that are shared, using the same
implementation as the fd cache, and only remove entries from this list
when every thread as updated its poller.
[wt: this will need to be backported to 1.8 but differently so this patch
must not be backported as-is]
When adding new events using kevent(), if there's an error, because we're
trying to delete an event that wasn't there, or because the fd has already
been closed, kevent() will either add an event in the eventlist array if
there's enough room for it, and keep on handling other events, or stop and
return -1.
We want it to process all the events, so give it a large-enough array to
store any error.
Special thanks to PiBa-NL for diagnosing the root cause of this bug.
This should be backported to 1.8.
Clearing the update_mask bit in fd_insert may lead to duplicate insertion
of fd in fd_updt, that could lead to a write past the end of the array.
Instead, make sure the update_mask bit is cleared by the pollers no matter
what.
This should be backported to 1.8.
[wt: warning: 1.8 doesn't have the lockless fdcache changes and will
require some careful changes in the pollers]
Maxfd is really only useful to poll() and select(), yet epoll and
kqueue reference it almost by mistake :
- cloning of the initial FDs (maxsock should be used here)
- max polled events, it's maxpollevents which should be used here.
Let's fix these places.
This is the same patch than the previous one ("BUILD: epoll/threads: Add test on
MAX_THREADS to avoid warnings when complied without threads ").
It should be backported in 1.8 with the commit 7a2364d4 ("BUG/MEDIUM:
kqueue/threads: use one kqueue_fd per thread").
in deinit_kqueue_per_thread, kqueue_fd[tid] must be closed, except for the main
thread (the first one, tid==0).
This patch must be backported in 1.8 with commit 7a2364d4.
This is the same principle as the previous patch (BUG/MEDIUM:
epoll/threads: use one epoll_fd per thread) except that this time it's
for kqueue. We don't want all threads to wake up because of activity on
a single other thread that the other ones are not interested in.
Just like with previous patch, this one shows that the polling state
doesn't need to be changed here and that some simplifications are now
possible. This patch only implements the minimum required for a stable
backport.
This should be backported to 1.8.
Since the fd update tables are per-thread, we need to have a bit per
thread to indicate whether an update exists, otherwise this can lead
to lost update events every time multiple threads want to update the
same FD. In practice *for now*, it only happens at start time when
listeners are enabled and ask for polling after facing their first
EAGAIN. But since the pollers are still shared, a lost event is still
recovered by a neighbor thread. This will not reliably work anymore
with per-thread pollers, where it has been observed a few times on
startup that a single-threaded listener would not always accept
incoming connections upon startup.
It's worth noting that during this code review it appeared that the
"new" flag in the fdtab isn't used anymore.
This fix should be backported to 1.8.
A number of counters have been added at special places helping better
understanding certain bug reports. These counters are maintained per
thread and are shown using "show activity" on the CLI. The "clear
counters" commands also reset these counters. The output is sent as a
single write(), which currently produces up to about 7 kB of data for
64 threads. If more counters are added, it may be necessary to write
into multiple buffers, or to reset the counters.
To backport to 1.8 to help collect more detailed bug reports.
kqueue fd's are not shared with children after fork(), so the children
don't have to close them, and it may in fact be dangerous, because we may
end up closing a totally unrelated fd.
[wt: to be backported to 1.8 where master-worker broke on this, and
likely to older versions for completeness]
It was a leftover from the last cleaning session; this mask applies
to threads and calling it process_mask is a bit confusing. It's the
same in fd, task and applets.
There was a flaw in the way the threads was created. the main one was just used
to create all the others and just wait to exit. Now, it is used to run a poll
loop. So we only create nbthread-1 threads.
This also fixes a bug about the compression filter when there is only 1 thread
(nbthread == 1 or no threads support). The bug was in the way thread-local
resources was initialized. per-thread init/deinit callbacks were never called
for the main process. So, with nthread set to 1, some buffers remained
uninitialized.
Many changes have been made to do so. First, the fd_updt array, where all
pending FDs for polling are stored, is now a thread-local array. Then 3 locks
have been added to protect, respectively, the fdtab array, the fd_cache array
and poll information. In addition, a lock for each entry in the fdtab array has
been added to protect all accesses to a specific FD or its information.
For pollers, according to the poller, the way to manage the concurrency is
different. There is a poller loop on each thread. So the set of monitored FDs
may need to be protected. epoll and kqueue are thread-safe per-se, so there few
things to do to protect these pollers. This is not possible with select and
poll, so there is no sharing between the threads. The poller on each thread is
independant from others.
Finally, per-thread init/deinit functions are used for each pollers and for FD
part for manage thread-local ressources.
Now, you must be carefull when a FD is created during the HAProxy startup. All
update on the FD state must be made in the threads context and never before
their creation. This is mandatory because fd_updt array is thread-local and
initialized only for threads. Because there is no pollers for the main one, this
array remains uninitialized in this context. For this reason, listeners are now
enabled in run_thread_poll_loop function, just like the worker pipe.