It looks like we'll need:
- one share timers queue for the rare tasks that need to wake up
anywhere
- one private timers queue per thread
- one global queue per group
- one local queue per thread
And may be we can simply get rid of any global/shared run queue as
we don't seem to have any task bound to a subset of threads.
This macro was used both for binding and for lookups. When binding tasks
or FDs, using all_threads_mask instead is better as it will later be per
group. For lookups, ~0UL always does the job. Thus in practice the macro
was already almost not used anymore since the rest of the code could run
fine with a constant of all ones there.
Instead of having a global mask of all the profiled threads, let's have
one flag per thread in each thread's flags. They are never accessed more
than one at a time an are better located inside the threads' contexts for
both performance and scalability.