On process startup, the CPU topology of the machine is inspected. If a
multi-socket CPU machine is detected, automatically define the process
affinity on the first node with active cpus. This is done to prevent an
impact on the overall performance of the process in case the topology of
the machine is unknown to the user.
This step is not executed in the following condition :
- a non-null nbthread statement is present
- a restrictive 'cpu-map' statement is present
- the process affinity is already restricted, for example via a taskset
call
For the record, benchmarks were executed on a machine with 2 CPUs
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. In both clear and ssl
scenario, the performance were sub-optimal without the automatic
rebinding on a single node.
Allow to specify multiple cpu ids/ranges in parse_cpu_set separated by a
comma. This is optional and must be activated by a parameter.
The comma support is disabled for the parsing of the 'cpu-map' config
statement. However, it will be useful to parse files in sysfs when
inspecting the cpus topology for NUMA automatic process binding.
Create a function thread_cpu_mask_forced. Its purpose is to report if a
restrictive cpu mask is active for the current proces, for example due
to a taskset invocation. It is only implemented for the linux platform
currently.
Use the platform independent type hap_cpuset for the cpu-map statement
parsing. This allow to address CPU index greater than LONGBITS.
Update the documentation to reflect the removal of this limit except for
platforms without cpu_set_t type or equivalent.
Replace the unsigned long parameter by a hap_cpuset. This allows to
address CPU with index greater than LONGBITS.
This function is used to parse the 'cpu-map' statement. However at the
moment, the result is casted back to a long to store it in the global
structure. The next step is to replace ulong in in cpu_map in the
global structure with hap_cpuset.
This module can be used to manipulate a cpu sets in a platform agnostic
way. Use the type cpu_set_t/cpuset_t if available on the platform, or
fallback to unsigned long, which limits de facto the maximum cpu index
to LONGBITS.
As we now embed the library we don't need to support the older 1.0 API
any more, so we can remove the explicit calls to slz_make_crc_table()
and slz_prepare_dist_table().
Now that SLZ is merged, let's update the makefile and compression
files to use it. As a result, SLZ_INC and SLZ_LIB are neither defined
nor used anymore.
USE_SLZ is enabled by default ("USE_SLZ=default") and can be disabled
by passing "USE_SLZ=" or by enabling USE_ZLIB=1.
The doc was updated to reflect the changes.
SLZ is rarely packaged by distros and there have been complaints about
the CPU and memory usage of ZLIB, leading to some suggestions to better
address the issue by simply integrating SLZ into the tree (just 3 files).
See discussions below:
https://www.mail-archive.com/haproxy@formilux.org/msg38037.htmlhttps://www.mail-archive.com/haproxy@formilux.org/msg40079.htmlhttps://www.mail-archive.com/haproxy@formilux.org/msg40365.html
This patch does just this, after minor adjustments to these files:
- tables.h was renamed to slz-tables.h
- tables.h had the precomputed tables removed since not used here
- slz.c uses includes <import/slz*> instead of "slz*.h"
The slz commit imported here was b06c172 ("slz: avoid a build warning
with -Wimplicit-fallthrough"). No other change was performed either to
SLZ nor to haproxy at this point so that this operation may be replicated
if needed for a future version.
This normalizer removes "/./" segments from the path component.
Usually the dot refers to the current directory which renders those segments redundant.
See GitHub Issue #714.
The current "ADD" vs "ADDQ" is confusing because when thinking in terms
of appending at the end of a list, "ADD" naturally comes to mind, but
here it does the opposite, it inserts. Several times already it's been
incorrectly used where ADDQ was expected, the latest of which was a
fortunate accident explained in 6fa922562 ("CLEANUP: stream: explain
why we queue the stream at the head of the server list").
Let's use more explicit (but slightly longer) names now:
LIST_ADD -> LIST_INSERT
LIST_ADDQ -> LIST_APPEND
LIST_ADDED -> LIST_INLIST
LIST_DEL -> LIST_DELETE
The same is true for MT_LISTs, including their "TRY" variant.
LIST_DEL_INIT keeps its short name to encourage to use it instead of the
lazier LIST_DELETE which is often less safe.
The change is large (~674 non-comment entries) but is mechanical enough
to remain safe. No permutation was performed, so any out-of-tree code
can easily map older names to new ones.
The list doc was updated.
Amaury noticed that I managed to break the build of DEBUG_FAIL_ALLOC
for the second time with 207c09509 ("MINOR: pools: move the fault
injector to __pool_alloc()"). The joy of endlessly reworking patch
sets... No backport is needed, that was in the just merged cleanup
series.
By having a pair of dummy pool_get_from_cache() and pool_put_to_cache()
we can remove some ugly ifdefs, so let's do this. We've already done it
for the shared cache.
This function has become too big (251 bytes) and is now hurting
performance a lot, with up to 4% request rate being lost over the last
pool changes. Let's move it to pool.c as a regular function. Other
attempts were made to cut it in half but it's still inefficient. Doing
this results in saving ~90kB of object code, and even 112kB since the
pool changes, with code that is even slightly faster!
Conversely, pool_get_from_cache(), which remains half of this size, is
still faster inlined, likely in part due to the immediate use of the
returned pointer afterwards.
Since pool_get_from_cache() and pool_put_to_cache() were now only wrappers
to the local cache versions which do all the job, let's merge them together
so that there is no more local-cache specific function.
Now when pool_get_from_local_cache() fails, it automatically falls back
to pool_get_from_shared_cache(), which used to always be done in
pool_get_from_cache(). Thus now the API is simpler as we always allocate
and free from/to the local caches.
Till now it used to call it only if there were not too many objects into
the local cache otherwise would send the latest one directly into the
shared cache. Now it always sends to the local cache and it's up to the
local cache to free its oldest objects. From a cache freshness perspective
it's better this way since we always evict cold objects instead of hot
ones. From an API perspective it's better because it will help make the
shared cache invisible to the public API.
These two functions are now responsible for allocating directly from
the cache and releasing to the cache.
Now the pool_alloc() function simply does this:
if cache enabled
return pool_alloc_from_cache() if no NULL
return pool_alloc_nocache() otherwise
and the pool_free() function does this:
if cache enabled
pool_put_to_cache()
else
pool_free_nocache()
For now this only introduces these two functions without changing anything
else, but the goal is to soon allow to make them implementation-specific.
Continuing the unification of local and shared pools, now the usage of
pools is governed by CONFIG_HAP_POOLS without which allocations and
releases are performed directly from the OS using pool_alloc_nocache()
and pool_free_nocache().
There are two levels of freeing to the OS:
- code that wants to keep the pool's usage counters updated uses
pool_free_area() and handles the counters itself. That's what
pool_put_to_shared_cache() does in the no-global-pools case.
- code that does not want to update the counters because they were
already updated only calls pool_free_area().
Let's extract these calls to establish the symmetry with pool_get_from_os()
and pool_alloc_nocache(), resulting in pool_put_to_os() (which only updates
the allocated counter) and pool_free_nocache() (which also updates the used
counter). This will later allow to simplify the generic code.
Calling pool_free_area() inside a lock in pool_put_to_shared_cache() is
a very bad idea. Fortunately this only happens on the lowest end platforms
which almost never use threads or in very small counts.
This change consists in zeroing the pointer once already released to the
cache in the first test so that the second stage knows if it needs to
pass it to the OS or not. This has slightly reduced the length of the
A part of the code cannot be factored out because it still uses non-atomic
inc/dec for pool->used and pool->allocated as these are located under the
pool's lock. While it can make sense in terms of bus cycles, it does not
make sense in terms of code normalization. Further, some operations were
still performed under a lock that could be totally removed via the use of
atomic ops.
There is still one occurrence in pool_put_to_shared_cache() in the locked
code where pool_free_area() is called under the lock, which must absolutely
be fixed.
Now there's one part dealing with the allocation itself and keeping
counters up to date, and another one on top of it to return such an
allocated pointer to the user and update the use count and stats.
This is in anticipation for being able to group cache-related parts.
The release code is still done at once.
Till now it was limited to objects allocated from the OS which means
it had little use as soon as pools were enabled. Let's move it upper
in the layers so that any code can benefit from fault injection. In
addition this allows to pass a new flag POOL_F_NO_FAIL to disable it
if some callers prefer a no-failure approach.
Now the multi-level cache becomes more visible:
pool_get_from_local_cache()
pool_put_to_local_cache()
pool_get_from_shared_cache()
pool_put_to_shared_cache()
The functions were rightfully called from/to_cache when the thread-local
cache was considered as the only cache, but this is getting terribly
confusing. Let's call them from/to local_cache to make it clear that
it is not related with the shared cache.
As a side note, since pool_evict_from_cache() used not to work for a
particular pool but for all of them at once, it was renamed to
pool_evict_from_local_caches() (plural form).
This is exactly what it is, the entry is retrieved from the shared
cache when it is defined. The implementation that is enabled with
CONFIG_HAP_NO_GLOBAL_POOLS continues to return NULL.
Now that __pool_alloc() only surrounds __pool_get_first() with the lock,
let's move it to the only variant that requires it and remove the ugly
ifdefs from the function. This is safe because nobody else calls this
function.
In __pool_alloc(), historically we used to use factor out the
pool's lock between __pool_get_first() and __pool_refill_alloc(),
resulting in real malloc() or mmap() calls being performed under
the pool lock (for platforms using the locked shared pools).
As this is not needed anymore, let's move the call out of the
lock, it may improve allocation patterns on some platforms. This
also makes __pool_alloc() cleaner as we see a first attempt to
allocate from the local cache, then a second from the shared
cache then a reall allocation.
They were strictly equivalent, let's remerge them and rename them to
pool_alloc_nocache() as it's the call which performs a real allocation
which does not check nor update the cache. The only difference in the
past was the former taking the lock and not the second but now the lock
is not needed anymore at this stage since the pool's list is not touched.
In addition, given that the "avail" argument is no longer used by the
function nor by its callers, let's drop it.
This is a first step towards unifying all the fallback code. Right now
these two functions are the only ones which do not update the needed_avg
rate counter since there's currently no shared pool kept when using them.
But their code is similar to what could be used everywhere except for
this one, so let's make them capable of maintaining usage statistics.
As a side effect the needed field in "show pools" will now be populated.
The mem_should_fail() call enabled by DEBUG_FAIL_ALLOC used to be placed
only in the no-cache version of the allocator. Now we can generalize it
to all modes and remove the exclusive test on CONFIG_HAP_NO_GLOBAL_POOLS.
We're going to make the local pool always present unless pools are
completely disabled. This means that pools are always enabled by
default, regardless of the use of threads. Let's drop this notion
of "local" pools and make it just "pool". The equivalent debug
option becomes DEBUG_NO_POOLS instead of DEBUG_NO_LOCAL_POOLS.
For now this changes nothing except the option and dropping the
dependency on USE_THREAD.
Initially per-thread pool caches were stored into a fixed-size array.
But this was a bit ugly because the last allocated pools were not able
to benefit from the cache at all. As a work around to preserve
performance, a size of 64 cacheable pools was set by default (there
are 51 pools at the moment, excluding any addon and debugging code),
so all in-tree pools were covered, at the expense of higher memory
usage.
In addition an index had to be calculated for each pool, and was used
to acces the pool cache head into that array. The pool index was not
even stored into the pools so it was required to determine it to access
the cache when the pool was already known.
This patch changes this by moving the pool cache head into the pool
head itself. This way it is certain that each pool will have its own
cache. This removes the need for index calculation.
The pool cache head is 32 bytes long so it was aligned to 64B to avoid
false sharing between threads. The extra cost is not huge (~2kB more
per pool than before), and we'll make better use of that space soon.
The pool cache head contains the size, which should probably be removed
since it's already in the pool's head.
In commit fb117e6a8 ("MEDIUM: memory: don't let pool_put_to_cache() free
the objects itself") pool_evict_from_cache() was introduced with no
argument, yet the only call place passes it the pool, the pointer and
the index number!
Let's remove these as they even let the reader think that the function
does something specific to the current pool while it's not the case.
This patch renames all existing uri-normalizers into a more consistent naming
scheme:
1. The part of the URI that is being touched.
2. The modification being performed as an explicit verb.
This normalizer merges `../` path segments with the predecing segment, removing
both the preceding segment and the `../`.
Empty segments do not receive special treatment. The `merge-slashes` normalizer
should be executed first.
See GitHub Issue #714.
When a thread ends its harmeless period, we must only consider running
threads when testing threads_want_rdv_mask mask. To do so, we reintroduce
all_threads_mask mask in the bitwise operation (It was removed to fix a
deadlock).
Note that for now it is useless because there is no way to stop threads or
to have threads reserved for another task. But it is safer this way to avoid
bugs in the future.
A previous patch was pushed to fix a deadlock when an isolated thread ends
its harmless period (a9a9e9aac ["BUG/MEDIUM: thread: Fix a deadlock if an
isolated thread is marked as harmless"]). But, unfortunately, the fix is
incomplete. The same must be done in the outer loop, in
thread_harmless_end() function. The current thread must be ignored when
threads_want_rdv_mask mask is tested.
This patch must also be backported as far as 2.0.
This library is required for the subsequent patch which adds
the JSON query possibility.
It is necessary to change the include statement in "src/mjson.c"
because the imported includes in haproxy are in "include/import"
orig: #include "mjson.h"
new: #include <import/mjson.h>