This benchmark exercizes multi-threaded central free list operations,
and this is where we're losing to a bunch of competing malloc (i.e. which
shard heap).
We now support a set of command line flags similar to "abseil"
benchmark thingy. I.e. to let people specify a subset of benchmarks or
run them longer/shorter as needed.
This commit also includes small, portable and very simple regexp
facility. It isn't good enough for some production use, but it is
plenty good for some testing uses or benchmark selection.
Instead of relying on gperftools-specific tc_XYZ functions for sized
deallocation and memalign we use standard C++ facilities. There are
also other minor improvements like mallocing larger buffers rather
than statically allocating them.
We got bitten again by compiler optimizing out free(malloc(sz))
combination. We replace calls to malloc/free with calls to global
function operator new/delete. Those are apparently forbidden by
standard to be optimized out.
If for some reason benchmark function is too fast (like when it got
optimized out to nothing), we'd previously hang in infinite loop. Now
we'll catch this condition due to iterations count overflowing.
Previously we bumped size by 16 between iterations, but for many size
classess that gave is subsequent iteration into same size
class. Multiplying by prime number randomizes sizes more so speeds up
this benchmark on at least modern x86.
This also makes them output nicer results. I.e. every benchmark is run 3
times and iteration duration is printed for every run.
While this is still very synthetic and unrepresentave of malloc performance
as a whole, it is exercising more situations in tcmalloc fastpath. So it a
step forward.
While this is not good representation of real-world production malloc
behavior, it is representative of length (instruction-wise and well as
cycle-wise) of fast-path. So this is better than nothing.