We got bitten again by compiler optimizing out free(malloc(sz))
combination. We replace calls to malloc/free with calls to global
function operator new/delete. Those are apparently forbidden by
standard to be optimized out.
Previously we bumped size by 16 between iterations, but for many size
classess that gave is subsequent iteration into same size
class. Multiplying by prime number randomizes sizes more so speeds up
this benchmark on at least modern x86.
This also makes them output nicer results. I.e. every benchmark is run 3
times and iteration duration is printed for every run.
While this is still very synthetic and unrepresentave of malloc performance
as a whole, it is exercising more situations in tcmalloc fastpath. So it a
step forward.
While this is not good representation of real-world production malloc
behavior, it is representative of length (instruction-wise and well as
cycle-wise) of fast-path. So this is better than nothing.