This is part of effort to get rid of perl pprof dependency. We're
replacing forking to pprof --symbols with carefully crafted
libbacktrace integration which has enough support for symbolizing
backtraces.
Part of this change is better diagnostic for when/if it fails. And
most important part is compensating for the delay between sampling
parameter is set for the test and when it is actually taken into
account by thread cache Sampler logic.
As a result about 1% of flaking probability has been fixed as we're
getting mean estimate for the allocated size actually same (or about
same) as allocated size.
Update github ticket #1557.
We now support a set of command line flags similar to "abseil"
benchmark thingy. I.e. to let people specify a subset of benchmarks or
run them longer/shorter as needed.
This commit also includes small, portable and very simple regexp
facility. It isn't good enough for some production use, but it is
plenty good for some testing uses or benchmark selection.
Instead of relying on __sbrk (on subset of Linux systems) or invoking
sbrk syscall directly (on subset of FreeBSD systems), we have malloc
invoke special tcmalloc_hooked_sbrk function. Which handles hooking
and then invokes regular system's sbrk. Yes, we loose theoretical
ability to hook into non-tcmalloc uses of sbrk, but we gain portable
simplicity.
The case of mmap and sbrk hooks is simple enough that we can do
simpler "skip right number of frames" approach. Instead of relying on
less portable and brittle attribute section trick.
The comments in this file stated that it has to be linked in specific
order to get initialized early enough. But our initialization is
de-facto via initial malloc hook. And to deal with latest-possible
destruction, we use more convenient destructor function
attribute, and make things simpler.
References github issue #1503.
This significantly reworks both early thread cache access and related
emergency malloc mode checking integration. As a result, we're able to
to rely on emergency malloc even on systems without "good"
TLS (e.g. QNX which does emutls).
One big change here is we're undoing early change to have single
"global" thread cache early during process lifetime. It was nice and
somewhat simpler approach. But because of inherent locking during
early process lifetime, we couldn't be sure of certain lock ordering
aspects w.r.t. backtracing/exception-stack-unwinding. So I choose to
keep it safe. So the new idea is we use SlowTLS facility to find
threads' caches when normal tls isn't ready yet. It avoids holding
locks around potentially recursion-ful things (like
ThreadCache::ModuleInit or growing heap). But we then have to be
careful to "re-attach" those early thread cache instances to regular
TLS. There will nearly always be just one of such thread caches. For
initial thread. But we cannot entirely rule out more general case
where someone creates threads before process initializers ran and
main() is reached. Another notable thing is free-ing memory in this
early mode will always using slow-path deletion directly into central
free list.
SlowTLS facility is "simply" a generalization of previous
CreateBadTLSCache code. I.e. we have a small fixed-size cache that
holds "exceptional" instances of thread-identitity to
thread-cache+emergency-mode-flag mappings.
We also take advantage of tcmalloc::kInvalidTLSKey we introduced
earlier and remove potentially raceful memory ordering between reading
tls_ready_ and tls_key_.
For emergency malloc detection we previously used thread_local
flag. Which we cannot use on !kHaveGoodTLS systems. So we instead
_remove_ thread's cache from it's normal tls storage and place it
"into" SlowTLS instead for the duration of WithStacktraceScope
call (which is how emergency mode is enabled now).
While it still needs more work to be up to highest standards of
cleanness and clarity, we made it better.
One big part of this change is we're now liberated from
tcmalloc_unittest.sh. We now arrange testing of various environment
variable tweaks in C++ by self-execing with updated environment at the
end of tests.
Biggest part of it is removal of SetTestResourceLimit. The reasoning
behind it is, we don't do it on windows. Tests pass just fine. So,
there is no reason to bother.
It is still somewhat broken and somewhat of a mess, but it is much
lesser mess now.
As part of this change I also amputated a number of unnecessary or too
complicated bits. Like attempt to build both static and shared
versions of tcmalloc libraries at the same time. We stick to cmake's
"common" (seemingly) behavior of defaulting to something (in our case
shared library) and letting users override BUILD_SHARED_LIBS.
All the "install" bits are amputated as well, they're not ready.
We now wrap StackTraceScope thingy in tcmalloc-specific parts, instead
of automagically inside every stacktrace.cc function. TCMalloc bits
all need to grab stacktraces via newly introduced
tcmalloc::GrabBacktrace (which handles emergency malloc wrapping).
New approach eliminates the need for doing fake stacktrace scope. CPU
profiler, being distinct .so library couldn't take advantage of
emergency malloc anyways.
This simplifies the build further and eliminates another potential
point of runtime divergence when stacktrace is linked to both
libprofiler and libtcmalloc.
Logic was removed from thread_cache.{h,cc} into
thread_cache_ptr.{h,cc}.
Separation will help possible future evolution, and we already changed
the logic quite a bit:
* early access (when TLS isn't initialized yet) now uses global
ThreadCache instance. We therefore have ThreadCachePtr instances
managing required locking. This eliminates unnecessary complication of
PTHREADS_CRASHES_IF_RUN_TOO_EARLY logic, and any other danger of
touching TLS too early. BTW previous implementation actually leaked
initial early-initialized ThreadCache instance(!)
* old configure-time HAVE_TLS logic is amputated. Config-time part of
it made little sense as C++ 17 guarantees availability of
thread_local, but we have manually curated deny-list of "bad" OSes,
that we tested (via compile checks!) at configure time. Now this
is all compile time. There is now compile-time kHaveGoodTLS variable
and we're using it mostly via if constexpr.
* kHaveGoodTLS case of creating thread cache is simplified and made
more straightforward (no need to have in_setspecific logic).
* !kHaveGoodTLS case if fixed and improved too. We avoid
std:🧵:get_id, as it deadlocks on mingw. We use errno address as
a portable and (usually) async-signal safe 'my thread' identifier. We
also eliminate linear searching of thread's cache and replace it with
straightforward hash table lookup.