References github issue #1503.
This significantly reworks both early thread cache access and related
emergency malloc mode checking integration. As a result, we're able to
to rely on emergency malloc even on systems without "good"
TLS (e.g. QNX which does emutls).
One big change here is we're undoing early change to have single
"global" thread cache early during process lifetime. It was nice and
somewhat simpler approach. But because of inherent locking during
early process lifetime, we couldn't be sure of certain lock ordering
aspects w.r.t. backtracing/exception-stack-unwinding. So I choose to
keep it safe. So the new idea is we use SlowTLS facility to find
threads' caches when normal tls isn't ready yet. It avoids holding
locks around potentially recursion-ful things (like
ThreadCache::ModuleInit or growing heap). But we then have to be
careful to "re-attach" those early thread cache instances to regular
TLS. There will nearly always be just one of such thread caches. For
initial thread. But we cannot entirely rule out more general case
where someone creates threads before process initializers ran and
main() is reached. Another notable thing is free-ing memory in this
early mode will always using slow-path deletion directly into central
free list.
SlowTLS facility is "simply" a generalization of previous
CreateBadTLSCache code. I.e. we have a small fixed-size cache that
holds "exceptional" instances of thread-identitity to
thread-cache+emergency-mode-flag mappings.
We also take advantage of tcmalloc::kInvalidTLSKey we introduced
earlier and remove potentially raceful memory ordering between reading
tls_ready_ and tls_key_.
For emergency malloc detection we previously used thread_local
flag. Which we cannot use on !kHaveGoodTLS systems. So we instead
_remove_ thread's cache from it's normal tls storage and place it
"into" SlowTLS instead for the duration of WithStacktraceScope
call (which is how emergency mode is enabled now).
While it still needs more work to be up to highest standards of
cleanness and clarity, we made it better.
One big part of this change is we're now liberated from
tcmalloc_unittest.sh. We now arrange testing of various environment
variable tweaks in C++ by self-execing with updated environment at the
end of tests.
In 57d6ecc4ae I removed obsolete
src/windows/TODO but failed to remove it from Makefile.am
I also previously failed to arrange distribution of vendored
googletest. And we also failed to distribute headers in src/tests/
This is all fixed now.
Turns out at least on FreeBSD tmpfile will fail if TMPDIR points to
non-existant directory. This also has nice property of leaving it down
to users to set up TMPDIR the way they want.
There were far too many intermediate libraries, references to obsolete
libtool bugs and whatnot.
We now have libcommon.la as a convenience archive that contains
spinlock, logging and misc stuff. A number of unit tests that need
those facilities are being linked to this.
Then we have libstacktrace.la convenience archive, since it is used by
libtcmalloc.la and libprofiler.la.
And after that we're simply directly creating our main library
products: libtcmalloc_minimal{,_debug}.la, libtcmalloc_{,debug}.la,
libprofiler.la etc.
We now wrap StackTraceScope thingy in tcmalloc-specific parts, instead
of automagically inside every stacktrace.cc function. TCMalloc bits
all need to grab stacktraces via newly introduced
tcmalloc::GrabBacktrace (which handles emergency malloc wrapping).
New approach eliminates the need for doing fake stacktrace scope. CPU
profiler, being distinct .so library couldn't take advantage of
emergency malloc anyways.
This simplifies the build further and eliminates another potential
point of runtime divergence when stacktrace is linked to both
libprofiler and libtcmalloc.
For compiling things automake never needs to be given a full set of
headers. Usually headers are specified so that make dist includes them
into archive, but we can achieve this goal easier.
This reduces size and complexity of our Makefile.am stuff.
Logic was removed from thread_cache.{h,cc} into
thread_cache_ptr.{h,cc}.
Separation will help possible future evolution, and we already changed
the logic quite a bit:
* early access (when TLS isn't initialized yet) now uses global
ThreadCache instance. We therefore have ThreadCachePtr instances
managing required locking. This eliminates unnecessary complication of
PTHREADS_CRASHES_IF_RUN_TOO_EARLY logic, and any other danger of
touching TLS too early. BTW previous implementation actually leaked
initial early-initialized ThreadCache instance(!)
* old configure-time HAVE_TLS logic is amputated. Config-time part of
it made little sense as C++ 17 guarantees availability of
thread_local, but we have manually curated deny-list of "bad" OSes,
that we tested (via compile checks!) at configure time. Now this
is all compile time. There is now compile-time kHaveGoodTLS variable
and we're using it mostly via if constexpr.
* kHaveGoodTLS case of creating thread cache is simplified and made
more straightforward (no need to have in_setspecific logic).
* !kHaveGoodTLS case if fixed and improved too. We avoid
std:🧵:get_id, as it deadlocks on mingw. We use errno address as
a portable and (usually) async-signal safe 'my thread' identifier. We
also eliminate linear searching of thread's cache and replace it with
straightforward hash table lookup.
This allows us, later, to avoid building this stuff in configurations
that don't use it. I have also reduced API and ABI surface to enable
further refactorings.