Commit Graph

601 Commits

Author SHA1 Message Date
Aliaksey Kandratsenka
eeb7b84c20 Register tcmalloc atfork handler as early as possible
This is what other mallocs do (glibc malloc and jemalloc). The idea is
malloc is usually initialized very eary. So if we register atfork
handler at that time, we're likely to be first. And that makes our
atfork handler a bit safer, since there is much less chance of some
other library installing their "take all locks" handler first and
having fork take malloc lock before library's lock and deadlocking.

This should address issue #904.
2017-07-08 16:08:29 -07:00
Aliaksey Kandratsenka
208c26caef Add initial syscall support for mips64 32-bit ABI
This applies patch by Adhemerval Zanella from
https://github.com/gperftools/gperftools/issues/845.

Only malloc (i.e. tcmalloc_minimal) was tested to work so far.
2017-07-08 13:34:41 -07:00
Francis Ricci
a3bf61ca81 Ensure that lsan flags are appended on all necessary targets 2017-07-08 13:33:30 -07:00
Aliaksey Kandratsenka
97646a1932 Add missing NEWS entry for recent 2.6 release
Somehow I managed to miss this last commit in 2.6 release. So lets add
it now even if it is too late.
2017-07-04 21:02:34 -07:00
Aliaksey Kandratsenka
4be05e43a1 bumped version up to 2.6 2017-07-04 20:35:25 -07:00
Francis Ricci
70a35422b5 Ignore current_instance heap allocation when leak sanitizer is enabled
Without this patch, any user program that enables LeakSanitizer will
see a leak from tcmalloc. Add a weak hook to __lsan_ignore_object,
so that if LeakSanitizer is enabled, the allocation can be ignored.
2017-07-04 20:24:47 -07:00
Aliaksey Kandratsenka
6eca6c64fa Revert "issue-654: [pprof] handle split text segments"
This reverts commit 8c3dc52fcf.

People have reported issues with this so lets stay safe and use older
even if less powerful code.
2017-07-01 18:48:58 -07:00
KernelMaker
a495969cb6 update the prev_class_size in each loop, or the min_object_size of tcmalloc.thread will always be 1 when calling GetFreeListSizes 2017-05-29 15:05:55 -07:00
Kim Gräsman
163224d8af Document HEAPPROFILESIGNAL environment variable 2017-05-29 15:04:00 -07:00
Aliaksey Kandratsenka
5ac82ec5b9 added stacktrace capturing benchmark 2017-05-29 14:57:13 -07:00
Aliaksey Kandratsenka
c571ae2fc9 2.6rc4 2017-05-22 19:04:20 -07:00
Aliaksey Kandratsenka
f2bae51e7e Revert "Revert "disable dynamic sized delete support by default""
This reverts commit b82d89cb7c.

Dynamic sized delete support relies on ifunc handler being able to
look up environment variable. The issue is, when stuff is linked with
-z now linker flags, all relocations are performed early. And sadly
ifunc relocations are not treated specially. So when ifunc handler
runs, it cannot rely on any dynamic relocations at all, otherwise
crash is real possibility. So we cannot afford doing it until (and if)
ifunc is fixed.

This was brought to my attention by Fedora people at
https://bugzilla.redhat.com/show_bug.cgi?id=1452813
2017-05-22 18:58:15 -07:00
Aliaksey Kandratsenka
6426c0cc80 2.6rc3 2017-05-22 03:08:30 -07:00
Aliaksey Kandratsenka
0c0e2fe43b enable 48-bit page map on msvc as well 2017-05-22 03:08:30 -07:00
Aliaksey Kandratsenka
83d6818295 speed up 3-level page map access
There is no need to have pointer indirection for root node. This also
helps the case of early free of garbage pointer because we didn't
check root_ pointer for NULL.
2017-05-22 03:08:15 -07:00
Aliaksey Kandratsenka
f7ff175b92 add configure-time warning on unsupported backtrace capturing
Both libgcc and libc's backtrace() are not really options for stack
trace capturing from inside profiling signal handler. So lets warn
people.
2017-05-22 01:55:50 -07:00
Aliaksey Kandratsenka
cef582350c align fast-path functions only if compiler supports that
Apparently gcc only supports __attribute__((aligned(N))) on functions
only since version 4.3. So lets test it in configure script and only
use when possible. We now use CACHELINE_ALIGNED_FN macro for aligning
functions.
2017-05-22 01:55:50 -07:00
Aliaksey Kandratsenka
bddf862b18 actually support very early freeing of NULL
This was caught by unit tests on centos 5. Apparently some early
thingy is trying to do vprintf which calls free(0). Which used to
crash since before size class cache is initialized it'll report
hit (with size class 0) for NULL pointer, so we'd miss the case of
checking NULL pointer free and crash.

The fix is to check for IsInited in the case when thread cache is
null, and if so then we escalte to free_null_or_invalid.
2017-05-22 01:54:56 -07:00
Aliaksey Kandratsenka
07a124d8c1 don't use arg-ful constructor attribute for early nallocx test
101 is not very early anyways and arg-ful constructor attribute is
only supported since gcc 4.3 (and e.g. rhel 5's compiler fails to
compile it). So there seems to be very little value trying to ask for
priority of 101.
2017-05-21 22:49:54 -07:00
Aliaksey Kandratsenka
5346b8a4de don't depend on SIZE_MAX definition in sampler.cc
It was reported that SIZE_MAX isn't getting defined in C++ mode when
C++ standard is less than c++11. Because we still want to support
non-c++11 systems (for now), lets make it simple and not depend on
SIZE_MAX (original google-internal code used
std::numeric_limits<ssize_t>::max, but that failed to compile on
msvc).

Fixes issue #887 and issue #889.
2017-05-21 22:49:20 -07:00
Aliaksey Kandratsenka
50125d8f70 2.6rc2 2017-05-15 00:02:43 -07:00
Aliaksey Kandratsenka
a5e8e42a47 don't link-in libunwind if libunwind.h is missing
I got report that some build environments for
https://github.com/lyft/envoy are having link-time issue due to
linking libunwind. It was happening despite libunwind.h being present,
which is clear bug as without header we won't really use libunwind.
2017-05-14 23:45:08 -07:00
Rajalakshmi Srinivasaraghavan
e92acdf98d Fix compilation error for powerpc32
Fix the following compilation error for powerpc32 platform when using
latest glibc.
error: ‘siginfo_t’ was not declared in this scope
2017-05-14 23:08:13 -07:00
Aliaksey Kandratsenka
b48403a4b0 2.6rc 2017-05-14 22:00:28 -07:00
Aliaksey Kandratsenka
53f15325d9 fix compilation of tcmalloc_unittest.cc on older llvm-gcc 2017-05-14 20:35:22 -07:00
Aliaksey Kandratsenka
b1d88662cb change size class to be represented by 32 bit int
This moves code closer to Google-internal version and provides for
slightly tighter code encoding on amd64.
2017-05-14 19:04:56 -07:00
Aliaksey Kandratsenka
991f47a159 change default transfer batch back to 32
Some tensorflow benchmarks are seeing large regression with elevated
values. So lets stick to old safe default until we understand how to make
larger values work for all workloads.
2017-05-14 19:04:56 -07:00
Aliaksey Kandratsenka
7bc34ad1f6 support different number of size classes at runtime
With TCMALLOC_TRANSFER_NUM_OBJ environment variable we can change
transfer batch size. And with that comes slightly different number of
size classes depending on value of transfer batch size.

We used to have hardcoded number of size classes, so we couldn't
really support any batch size setting.

This commit adds support for dynamic number of size classes (runtime
value returned by Static::num_size_classes()).
2017-05-14 19:04:56 -07:00
Aliaksey Kandratsenka
4585b78c8d massage allocation and deallocation fast-path for performance
This is significant speedup of fast-path of malloc. Large part comes
from avoiding expensive function prologue/epilogue. Which is achieved
by making sure that tc_{malloc,new,free} etc are small functions that
do only tail-calls. We keep only critical path in those functions and
tail-call to slower "full" versions when we need to deal with less
common case. This helps compiler generate much tidier code.

Fast-path readyness check is now different too. We used to have "min
size for slow path" variable, which was set to non-zero value when we
know that thread cache is present and ready. We now have use
thread-cache pointer not equal to NULL as readyness check.

There is special ThreadCache::threadlocal_data_.fast_path_heap copy of
that pointer that can be temporarily nulled to disable malloc fast
path. This is used to enable emergency malloc.

There is also slight change to tracking thread cache size. Instead of
tracking total size of free list, it now tracks size headroom. This
allows for slightly faster deallocation fast-path check where we're
checking headroom to stay above zero. This check is a bit faster than
comparing with max_size_.
2017-05-14 19:04:56 -07:00
Aliaksey Kandratsenka
5964a1d9c9 always inline a number of hot functions 2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
e419b7b9a6 introduce ATTRIBUTE_ALWAYS_INLINE 2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
7d588da7ec synchronized Sampler implementation with Google-internal version
This is mostly dropping FastLog2 which was never necessary for
performance, and making sampler to be called always, even if sampling
is disabled (this benefits more for always-sampling case of Google
fork).

We're also getting TryRecordAllocationFast which is not used yet, but
will be as part of subsequent fast-path speedup commit.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
27da4ade70 reduce size of class_to_size_ array
Since 32-bit int is enough and accessing smaller array will use a bit
less of cache.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
335f09d4e4 use static location for pageheap
Makes it a bit faster to access, since we're dropping pointer
indirection.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
6ff332fb51 move size classes map earlier in SizeMap
Since we access them more often, having at least one of them at offset
0 makes pi{c,e} code a bit smaller.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
121b1cb32e slightly faster size class cache
Lower bits of page index are still used as index into hash
table. Those lower bits are zeroed, or-ed with size class and
placed into hash table. So checking is just loading value from hash
table, xoring with higher bits of address and checking if resultant
value is lower than 128. Notably, size class 0 is not considered
"invalid" anymore.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
b57c0bad41 init tcmalloc prior to replacing system alloc
Currently on windows, we're depending on uninitialized tcmalloc
variables to detect freeing foreign malloc's chunks. This works
somewhat by chance due to 0-initialized size classes cache working as
cache with no values. But this is about to change, so lets do explicit
initialization.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
71fa9f8730 use 2-level page map for 48-bit addresses
48 bits is size of x86-64 and arm64 address spaces. So using 2 levels
map for them is slightly faster. We keep 3 levels for small-but-slow
configuration, since 2 levels consume a bit more memory.

This is partial port of Google-internal commit by Sanjay
Ghemawat (same idea, different implementation).
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
bad70249dd use 48-bit addresses on 64-bit arms too 2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
5f12147c6d use hidden visibility for some key global variables
So that our -fPIC code is faster
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
dfd53da578 set ENOMEM in handle_oom 2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
14fd551072 avoid O(N²) in thread cache creation code 2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
507a105e84 pass original size to DoSampledAllocation
It makes heap profiles more accurate. Google's internal malloc is doing
it as well.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
bb77979dea don't declare throw() on malloc funtions since it is faster
Apparently throw() on functions actually asks compiler to generate code
to detect unexpected exceptions. Which prevents tail calls optimization.

So in order to re-enable this optimization, we simply don't tell
compiler about throw() at all. C++11 noexcept would be even better, but
it is not universally available yet.

So we change to no exception specifications. Which at least for gcc &
clang on Linux (and likely for all ELF platforms, if not just all)
really eliminates all overhead of exceptions.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
89c74cb79c handle duplicate google_malloc frames in malloc hook stack trace
Subsequent optimization may cause multiple malloc functions in
google_malloc section to be in call stack. Particularly when fast-path
malloc function calls slow-path and compiler chooses to implement such
call as regular call instead of tail-call.

Because we need stacktrace just until first such function, once we find
innermost such frame, we're simply checking if next outer frame is also
google_malloc and consider it instead.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
0feb1109ac fix stack trace capturing in debug malloc
Particularly, hardcoded skip count was relying on certain behavior of
compiler. Namely, that tail calls inside DebugDeallocate path are not
actually implemented as tail calls.

New implementation is using google_malloc section as a marker of malloc
boundary. But in order for this to work, we have to prevent tail-call in
debugallocation's tc_XXX functions. Which is achieved by doing volatile
read of static variable at the end of such functions.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
0506e965ee replace LIKELY/UNLIKELY with PREDICT_{TRUE,FALSE}
Google-internal code is using PREDICT_TRUE/FALSE, so we should be
doing it too.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
59a4987054 prevent inlining ATTRIBUTE_SECTION functions
So that their code is always executing in prescribed section.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
ebb575b8a0 Revert "enabled aggressive decommit by default"
This reverts commit 7da5bd014d.

Some tensorflow benchmarks are getting slower with aggressive
decommit.
2017-05-14 19:04:55 -07:00
Aliaksey Kandratsenka
b82d89cb7c Revert "disable dynamic sized delete support by default"
This reverts commit 06811b3ae4.
2017-05-14 19:04:55 -07:00