Fix below "BUS ERROR" issue:
a0 hold start address of memory block allocated by DebugAllocate in debugallocation.cc
gdb) info registers
zero at v0 v1 a0 a1 a2 a3
R0 00000000 10008700 772f62a0 00084d40 766dcfef 7fb5f420 00000000 004b4dd8
t0 t1 t2 t3 t4 t5 t6 t7
R8 7713c1a0 7712dbc0 ffffffff 777bc000 f0000000 00000001 00000000 00403d10
s0 s1 s2 s3 s4 s5 s6 s7
R16 7fb5ff1c 00401b9c 77050020 7fb5fb18 00000000 004cb008 004ca748 ffffffff
t8 t9 k0 k1 gp sp s8 ra
R24 0000002f 771adcd4 00000000 00000000 771f4140 7fb5f408 7fb5f430 771add6c
sr lo hi bad cause pc
00008713 0000e9fe 00000334 766dcff7 00800010 771adcfc
fsr fir
00000004 00000000
(gdb) disassemble
Dump of assembler code for function _ZNSs4_Rep10_M_disposeERKSaIcE:
0x771adcd4 <+0>: lui gp,0x4
0x771adcd8 <+4>: addiu gp,gp,25708
0x771adcdc <+8>: addu gp,gp,t9
0x771adce0 <+12>: lw v0,-28696(gp)
0x771adce4 <+16>: beq a0,v0,0x771add38 <_ZNSs4_Rep10_M_disposeERKSaIcE+100>
0x771adce8 <+20>: nop
0x771adcec <+24>: lw v0,-30356(gp)
0x771adcf0 <+28>: beqzl v0,0x771add1c <_ZNSs4_Rep10_M_disposeERKSaIcE+72>
0x771adcf4 <+32>: lw v0,8(a0)
0x771adcf8 <+36>: sync
=> 0x771adcfc <+40>: ll v0,8(a0)
0x771add00 <+44>: addiu at,v0,-1
0x771add04 <+48>: sc at,8(a0)
0x771add08 <+52>: beqz at,0x771adcfc <_ZNSs4_Rep10_M_disposeERKSaIcE+40>
0x771add0c <+56>: nop
0x771add10 <+60>: sync
0x771add14 <+64>: b 0x771add24 <_ZNSs4_Rep10_M_disposeERKSaIcE+80>
0x771add18 <+68>: nop
0x771add1c <+72>: addiu v1,v0,-1
0x771add20 <+76>: sw v1,8(a0)
0x771add24 <+80>: bgtz v0,0x771add38 <_ZNSs4_Rep10_M_disposeERKSaIcE+100>
0x771add28 <+84>: nop
0x771add2c <+88>: lw t9,-27072(gp)
0x771add30 <+92>: jr t9
0x771add34 <+96>: nop
0x771add38 <+100>: jr ra
0x771add3c <+104>: nop
End of assembler dump.
ll instruction manual:
Load Linked:
Loads the destination register with the contents of the word
that is at the memory location. This instruction implicity performs
a SYNC operation; all loads and stores to shared memory fetched prior
to the ll must access memory before the ll, and loads and stores to
shared memory fetched subsequent to the ll must access memory after ll.
Load Linked and Store Conditional can be use to automatically update
memory locations. *This instruction is not valid in the mips1 architectures.
The machine signals an address exception when the effective address is not
divisible by four.
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Aliaksey Kandratsenka <alk@tut.by>
[alk@tut.by: removed addition of unused #include]
Gcc after 4.7 provides atomic builtins[1]. Use these instead of adding
yet-another-assembly port for Aarch64 (64-bit ARM). This patch enables
succesfully building and running atomicops unittest on Aarch64.
This patch enables using gcc builtins only when no assembly
implementation is provided. But as a quick check, atomicops_unittest
and rest of testsuite passes with atomicops-internals-gcc also
ARMv7 and X86_64 if the ifdef in atomicops is adjusted to prefer
the generic implementation.
[1] http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
Instead those functions that are original taken from google's "base"
code now have prefix TCMalloc_. So that they don't conflict with other
google's libraries having same functions.
In issue-66 (and readme) it is pointed out that sometimes there are
some issues grabbing backtrace across signal handler boundary.
This code attempts to fix it by grabbing backtrace from signal's
ucontext which clearly does not include signal handler boundary.
We're using "feature" of libunwind that for some important platforms
libunwind's context is same as libc's ucontext_t which is given to us
as part of calling signal handler.
Because otherwise destructor might be invoked well before other places
that might touch malloc extension instance.
We're using placement new to initialize it and pass pointer to
MallocExtension::Register. Which ensures that destructor for it is
never run.
Based on idea suggested by Andrew C. Morrow.
Because clang doesn't understand -fno-builtin-malloc and friends. And
otherwise new/delete pairs get optimized away causing our tests that
expect hooks to be called to fail.
Somehow it's c++ headers (like string) define pthread symbols without
even us asking for. That breaks old assumption that pthread symbols
are not available on windows.
In order to fix that we detect this condition in configure.ac and
avoid defining windows versions of pthread symbols.
* some variables defined with "char *" should be modified to "const char*"
* For uclibc, glibc's "void malloc_stats(void)" should be "void malloc_stats(FILE *)", is commented now.
* For uclibc, __sbrk is with attribute "hidden", so we use mmap allocator for uclibc.
Previous logic of detecting main program addresses is to assume that
main executable is at least addressess. With PIE (active by default on
Ubuntus) it doesn't work.
In order to deal with that, we're attempting to find main executable
mapping in /proc/[pid]/maps. And old logic is preserved too just in
case.
Some people might want to check-in unpacked result on make dist into
git. But because git doesn't preserve timestamps it would cause those
automatic "auto-retool" rules to trigger. Sometimes even causing build
breakage if system's autotools version don't match autotools version
used for make dist.
Easiest way around this problem is to simply disable those unnecessary
"maintainer" rebuild rules. Especially given that source is always
freely available via git and therefore there should be no reason to
regenerate any of autotools products in 'make dist'-produced sources.
Previously call to CheckAddressBits was made but nothing was done to
it's result.
I've also make sure that actual size is used in checks and in bumping
up of TCMalloc_SystemTaken.
As suggested by Hannes Weisbach.
Call heap-profiler_unittest with the arguments 1 -2 (one iteration, 2
fork()ed children).
Instead of running the test, the program crashes with a std::bad_alloc
exception. This is caused by unconditionally passing the
number-of-threads-argument (0 or positive for threads, negative for
fork()s) in RunManyThreads(), thus allocating an array of pthread_t of
size -2. Depending on the sign of the thread number argument either
RunManyThreads or fork() should be called.
As proposed by Hannes Weisbach.
The argument will be garbled because of a misplaced brace, for example
(heap-checker_unittest.sh):
HEAP_CHECKER="${1:-$BINDIR}/heap-checker_unittest"
which should be:
HEAP_CHECKER="${1:-$BINDIR/heap-checker_unittest}"
This unit test is used to check the binaries heap-checker_unittest and
heap-checker_debug_unittest. With the typo, the executable
heap-checker_debug_unittest is never actually run.
When we fetch objects from the span for thread cache, we make
reverse-ordered list against original list on the span and suppy this list
to thread cache. This algorithm has trouble with newly created span.
Newly created span has ascending ordered objects list. Since thread cache
will get reverse-ordered list against it, user gets objects as descending order.
Following example shows what occurs in this algorithm.
new span: object list: 1 -> 2 -> 3 -> 4 -> 5 -> ...
fetch N items: N -> N-1 -> N-2 -> ... -> 2 -> 1 -> NULL
thread cache: N -> N-1 -> N-2 -> ... -> 2 -> 1 -> NULL
user's 1st malloc: N
user's 2nd malloc: N-1
...
user's Nth malloc: 1
In general, access memory with ascending order is better than descending
order in terms of the performance. So this patch fix this situation.
I run below program to measure performance effect.
#define MALLOC_SIZE (512)
#define CACHE_SIZE (64)
#define TOUCH_SIZE (512 / CACHE_SIZE)
array = malloc(sizeof(void *) * count);
for (i = 0; i < 1; i++) {
for (j = 0; j < count; j++) {
x = malloc(MALLOC_SIZE);
array[j] = x;
}
}
repeat = 10;
for (i = 0; i < repeat; i++) {
for (j = 0; j < count; j++) {
x = array[j];
for (k = 0; k < TOUCH_SIZE; k++) {
*(x + (k * CACHE_SIZE)) = '1';
}
}
}
LD_PRELOAD=libtcmalloc_minimal.so perf stat -r 10 ./a.out 1000000
**** Before ****
Performance counter stats for './a.out 1000000' (10 runs):
2.715161299 seconds time elapsed ( +- 0.07% )
**** After ****
Performance counter stats for './a.out 1000000' (10 runs):
2.259366428 seconds time elapsed ( +- 0.08% )
It is better to reduce function call if possible. If we try to fetch
objects from one span as much as possible during each function call,
number of function call would be reduced and this would help performance.
On initialization step, tcmalloc double-checks SizeClass integrity with
all possible size values, 0 to kMaxSize. This causes tremendous overhead
for short-lived applications.
For example, consider following command.
'find -exec grep something {} \;'
Actual work of each grep is really small, but double-check requires
more work. To reduce this overhead, it is best to remove double-check
entirely. But we cannot be sure the integrity without double-checking,
so alternative is needed.
This patch doesn't remove double-check, instead, try to skip unnecessary
check based on ClassIndex() implementation. This reduce much overhead and
the code has same coverage as previous double-check. Following is
the result of this patch.
time LD_PRELOAD=libtcmalloc_minimal.so find ./ -exec grep "SOMETHING" {} \;
* Before
real 0m3.675s
user 0m1.000s
sys 0m0.640s
* This patch
real 0m2.833s
user 0m0.056s
sys 0m0.220s
* Remove double-check entirely
real 0m2.675s
user 0m0.072s
sys 0m0.184s
I.e. to prevent possible deadlock when this locks are taked by
different threads in different order.
This particular problem was also reported as part of issue 66.
This applies patch from Jean Lee.
I've reformatted it to match surronding code style and changed
validation logic a bit. I.e. we're not checking signal for range
anymore given we're not sure what different platforms support, but
we're checking return value of signal() for SIG_ERR instead.
When we detect running under valgrind we do not initialize our own
malloc. So trying to print malloc stats when asked via MALLOCSTATS
cannot work.
This does fix proposed by Philippe Waroquiers. In which we detect
running under valgrind prior to checking MALLOCSTATS environment
variable and refuse printing stats if we detect valgrind.
This merges patch contributed by Jovan Zelincevic.
And with that patch tcmalloc build with --enable-minimal (just malloc
replacement) appears to work (passes unit tests).
Because it was found that __thread variables access is compiled into
calls to tlv_get_addr which was found to call malloc. Because we
actually use thread-local storage from inside malloc it leads to stack
overflow. So we'll continue using pthreads API for that which is known
to work on OSX.