Import of HTML documentation from SourceForge.
git-svn-id: http://gperftools.googlecode.com/svn/trunk@3 6b5cf1ce-ec42-a296-1ba9-69fdba395a50
409
docs/html/cpu_profiler.html
Normal file
@ -0,0 +1,409 @@
|
|||||||
|
<html><head><title>Google CPU Profiler</title></head><body>
|
||||||
|
|
||||||
|
This is the CPU profiler we use at Google. There are three parts to
|
||||||
|
using it: linking the library into an application, running the code,
|
||||||
|
and analyzing the output.
|
||||||
|
|
||||||
|
|
||||||
|
<h1>Linking in the Library</h1>
|
||||||
|
|
||||||
|
<p>To install the CPU profiler into your executable, add -lprofiler to
|
||||||
|
the link-time step for your executable. (It's also probably possible
|
||||||
|
to add in the profiler at run-time using LD_PRELOAD, but this isn't
|
||||||
|
necessarily recommended.)</p>
|
||||||
|
|
||||||
|
<p>This does <i>not</i> turn on CPU profiling; it just inserts the code.
|
||||||
|
For that reason, it's practical to just always link -lprofiler into a
|
||||||
|
binary while developing; that's what we do at Google. (However, since
|
||||||
|
any user can turn on the profiler by setting an environment variable,
|
||||||
|
it's not necessarily recommended to install profiler-linked binaries
|
||||||
|
into a production, running system.)</p>
|
||||||
|
|
||||||
|
|
||||||
|
<h1>Running the Code</h1>
|
||||||
|
|
||||||
|
<p>There are two alternatives to actually turn on CPU profiling for a
|
||||||
|
given run of an executable:</p>
|
||||||
|
|
||||||
|
<ol>
|
||||||
|
<li> Define the environment variable CPUPROFILE to the filename to dump the
|
||||||
|
profile to. For instance, to profile /usr/local/netscape:
|
||||||
|
<pre> $ CPUPROFILE=/tmp/profile /usr/local/netscape # sh
|
||||||
|
% setenv CPUPROFILE /tmp/profile; /usr/local/netscape # csh
|
||||||
|
</pre>
|
||||||
|
OR
|
||||||
|
|
||||||
|
</li><li> In your code, bracket the code you want profiled in calls to
|
||||||
|
ProfilerStart() and ProfilerStop(). ProfilerStart() will take the
|
||||||
|
profile-filename as an argument.
|
||||||
|
</li></ol>
|
||||||
|
|
||||||
|
<p>In Linux 2.6 and above, profiling works correctly with threads,
|
||||||
|
automatically profiling all threads. In Linux 2.4, profiling only
|
||||||
|
profiles the main thread (due to a kernel bug involving itimers and
|
||||||
|
threads). Profiling works correctly with sub-processes: each child
|
||||||
|
process gets its own profile with its own name (generated by combining
|
||||||
|
CPUPROFILE with the child's process id).</p>
|
||||||
|
|
||||||
|
<p>For security reasons, CPU profiling will not write to a file -- and
|
||||||
|
is thus not usable -- for setuid programs.</p>
|
||||||
|
|
||||||
|
<h2>Controlling Behavior via the Environment</h2>
|
||||||
|
|
||||||
|
<p>In addition to the environment variable <code>CPUPROFILE</code>,
|
||||||
|
which determines where profiles are written, there are several
|
||||||
|
environment variables which control the performance of the CPU
|
||||||
|
profile.</p>
|
||||||
|
|
||||||
|
<table cellpadding="5" frame="box" rules="sides" width="100%">
|
||||||
|
<tbody><tr>
|
||||||
|
<td><code>PROFILEFREQUENCY=<i>x</i></code></td>
|
||||||
|
<td>How many interrupts/second the cpu-profiler samples.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</tbody></table>
|
||||||
|
|
||||||
|
<h1>Analyzing the Output</h1>
|
||||||
|
|
||||||
|
<p>pprof is the script used to analyze a profile. It has many output
|
||||||
|
modes, both textual and graphical. Some give just raw numbers, much
|
||||||
|
like the -pg output of gcc, and others show the data in the form of a
|
||||||
|
dependency graph.</p>
|
||||||
|
|
||||||
|
<p>pprof <b>requires</b> perl5 to be installed to run. It also
|
||||||
|
requires dot to be installed for any of the graphical output routines,
|
||||||
|
and gv to be installed for --gv mode (described below).</p>
|
||||||
|
|
||||||
|
<p>Here are some ways to call pprof. These are described in more
|
||||||
|
detail below.</p>
|
||||||
|
|
||||||
|
<pre>% pprof "program" "profile"
|
||||||
|
Generates one line per procedure
|
||||||
|
|
||||||
|
% pprof --gv "program" "profile"
|
||||||
|
Generates annotated call-graph and displays via "gv"
|
||||||
|
|
||||||
|
% pprof --gv --focus=Mutex "program" "profile"
|
||||||
|
Restrict to code paths that involve an entry that matches "Mutex"
|
||||||
|
|
||||||
|
% pprof --gv --focus=Mutex --ignore=string "program" "profile"
|
||||||
|
Restrict to code paths that involve an entry that matches "Mutex"
|
||||||
|
and does not match "string"
|
||||||
|
|
||||||
|
% pprof --list=IBF_CheckDocid "program" "profile"
|
||||||
|
Generates disassembly listing of all routines with at least one
|
||||||
|
sample that match the --list=<regexp> pattern. The listing is
|
||||||
|
annotated with the flat and cumulative sample counts at each line.
|
||||||
|
|
||||||
|
% pprof --disasm=IBF_CheckDocid "program" "profile"
|
||||||
|
Generates disassembly listing of all routines with at least one
|
||||||
|
sample that match the --disasm=<regexp> pattern. The listing is
|
||||||
|
annotated with the flat and cumulative sample counts at each PC value.
|
||||||
|
</regexp></regexp></pre>
|
||||||
|
|
||||||
|
<h3>Node Information</h3>
|
||||||
|
|
||||||
|
<p>In the various graphical modes of pprof, the output is a call graph
|
||||||
|
annotated with timing information, like so:</p>
|
||||||
|
|
||||||
|
<a href="http://goog-perftools.sourceforge.net/doc/pprof-test-big.gif">
|
||||||
|
<center><table><tbody><tr><td>
|
||||||
|
<img src="../images/pprof-test.gif">
|
||||||
|
</td></tr></tbody></table></center>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
<p>Each node represents a procedure.
|
||||||
|
The directed edges indicate caller to callee relations. Each node is
|
||||||
|
formatted as follows:</p>
|
||||||
|
|
||||||
|
<center><pre>Class Name
|
||||||
|
Method Name
|
||||||
|
local (percentage)
|
||||||
|
<b>of</b> cumulative (percentage)
|
||||||
|
</pre></center>
|
||||||
|
|
||||||
|
<p>The last one or two lines contains the timing information. (The
|
||||||
|
profiling is done via a sampling method, where by default we take 100
|
||||||
|
samples a second. Therefor one unit of time in the output corresponds
|
||||||
|
to about 10 milliseconds of execution time.) The "local" time is the
|
||||||
|
time spent executing the instructions directly contained in the
|
||||||
|
procedure (and in any other procedures that were inlined into the
|
||||||
|
procedure). The "cumulative" time is the sum of the "local" time and
|
||||||
|
the time spent in any callees. If the cumulative time is the same as
|
||||||
|
the local time, it is not printed.
|
||||||
|
|
||||||
|
</p><p>For instance, the timing information for test_main_thread()
|
||||||
|
indicates that 155 units (about 1.55 seconds) were spent executing the
|
||||||
|
code in test_main_thread() and 200 units were spent while executing
|
||||||
|
test_main_thread() and its callees such as snprintf().</p>
|
||||||
|
|
||||||
|
<p>The size of the node is proportional to the local count. The
|
||||||
|
percentage displayed in the node corresponds to the count divided by
|
||||||
|
the total run time of the program (that is, the cumulative count for
|
||||||
|
main()).</p>
|
||||||
|
|
||||||
|
<h3>Edge Information</h3>
|
||||||
|
|
||||||
|
<p>An edge from one node to another indicates a caller to callee
|
||||||
|
relationship. Each edge is labelled with the time spent by the callee
|
||||||
|
on behalf of the caller. E.g, the edge from test_main_thread() to
|
||||||
|
snprintf() indicates that of the 200 samples in
|
||||||
|
test_main_thread(), 37 are because of calls to snprintf().</p>
|
||||||
|
|
||||||
|
<p>Note that test_main_thread() has an edge to vsnprintf(), even
|
||||||
|
though test_main_thread() doesn't call that function directly. This
|
||||||
|
is because the code was compiled with -O2; the profile reflects the
|
||||||
|
optimized control flow.</p>
|
||||||
|
|
||||||
|
<h3>Meta Information</h3>
|
||||||
|
|
||||||
|
The top of the display should contain some meta information like:
|
||||||
|
<pre> /tmp/profiler2_unittest
|
||||||
|
Total samples: 202
|
||||||
|
Focusing on: 202
|
||||||
|
Dropped nodes with <= 1 abs(samples)
|
||||||
|
Dropped edges with <= 0 samples
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
This section contains the name of the program, and the total samples
|
||||||
|
collected during the profiling run. If the --focus option is on (see
|
||||||
|
the <a href="#focus">Focus</a> section below), the legend also
|
||||||
|
contains the number of samples being shown in the focused display.
|
||||||
|
Furthermore, some unimportant nodes and edges are dropped to reduce
|
||||||
|
clutter. The characteristics of the dropped nodes and edges are also
|
||||||
|
displayed in the legend.
|
||||||
|
|
||||||
|
<h3><a name="focus">Focus and Ignore</a></h3>
|
||||||
|
|
||||||
|
<p>You can ask pprof to generate a display focused on a particular
|
||||||
|
piece of the program. You specify a regular expression. Any portion
|
||||||
|
of the call-graph that is on a path which contains at least one node
|
||||||
|
matching the regular expression is preserved. The rest of the
|
||||||
|
call-graph is dropped on the floor. For example, you can focus on the
|
||||||
|
vsnprintf() libc call in profiler2_unittest as follows:</p>
|
||||||
|
|
||||||
|
<pre>% pprof --gv --focus=vsnprintf /tmp/profiler2_unittest test.prof
|
||||||
|
</pre>
|
||||||
|
<a href="http://goog-perftools.sourceforge.net/doc/pprof-vsnprintf-big.gif">
|
||||||
|
<center><table><tbody><tr><td>
|
||||||
|
<img src="../images/pprof-vsnprintf.gif">
|
||||||
|
</td></tr></tbody></table></center>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Similarly, you can supply the --ignore option to ignore
|
||||||
|
samples that match a specified regular expression. E.g.,
|
||||||
|
if you are interested in everything except calls to snprintf(),
|
||||||
|
you can say:
|
||||||
|
</p><pre>% pprof --gv --ignore=snprintf /tmp/profiler2_unittest test.prof
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
<h3><a name="options">pprof Options</a></h3>
|
||||||
|
|
||||||
|
<h4>Output Type</h4>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
</p><center>
|
||||||
|
<table cellpadding="5" frame="box" rules="sides" width="100%">
|
||||||
|
<tbody><tr valign="top">
|
||||||
|
<td><code>--text</code></td>
|
||||||
|
<td>
|
||||||
|
Produces a textual listing. This is currently the default
|
||||||
|
since it does not need to access to an X display, or
|
||||||
|
dot or gv. However if you
|
||||||
|
have these programs installed, you will probably be
|
||||||
|
happier with the --gv output.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--gv</code></td>
|
||||||
|
<td>
|
||||||
|
Generates annotated call-graph, converts to postscript, and
|
||||||
|
displays via gv.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--dot</code></td>
|
||||||
|
<td>
|
||||||
|
Generates the annotated call-graph in dot format and
|
||||||
|
emits to stdout.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--ps</code></td>
|
||||||
|
<td>
|
||||||
|
Generates the annotated call-graph in Postscript format and
|
||||||
|
emits to stdout.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--gif</code></td>
|
||||||
|
<td>
|
||||||
|
Generates the annotated call-graph in GIF format and
|
||||||
|
emits to stdout.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--list=<<i>regexp</i>></code></td>
|
||||||
|
<td>
|
||||||
|
<p>Outputs source-code listing of routines whose
|
||||||
|
name matches <regexp>. Each line
|
||||||
|
in the listing is annotated with flat and cumulative
|
||||||
|
sample counts.</p>
|
||||||
|
|
||||||
|
<p>In the presence of inlined calls, the samples
|
||||||
|
associated with inlined code tend to get assigned
|
||||||
|
to a line that follows the location of the
|
||||||
|
inlined call. A more precise accounting can be
|
||||||
|
obtained by disassembling the routine using the
|
||||||
|
--disasm flag.</p>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--disasm=<<i>regexp</i>></code></td>
|
||||||
|
<td>
|
||||||
|
Generates disassembly of routines that match
|
||||||
|
<regexp>, annotated with flat and
|
||||||
|
cumulative sample counts and emits to stdout.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</tbody></table>
|
||||||
|
</center>
|
||||||
|
|
||||||
|
<h4>Reporting Granularity</h4>
|
||||||
|
|
||||||
|
<p>By default, pprof produces one entry per procedure. However you can
|
||||||
|
use one of the following options to change the granularity of the
|
||||||
|
output. The --files option seems to be particularly useless, and may
|
||||||
|
be removed eventually.</p>
|
||||||
|
|
||||||
|
<center>
|
||||||
|
<table cellpadding="5" frame="box" rules="sides" width="100%">
|
||||||
|
<tbody><tr valign="top">
|
||||||
|
<td><code>--addresses</code></td>
|
||||||
|
<td>
|
||||||
|
Produce one node per program address.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr><td><code>--lines</code></td>
|
||||||
|
<td>
|
||||||
|
Produce one node per source line.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr><td><code>--functions</code></td>
|
||||||
|
<td>
|
||||||
|
Produce one node per function (this is the default).
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr><td><code>--files</code></td>
|
||||||
|
<td>
|
||||||
|
Produce one node per source file.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</tbody></table>
|
||||||
|
</center>
|
||||||
|
|
||||||
|
<h4>Controlling the Call Graph Display</h4>
|
||||||
|
|
||||||
|
<p>Some nodes and edges are dropped to reduce clutter in the output
|
||||||
|
display. The following options control this effect:</p>
|
||||||
|
|
||||||
|
<center>
|
||||||
|
<table cellpadding="5" frame="box" rules="sides" width="100%">
|
||||||
|
<tbody><tr valign="top">
|
||||||
|
<td><code>--nodecount=<n></code></td>
|
||||||
|
<td>
|
||||||
|
This option controls the number of displayed nodes. The nodes
|
||||||
|
are first sorted by decreasing cumulative count, and then only
|
||||||
|
the top N nodes are kept. The default value is 80.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--nodefraction=<f></code></td>
|
||||||
|
<td>
|
||||||
|
This option provides another mechanism for discarding nodes
|
||||||
|
from the display. If the cumulative count for a node is
|
||||||
|
less than this option's value multiplied by the total count
|
||||||
|
for the profile, the node is dropped. The default value
|
||||||
|
is 0.005; i.e. nodes that account for less than
|
||||||
|
half a percent of the total time are dropped. A node
|
||||||
|
is dropped if either this condition is satisfied, or the
|
||||||
|
--nodecount condition is satisfied.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--edgefraction=<f></code></td>
|
||||||
|
<td>
|
||||||
|
This option controls the number of displayed edges. First of all,
|
||||||
|
an edge is dropped if either its source or destination node is
|
||||||
|
dropped. Otherwise, the edge is dropped if the sample
|
||||||
|
count along the edge is less than this option's value multiplied
|
||||||
|
by the total count for the profile. The default value is
|
||||||
|
0.001; i.e., edges that account for less than
|
||||||
|
0.1% of the total time are dropped.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--focus=<re></code></td>
|
||||||
|
<td>
|
||||||
|
This option controls what region of the graph is displayed
|
||||||
|
based on the regular expression supplied with the option.
|
||||||
|
For any path in the callgraph, we check all nodes in the path
|
||||||
|
against the supplied regular expression. If none of the nodes
|
||||||
|
match, the path is dropped from the output.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--ignore=<re></code></td>
|
||||||
|
<td>
|
||||||
|
This option controls what region of the graph is displayed
|
||||||
|
based on the regular expression supplied with the option.
|
||||||
|
For any path in the callgraph, we check all nodes in the path
|
||||||
|
against the supplied regular expression. If any of the nodes
|
||||||
|
match, the path is dropped from the output.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</tbody></table>
|
||||||
|
</center>
|
||||||
|
|
||||||
|
<p>The dropped edges and nodes account for some count mismatches in
|
||||||
|
the display. For example, the cumulative count for
|
||||||
|
snprintf() in the first diagram above was 41. However the local
|
||||||
|
count (1) and the count along the outgoing edges (12+1+20+6) add up to
|
||||||
|
only 40.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<h1>Caveats</h1>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li> If the program exits because of a signal, the generated profile
|
||||||
|
will be <font color="red">incomplete, and may perhaps be
|
||||||
|
completely empty.</font>
|
||||||
|
</li><li> The displayed graph may have disconnected regions because
|
||||||
|
of the edge-dropping heuristics described above.
|
||||||
|
</li><li> If the program linked in a library that was not compiled
|
||||||
|
with enough symbolic information, all samples associated
|
||||||
|
with the library may be charged to the last symbol found
|
||||||
|
in the program before the libary. This will artificially
|
||||||
|
inflate the count for that symbol.
|
||||||
|
</li><li> If you run the program on one machine, and profile it on another,
|
||||||
|
and the shared libraries are different on the two machines, the
|
||||||
|
profiling output may be confusing: samples that fall within
|
||||||
|
the shared libaries may be assigned to arbitrary procedures.
|
||||||
|
</li><li> If your program forks, the children will also be profiled (since
|
||||||
|
they inherit the same CPUPROFILE setting). Each process is
|
||||||
|
profiled separately; to distinguish the child profiles from the
|
||||||
|
parent profile and from each other, all children will have their
|
||||||
|
process-id appended to the CPUPROFILE name.
|
||||||
|
</li><li> Due to a hack we make to work around a possible gcc bug, your
|
||||||
|
profiles may end up named strangely if the first character of
|
||||||
|
your CPUPROFILE variable has ascii value greater than 127. This
|
||||||
|
should be exceedingly rare, but if you need to use such a name,
|
||||||
|
just set prepend <code>./</code> to your filename:
|
||||||
|
<code>CPUPROFILE=./Ägypten</code>.
|
||||||
|
</li></ul>
|
||||||
|
|
||||||
|
<hr>
|
||||||
|
Last modified: Wed Apr 20 04:54:23 PDT 2005
|
||||||
|
|
||||||
|
</body></html>
|
133
docs/html/heap_checker.html
Normal file
@ -0,0 +1,133 @@
|
|||||||
|
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
|
||||||
|
<html><head><title>Google Heap Checker</title></head><body>
|
||||||
|
<h1>Automatic Leaks Checking Support</h1>
|
||||||
|
|
||||||
|
This document describes how to check the heap usage of a C++
|
||||||
|
program. This facility can be useful for automatically detecting
|
||||||
|
memory leaks.
|
||||||
|
|
||||||
|
<h2>Linking in the Heap Checker</h2>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
You can heap-check any program that has the tcmalloc library linked
|
||||||
|
in. No recompilation is necessary to use the heap checker.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
In order to catch all heap leaks, tcmalloc must be linked <i>last</i> into
|
||||||
|
your executable. The heap checker may mischaracterize some memory
|
||||||
|
accesses in libraries listed after it on the link line. For instance,
|
||||||
|
it may report these libraries as leaking memory when they're not.
|
||||||
|
(See the source code for more details.)
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
It's safe to link in tcmalloc even if you don't expect to
|
||||||
|
heap-check your program. Your programs will not run any slower
|
||||||
|
as long as you don't use any of the heap-checker features.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
You can run the heap checker on applications you didn't compile
|
||||||
|
yourself, by using LD_PRELOAD:
|
||||||
|
</p>
|
||||||
|
<pre> $ LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPCHECK=normal <binary>
|
||||||
|
</binary></pre>
|
||||||
|
<p>
|
||||||
|
We don't necessarily recommend this mode of usage.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<h2>Turning On Heap Checking</h2>
|
||||||
|
|
||||||
|
<p>There are two alternatives to actually turn on heap checking for a
|
||||||
|
given run of an executable.</p>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li> For whole-program heap-checking, define the environment variable
|
||||||
|
HEAPCHECK to the type of heap
|
||||||
|
checking you want: normal, strict, or draconian. For instance,
|
||||||
|
to heap-check <code>/bin/ls</code>:
|
||||||
|
<pre> $ HEAPCHECK=normal /bin/ls
|
||||||
|
% setenv HEAPCHECK normal; /bin/ls # csh
|
||||||
|
</pre>
|
||||||
|
OR
|
||||||
|
|
||||||
|
</li><li> For partial-code heap-checking, you need to modify your code.
|
||||||
|
For each piece of code you want heap-checked, bracket the code
|
||||||
|
by creating a <code>HeapLeakChecker</code> object
|
||||||
|
(which takes a descriptive label as an argument), and calling
|
||||||
|
<code>check.NoLeaks()</code> at the end of the code you want
|
||||||
|
checked. This will verify no more memory is allocated at the
|
||||||
|
end of the code segment than was allocated in the beginning. To
|
||||||
|
actually turn on the heap-checking, set the environment variable
|
||||||
|
HEAPCHECK to <code>local</code>.
|
||||||
|
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Here is an example of the second usage. The following code will
|
||||||
|
die if <code>Foo()</code> leaks any memory
|
||||||
|
(i.e. it allocates memory that is not freed by the time it returns):
|
||||||
|
</p>
|
||||||
|
<pre> HeapProfileLeakChecker checker("foo");
|
||||||
|
Foo();
|
||||||
|
assert(checker.NoLeaks());
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
When the <code>checker</code> object is allocated, it creates
|
||||||
|
one heap profile. When <code>checker.NoLeaks()</code> is invoked,
|
||||||
|
it creates another heap profile and compares it to the previously
|
||||||
|
created profile. If the new profile indicates memory growth
|
||||||
|
(or any memory allocation change if one
|
||||||
|
uses <code>checker.SameHeap()</code> instead), <code>NoLeaks()</code>
|
||||||
|
will return false and the program will abort. An error message will
|
||||||
|
also be printed out saying how <code>pprof</code> command can be run
|
||||||
|
to get a detailed analysis of the actual leaks.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
See the comments for <code>HeapProfileLeakChecker</code> class in
|
||||||
|
<code>heap-checker.h</code> and the code in
|
||||||
|
<code>heap-checker_unittest.cc</code> for more information and
|
||||||
|
examples. (TODO: document it all here instead!)
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
<b>IMPORTANT NOTE</b>: pthreads handling is currently incomplete.
|
||||||
|
Heap leak checks will fail with bogus leaks if there are pthreads live
|
||||||
|
at construction or leak checking time. One solution, for global
|
||||||
|
heap-checking, is to make sure all threads but the main thread have
|
||||||
|
exited at program-end time. We hope (as of March 2005) to have a fix
|
||||||
|
soon.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<h2>Disabling Heap-checking of Known Leaks</h2>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Sometimes your code has leaks that you know about and are willing to
|
||||||
|
accept. You would like the heap checker to ignore them when checking
|
||||||
|
your program. You can do this by bracketing the code in question with
|
||||||
|
an appropriate heap-checking object:
|
||||||
|
</p>
|
||||||
|
<pre> #include <google>
|
||||||
|
...
|
||||||
|
void *mark = HeapLeakChecker::GetDisableChecksStart();
|
||||||
|
<leaky code>
|
||||||
|
HeapLeakChecker::DisableChecksToHereFrom(mark);
|
||||||
|
</google></pre>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Some libc routines allocate memory, and may need to be 'disabled' in
|
||||||
|
this way. As time goes on, we hope to encode proper handling of
|
||||||
|
these routines into the heap-checker library code, so applications
|
||||||
|
needn't worry about them, but that process is not yet complete.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<hr>
|
||||||
|
<address><a href="mailto:opensource@google.com">Maxim Lifantsev</a></address>
|
||||||
|
<!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
|
||||||
|
<!-- hhmts start -->
|
||||||
|
Last modified: Thu Mar 3 05:51:40 PST 2005
|
||||||
|
<!-- hhmts end -->
|
||||||
|
|
||||||
|
</li></ul></body></html>
|
310
docs/html/heap_profiler.html
Normal file
@ -0,0 +1,310 @@
|
|||||||
|
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
|
||||||
|
<html><head><title>Google Heap Profiler</title></head><body>
|
||||||
|
<h1>Profiling heap usage</h1>
|
||||||
|
|
||||||
|
This document describes how to profile the heap usage of a C++
|
||||||
|
program. This facility can be useful for
|
||||||
|
<ul>
|
||||||
|
<li> Figuring out what is in the program heap at any given time
|
||||||
|
</li><li> Locating memory leaks
|
||||||
|
</li><li> Finding places that do a lot of allocation
|
||||||
|
</li></ul>
|
||||||
|
|
||||||
|
<h2>Linking in the Heap Profiler</h2>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
You can profile any program that has the tcmalloc library linked
|
||||||
|
in. No recompilation is necessary to use the heap profiler.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
It's safe to link in tcmalloc even if you don't expect to
|
||||||
|
heap-profiler your program. Your programs will not run any slower
|
||||||
|
as long as you don't use any of the heap-profiler features.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
You can run the heap profiler on applications you didn't compile
|
||||||
|
yourself, by using LD_PRELOAD:
|
||||||
|
</p>
|
||||||
|
<pre> $ LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPPROFILE=... <binary>
|
||||||
|
</binary></pre>
|
||||||
|
<p>
|
||||||
|
We don't necessarily recommend this mode of usage.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
|
||||||
|
<h2>Turning On Heap Profiling</h2>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Define the environment variable HEAPPROFILE to the filename to dump the
|
||||||
|
profile to. For instance, to profile /usr/local/netscape:
|
||||||
|
</p>
|
||||||
|
<pre> $ HEAPPROFILE=/tmp/profile /usr/local/netscape # sh
|
||||||
|
% setenv HEAPPROFILE /tmp/profile; /usr/local/netscape # csh
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
<p>Profiling also works correctly with sub-processes: each child
|
||||||
|
process gets its own profile with its own name (generated by combining
|
||||||
|
HEAPPROFILE with the child's process id).</p>
|
||||||
|
|
||||||
|
<p>For security reasons, heap profiling will not write to a file --
|
||||||
|
and it thus not usable -- for setuid programs.</p>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<h2>Extracting a profile</h2>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
If heap-profiling is turned on in a program, the program will periodically
|
||||||
|
write profiles to the filesystem. The sequence of profiles will be named:
|
||||||
|
</p>
|
||||||
|
<pre> <prefix>.0000.heap
|
||||||
|
<prefix>.0001.heap
|
||||||
|
<prefix>.0002.heap
|
||||||
|
...
|
||||||
|
</pre>
|
||||||
|
<p>
|
||||||
|
where <code><prefix></code> is the value supplied in
|
||||||
|
<code>HEAPPROFILE</code>. Note that if the supplied prefix
|
||||||
|
does not start with a <code>/</code>, the profile files will be
|
||||||
|
written to the program's working directory.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
By default, a new profile file is written after every 1GB of
|
||||||
|
allocation. The profile-writing interval can be adjusted by calling
|
||||||
|
HeapProfilerSetAllocationInterval() from your program. This takes one
|
||||||
|
argument: a numeric value that indicates the number of bytes of allocation
|
||||||
|
between each profile dump.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
You can also generate profiles from specific points in the program
|
||||||
|
by inserting a call to <code>HeapProfile()</code>. Example:
|
||||||
|
</p>
|
||||||
|
<pre> extern const char* HeapProfile();
|
||||||
|
const char* profile = HeapProfile();
|
||||||
|
fputs(profile, stdout);
|
||||||
|
free(const_cast<char*>(profile));
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
<h2>What is profiled</h2>
|
||||||
|
|
||||||
|
The profiling system instruments all allocations and frees. It keeps
|
||||||
|
track of various pieces of information per allocation site. An
|
||||||
|
allocation site is defined as the active stack trace at the call to
|
||||||
|
<code>malloc</code>, <code>calloc</code>, <code>realloc</code>, or,
|
||||||
|
<code>new</code>.
|
||||||
|
|
||||||
|
<h2>Interpreting the profile</h2>
|
||||||
|
|
||||||
|
The profile output can be viewed by passing it to the
|
||||||
|
<code>pprof</code> tool. The <code>pprof</code> tool can print both
|
||||||
|
CPU usage and heap usage information. It is documented in detail
|
||||||
|
on the <a href="http://goog-perftools.sourceforge.net/doc/cpu_profiler.html">CPU Profiling</a> page.
|
||||||
|
Heap-profile-specific flags and usage are explained below.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Here are some examples. These examples assume the binary is named
|
||||||
|
<code>gfs_master</code>, and a sequence of heap profile files can be
|
||||||
|
found in files named:
|
||||||
|
</p>
|
||||||
|
<pre> profile.0001.heap
|
||||||
|
profile.0002.heap
|
||||||
|
...
|
||||||
|
profile.0100.heap
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
<h3>Why is a process so big</h3>
|
||||||
|
|
||||||
|
<pre> % pprof --gv gfs_master profile.0100.heap
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
This command will pop-up a <code>gv</code> window that displays
|
||||||
|
the profile information as a directed graph. Here is a portion
|
||||||
|
of the resulting output:
|
||||||
|
|
||||||
|
<p>
|
||||||
|
</p><center>
|
||||||
|
<img src="../images/heap-example1.png">
|
||||||
|
</center>
|
||||||
|
<p></p>
|
||||||
|
|
||||||
|
A few explanations:
|
||||||
|
<ul>
|
||||||
|
<li> <code>GFS_MasterChunk::AddServer</code> accounts for 255.6 MB
|
||||||
|
of the live memory, which is 25% of the total live memory.
|
||||||
|
</li><li> <code>GFS_MasterChunkTable::UpdateState</code> is directly
|
||||||
|
accountable for 176.2 MB of the live memory (i.e., it directly
|
||||||
|
allocated 176.2 MB that has not been freed yet). Furthermore,
|
||||||
|
it and its callees are responsible for 729.9 MB. The
|
||||||
|
labels on the outgoing edges give a good indication of the
|
||||||
|
amount allocated by each callee.
|
||||||
|
</li></ul>
|
||||||
|
|
||||||
|
<h3>Comparing Profiles</h3>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
You often want to skip allocations during the initialization phase of
|
||||||
|
a program so you can find gradual memory leaks. One simple way to do
|
||||||
|
this is to compare two profiles -- both collected after the program
|
||||||
|
has been running for a while. Specify the name of the first profile
|
||||||
|
using the <code>--base</code> option. Example:
|
||||||
|
</p>
|
||||||
|
<pre> % pprof --base=profile.0004.heap gfs_master profile.0100.heap
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
The memory-usage in <code>profile.0004.heap</code> will be subtracted from
|
||||||
|
the memory-usage in <code>profile.0100.heap</code> and the result will
|
||||||
|
be displayed.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<h3>Text display</h3>
|
||||||
|
|
||||||
|
<pre>% pprof gfs_master profile.0100.heap
|
||||||
|
255.6 24.7% 24.7% 255.6 24.7% GFS_MasterChunk::AddServer
|
||||||
|
184.6 17.8% 42.5% 298.8 28.8% GFS_MasterChunkTable::Create
|
||||||
|
176.2 17.0% 59.5% 729.9 70.5% GFS_MasterChunkTable::UpdateState
|
||||||
|
169.8 16.4% 75.9% 169.8 16.4% PendingClone::PendingClone
|
||||||
|
76.3 7.4% 83.3% 76.3 7.4% __default_alloc_template::_S_chunk_alloc
|
||||||
|
49.5 4.8% 88.0% 49.5 4.8% hashtable::resize
|
||||||
|
...
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
</p><ul>
|
||||||
|
<li> The first column contains the direct memory use in MB.
|
||||||
|
</li><li> The fourth column contains memory use by the procedure
|
||||||
|
and all of its callees.
|
||||||
|
</li><li> The second and fifth columns are just percentage representations
|
||||||
|
of the numbers in the first and fifth columns.
|
||||||
|
</li><li> The third column is a cumulative sum of the second column
|
||||||
|
(i.e., the <code>k</code>th entry in the third column is the
|
||||||
|
sum of the first <code>k</code> entries in the second column.)
|
||||||
|
</li></ul>
|
||||||
|
|
||||||
|
<h3>Ignoring or focusing on specific regions</h3>
|
||||||
|
|
||||||
|
The following command will give a graphical display of a subset of
|
||||||
|
the call-graph. Only paths in the call-graph that match the
|
||||||
|
regular expression <code>DataBuffer</code> are included:
|
||||||
|
<pre>% pprof --gv --focus=DataBuffer gfs_master profile.0100.heap
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
Similarly, the following command will omit all paths subset of the
|
||||||
|
call-graph. All paths in the call-graph that match the regular
|
||||||
|
expression <code>DataBuffer</code> are discarded:
|
||||||
|
<pre>% pprof --gv --ignore=DataBuffer gfs_master profile.0100.heap
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
<h3>Total allocations + object-level information</h3>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
All of the previous examples have displayed the amount of in-use
|
||||||
|
space. I.e., the number of bytes that have been allocated but not
|
||||||
|
freed. You can also get other types of information by supplying
|
||||||
|
a flag to <code>pprof</code>:
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<center>
|
||||||
|
<table cellpadding="5" frame="box" rules="sides" width="100%">
|
||||||
|
|
||||||
|
<tbody><tr valign="top">
|
||||||
|
<td><code>--inuse_space</code></td>
|
||||||
|
<td>
|
||||||
|
Display the number of in-use megabytes (i.e. space that has
|
||||||
|
been allocated but not freed). This is the default.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--inuse_objects</code></td>
|
||||||
|
<td>
|
||||||
|
Display the number of in-use objects (i.e. number of
|
||||||
|
objects that have been allocated but not freed).
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--alloc_space</code></td>
|
||||||
|
<td>
|
||||||
|
Display the number of allocated megabytes. This includes
|
||||||
|
the space that has since been de-allocated. Use this
|
||||||
|
if you want to find the main allocation sites in the
|
||||||
|
program.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
|
||||||
|
<tr valign="top">
|
||||||
|
<td><code>--alloc_objects</code></td>
|
||||||
|
<td>
|
||||||
|
Display the number of allocated objects. This includes
|
||||||
|
the objects that have since been de-allocated. Use this
|
||||||
|
if you want to find the main allocation sites in the
|
||||||
|
program.
|
||||||
|
</td>
|
||||||
|
|
||||||
|
</tr></tbody></table>
|
||||||
|
</center>
|
||||||
|
|
||||||
|
<h2>Caveats</h2>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li> <p>
|
||||||
|
Heap profiling requires the use of libtcmalloc. This requirement
|
||||||
|
may be removed in a future version of the heap profiler, and the
|
||||||
|
heap profiler separated out into its own library.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
</li><li> <p>
|
||||||
|
If the program linked in a library that was not compiled
|
||||||
|
with enough symbolic information, all samples associated
|
||||||
|
with the library may be charged to the last symbol found
|
||||||
|
in the program before the libary. This will artificially
|
||||||
|
inflate the count for that symbol.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
</li><li> <p>
|
||||||
|
If you run the program on one machine, and profile it on another,
|
||||||
|
and the shared libraries are different on the two machines, the
|
||||||
|
profiling output may be confusing: samples that fall within
|
||||||
|
the shared libaries may be assigned to arbitrary procedures.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
</li><li> <p>
|
||||||
|
Several libraries, such as some STL implementations, do their own
|
||||||
|
memory management. This may cause strange profiling results. We
|
||||||
|
have code in libtcmalloc to cause STL to use tcmalloc for memory
|
||||||
|
management (which in our tests is better than STL's internal
|
||||||
|
management), though it only works for some STL implementations.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
</li><li> <p>
|
||||||
|
If your program forks, the children will also be profiled (since
|
||||||
|
they inherit the same HEAPPROFILE setting). Each process is
|
||||||
|
profiled separately; to distinguish the child profiles from the
|
||||||
|
parent profile and from each other, all children will have their
|
||||||
|
process-id attached to the HEAPPROFILE name.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
</li><li> <p>
|
||||||
|
Due to a hack we make to work around a possible gcc bug, your
|
||||||
|
profiles may end up named strangely if the first character of
|
||||||
|
your HEAPPROFILE variable has ascii value greater than 127. This
|
||||||
|
should be exceedingly rare, but if you need to use such a name,
|
||||||
|
just set prepend <code>./</code> to your filename:
|
||||||
|
<code>HEAPPROFILE=./Ägypten</code>.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
</li></ul>
|
||||||
|
|
||||||
|
<hr>
|
||||||
|
<address><a href="mailto:opensource@google.com">Sanjay Ghemawat</a></address>
|
||||||
|
<!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
|
||||||
|
<!-- hhmts start -->
|
||||||
|
Last modified: Wed Apr 20 05:46:16 PDT 2005
|
||||||
|
<!-- hhmts end -->
|
||||||
|
|
||||||
|
</body></html>
|
373
docs/html/tcmalloc.html
Normal file
@ -0,0 +1,373 @@
|
|||||||
|
<!DOCTYPE html PUBLIC "-//w3c//dtd html 4.01 transitional//en">
|
||||||
|
<html><head><!-- $Id: $ --><title>TCMalloc : Thread-Caching Malloc</title>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<style type="text/css">
|
||||||
|
em {
|
||||||
|
color: red;
|
||||||
|
font-style: normal;
|
||||||
|
}
|
||||||
|
</style></head><body>
|
||||||
|
|
||||||
|
<h1>TCMalloc : Thread-Caching Malloc</h1>
|
||||||
|
|
||||||
|
<address>Sanjay Ghemawat, Paul Menage <opensource@google.com></address>
|
||||||
|
|
||||||
|
<h2>Motivation</h2>
|
||||||
|
|
||||||
|
TCMalloc is faster than the glibc 2.3 malloc (available as a separate
|
||||||
|
library called ptmalloc2) and other mallocs
|
||||||
|
that I have tested. ptmalloc2 takes approximately 300 nanoseconds to
|
||||||
|
execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The
|
||||||
|
TCMalloc implementation takes approximately 50 nanoseconds for the
|
||||||
|
same operation pair. Speed is important for a malloc implementation
|
||||||
|
because if malloc is not fast enough, application writers are inclined
|
||||||
|
to write their own custom free lists on top of malloc. This can lead
|
||||||
|
to extra complexity, and more memory usage unless the application
|
||||||
|
writer is very careful to appropriately size the free lists and
|
||||||
|
scavenge idle objects out of the free list
|
||||||
|
|
||||||
|
<p>
|
||||||
|
TCMalloc also reduces lock contention for multi-threaded programs.
|
||||||
|
For small objects, there is virtually zero contention. For large
|
||||||
|
objects, TCMalloc tries to use fine grained and efficient spinlocks.
|
||||||
|
ptmalloc2 also reduces lock contention by using per-thread arenas but
|
||||||
|
there is a big problem with ptmalloc2's use of per-thread arenas. In
|
||||||
|
ptmalloc2 memory can never move from one arena to another. This can
|
||||||
|
lead to huge amounts of wasted space. For example, in one Google application, the first phase would
|
||||||
|
allocate approximately 300MB of memory for its data
|
||||||
|
structures. When the first phase finished, a second phase would be
|
||||||
|
started in the same address space. If this second phase was assigned a
|
||||||
|
different arena than the one used by the first phase, this phase would
|
||||||
|
not reuse any of the memory left after the first phase and would add
|
||||||
|
another 300MB to the address space. Similar memory blowup problems
|
||||||
|
were also noticed in other applications.
|
||||||
|
|
||||||
|
</p><p>
|
||||||
|
Another benefit of TCMalloc is space-efficient representation of small
|
||||||
|
objects. For example, N 8-byte objects can be allocated while using
|
||||||
|
space approximately <code>8N * 1.01</code> bytes. I.e., a one-percent
|
||||||
|
space overhead. ptmalloc2 uses a four-byte header for each object and
|
||||||
|
(I think) rounds up the size to a multiple of 8 bytes and ends up
|
||||||
|
using <code>16N</code> bytes.
|
||||||
|
|
||||||
|
|
||||||
|
</p><h2>Usage</h2>
|
||||||
|
|
||||||
|
<p>To use TCmalloc, just link tcmalloc into your application via the
|
||||||
|
"-ltcmalloc" linker flag.</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
You can use tcmalloc in applications you didn't compile yourself, by
|
||||||
|
using LD_PRELOAD:
|
||||||
|
</p>
|
||||||
|
<pre> $ LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary>
|
||||||
|
</binary></pre>
|
||||||
|
<p>
|
||||||
|
LD_PRELOAD is tricky, and we don't necessarily recommend this mode of
|
||||||
|
usage.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>TCMalloc includes a <a href="http://goog-perftools.sourceforge.net/doc/heap_checker.html">heap checker</a>
|
||||||
|
and <a href="http://goog-perftools.sourceforge.net/doc/heap_profiler.html">heap profiler</a> as well.</p>
|
||||||
|
|
||||||
|
<p>If you'd rather link in a version of TCMalloc that does not include
|
||||||
|
the heap profiler and checker (perhaps to reduce binary size for a
|
||||||
|
static binary), you can link in <code>libtcmalloc_minimal</code>
|
||||||
|
instead.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<h2>Overview</h2>
|
||||||
|
|
||||||
|
TCMalloc assigns each thread a thread-local cache. Small allocations
|
||||||
|
are satisfied from the thread-local cache. Objects are moved from
|
||||||
|
central data structures into a thread-local cache as needed, and
|
||||||
|
periodic garbage collections are used to migrate memory back from a
|
||||||
|
thread-local cache into the central data structures.
|
||||||
|
<center><img src="../images/overview.gif"></center>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
TCMalloc treates objects with size <= 32K ("small" objects)
|
||||||
|
differently from larger objects. Large objects are allocated
|
||||||
|
directly from the central heap using a page-level allocator
|
||||||
|
(a page is a 4K aligned region of memory). I.e., a large object
|
||||||
|
is always page-aligned and occupies an integral number of pages.
|
||||||
|
|
||||||
|
</p><p>
|
||||||
|
A run of pages can be carved up into a sequence of small objects, each
|
||||||
|
equally sized. For example a run of one page (4K) can be carved up
|
||||||
|
into 32 objects of size 128 bytes each.
|
||||||
|
|
||||||
|
</p><h2>Small Object Allocation</h2>
|
||||||
|
|
||||||
|
Each small object size maps to one of approximately 170 allocatable
|
||||||
|
size-classes. For example, all allocations in the range 961 to 1024
|
||||||
|
bytes are rounded up to 1024. The size-classes are spaced so that
|
||||||
|
small sizes are separated by 8 bytes, larger sizes by 16 bytes, even
|
||||||
|
larger sizes by 32 bytes, and so forth. The maximal spacing (for sizes
|
||||||
|
>= ~2K) is 256 bytes.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
A thread cache contains a singly linked list of free objects per size-class.
|
||||||
|
</p><center><img src="../images/threadheap.gif"></center>
|
||||||
|
|
||||||
|
When allocating a small object: (1) We map its size to the
|
||||||
|
corresponding size-class. (2) Look in the corresponding free list in
|
||||||
|
the thread cache for the current thread. (3) If the free list is not
|
||||||
|
empty, we remove the first object from the list and return it. When
|
||||||
|
following this fast path, TCMalloc acquires no locks at all. This
|
||||||
|
helps speed-up allocation significantly because a lock/unlock pair
|
||||||
|
takes approximately 100 nanoseconds on a 2.8 GHz Xeon.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
If the free list is empty: (1) We fetch a bunch of objects from a
|
||||||
|
central free list for this size-class (the central free list is shared
|
||||||
|
by all threads). (2) Place them in the thread-local free list. (3)
|
||||||
|
Return one of the newly fetched objects to the applications.
|
||||||
|
|
||||||
|
</p><p>
|
||||||
|
If the central free list is also empty: (1) We allocate a run of pages
|
||||||
|
from the central page allocator. (2) Split the run into a set of
|
||||||
|
objects of this size-class. (3) Place the new objects on the central
|
||||||
|
free list. (4) As before, move some of these objects to the
|
||||||
|
thread-local free list.
|
||||||
|
|
||||||
|
</p><h2>Large Object Allocation</h2>
|
||||||
|
|
||||||
|
A large object size (> 32K) is rounded up to a page size (4K) and
|
||||||
|
is handled by a central page heap. The central page heap is again an
|
||||||
|
array of free lists. For <code>i < 256</code>, the
|
||||||
|
<code>k</code>th entry is a free list of runs that consist of
|
||||||
|
<code>k</code> pages. The <code>256</code>th entry is a free list of
|
||||||
|
runs that have length <code>>= 256</code> pages:
|
||||||
|
<center><img src="../images/pageheap.gif"></center>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
An allocation for <code>k</code> pages is satisfied by looking in the
|
||||||
|
<code>k</code>th free list. If that free list is empty, we look in
|
||||||
|
the next free list, and so forth. Eventually, we look in the last
|
||||||
|
free list if necessary. If that fails, we fetch memory from the
|
||||||
|
system (using sbrk, mmap, or by mapping in portions of /dev/mem).
|
||||||
|
|
||||||
|
</p><p>
|
||||||
|
If an allocation for <code>k</code> pages is satisfied by a run
|
||||||
|
of pages of length > <code>k</code>, the remainder of the
|
||||||
|
run is re-inserted back into the appropriate free list in the
|
||||||
|
page heap.
|
||||||
|
|
||||||
|
</p><h2>Spans</h2>
|
||||||
|
|
||||||
|
The heap managed by TCMalloc consists of a set of pages. A run of
|
||||||
|
contiguous pages is represented by a <code>Span</code> object. A span
|
||||||
|
can either be <em>allocated</em>, or <em>free</em>. If free, the span
|
||||||
|
is one of the entries in a page heap linked-list. If allocated, it is
|
||||||
|
either a large object that has been handed off to the application, or
|
||||||
|
a run of pages that have been split up into a sequence of small
|
||||||
|
objects. If split into small objects, the size-class of the objects
|
||||||
|
is recorded in the span.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
A central array indexed by page number can be used to find the span to
|
||||||
|
which a page belongs. For example, span <em>a</em> below occupies 2
|
||||||
|
pages, span <em>b</em> occupies 1 page, span <em>c</em> occupies 5
|
||||||
|
pages and span <em>d</em> occupies 3 pages.
|
||||||
|
</p><center><img src="../images/spanmap.gif"></center>
|
||||||
|
A 32-bit address space can fit 2^20 4K pages, so this central array
|
||||||
|
takes 4MB of space, which seems acceptable. On 64-bit machines, we
|
||||||
|
use a 3-level radix tree instead of an array to map from a page number
|
||||||
|
to the corresponding span pointer.
|
||||||
|
|
||||||
|
<h2>Deallocation</h2>
|
||||||
|
|
||||||
|
When an object is deallocated, we compute its page number and look it up
|
||||||
|
in the central array to find the corresponding span object. The span tells
|
||||||
|
us whether or not the object is small, and its size-class if it is
|
||||||
|
small. If the object is small, we insert it into the appropriate free
|
||||||
|
list in the current thread's thread cache. If the thread cache now
|
||||||
|
exceeds a predetermined size (2MB by default), we run a garbage
|
||||||
|
collector that moves unused objects from the thread cache into central
|
||||||
|
free lists.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
If the object is large, the span tells us the range of pages covered
|
||||||
|
by the object. Suppose this range is <code>[p,q]</code>. We also
|
||||||
|
lookup the spans for pages <code>p-1</code> and <code>q+1</code>. If
|
||||||
|
either of these neighboring spans are free, we coalesce them with the
|
||||||
|
<code>[p,q]</code> span. The resulting span is inserted into the
|
||||||
|
appropriate free list in the page heap.
|
||||||
|
|
||||||
|
</p><h2>Central Free Lists for Small Objects</h2>
|
||||||
|
|
||||||
|
As mentioned before, we keep a central free list for each size-class.
|
||||||
|
Each central free list is organized as a two-level data structure:
|
||||||
|
a set of spans, and a linked list of free objects per span.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
An object is allocated from a central free list by removing the
|
||||||
|
first entry from the linked list of some span. (If all spans
|
||||||
|
have empty linked lists, a suitably sized span is first allocated
|
||||||
|
from the central page heap.)
|
||||||
|
|
||||||
|
</p><p>
|
||||||
|
An object is returned to a central free list by adding it to the
|
||||||
|
linked list of its containing span. If the linked list length now
|
||||||
|
equals the total number of small objects in the span, this span is now
|
||||||
|
completely free and is returned to the page heap.
|
||||||
|
|
||||||
|
</p><h2>Garbage Collection of Thread Caches</h2>
|
||||||
|
|
||||||
|
A thread cache is garbage collected when the combined size of all
|
||||||
|
objects in the cache exceeds 2MB. The garbage collection threshold
|
||||||
|
is automatically decreased as the number of threads increases so that
|
||||||
|
we don't waste an inordinate amount of memory in a program with lots
|
||||||
|
of threads.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
We walk over all free lists in the cache and move some number of
|
||||||
|
objects from the free list to the corresponding central list.
|
||||||
|
|
||||||
|
</p><p>
|
||||||
|
The number of objects to be moved from a free list is determined using
|
||||||
|
a per-list low-water-mark <code>L</code>. <code>L</code> records the
|
||||||
|
minimum length of the list since the last garbage collection. Note
|
||||||
|
that we could have shortened the list by <code>L</code> objects at the
|
||||||
|
last garbage collection without requiring any extra accesses to the
|
||||||
|
central list. We use this past history as a predictor of future
|
||||||
|
accesses and move <code>L/2</code> objects from the thread cache free
|
||||||
|
list to the corresponding central free list. This algorithm has the
|
||||||
|
nice property that if a thread stops using a particular size, all
|
||||||
|
objects of that size will quickly move from the thread cache to the
|
||||||
|
central free list where they can be used by other threads.
|
||||||
|
|
||||||
|
</p><h2>Performance Notes</h2>
|
||||||
|
|
||||||
|
<h3>PTMalloc2 unittest</h3>
|
||||||
|
The PTMalloc2 package (now part of glibc) contains a unittest program
|
||||||
|
t-test1.c. This forks a number of threads and performs a series of
|
||||||
|
allocations and deallocations in each thread; the threads do not
|
||||||
|
communicate other than by synchronization in the memory allocator.
|
||||||
|
|
||||||
|
<p> t-test1 (included in google-perftools/tests/tcmalloc, and compiled
|
||||||
|
as ptmalloc_unittest1) was run with a varying numbers of threads
|
||||||
|
(1-20) and maximum allocation sizes (64 bytes - 32Kbytes). These tests
|
||||||
|
were run on a 2.4GHz dual Xeon system with hyper-threading enabled,
|
||||||
|
using Linux glibc-2.3.2 from RedHat 9, with one million operations per
|
||||||
|
thread in each test. In each case, the test was run once normally, and
|
||||||
|
once with LD_PRELOAD=libtcmalloc.so.
|
||||||
|
|
||||||
|
</p><p>The graphs below show the performance of TCMalloc vs PTMalloc2 for
|
||||||
|
several different metrics. Firstly, total operations (millions) per elapsed
|
||||||
|
second vs max allocation size, for varying numbers of threads. The raw
|
||||||
|
data used to generate these graphs (the output of the "time" utility)
|
||||||
|
is available in t-test1.times.txt.
|
||||||
|
|
||||||
|
</p><p>
|
||||||
|
<table>
|
||||||
|
<tbody><tr>
|
||||||
|
<td><img src="../images/tcmalloc-opspersec_004.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspersec_009.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspersec_005.png"></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td><img src="../images/tcmalloc-opspersec.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspersec_006.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspersec_008.png"></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td><img src="../images/tcmalloc-opspersec_003.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspersec_002.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspersec_007.png"></td>
|
||||||
|
</tr>
|
||||||
|
</tbody></table>
|
||||||
|
|
||||||
|
|
||||||
|
</p><ul>
|
||||||
|
|
||||||
|
<li> TCMalloc is much more consistently scalable than PTMalloc2 - for
|
||||||
|
all thread counts >1 it achieves ~7-9 million ops/sec for small
|
||||||
|
allocations, falling to ~2 million ops/sec for larger allocations. The
|
||||||
|
single-thread case is an obvious outlier, since it is only able to
|
||||||
|
keep a single processor busy and hence can achieve fewer
|
||||||
|
ops/sec. PTMalloc2 has a much higher variance on operations/sec -
|
||||||
|
peaking somewhere around 4 million ops/sec for small allocations and
|
||||||
|
falling to <1 million ops/sec for larger allocations.
|
||||||
|
|
||||||
|
</li><li> TCMalloc is faster than PTMalloc2 in the vast majority of cases,
|
||||||
|
and particularly for small allocations. Contention between threads is
|
||||||
|
less of a problem in TCMalloc.
|
||||||
|
|
||||||
|
</li><li> TCMalloc's performance drops off as the allocation size
|
||||||
|
increases. This is because the per-thread cache is garbage-collected
|
||||||
|
when it hits a threshold (defaulting to 2MB). With larger allocation
|
||||||
|
sizes, fewer objects can be stored in the cache before it is
|
||||||
|
garbage-collected.
|
||||||
|
|
||||||
|
</li><li> There is a noticeably drop in the TCMalloc performance at ~32K
|
||||||
|
maximum allocation size; at larger sizes performance drops less
|
||||||
|
quickly. This is due to the 32K maximum size of objects in the
|
||||||
|
per-thread caches; for objects larger than this tcmalloc allocates
|
||||||
|
from the central page heap.
|
||||||
|
|
||||||
|
</li></ul>
|
||||||
|
|
||||||
|
<p> Next, operations (millions) per second of CPU time vs number of threads, for
|
||||||
|
max allocation size 64 bytes - 128 Kbytes.
|
||||||
|
|
||||||
|
</p><p>
|
||||||
|
<table>
|
||||||
|
<tbody><tr>
|
||||||
|
<td><img src="../images/tcmalloc-opspercpusec_005.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspercpusec_006.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspercpusec_009.png"></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td><img src="../images/tcmalloc-opspercpusec_003.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspercpusec_002.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspercpusec_008.png"></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td><img src="../images/tcmalloc-opspercpusec.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspercpusec_007.png"></td>
|
||||||
|
<td><img src="../images/tcmalloc-opspercpusec_004.png"></td>
|
||||||
|
</tr>
|
||||||
|
</tbody></table>
|
||||||
|
|
||||||
|
</p><p> Here we see again that TCMalloc is both more consistent and more
|
||||||
|
efficient than PTMalloc2. For max allocation sizes <32K, TCMalloc
|
||||||
|
typically achieves ~2-2.5 million ops per second of CPU time with a
|
||||||
|
large number of threads, whereas PTMalloc achieves generally 0.5-1
|
||||||
|
million ops per second of CPU time, with a lot of cases achieving much
|
||||||
|
less than this figure. Above 32K max allocation size, TCMalloc drops
|
||||||
|
to 1-1.5 million ops per second of CPU time, and PTMalloc drops almost
|
||||||
|
to zero for large numbers of threads (i.e. with PTMalloc, lots of CPU
|
||||||
|
time is being burned spinning waiting for locks in the heavily
|
||||||
|
multi-threaded case).
|
||||||
|
|
||||||
|
</p><h2>Caveats</h2>
|
||||||
|
|
||||||
|
<p>For some systems, TCMalloc may not work correctly on with
|
||||||
|
applications that aren't linked against libpthread.so (or the
|
||||||
|
equivalent on your OS). It should work on Linux using glibc 2.3, but
|
||||||
|
other OS/libc combinations have not been tested.
|
||||||
|
|
||||||
|
</p><p>TCMalloc may be somewhat more memory hungry than other mallocs,
|
||||||
|
though it tends not to have the huge blowups that can happen with
|
||||||
|
other mallocs. In particular, at startup TCMalloc allocates
|
||||||
|
approximately 6 MB of memory. It would be easy to roll a specialized
|
||||||
|
version that trades a little bit of speed for more space efficiency.
|
||||||
|
|
||||||
|
</p><p>
|
||||||
|
TCMalloc currently does not return any memory to the system.
|
||||||
|
|
||||||
|
</p><p>
|
||||||
|
Don't try to load TCMalloc into a running binary (e.g., using
|
||||||
|
JNI in Java programs). The binary will have allocated some
|
||||||
|
objects using the system malloc, and may try to pass them
|
||||||
|
to TCMalloc for deallocation. TCMalloc will not be able
|
||||||
|
to handle such objects.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</p></body></html>
|
BIN
docs/images/heap-example1.png
Normal file
After Width: | Height: | Size: 37 KiB |
BIN
docs/images/overview.gif
Normal file
After Width: | Height: | Size: 6.3 KiB |
BIN
docs/images/pageheap.gif
Normal file
After Width: | Height: | Size: 15 KiB |
BIN
docs/images/pprof-test.gif
Normal file
After Width: | Height: | Size: 56 KiB |
BIN
docs/images/pprof-vsnprintf.gif
Normal file
After Width: | Height: | Size: 30 KiB |
BIN
docs/images/spanmap.gif
Normal file
After Width: | Height: | Size: 8.3 KiB |
BIN
docs/images/tcmalloc-opspercpusec.png
Normal file
After Width: | Height: | Size: 1.5 KiB |
BIN
docs/images/tcmalloc-opspercpusec_002.png
Normal file
After Width: | Height: | Size: 1.9 KiB |
BIN
docs/images/tcmalloc-opspercpusec_003.png
Normal file
After Width: | Height: | Size: 2.0 KiB |
BIN
docs/images/tcmalloc-opspercpusec_004.png
Normal file
After Width: | Height: | Size: 1.3 KiB |
BIN
docs/images/tcmalloc-opspercpusec_005.png
Normal file
After Width: | Height: | Size: 1.6 KiB |
BIN
docs/images/tcmalloc-opspercpusec_006.png
Normal file
After Width: | Height: | Size: 1.8 KiB |
BIN
docs/images/tcmalloc-opspercpusec_007.png
Normal file
After Width: | Height: | Size: 1.5 KiB |
BIN
docs/images/tcmalloc-opspercpusec_008.png
Normal file
After Width: | Height: | Size: 1.8 KiB |
BIN
docs/images/tcmalloc-opspercpusec_009.png
Normal file
After Width: | Height: | Size: 1.8 KiB |
BIN
docs/images/tcmalloc-opspersec.png
Normal file
After Width: | Height: | Size: 2.1 KiB |
BIN
docs/images/tcmalloc-opspersec_002.png
Normal file
After Width: | Height: | Size: 2.0 KiB |
BIN
docs/images/tcmalloc-opspersec_003.png
Normal file
After Width: | Height: | Size: 2.2 KiB |
BIN
docs/images/tcmalloc-opspersec_004.png
Normal file
After Width: | Height: | Size: 1.6 KiB |
BIN
docs/images/tcmalloc-opspersec_005.png
Normal file
After Width: | Height: | Size: 2.2 KiB |
BIN
docs/images/tcmalloc-opspersec_006.png
Normal file
After Width: | Height: | Size: 1.9 KiB |
BIN
docs/images/tcmalloc-opspersec_007.png
Normal file
After Width: | Height: | Size: 2.1 KiB |
BIN
docs/images/tcmalloc-opspersec_008.png
Normal file
After Width: | Height: | Size: 2.1 KiB |
BIN
docs/images/tcmalloc-opspersec_009.png
Normal file
After Width: | Height: | Size: 2.1 KiB |
BIN
docs/images/threadheap.gif
Normal file
After Width: | Height: | Size: 7.4 KiB |