396 lines
13 KiB
HTML
396 lines
13 KiB
HTML
|
<HTML>
|
||
|
|
||
|
<HEAD>
|
||
|
<title>Google CPU Profiler</title>
|
||
|
</HEAD>
|
||
|
|
||
|
<BODY>
|
||
|
|
||
|
This is the CPU profiler we use at Google. There are three parts to
|
||
|
using it: linking the library into an application, running the code,
|
||
|
and analyzing the output.
|
||
|
|
||
|
|
||
|
<H1>Linking in the Library</H1>
|
||
|
|
||
|
<p>To install the CPU profiler into your executable, add -lprofiler to
|
||
|
the link-time step for your executable. (It's also probably possible
|
||
|
to add in the profiler at run-time using LD_PRELOAD, but this isn't
|
||
|
necessarily recommended.)</p>
|
||
|
|
||
|
<p>This does <i>not</i> turn on CPU profiling; it just inserts the code.
|
||
|
For that reason, it's practical to just always link -lprofiler into a
|
||
|
binary while developing; that's what we do at Google. (However, since
|
||
|
any user can turn on the profiler by setting an environment variable,
|
||
|
it's not necessarily recommended to install profiler-linked binaries
|
||
|
into a production, running system.)</p>
|
||
|
|
||
|
|
||
|
<H1>Running the Code</H1>
|
||
|
|
||
|
<p>There are two alternatives to actually turn on CPU profiling for a
|
||
|
given run of an executable:</p>
|
||
|
|
||
|
<ol>
|
||
|
<li> Define the environment variable CPUPROFILE to the filename to dump the
|
||
|
profile to. For instance, to profile /usr/local/netscape:
|
||
|
<pre>
|
||
|
$ CPUPROFILE=/tmp/profile /usr/local/netscape # sh
|
||
|
% setenv CPUPROFILE /tmp/profile; /usr/local/netscape # csh
|
||
|
</pre>
|
||
|
OR
|
||
|
|
||
|
<li> In your code, bracket the code you want profiled in calls to
|
||
|
ProfilerStart() and ProfilerStop(). ProfilerStart() will take the
|
||
|
profile-filename as an argument.
|
||
|
</ol>
|
||
|
|
||
|
<p>Profiling works correctly with threads. To use, just call
|
||
|
ProfilerRegisterThread() at the beginning of the routine the thread
|
||
|
runs.</p>
|
||
|
|
||
|
<p>You can also turn profiling on and off throughout the code, and do
|
||
|
other tweaks. This functionality will not frequently be needed. See
|
||
|
/usr/local/include/google/profiler.h (or src/google/profiler.h in this
|
||
|
directory) for more details.</p>
|
||
|
|
||
|
|
||
|
<H1>Analyzing the Output</H1>
|
||
|
|
||
|
<p>pprof is the script used to analyze a profile. It has many output
|
||
|
modes, both textual and graphical. Some give just raw numbers, much
|
||
|
like the -pg output of gcc, and others show the data in the form of a
|
||
|
dependency graph.</p>
|
||
|
|
||
|
<p>pprof <b>requires</b> perl5 to be installed to run. It also
|
||
|
requires dot to be installed for any of the graphical output routines,
|
||
|
and gv to be installed for --gv mode (described below).</p>
|
||
|
|
||
|
<p>Here are some ways to call pprof. These are described in more
|
||
|
detail below.</p>
|
||
|
|
||
|
<pre>
|
||
|
% pprof "program" "profile"
|
||
|
Generates one line per procedure
|
||
|
|
||
|
% pprof --gv "program" "profile"
|
||
|
Generates annotated call-graph and displays via "gv"
|
||
|
|
||
|
% pprof --gv --focus=Mutex "program" "profile"
|
||
|
Restrict to code paths that involve an entry that matches "Mutex"
|
||
|
|
||
|
% pprof --gv --focus=Mutex --ignore=string "program" "profile"
|
||
|
Restrict to code paths that involve an entry that matches "Mutex"
|
||
|
and does not match "string"
|
||
|
|
||
|
% pprof --list=IBF_CheckDocid "program" "profile"
|
||
|
Generates disassembly listing of all routines with at least one
|
||
|
sample that match the --list=<regexp> pattern. The listing is
|
||
|
annotated with the flat and cumulative sample counts at each line.
|
||
|
|
||
|
% pprof --disasm=IBF_CheckDocid "program" "profile"
|
||
|
Generates disassembly listing of all routines with at least one
|
||
|
sample that match the --disasm=<regexp> pattern. The listing is
|
||
|
annotated with the flat and cumulative sample counts at each PC value.
|
||
|
</pre>
|
||
|
|
||
|
<h3>Node Information</h3>
|
||
|
|
||
|
<p>In the various graphical modes of pprof, the output is a call graph
|
||
|
annotated with timing information, like so:</p>
|
||
|
|
||
|
<A HREF="pprof-test-big.gif">
|
||
|
<center><table><tr><td>
|
||
|
<img src="pprof-test.gif">
|
||
|
</td></tr></table></center>
|
||
|
</A>
|
||
|
|
||
|
<p>Each node represents a procedure.
|
||
|
The directed edges indicate caller to callee relations. Each node is
|
||
|
formatted as follows:</p>
|
||
|
|
||
|
<center><pre>
|
||
|
Class Name
|
||
|
Method Name
|
||
|
local (percentage)
|
||
|
<b>of</b> cumulative (percentage)
|
||
|
</pre></center>
|
||
|
|
||
|
<p>The last one or two lines contains the timing information. (The
|
||
|
profiling is done via a sampling method, where by default we take 100
|
||
|
samples a second. Therefor one unit of time in the output corresponds
|
||
|
to about 10 milliseconds of execution time.) The "local" time is the
|
||
|
time spent executing the instructions directly contained in the
|
||
|
procedure (and in any other procedures that were inlined into the
|
||
|
procedure). The "cumulative" time is the sum of the "local" time and
|
||
|
the time spent in any callees. If the cumulative time is the same as
|
||
|
the local time, it is not printed.
|
||
|
|
||
|
<p>For instance, the timing information for test_main_thread()
|
||
|
indicates that 155 units (about 1.55 seconds) were spent executing the
|
||
|
code in test_main_thread() and 200 units were spent while executing
|
||
|
test_main_thread() and its callees such as snprintf().</p>
|
||
|
|
||
|
<p>The size of the node is proportional to the local count. The
|
||
|
percentage displayed in the node corresponds to the count divided by
|
||
|
the total run time of the program (that is, the cumulative count for
|
||
|
main()).</p>
|
||
|
|
||
|
<h3>Edge Information</h3>
|
||
|
|
||
|
<p>An edge from one node to another indicates a caller to callee
|
||
|
relationship. Each edge is labelled with the time spent by the callee
|
||
|
on behalf of the caller. E.g, the edge from test_main_thread() to
|
||
|
snprintf() indicates that of the 200 samples in
|
||
|
test_main_thread(), 37 are because of calls to snprintf().</p>
|
||
|
|
||
|
<p>Note that test_main_thread() has an edge to vsnprintf(), even
|
||
|
though test_main_thread() doesn't call that function directly. This
|
||
|
is because the code was compiled with -O2; the profile reflects the
|
||
|
optimized control flow.</p>
|
||
|
|
||
|
<h3>Meta Information</h3>
|
||
|
|
||
|
The top of the display should contain some meta information like:
|
||
|
<pre>
|
||
|
/tmp/profiler2_unittest
|
||
|
Total samples: 202
|
||
|
Focusing on: 202
|
||
|
Dropped nodes with <= 1 abs(samples)
|
||
|
Dropped edges with <= 0 samples
|
||
|
</pre>
|
||
|
|
||
|
This section contains the name of the program, and the total samples
|
||
|
collected during the profiling run. If the --focus option is on (see
|
||
|
the <a href="#focus">Focus</a> section below), the legend also
|
||
|
contains the number of samples being shown in the focused display.
|
||
|
Furthermore, some unimportant nodes and edges are dropped to reduce
|
||
|
clutter. The characteristics of the dropped nodes and edges are also
|
||
|
displayed in the legend.
|
||
|
|
||
|
<h3><a name=focus>Focus and Ignore</a></h3>
|
||
|
|
||
|
<p>You can ask pprof to generate a display focused on a particular
|
||
|
piece of the program. You specify a regular expression. Any portion
|
||
|
of the call-graph that is on a path which contains at least one node
|
||
|
matching the regular expression is preserved. The rest of the
|
||
|
call-graph is dropped on the floor. For example, you can focus on the
|
||
|
vsnprintf() libc call in profiler2_unittest as follows:</p>
|
||
|
|
||
|
<pre>
|
||
|
% pprof --gv --focus=vsnprintf /tmp/profiler2_unittest test.prof
|
||
|
</pre>
|
||
|
<A HREF="pprof-vsnprintf-big.gif">
|
||
|
<center><table><tr><td>
|
||
|
<img src="pprof-vsnprintf.gif">
|
||
|
</td></tr></table></center>
|
||
|
</A>
|
||
|
|
||
|
<p>
|
||
|
Similarly, you can supply the --ignore option to ignore
|
||
|
samples that match a specified regular expression. E.g.,
|
||
|
if you are interested in everything except calls to snprintf(),
|
||
|
you can say:
|
||
|
<pre>
|
||
|
% pprof --gv --ignore=snprintf /tmp/profiler2_unittest test.prof
|
||
|
</pre>
|
||
|
|
||
|
<h3><a name=options>pprof Options</a></h3>
|
||
|
|
||
|
<h4>Output Type</h4>
|
||
|
|
||
|
<p>
|
||
|
<center>
|
||
|
<table frame=box rules=sides cellpadding=5 width=100%>
|
||
|
<tr valign=top>
|
||
|
<td><code>--text</code></td>
|
||
|
<td>
|
||
|
Produces a textual listing. This is currently the default
|
||
|
since it does not need to access to an X display, or
|
||
|
dot or gv. However if you
|
||
|
have these programs installed, you will probably be
|
||
|
happier with the --gv output.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign=top>
|
||
|
<td><code>--gv</code></td>
|
||
|
<td>
|
||
|
Generates annotated call-graph, converts to postscript, and
|
||
|
displays via gv.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign=top>
|
||
|
<td><code>--dot</code></td>
|
||
|
<td>
|
||
|
Generates the annotated call-graph in dot format and
|
||
|
emits to stdout.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign=top>
|
||
|
<td><code>--ps</code></td>
|
||
|
<td>
|
||
|
Generates the annotated call-graph in Postscript format and
|
||
|
emits to stdout.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign=top>
|
||
|
<td><code>--gif</code></td>
|
||
|
<td>
|
||
|
Generates the annotated call-graph in GIF format and
|
||
|
emits to stdout.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign=top>
|
||
|
<td><code>--list=<<i>regexp</i>></code></td>
|
||
|
<td>
|
||
|
<p>Outputs source-code listing of routines whose
|
||
|
name matches <regexp>. Each line
|
||
|
in the listing is annotated with flat and cumulative
|
||
|
sample counts.</p>
|
||
|
|
||
|
<p>In the presence of inlined calls, the samples
|
||
|
associated with inlined code tend to get assigned
|
||
|
to a line that follows the location of the
|
||
|
inlined call. A more precise accounting can be
|
||
|
obtained by disassembling the routine using the
|
||
|
--disasm flag.</p>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign=top>
|
||
|
<td><code>--disasm=<<i>regexp</i>></code></td>
|
||
|
<td>
|
||
|
Generates disassembly of routines that match
|
||
|
<regexp>, annotated with flat and
|
||
|
cumulative sample counts and emits to stdout.
|
||
|
</td>
|
||
|
</tr>
|
||
|
</table>
|
||
|
</center>
|
||
|
|
||
|
<h4>Reporting Granularity</h4>
|
||
|
|
||
|
<p>By default, pprof produces one entry per procedure. However you can
|
||
|
use one of the following options to change the granularity of the
|
||
|
output. The --files option seems to be particularly useless, and may
|
||
|
be removed eventually.</p>
|
||
|
|
||
|
<center>
|
||
|
<table frame=box rules=sides cellpadding=5 width=100%>
|
||
|
<tr valign=top>
|
||
|
<td><code>--addresses</code></td>
|
||
|
<td>
|
||
|
Produce one node per program address.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<td><code>--lines</code></td>
|
||
|
<td>
|
||
|
Produce one node per source line.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<td><code>--functions</code></td>
|
||
|
<td>
|
||
|
Produce one node per function (this is the default).
|
||
|
</td>
|
||
|
</tr>
|
||
|
<td><code>--files</code></td>
|
||
|
<td>
|
||
|
Produce one node per source file.
|
||
|
</td>
|
||
|
</tr>
|
||
|
</table>
|
||
|
</center>
|
||
|
|
||
|
<h4>Controlling the Call Graph Display</h4>
|
||
|
|
||
|
<p>Some nodes and edges are dropped to reduce clutter in the output
|
||
|
display. The following options control this effect:</p>
|
||
|
|
||
|
<center>
|
||
|
<table frame=box rules=sides cellpadding=5 width=100%>
|
||
|
<tr valign=top>
|
||
|
<td><code>--nodecount=<n></code></td>
|
||
|
<td>
|
||
|
This option controls the number of displayed nodes. The nodes
|
||
|
are first sorted by decreasing cumulative count, and then only
|
||
|
the top N nodes are kept. The default value is 80.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign=top>
|
||
|
<td><code>--nodefraction=<f></code></td>
|
||
|
<td>
|
||
|
This option provides another mechanism for discarding nodes
|
||
|
from the display. If the cumulative count for a node is
|
||
|
less than this option's value multiplied by the total count
|
||
|
for the profile, the node is dropped. The default value
|
||
|
is 0.005; i.e. nodes that account for less than
|
||
|
half a percent of the total time are dropped. A node
|
||
|
is dropped if either this condition is satisfied, or the
|
||
|
--nodecount condition is satisfied.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign=top>
|
||
|
<td><code>--edgefraction=<f></code></td>
|
||
|
<td>
|
||
|
This option controls the number of displayed edges. First of all,
|
||
|
an edge is dropped if either its source or destination node is
|
||
|
dropped. Otherwise, the edge is dropped if the sample
|
||
|
count along the edge is less than this option's value multiplied
|
||
|
by the total count for the profile. The default value is
|
||
|
0.001; i.e., edges that account for less than
|
||
|
0.1% of the total time are dropped.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign=top>
|
||
|
<td><code>--focus=<re></code></td>
|
||
|
<td>
|
||
|
This option controls what region of the graph is displayed
|
||
|
based on the regular expression supplied with the option.
|
||
|
For any path in the callgraph, we check all nodes in the path
|
||
|
against the supplied regular expression. If none of the nodes
|
||
|
match, the path is dropped from the output.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign=top>
|
||
|
<td><code>--ignore=<re></code></td>
|
||
|
<td>
|
||
|
This option controls what region of the graph is displayed
|
||
|
based on the regular expression supplied with the option.
|
||
|
For any path in the callgraph, we check all nodes in the path
|
||
|
against the supplied regular expression. If any of the nodes
|
||
|
match, the path is dropped from the output.
|
||
|
</td>
|
||
|
</tr>
|
||
|
</table>
|
||
|
</center>
|
||
|
|
||
|
<p>The dropped edges and nodes account for some count mismatches in
|
||
|
the display. For example, the cumulative count for
|
||
|
snprintf() in the first diagram above was 41. However the local
|
||
|
count (1) and the count along the outgoing edges (12+1+20+6) add up to
|
||
|
only 40.</p>
|
||
|
|
||
|
|
||
|
<h1>Caveats</h1>
|
||
|
|
||
|
<ul>
|
||
|
<li> If the program exits because of a signal, the generated profile
|
||
|
will be <font color=red>incomplete, and may perhaps be
|
||
|
completely empty.</font>
|
||
|
<li> The displayed graph may have disconnected regions because
|
||
|
of the edge-dropping heuristics described above.
|
||
|
<li> If the program linked in a library that was not compiled
|
||
|
with enough symbolic information, all samples associated
|
||
|
with the library may be charged to the last symbol found
|
||
|
in the program before the libary. This will artificially
|
||
|
inflate the count for that symbol.
|
||
|
<li> If you run the program on one machine, and profile it on another,
|
||
|
and the shared libraries are different on the two machines, the
|
||
|
profiling output may be confusing: samples that fall within
|
||
|
the shared libaries may be assigned to arbitrary procedures.
|
||
|
</ul>
|
||
|
|
||
|
<hr>
|
||
|
Last modified: Tue Jan 25 18:06:38 PST 2005
|
||
|
</body>
|
||
|
</html>
|