diff --git a/docs/html/cpu_profiler.html b/docs/html/cpu_profiler.html
new file mode 100644
index 0000000..f05d5ec
--- /dev/null
+++ b/docs/html/cpu_profiler.html
@@ -0,0 +1,409 @@
+<html><head><title>Google CPU Profiler</title></head><body>
+
+This is the CPU profiler we use at Google.  There are three parts to
+using it: linking the library into an application, running the code,
+and analyzing the output.
+
+
+<h1>Linking in the Library</h1>
+
+<p>To install the CPU profiler into your executable, add -lprofiler to
+the link-time step for your executable.  (It's also probably possible
+to add in the profiler at run-time using LD_PRELOAD, but this isn't
+necessarily recommended.)</p>
+
+<p>This does <i>not</i> turn on CPU profiling; it just inserts the code.
+For that reason, it's practical to just always link -lprofiler into a
+binary while developing; that's what we do at Google.  (However, since
+any user can turn on the profiler by setting an environment variable,
+it's not necessarily recommended to install profiler-linked binaries
+into a production, running system.)</p>
+
+
+<h1>Running the Code</h1>
+
+<p>There are two alternatives to actually turn on CPU profiling for a
+given run of an executable:</p>
+
+<ol>
+<li> Define the environment variable CPUPROFILE to the filename to dump the
+     profile to.  For instance, to profile /usr/local/netscape:
+     <pre>      $ CPUPROFILE=/tmp/profile /usr/local/netscape           # sh
+      % setenv CPUPROFILE /tmp/profile; /usr/local/netscape   # csh
+     </pre>
+     OR
+
+</li><li> In your code, bracket the code you want profiled in calls to
+     ProfilerStart() and ProfilerStop().  ProfilerStart() will take the
+     profile-filename as an argument.
+</li></ol>
+
+<p>In Linux 2.6 and above, profiling works correctly with threads,
+automatically profiling all threads.  In Linux 2.4, profiling only
+profiles the main thread (due to a kernel bug involving itimers and
+threads).  Profiling works correctly with sub-processes: each child
+process gets its own profile with its own name (generated by combining
+CPUPROFILE with the child's process id).</p>
+
+<p>For security reasons, CPU profiling will not write to a file -- and
+is thus not usable -- for setuid programs.</p>
+
+<h2>Controlling Behavior via the Environment</h2>
+
+<p>In addition to the environment variable <code>CPUPROFILE</code>,
+which determines where profiles are written, there are several
+environment variables which control the performance of the CPU
+profile.</p>
+
+<table cellpadding="5" frame="box" rules="sides" width="100%">
+<tbody><tr>
+<td><code>PROFILEFREQUENCY=<i>x</i></code></td>
+    <td>How many interrupts/second the cpu-profiler samples.
+    </td>
+</tr>
+</tbody></table>
+
+<h1>Analyzing the Output</h1>
+
+<p>pprof is the script used to analyze a profile.  It has many output
+modes, both textual and graphical.  Some give just raw numbers, much
+like the -pg output of gcc, and others show the data in the form of a
+dependency graph.</p>
+
+<p>pprof <b>requires</b> perl5 to be installed to run.  It also
+requires dot to be installed for any of the graphical output routines,
+and gv to be installed for --gv mode (described below).</p>
+
+<p>Here are some ways to call pprof.  These are described in more
+detail below.</p>
+
+<pre>% pprof "program" "profile"
+  Generates one line per procedure
+
+% pprof --gv "program" "profile"
+  Generates annotated call-graph and displays via "gv"
+
+% pprof --gv --focus=Mutex "program" "profile"
+  Restrict to code paths that involve an entry that matches "Mutex"
+
+% pprof --gv --focus=Mutex --ignore=string "program" "profile"
+  Restrict to code paths that involve an entry that matches "Mutex"
+  and does not match "string"
+
+% pprof --list=IBF_CheckDocid "program" "profile"
+  Generates disassembly listing of all routines with at least one
+  sample that match the --list=<regexp> pattern.  The listing is
+  annotated with the flat and cumulative sample counts at each line.
+
+% pprof --disasm=IBF_CheckDocid "program" "profile"
+  Generates disassembly listing of all routines with at least one
+  sample that match the --disasm=<regexp> pattern.  The listing is
+  annotated with the flat and cumulative sample counts at each PC value.
+</regexp></regexp></pre>
+
+<h3>Node Information</h3>
+
+<p>In the various graphical modes of pprof, the output is a call graph
+annotated with timing information, like so:</p>
+
+<a href="http://goog-perftools.sourceforge.net/doc/pprof-test-big.gif">
+<center><table><tbody><tr><td>
+   <img src="../images/pprof-test.gif">
+</td></tr></tbody></table></center>
+</a>
+
+<p>Each node represents a procedure.
+The directed edges indicate caller to callee relations.  Each node is
+formatted as follows:</p>
+
+<center><pre>Class Name
+Method Name
+local (percentage)
+<b>of</b> cumulative (percentage)
+</pre></center>
+
+<p>The last one or two lines contains the timing information.  (The
+profiling is done via a sampling method, where by default we take 100
+samples a second.  Therefor one unit of time in the output corresponds
+to about 10 milliseconds of execution time.) The "local" time is the
+time spent executing the instructions directly contained in the
+procedure (and in any other procedures that were inlined into the
+procedure).  The "cumulative" time is the sum of the "local" time and
+the time spent in any callees.  If the cumulative time is the same as
+the local time, it is not printed.
+
+</p><p>For instance, the timing information for test_main_thread()
+indicates that 155 units (about 1.55 seconds) were spent executing the
+code in test_main_thread() and 200 units were spent while executing
+test_main_thread() and its callees such as snprintf().</p>
+
+<p>The size of the node is proportional to the local count.  The
+percentage displayed in the node corresponds to the count divided by
+the total run time of the program (that is, the cumulative count for
+main()).</p>
+
+<h3>Edge Information</h3>
+
+<p>An edge from one node to another indicates a caller to callee
+relationship.  Each edge is labelled with the time spent by the callee
+on behalf of the caller.  E.g, the edge from test_main_thread() to
+snprintf() indicates that of the 200 samples in
+test_main_thread(), 37 are because of calls to snprintf().</p>
+
+<p>Note that test_main_thread() has an edge to vsnprintf(), even
+though test_main_thread() doesn't call that function directly.  This
+is because the code was compiled with -O2; the profile reflects the
+optimized control flow.</p>
+
+<h3>Meta Information</h3>
+
+The top of the display should contain some meta information like:
+<pre>      /tmp/profiler2_unittest
+      Total samples: 202
+      Focusing on: 202
+      Dropped nodes with &lt;= 1 abs(samples)
+      Dropped edges with &lt;= 0 samples
+</pre>
+
+This section contains the name of the program, and the total samples
+collected during the profiling run.  If the --focus option is on (see
+the <a href="#focus">Focus</a> section below), the legend also
+contains the number of samples being shown in the focused display.
+Furthermore, some unimportant nodes and edges are dropped to reduce
+clutter.  The characteristics of the dropped nodes and edges are also
+displayed in the legend.
+
+<h3><a name="focus">Focus and Ignore</a></h3>
+
+<p>You can ask pprof to generate a display focused on a particular
+piece of the program.  You specify a regular expression.  Any portion
+of the call-graph that is on a path which contains at least one node
+matching the regular expression is preserved.  The rest of the
+call-graph is dropped on the floor.  For example, you can focus on the
+vsnprintf() libc call in profiler2_unittest as follows:</p>
+
+<pre>% pprof --gv --focus=vsnprintf /tmp/profiler2_unittest test.prof
+</pre>
+<a href="http://goog-perftools.sourceforge.net/doc/pprof-vsnprintf-big.gif">
+<center><table><tbody><tr><td>
+   <img src="../images/pprof-vsnprintf.gif">
+</td></tr></tbody></table></center>
+</a>
+
+<p>
+Similarly, you can supply the --ignore option to ignore
+samples that match a specified regular expression.  E.g.,
+if you are interested in everything except calls to snprintf(),
+you can say:
+</p><pre>% pprof --gv --ignore=snprintf /tmp/profiler2_unittest test.prof
+</pre>
+
+<h3><a name="options">pprof Options</a></h3>
+
+<h4>Output Type</h4>
+
+<p>
+</p><center>
+<table cellpadding="5" frame="box" rules="sides" width="100%">
+<tbody><tr valign="top">
+  <td><code>--text</code></td>
+  <td>
+    Produces a textual listing.  This is currently the default
+    since it does not need to access to an X display, or
+    dot or gv.  However if you
+    have these programs installed, you will probably be
+    happier with the --gv output.
+  </td>
+</tr>
+<tr valign="top">
+  <td><code>--gv</code></td>
+  <td>
+    Generates annotated call-graph, converts to postscript, and
+    displays via gv.
+  </td>
+</tr>
+<tr valign="top">
+  <td><code>--dot</code></td>
+  <td>
+    Generates the annotated call-graph in dot format and
+    emits to stdout.
+  </td>
+</tr>
+<tr valign="top">
+  <td><code>--ps</code></td>
+  <td>
+    Generates the annotated call-graph in Postscript format and
+    emits to stdout.
+  </td>
+</tr>
+<tr valign="top">
+  <td><code>--gif</code></td>
+  <td>
+    Generates the annotated call-graph in GIF format and
+    emits to stdout.
+  </td>
+</tr>
+<tr valign="top">
+  <td><code>--list=&lt;<i>regexp</i>&gt;</code></td>
+  <td>
+    <p>Outputs source-code listing of routines whose
+    name matches &lt;regexp&gt;.  Each line
+    in the listing is annotated with flat and cumulative
+    sample counts.</p>
+
+    <p>In the presence of inlined calls, the samples
+    associated with inlined code tend to get assigned
+    to a line that follows the location of the 
+    inlined call.  A more precise accounting can be
+    obtained by disassembling the routine using the
+    --disasm flag.</p>
+  </td>
+</tr>
+<tr valign="top">
+  <td><code>--disasm=&lt;<i>regexp</i>&gt;</code></td>
+  <td>
+    Generates disassembly of routines that match
+    &lt;regexp&gt;, annotated with flat and
+    cumulative sample counts and emits to stdout.
+  </td>
+</tr>
+</tbody></table>
+</center>
+
+<h4>Reporting Granularity</h4>
+
+<p>By default, pprof produces one entry per procedure.  However you can
+use one of the following options to change the granularity of the
+output.  The --files option seems to be particularly useless, and may
+be removed eventually.</p>
+
+<center>
+<table cellpadding="5" frame="box" rules="sides" width="100%">
+<tbody><tr valign="top">
+  <td><code>--addresses</code></td>
+  <td>
+     Produce one node per program address.
+  </td>
+</tr>
+  <tr><td><code>--lines</code></td>
+  <td>
+     Produce one node per source line.
+  </td>
+</tr>
+  <tr><td><code>--functions</code></td>
+  <td>
+     Produce one node per function (this is the default).
+  </td>
+</tr>
+  <tr><td><code>--files</code></td>
+  <td>
+     Produce one node per source file.
+  </td>
+</tr>
+</tbody></table>
+</center>
+
+<h4>Controlling the Call Graph Display</h4>
+
+<p>Some nodes and edges are dropped to reduce clutter in the output
+display.  The following options control this effect:</p>
+
+<center>
+<table cellpadding="5" frame="box" rules="sides" width="100%">
+<tbody><tr valign="top">
+  <td><code>--nodecount=&lt;n&gt;</code></td>
+  <td>
+    This option controls the number of displayed nodes.  The nodes
+    are first sorted by decreasing cumulative count, and then only
+    the top N nodes are kept.  The default value is 80.
+  </td>
+</tr>
+<tr valign="top">
+  <td><code>--nodefraction=&lt;f&gt;</code></td>
+  <td>
+    This option provides another mechanism for discarding nodes
+    from the display.  If the cumulative count for a node is
+    less than this option's value multiplied by the total count
+    for the profile, the node is dropped.  The default value
+    is 0.005; i.e. nodes that account for less than
+    half a percent of the total time are dropped.  A node
+    is dropped if either this condition is satisfied, or the
+    --nodecount condition is satisfied.
+  </td>
+</tr>
+<tr valign="top">
+  <td><code>--edgefraction=&lt;f&gt;</code></td>
+  <td>
+    This option controls the number of displayed edges.  First of all,
+    an edge is dropped if either its source or destination node is
+    dropped.  Otherwise, the edge is dropped if the sample
+    count along the edge is less than this option's value multiplied
+    by the total count for the profile.  The default value is
+    0.001; i.e., edges that account for less than
+    0.1% of the total time are dropped.
+  </td>
+</tr>
+<tr valign="top">
+  <td><code>--focus=&lt;re&gt;</code></td>
+  <td>
+    This option controls what region of the graph is displayed
+    based on the regular expression supplied with the option.
+    For any path in the callgraph, we check all nodes in the path
+    against the supplied regular expression.  If none of the nodes
+    match, the path is dropped from the output.
+  </td>
+</tr>
+<tr valign="top">
+  <td><code>--ignore=&lt;re&gt;</code></td>
+  <td>
+    This option controls what region of the graph is displayed
+    based on the regular expression supplied with the option.
+    For any path in the callgraph, we check all nodes in the path
+    against the supplied regular expression.  If any of the nodes
+    match, the path is dropped from the output.
+  </td>
+</tr>
+</tbody></table>
+</center>
+
+<p>The dropped edges and nodes account for some count mismatches in
+the display.  For example, the cumulative count for
+snprintf() in the first diagram above was 41.  However the local
+count (1) and the count along the outgoing edges (12+1+20+6) add up to
+only 40.</p>
+
+
+<h1>Caveats</h1>
+
+<ul>
+<li> If the program exits because of a signal, the generated profile
+     will be <font color="red">incomplete, and may perhaps be
+     completely empty.</font>
+</li><li> The displayed graph may have disconnected regions because
+     of the edge-dropping heuristics described above.
+</li><li> If the program linked in a library that was not compiled
+     with enough symbolic information, all samples associated
+     with the library may be charged to the last symbol found
+     in the program before the libary.  This will artificially
+     inflate the count for that symbol.
+</li><li> If you run the program on one machine, and profile it on another,
+     and the shared libraries are different on the two machines, the
+     profiling output may be confusing: samples that fall within
+     the shared libaries may be assigned to arbitrary procedures.
+</li><li> If your program forks, the children will also be profiled (since
+     they inherit the same CPUPROFILE setting).  Each process is
+     profiled separately; to distinguish the child profiles from the
+     parent profile and from each other, all children will have their
+     process-id appended to the CPUPROFILE name.
+</li><li> Due to a hack we make to work around a possible gcc bug, your
+     profiles may end up named strangely if the first character of
+     your CPUPROFILE variable has ascii value greater than 127.  This
+     should be exceedingly rare, but if you need to use such a name,
+     just set prepend <code>./</code> to your filename:
+     <code>CPUPROFILE=./Ägypten</code>.
+</li></ul>
+
+<hr>
+Last modified: Wed Apr 20 04:54:23 PDT 2005
+
+</body></html>
diff --git a/docs/html/heap_checker.html b/docs/html/heap_checker.html
new file mode 100644
index 0000000..fd82f37
--- /dev/null
+++ b/docs/html/heap_checker.html
@@ -0,0 +1,133 @@
+<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
+<html><head><title>Google Heap Checker</title></head><body>
+<h1>Automatic Leaks Checking Support</h1>
+
+This document describes how to check the heap usage of a C++
+program.  This facility can be useful for automatically detecting
+memory leaks.
+
+<h2>Linking in the Heap Checker</h2>
+
+<p>
+You can heap-check any program that has the tcmalloc library linked
+in.  No recompilation is necessary to use the heap checker.
+</p>
+
+<p>
+In order to catch all heap leaks, tcmalloc must be linked <i>last</i> into
+your executable.  The heap checker may mischaracterize some memory
+accesses in libraries listed after it on the link line.  For instance,
+it may report these libraries as leaking memory when they're not.
+(See the source code for more details.)
+</p>
+
+<p>
+It's safe to link in tcmalloc even if you don't expect to
+heap-check your program.  Your programs will not run any slower
+as long as you don't use any of the heap-checker features.
+</p>
+
+<p>
+You can run the heap checker on applications you didn't compile
+yourself, by using LD_PRELOAD:
+</p>
+<pre>   $ LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPCHECK=normal <binary>
+</binary></pre>
+<p>
+We don't necessarily recommend this mode of usage.
+</p>
+
+<h2>Turning On Heap Checking</h2>
+
+<p>There are two alternatives to actually turn on heap checking for a
+given run of an executable.</p>
+
+<ul>
+<li> For whole-program heap-checking, define the environment variable
+     HEAPCHECK to the type of heap
+     checking you want: normal, strict, or draconian.  For instance,
+     to heap-check <code>/bin/ls</code>:
+     <pre>      $ HEAPCHECK=normal /bin/ls
+      % setenv HEAPCHECK normal; /bin/ls   # csh
+     </pre>
+     OR
+
+</li><li> For partial-code heap-checking, you need to modify your code.
+     For each piece of code you want heap-checked, bracket the code
+     by creating a <code>HeapLeakChecker</code> object
+     (which takes a descriptive label as an argument), and calling
+     <code>check.NoLeaks()</code> at the end of the code you want
+     checked.  This will verify no more memory is allocated at the
+     end of the code segment than was allocated in the beginning.  To
+     actually turn on the heap-checking, set the environment variable
+     HEAPCHECK to <code>local</code>.
+
+
+<p>
+Here is an example of the second usage.  The following code will
+die if <code>Foo()</code> leaks any memory
+(i.e. it allocates memory that is not freed by the time it returns):
+</p>
+<pre>    HeapProfileLeakChecker checker("foo");
+    Foo();
+    assert(checker.NoLeaks());
+</pre>
+
+<p>
+When the <code>checker</code> object is allocated, it creates
+one heap profile.  When <code>checker.NoLeaks()</code> is invoked,
+it creates another heap profile and compares it to the previously
+created profile.  If the new profile indicates memory growth
+(or any memory allocation change if one
+uses <code>checker.SameHeap()</code> instead), <code>NoLeaks()</code>
+will return false and the program will abort.  An error message will
+also be printed out saying how <code>pprof</code> command can be run
+to get a detailed analysis of the actual leaks.
+</p>
+
+<p>
+See the comments for <code>HeapProfileLeakChecker</code> class in
+<code>heap-checker.h</code> and the code in
+<code>heap-checker_unittest.cc</code> for more information and
+examples.  (TODO: document it all here instead!)
+</p>
+
+<p>
+<b>IMPORTANT NOTE</b>: pthreads handling is currently incomplete.
+Heap leak checks will fail with bogus leaks if there are pthreads live
+at construction or leak checking time.  One solution, for global
+heap-checking, is to make sure all threads but the main thread have
+exited at program-end time.  We hope (as of March 2005) to have a fix
+soon.
+</p>
+
+<h2>Disabling Heap-checking of Known Leaks</h2>
+
+<p>
+Sometimes your code has leaks that you know about and are willing to
+accept.  You would like the heap checker to ignore them when checking
+your program.  You can do this by bracketing the code in question with
+an appropriate heap-checking object:
+</p>
+<pre>   #include <google>
+   ...
+   void *mark = HeapLeakChecker::GetDisableChecksStart();
+   &lt;leaky code&gt;
+   HeapLeakChecker::DisableChecksToHereFrom(mark);
+</google></pre>
+
+<p>
+Some libc routines allocate memory, and may need to be 'disabled' in
+this way.  As time goes on, we hope to encode proper handling of
+these routines into the heap-checker library code, so applications
+needn't worry about them, but that process is not yet complete.
+</p>
+
+<hr>
+<address><a href="mailto:opensource@google.com">Maxim Lifantsev</a></address>
+<!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
+<!-- hhmts start -->
+Last modified: Thu Mar  3 05:51:40 PST 2005
+<!-- hhmts end -->
+
+</li></ul></body></html>
\ No newline at end of file
diff --git a/docs/html/heap_profiler.html b/docs/html/heap_profiler.html
new file mode 100644
index 0000000..6936901
--- /dev/null
+++ b/docs/html/heap_profiler.html
@@ -0,0 +1,310 @@
+<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
+<html><head><title>Google Heap Profiler</title></head><body>
+<h1>Profiling heap usage</h1>
+
+This document describes how to profile the heap usage of a C++
+program.  This facility can be useful for
+<ul>
+<li> Figuring out what is in the program heap at any given time
+</li><li> Locating memory leaks
+</li><li> Finding places that do a lot of allocation
+</li></ul>
+
+<h2>Linking in the Heap Profiler</h2>
+
+<p>
+You can profile any program that has the tcmalloc library linked
+in.  No recompilation is necessary to use the heap profiler.
+</p>
+
+<p>
+It's safe to link in tcmalloc even if you don't expect to
+heap-profiler your program.  Your programs will not run any slower
+as long as you don't use any of the heap-profiler features.
+</p>
+
+<p>
+You can run the heap profiler on applications you didn't compile
+yourself, by using LD_PRELOAD:
+</p>
+<pre>   $ LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPPROFILE=... <binary>
+</binary></pre>
+<p>
+We don't necessarily recommend this mode of usage.
+</p>
+
+
+<h2>Turning On Heap Profiling</h2>
+
+<p>
+Define the environment variable HEAPPROFILE to the filename to dump the
+profile to.  For instance, to profile /usr/local/netscape:
+</p>
+<pre> $ HEAPPROFILE=/tmp/profile /usr/local/netscape           # sh
+ % setenv HEAPPROFILE /tmp/profile; /usr/local/netscape   # csh
+</pre>
+
+<p>Profiling also works correctly with sub-processes: each child
+process gets its own profile with its own name (generated by combining
+HEAPPROFILE with the child's process id).</p>
+
+<p>For security reasons, heap profiling will not write to a file --
+and it thus not usable -- for setuid programs.</p>
+
+
+
+<h2>Extracting a profile</h2>
+
+<p>
+If heap-profiling is turned on in a program, the program will periodically
+write profiles to the filesystem.  The sequence of profiles will be named:
+</p>
+<pre>           &lt;prefix&gt;.0000.heap
+           &lt;prefix&gt;.0001.heap
+           &lt;prefix&gt;.0002.heap
+           ...
+</pre>
+<p>
+where <code>&lt;prefix&gt;</code> is the value supplied in
+<code>HEAPPROFILE</code>.  Note that if the supplied prefix
+does not start with a <code>/</code>, the profile files will be
+written to the program's working directory.
+</p>
+
+<p>
+By default, a new profile file is written after every 1GB of
+allocation.  The profile-writing interval can be adjusted by calling
+HeapProfilerSetAllocationInterval() from your program.  This takes one
+argument: a numeric value that indicates the number of bytes of allocation
+between each profile dump.
+</p>
+
+<p>
+You can also generate profiles from specific points in the program
+by inserting a call to <code>HeapProfile()</code>.  Example:
+</p>
+<pre>    extern const char* HeapProfile();
+    const char* profile = HeapProfile();
+    fputs(profile, stdout);
+    free(const_cast&lt;char*&gt;(profile));
+</pre>
+
+<h2>What is profiled</h2>
+
+The profiling system instruments all allocations and frees.  It keeps
+track of various pieces of information per allocation site.  An
+allocation site is defined as the active stack trace at the call to
+<code>malloc</code>, <code>calloc</code>, <code>realloc</code>, or,
+<code>new</code>.
+
+<h2>Interpreting the profile</h2>
+
+The profile output can be viewed by passing it to the
+<code>pprof</code> tool.  The <code>pprof</code> tool can print both
+CPU usage and heap usage information.  It is documented in detail
+on the <a href="http://goog-perftools.sourceforge.net/doc/cpu_profiler.html">CPU Profiling</a> page.
+Heap-profile-specific flags and usage are explained below.
+
+<p>
+Here are some examples.  These examples assume the binary is named
+<code>gfs_master</code>, and a sequence of heap profile files can be
+found in files named:
+</p>
+<pre>  profile.0001.heap
+  profile.0002.heap
+  ...
+  profile.0100.heap
+</pre>
+
+<h3>Why is a process so big</h3>
+
+<pre>    % pprof --gv gfs_master profile.0100.heap
+</pre>
+
+This command will pop-up a <code>gv</code> window that displays
+the profile information as a directed graph.  Here is a portion
+of the resulting output:
+
+<p>
+</p><center>
+<img src="../images/heap-example1.png">
+</center>
+<p></p>
+
+A few explanations:
+<ul>
+<li> <code>GFS_MasterChunk::AddServer</code> accounts for 255.6 MB
+     of the live memory, which is 25% of the total live memory.
+</li><li> <code>GFS_MasterChunkTable::UpdateState</code> is directly
+     accountable for 176.2 MB of the live memory (i.e., it directly
+     allocated 176.2 MB that has not been freed yet).  Furthermore,
+     it and its callees are responsible for 729.9 MB.  The
+     labels on the outgoing edges give a good indication of the
+     amount allocated by each callee.
+</li></ul>
+
+<h3>Comparing Profiles</h3>
+
+<p>
+You often want to skip allocations during the initialization phase of
+a program so you can find gradual memory leaks.  One simple way to do
+this is to compare two profiles -- both collected after the program
+has been running for a while.  Specify the name of the first profile
+using the <code>--base</code> option.  Example:
+</p>
+<pre>   % pprof --base=profile.0004.heap gfs_master profile.0100.heap
+</pre>
+
+<p>
+The memory-usage in <code>profile.0004.heap</code> will be subtracted from
+the memory-usage in <code>profile.0100.heap</code> and the result will
+be displayed.
+</p>
+
+<h3>Text display</h3>
+
+<pre>% pprof gfs_master profile.0100.heap
+   255.6  24.7%  24.7%    255.6  24.7% GFS_MasterChunk::AddServer
+   184.6  17.8%  42.5%    298.8  28.8% GFS_MasterChunkTable::Create
+   176.2  17.0%  59.5%    729.9  70.5% GFS_MasterChunkTable::UpdateState
+   169.8  16.4%  75.9%    169.8  16.4% PendingClone::PendingClone
+    76.3   7.4%  83.3%     76.3   7.4% __default_alloc_template::_S_chunk_alloc
+    49.5   4.8%  88.0%     49.5   4.8% hashtable::resize
+   ...
+</pre>
+
+<p>
+</p><ul>
+<li> The first column contains the direct memory use in MB.
+</li><li> The fourth column contains memory use by the procedure
+     and all of its callees.
+</li><li> The second and fifth columns are just percentage representations
+     of the numbers in the first and fifth columns.
+</li><li> The third column is a cumulative sum of the second column
+     (i.e., the <code>k</code>th entry in the third column is the
+     sum of the first <code>k</code> entries in the second column.)
+</li></ul>
+
+<h3>Ignoring or focusing on specific regions</h3>
+
+The following command will give a graphical display of a subset of
+the call-graph.  Only paths in the call-graph that match the
+regular expression <code>DataBuffer</code> are included:
+<pre>% pprof --gv --focus=DataBuffer gfs_master profile.0100.heap
+</pre>
+
+Similarly, the following command will omit all paths subset of the
+call-graph.  All paths in the call-graph that match the regular
+expression <code>DataBuffer</code> are discarded:
+<pre>% pprof --gv --ignore=DataBuffer gfs_master profile.0100.heap
+</pre>
+
+<h3>Total allocations + object-level information</h3>
+
+<p>
+All of the previous examples have displayed the amount of in-use
+space.  I.e., the number of bytes that have been allocated but not
+freed.  You can also get other types of information by supplying
+a flag to <code>pprof</code>:
+</p>
+
+<center>
+<table cellpadding="5" frame="box" rules="sides" width="100%">
+
+<tbody><tr valign="top">
+  <td><code>--inuse_space</code></td>
+  <td>
+     Display the number of in-use megabytes (i.e. space that has
+     been allocated but not freed).  This is the default.
+  </td>
+</tr>
+
+<tr valign="top">
+  <td><code>--inuse_objects</code></td>
+  <td>
+     Display the number of in-use objects (i.e. number of
+     objects that have been allocated but not freed).
+  </td>
+</tr>
+
+<tr valign="top">
+  <td><code>--alloc_space</code></td>
+  <td>
+     Display the number of allocated megabytes.  This includes
+     the space that has since been de-allocated.  Use this
+     if you want to find the main allocation sites in the
+     program.
+  </td>
+</tr>
+
+<tr valign="top">
+  <td><code>--alloc_objects</code></td>
+  <td>
+     Display the number of allocated objects.  This includes
+     the objects that have since been de-allocated.  Use this
+     if you want to find the main allocation sites in the
+     program.
+  </td>
+
+</tr></tbody></table>
+</center>
+
+<h2>Caveats</h2>
+
+<ul>
+<li> <p>
+     Heap profiling requires the use of libtcmalloc.  This requirement
+     may be removed in a future version of the heap profiler, and the
+     heap profiler separated out into its own library.
+     </p>
+     
+</li><li> <p>
+     If the program linked in a library that was not compiled
+     with enough symbolic information, all samples associated
+     with the library may be charged to the last symbol found
+     in the program before the libary.  This will artificially
+     inflate the count for that symbol.
+     </p>
+
+</li><li> <p>
+     If you run the program on one machine, and profile it on another,
+     and the shared libraries are different on the two machines, the
+     profiling output may be confusing: samples that fall within
+     the shared libaries may be assigned to arbitrary procedures.
+     </p>
+
+</li><li> <p>
+     Several libraries, such as some STL implementations, do their own
+     memory management.  This may cause strange profiling results.  We
+     have code in libtcmalloc to cause STL to use tcmalloc for memory
+     management (which in our tests is better than STL's internal
+     management), though it only works for some STL implementations.
+     </p>
+
+</li><li> <p>
+     If your program forks, the children will also be profiled (since
+     they inherit the same HEAPPROFILE setting).  Each process is
+     profiled separately; to distinguish the child profiles from the
+     parent profile and from each other, all children will have their
+     process-id attached to the HEAPPROFILE name.
+     </p>
+     
+</li><li> <p>
+     Due to a hack we make to work around a possible gcc bug, your
+     profiles may end up named strangely if the first character of
+     your HEAPPROFILE variable has ascii value greater than 127.  This
+     should be exceedingly rare, but if you need to use such a name,
+     just set prepend <code>./</code> to your filename:
+     <code>HEAPPROFILE=./Ägypten</code>.
+     </p>
+
+</li></ul>
+
+<hr>
+<address><a href="mailto:opensource@google.com">Sanjay Ghemawat</a></address>
+<!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
+<!-- hhmts start -->
+Last modified: Wed Apr 20 05:46:16 PDT 2005
+<!-- hhmts end -->
+
+</body></html>
diff --git a/docs/html/tcmalloc.html b/docs/html/tcmalloc.html
new file mode 100644
index 0000000..aa2d3ee
--- /dev/null
+++ b/docs/html/tcmalloc.html
@@ -0,0 +1,373 @@
+<!DOCTYPE html PUBLIC "-//w3c//dtd html 4.01 transitional//en">
+<html><head><!-- $Id: $ --><title>TCMalloc : Thread-Caching Malloc</title>
+
+
+
+
+<style type="text/css">
+  em {
+    color: red;
+    font-style: normal;
+  }
+</style></head><body>
+
+<h1>TCMalloc : Thread-Caching Malloc</h1>
+
+<address>Sanjay Ghemawat, Paul Menage &lt;opensource@google.com&gt;</address>
+
+<h2>Motivation</h2>
+
+TCMalloc is faster than the glibc 2.3 malloc (available as a separate
+library called ptmalloc2) and other mallocs
+that I have tested.  ptmalloc2 takes approximately 300 nanoseconds to
+execute a malloc/free pair on a 2.8 GHz P4 (for small objects).  The
+TCMalloc implementation takes approximately 50 nanoseconds for the
+same operation pair.  Speed is important for a malloc implementation
+because if malloc is not fast enough, application writers are inclined
+to write their own custom free lists on top of malloc.  This can lead
+to extra complexity, and more memory usage unless the application
+writer is very careful to appropriately size the free lists and
+scavenge idle objects out of the free list
+
+<p>
+TCMalloc also reduces lock contention for multi-threaded programs.
+For small objects, there is virtually zero contention.  For large
+objects, TCMalloc tries to use fine grained and efficient spinlocks.
+ptmalloc2 also reduces lock contention by using per-thread arenas but
+there is a big problem with ptmalloc2's use of per-thread arenas.  In
+ptmalloc2 memory can never move from one arena to another.  This can
+lead to huge amounts of wasted space.  For example, in one Google application, the first phase would
+allocate approximately 300MB of memory for its data
+structures.  When the first phase finished, a second phase would be
+started in the same address space.  If this second phase was assigned a
+different arena than the one used by the first phase, this phase would
+not reuse any of the memory left after the first phase and would add
+another 300MB to the address space.  Similar memory blowup problems
+were also noticed in other applications.
+
+</p><p>
+Another benefit of TCMalloc is space-efficient representation of small
+objects.  For example, N 8-byte objects can be allocated while using
+space approximately <code>8N * 1.01</code> bytes.  I.e., a one-percent
+space overhead.  ptmalloc2 uses a four-byte header for each object and
+(I think) rounds up the size to a multiple of 8 bytes and ends up
+using <code>16N</code> bytes.
+
+
+</p><h2>Usage</h2>
+
+<p>To use TCmalloc, just link tcmalloc into your application via the
+"-ltcmalloc" linker flag.</p>
+
+<p>
+You can use tcmalloc in applications you didn't compile yourself, by
+using LD_PRELOAD:
+</p>
+<pre>   $ LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary>
+</binary></pre>
+<p>
+LD_PRELOAD is tricky, and we don't necessarily recommend this mode of
+usage.
+</p>
+
+<p>TCMalloc includes a <a href="http://goog-perftools.sourceforge.net/doc/heap_checker.html">heap checker</a>
+and <a href="http://goog-perftools.sourceforge.net/doc/heap_profiler.html">heap profiler</a> as well.</p>
+
+<p>If you'd rather link in a version of TCMalloc that does not include
+the heap profiler and checker (perhaps to reduce binary size for a
+static binary), you can link in <code>libtcmalloc_minimal</code>
+instead.</p>
+
+
+<h2>Overview</h2>
+
+TCMalloc assigns each thread a thread-local cache.  Small allocations
+are satisfied from the thread-local cache.  Objects are moved from
+central data structures into a thread-local cache as needed, and
+periodic garbage collections are used to migrate memory back from a
+thread-local cache into the central data structures.
+<center><img src="../images/overview.gif"></center>
+
+<p>
+TCMalloc treates objects with size &lt;= 32K ("small" objects)
+differently from larger objects.  Large objects are allocated
+directly from the central heap using a page-level allocator
+(a page is a 4K aligned region of memory).  I.e., a large object
+is always page-aligned and occupies an integral number of pages.
+
+</p><p>
+A run of pages can be carved up into a sequence of small objects, each
+equally sized.  For example a run of one page (4K) can be carved up
+into 32 objects of size 128 bytes each.
+
+</p><h2>Small Object Allocation</h2>
+
+Each small object size maps to one of approximately 170 allocatable
+size-classes.  For example, all allocations in the range 961 to 1024
+bytes are rounded up to 1024.  The size-classes are spaced so that
+small sizes are separated by 8 bytes, larger sizes by 16 bytes, even
+larger sizes by 32 bytes, and so forth.  The maximal spacing (for sizes
+&gt;= ~2K) is 256 bytes.
+
+<p>
+A thread cache contains a singly linked list of free objects per size-class.
+</p><center><img src="../images/threadheap.gif"></center>
+
+When allocating a small object: (1) We map its size to the
+corresponding size-class.  (2) Look in the corresponding free list in
+the thread cache for the current thread.  (3) If the free list is not
+empty, we remove the first object from the list and return it.  When
+following this fast path, TCMalloc acquires no locks at all.  This
+helps speed-up allocation significantly because a lock/unlock pair
+takes approximately 100 nanoseconds on a 2.8 GHz Xeon.
+
+<p>
+If the free list is empty: (1) We fetch a bunch of objects from a
+central free list for this size-class (the central free list is shared
+by all threads).  (2) Place them in the thread-local free list.  (3)
+Return one of the newly fetched objects to the applications.
+
+</p><p>
+If the central free list is also empty: (1) We allocate a run of pages
+from the central page allocator.  (2) Split the run into a set of
+objects of this size-class.  (3) Place the new objects on the central
+free list.  (4) As before, move some of these objects to the
+thread-local free list.
+
+</p><h2>Large Object Allocation</h2>
+
+A large object size (&gt; 32K) is rounded up to a page size (4K) and
+is handled by a central page heap.  The central page heap is again an
+array of free lists.  For <code>i &lt; 256</code>, the
+<code>k</code>th entry is a free list of runs that consist of
+<code>k</code> pages.  The <code>256</code>th entry is a free list of
+runs that have length <code>&gt;= 256</code> pages: 
+<center><img src="../images/pageheap.gif"></center>
+
+<p>
+An allocation for <code>k</code> pages is satisfied by looking in the
+<code>k</code>th free list.  If that free list is empty, we look in
+the next free list, and so forth.  Eventually, we look in the last
+free list if necessary.  If that fails, we fetch memory from the
+system (using sbrk, mmap, or by mapping in portions of /dev/mem).
+
+</p><p>
+If an allocation for <code>k</code> pages is satisfied by a run
+of pages of length &gt; <code>k</code>, the remainder of the
+run is re-inserted back into the appropriate free list in the
+page heap.
+
+</p><h2>Spans</h2>
+
+The heap managed by TCMalloc consists of a set of pages.  A run of
+contiguous pages is represented by a <code>Span</code> object.  A span
+can either be <em>allocated</em>, or <em>free</em>.  If free, the span
+is one of the entries in a page heap linked-list.  If allocated, it is
+either a large object that has been handed off to the application, or
+a run of pages that have been split up into a sequence of small
+objects.  If split into small objects, the size-class of the objects
+is recorded in the span.
+
+<p>
+A central array indexed by page number can be used to find the span to
+which a page belongs.  For example, span <em>a</em> below occupies 2
+pages, span <em>b</em> occupies 1 page, span <em>c</em> occupies 5
+pages and span <em>d</em> occupies 3 pages.
+</p><center><img src="../images/spanmap.gif"></center>
+A 32-bit address space can fit 2^20 4K pages, so this central array
+takes 4MB of space, which seems acceptable.  On 64-bit machines, we
+use a 3-level radix tree instead of an array to map from a page number
+to the corresponding span pointer.
+
+<h2>Deallocation</h2>
+
+When an object is deallocated, we compute its page number and look it up
+in the central array to find the corresponding span object.  The span tells
+us whether or not the object is small, and its size-class if it is
+small.  If the object is small, we insert it into the appropriate free
+list in the current thread's thread cache.  If the thread cache now
+exceeds a predetermined size (2MB by default), we run a garbage
+collector that moves unused objects from the thread cache into central
+free lists.
+
+<p>
+If the object is large, the span tells us the range of pages covered
+by the object.  Suppose this range is <code>[p,q]</code>.  We also
+lookup the spans for pages <code>p-1</code> and <code>q+1</code>.  If
+either of these neighboring spans are free, we coalesce them with the
+<code>[p,q]</code> span.  The resulting span is inserted into the
+appropriate free list in the page heap.
+
+</p><h2>Central Free Lists for Small Objects</h2>
+
+As mentioned before, we keep a central free list for each size-class.
+Each central free list is organized as a two-level data structure:
+a set of spans, and a linked list of free objects per span.
+
+<p>
+An object is allocated from a central free list by removing the
+first entry from the linked list of some span.  (If all spans
+have empty linked lists, a suitably sized span is first allocated
+from the central page heap.)
+
+</p><p>
+An object is returned to a central free list by adding it to the
+linked list of its containing span.  If the linked list length now
+equals the total number of small objects in the span, this span is now
+completely free and is returned to the page heap.
+
+</p><h2>Garbage Collection of Thread Caches</h2>
+
+A thread cache is garbage collected when the combined size of all
+objects in the cache exceeds 2MB.  The garbage collection threshold
+is automatically decreased as the number of threads increases so that
+we don't waste an inordinate amount of memory in a program with lots
+of threads.
+
+<p>
+We walk over all free lists in the cache and move some number of
+objects from the free list to the corresponding central list.
+
+</p><p>
+The number of objects to be moved from a free list is determined using
+a per-list low-water-mark <code>L</code>.  <code>L</code> records the
+minimum length of the list since the last garbage collection.  Note
+that we could have shortened the list by <code>L</code> objects at the
+last garbage collection without requiring any extra accesses to the
+central list.  We use this past history as a predictor of future
+accesses and move <code>L/2</code> objects from the thread cache free
+list to the corresponding central free list.  This algorithm has the
+nice property that if a thread stops using a particular size, all
+objects of that size will quickly move from the thread cache to the
+central free list where they can be used by other threads.
+
+</p><h2>Performance Notes</h2>
+
+<h3>PTMalloc2 unittest</h3>
+The PTMalloc2 package (now part of glibc) contains a unittest program
+t-test1.c. This forks a number of threads and performs a series of
+allocations and deallocations in each thread; the threads do not
+communicate other than by synchronization in the memory allocator.
+
+<p> t-test1 (included in google-perftools/tests/tcmalloc, and compiled
+as ptmalloc_unittest1) was run with a varying numbers of threads
+(1-20) and maximum allocation sizes (64 bytes - 32Kbytes). These tests
+were run on a 2.4GHz dual Xeon system with hyper-threading enabled,
+using Linux glibc-2.3.2 from RedHat 9, with one million operations per
+thread in each test. In each case, the test was run once normally, and
+once with LD_PRELOAD=libtcmalloc.so.
+
+</p><p>The graphs below show the performance of TCMalloc vs PTMalloc2 for
+several different metrics. Firstly, total operations (millions) per elapsed
+second vs max allocation size, for varying numbers of threads. The raw
+data used to generate these graphs (the output of the "time" utility)
+is available in t-test1.times.txt.
+
+</p><p>
+<table>
+<tbody><tr>
+<td><img src="../images/tcmalloc-opspersec_004.png"></td>
+<td><img src="../images/tcmalloc-opspersec_009.png"></td>
+<td><img src="../images/tcmalloc-opspersec_005.png"></td>
+</tr>
+<tr>
+<td><img src="../images/tcmalloc-opspersec.png"></td>
+<td><img src="../images/tcmalloc-opspersec_006.png"></td>
+<td><img src="../images/tcmalloc-opspersec_008.png"></td>
+</tr>
+<tr>
+<td><img src="../images/tcmalloc-opspersec_003.png"></td>
+<td><img src="../images/tcmalloc-opspersec_002.png"></td>
+<td><img src="../images/tcmalloc-opspersec_007.png"></td>
+</tr>
+</tbody></table>
+
+
+</p><ul> 
+
+<li> TCMalloc is much more consistently scalable than PTMalloc2 - for
+all thread counts &gt;1 it achieves ~7-9 million ops/sec for small
+allocations, falling to ~2 million ops/sec for larger allocations. The
+single-thread case is an obvious outlier, since it is only able to
+keep a single processor busy and hence can achieve fewer
+ops/sec. PTMalloc2 has a much higher variance on operations/sec -
+peaking somewhere around 4 million ops/sec for small allocations and
+falling to &lt;1 million ops/sec for larger allocations.
+
+</li><li> TCMalloc is faster than PTMalloc2 in the vast majority of cases,
+and particularly for small allocations. Contention between threads is
+less of a problem in TCMalloc.
+
+</li><li> TCMalloc's performance drops off as the allocation size
+increases. This is because the per-thread cache is garbage-collected
+when it hits a threshold (defaulting to 2MB). With larger allocation
+sizes, fewer objects can be stored in the cache before it is
+garbage-collected.
+
+</li><li> There is a noticeably drop in the TCMalloc performance at ~32K
+maximum allocation size; at larger sizes performance drops less
+quickly. This is due to the 32K maximum size of objects in the
+per-thread caches; for objects larger than this tcmalloc allocates
+from the central page heap.
+
+</li></ul>
+
+<p> Next, operations (millions) per second of CPU time vs number of threads, for
+max allocation size 64 bytes - 128 Kbytes.
+
+</p><p>
+<table>
+<tbody><tr>
+<td><img src="../images/tcmalloc-opspercpusec_005.png"></td>
+<td><img src="../images/tcmalloc-opspercpusec_006.png"></td>
+<td><img src="../images/tcmalloc-opspercpusec_009.png"></td>
+</tr>
+<tr>
+<td><img src="../images/tcmalloc-opspercpusec_003.png"></td>
+<td><img src="../images/tcmalloc-opspercpusec_002.png"></td>
+<td><img src="../images/tcmalloc-opspercpusec_008.png"></td>
+</tr>
+<tr>
+<td><img src="../images/tcmalloc-opspercpusec.png"></td>
+<td><img src="../images/tcmalloc-opspercpusec_007.png"></td>
+<td><img src="../images/tcmalloc-opspercpusec_004.png"></td>
+</tr>
+</tbody></table>
+
+</p><p> Here we see again that TCMalloc is both more consistent and more
+efficient than PTMalloc2. For max allocation sizes &lt;32K, TCMalloc
+typically achieves ~2-2.5 million ops per second of CPU time with a
+large number of threads, whereas PTMalloc achieves generally 0.5-1
+million ops per second of CPU time, with a lot of cases achieving much
+less than this figure. Above 32K max allocation size, TCMalloc drops
+to 1-1.5 million ops per second of CPU time, and PTMalloc drops almost
+to zero for large numbers of threads (i.e. with PTMalloc, lots of CPU
+time is being burned spinning waiting for locks in the heavily
+multi-threaded case).
+
+</p><h2>Caveats</h2>
+
+<p>For some systems, TCMalloc may not work correctly on with
+applications that aren't linked against libpthread.so (or the
+equivalent on your OS). It should work on Linux using glibc 2.3, but
+other OS/libc combinations have not been tested.
+
+</p><p>TCMalloc may be somewhat more memory hungry than other mallocs,
+though it tends not to have the huge blowups that can happen with
+other mallocs.  In particular, at startup TCMalloc allocates
+approximately 6 MB of memory.  It would be easy to roll a specialized
+version that trades a little bit of speed for more space efficiency.
+
+</p><p>
+TCMalloc currently does not return any memory to the system.
+
+</p><p>
+Don't try to load TCMalloc into a running binary (e.g., using
+JNI in Java programs).  The binary will have allocated some
+objects using the system malloc, and may try to pass them
+to TCMalloc for deallocation.  TCMalloc will not be able
+to handle such objects.
+
+
+
+</p></body></html>
diff --git a/docs/images/heap-example1.png b/docs/images/heap-example1.png
new file mode 100644
index 0000000..9a14b6f
Binary files /dev/null and b/docs/images/heap-example1.png differ
diff --git a/docs/images/overview.gif b/docs/images/overview.gif
new file mode 100644
index 0000000..43828da
Binary files /dev/null and b/docs/images/overview.gif differ
diff --git a/docs/images/pageheap.gif b/docs/images/pageheap.gif
new file mode 100644
index 0000000..6632981
Binary files /dev/null and b/docs/images/pageheap.gif differ
diff --git a/docs/images/pprof-test.gif b/docs/images/pprof-test.gif
new file mode 100644
index 0000000..9eeab8a
Binary files /dev/null and b/docs/images/pprof-test.gif differ
diff --git a/docs/images/pprof-vsnprintf.gif b/docs/images/pprof-vsnprintf.gif
new file mode 100644
index 0000000..42a8547
Binary files /dev/null and b/docs/images/pprof-vsnprintf.gif differ
diff --git a/docs/images/spanmap.gif b/docs/images/spanmap.gif
new file mode 100644
index 0000000..a0627f6
Binary files /dev/null and b/docs/images/spanmap.gif differ
diff --git a/docs/images/tcmalloc-opspercpusec.png b/docs/images/tcmalloc-opspercpusec.png
new file mode 100644
index 0000000..18715e3
Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec.png differ
diff --git a/docs/images/tcmalloc-opspercpusec_002.png b/docs/images/tcmalloc-opspercpusec_002.png
new file mode 100644
index 0000000..3a99cbc
Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_002.png differ
diff --git a/docs/images/tcmalloc-opspercpusec_003.png b/docs/images/tcmalloc-opspercpusec_003.png
new file mode 100644
index 0000000..642e245
Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_003.png differ
diff --git a/docs/images/tcmalloc-opspercpusec_004.png b/docs/images/tcmalloc-opspercpusec_004.png
new file mode 100644
index 0000000..183a77b
Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_004.png differ
diff --git a/docs/images/tcmalloc-opspercpusec_005.png b/docs/images/tcmalloc-opspercpusec_005.png
new file mode 100644
index 0000000..3a080de
Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_005.png differ
diff --git a/docs/images/tcmalloc-opspercpusec_006.png b/docs/images/tcmalloc-opspercpusec_006.png
new file mode 100644
index 0000000..6213021
Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_006.png differ
diff --git a/docs/images/tcmalloc-opspercpusec_007.png b/docs/images/tcmalloc-opspercpusec_007.png
new file mode 100644
index 0000000..48ebdb6
Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_007.png differ
diff --git a/docs/images/tcmalloc-opspercpusec_008.png b/docs/images/tcmalloc-opspercpusec_008.png
new file mode 100644
index 0000000..db59d61
Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_008.png differ
diff --git a/docs/images/tcmalloc-opspercpusec_009.png b/docs/images/tcmalloc-opspercpusec_009.png
new file mode 100644
index 0000000..8c0ae6b
Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_009.png differ
diff --git a/docs/images/tcmalloc-opspersec.png b/docs/images/tcmalloc-opspersec.png
new file mode 100644
index 0000000..d7c79ef
Binary files /dev/null and b/docs/images/tcmalloc-opspersec.png differ
diff --git a/docs/images/tcmalloc-opspersec_002.png b/docs/images/tcmalloc-opspersec_002.png
new file mode 100644
index 0000000..e8a3c9f
Binary files /dev/null and b/docs/images/tcmalloc-opspersec_002.png differ
diff --git a/docs/images/tcmalloc-opspersec_003.png b/docs/images/tcmalloc-opspersec_003.png
new file mode 100644
index 0000000..d45458a
Binary files /dev/null and b/docs/images/tcmalloc-opspersec_003.png differ
diff --git a/docs/images/tcmalloc-opspersec_004.png b/docs/images/tcmalloc-opspersec_004.png
new file mode 100644
index 0000000..37d406d
Binary files /dev/null and b/docs/images/tcmalloc-opspersec_004.png differ
diff --git a/docs/images/tcmalloc-opspersec_005.png b/docs/images/tcmalloc-opspersec_005.png
new file mode 100644
index 0000000..1093e81
Binary files /dev/null and b/docs/images/tcmalloc-opspersec_005.png differ
diff --git a/docs/images/tcmalloc-opspersec_006.png b/docs/images/tcmalloc-opspersec_006.png
new file mode 100644
index 0000000..779eec6
Binary files /dev/null and b/docs/images/tcmalloc-opspersec_006.png differ
diff --git a/docs/images/tcmalloc-opspersec_007.png b/docs/images/tcmalloc-opspersec_007.png
new file mode 100644
index 0000000..da0328a
Binary files /dev/null and b/docs/images/tcmalloc-opspersec_007.png differ
diff --git a/docs/images/tcmalloc-opspersec_008.png b/docs/images/tcmalloc-opspersec_008.png
new file mode 100644
index 0000000..76c125a
Binary files /dev/null and b/docs/images/tcmalloc-opspersec_008.png differ
diff --git a/docs/images/tcmalloc-opspersec_009.png b/docs/images/tcmalloc-opspersec_009.png
new file mode 100644
index 0000000..52d7aee
Binary files /dev/null and b/docs/images/tcmalloc-opspersec_009.png differ
diff --git a/docs/images/threadheap.gif b/docs/images/threadheap.gif
new file mode 100644
index 0000000..c43d0a3
Binary files /dev/null and b/docs/images/threadheap.gif differ

`--text`	+ Produces a textual listing. This is currently the default + since it does not need to access to an X display, or + dot or gv. However if you + have these programs installed, you will probably be + happier with the --gv output. +
`--gv`	+ Generates annotated call-graph, converts to postscript, and + displays via gv. +
`--dot`	+ Generates the annotated call-graph in dot format and + emits to stdout. +
`--ps`	+ Generates the annotated call-graph in Postscript format and + emits to stdout. +
`--gif`	+ Generates the annotated call-graph in GIF format and + emits to stdout. +
`--list=<regexp>`	+ Outputs source-code listing of routines whose + name matches <regexp>. Each line + in the listing is annotated with flat and cumulative + sample counts. + + In the presence of inlined calls, the samples + associated with inlined code tend to get assigned + to a line that follows the location of the + inlined call. A more precise accounting can be + obtained by disassembling the routine using the + --disasm flag. +
`--disasm=<regexp>`	+ Generates disassembly of routines that match + <regexp>, annotated with flat and + cumulative sample counts and emits to stdout. +
`--addresses`	+ Produce one node per program address. +
`--lines`	+ Produce one node per source line. +
`--functions`	+ Produce one node per function (this is the default). +
`--files`	+ Produce one node per source file. +
`--nodecount=<n>`	+ This option controls the number of displayed nodes. The nodes + are first sorted by decreasing cumulative count, and then only + the top N nodes are kept. The default value is 80. +
`--nodefraction=<f>`	+ This option provides another mechanism for discarding nodes + from the display. If the cumulative count for a node is + less than this option's value multiplied by the total count + for the profile, the node is dropped. The default value + is 0.005; i.e. nodes that account for less than + half a percent of the total time are dropped. A node + is dropped if either this condition is satisfied, or the + --nodecount condition is satisfied. +
`--edgefraction=<f>`	+ This option controls the number of displayed edges. First of all, + an edge is dropped if either its source or destination node is + dropped. Otherwise, the edge is dropped if the sample + count along the edge is less than this option's value multiplied + by the total count for the profile. The default value is + 0.001; i.e., edges that account for less than + 0.1% of the total time are dropped. +
`--focus=<re>`	+ This option controls what region of the graph is displayed + based on the regular expression supplied with the option. + For any path in the callgraph, we check all nodes in the path + against the supplied regular expression. If none of the nodes + match, the path is dropped from the output. +
`--ignore=<re>`	+ This option controls what region of the graph is displayed + based on the regular expression supplied with the option. + For any path in the callgraph, we check all nodes in the path + against the supplied regular expression. If any of the nodes + match, the path is dropped from the output. +
`--inuse_space`	+ Display the number of in-use megabytes (i.e. space that has + been allocated but not freed). This is the default. +
`--inuse_objects`	+ Display the number of in-use objects (i.e. number of + objects that have been allocated but not freed). +
`--alloc_space`	+ Display the number of allocated megabytes. This includes + the space that has since been de-allocated. Use this + if you want to find the main allocation sites in the + program. +
`--alloc_objects`	+ Display the number of allocated objects. This includes + the objects that have since been de-allocated. Use this + if you want to find the main allocation sites in the + program. +