diff --git a/docs/html/cpu_profiler.html b/docs/html/cpu_profiler.html new file mode 100644 index 0000000..f05d5ec --- /dev/null +++ b/docs/html/cpu_profiler.html @@ -0,0 +1,409 @@ +Google CPU Profiler + +This is the CPU profiler we use at Google. There are three parts to +using it: linking the library into an application, running the code, +and analyzing the output. + + +

Linking in the Library

+ +

To install the CPU profiler into your executable, add -lprofiler to +the link-time step for your executable. (It's also probably possible +to add in the profiler at run-time using LD_PRELOAD, but this isn't +necessarily recommended.)

+ +

This does not turn on CPU profiling; it just inserts the code. +For that reason, it's practical to just always link -lprofiler into a +binary while developing; that's what we do at Google. (However, since +any user can turn on the profiler by setting an environment variable, +it's not necessarily recommended to install profiler-linked binaries +into a production, running system.)

+ + +

Running the Code

+ +

There are two alternatives to actually turn on CPU profiling for a +given run of an executable:

+ +
    +
  1. Define the environment variable CPUPROFILE to the filename to dump the + profile to. For instance, to profile /usr/local/netscape: +
          $ CPUPROFILE=/tmp/profile /usr/local/netscape           # sh
    +      % setenv CPUPROFILE /tmp/profile; /usr/local/netscape   # csh
    +     
    + OR + +
  2. In your code, bracket the code you want profiled in calls to + ProfilerStart() and ProfilerStop(). ProfilerStart() will take the + profile-filename as an argument. +
+ +

In Linux 2.6 and above, profiling works correctly with threads, +automatically profiling all threads. In Linux 2.4, profiling only +profiles the main thread (due to a kernel bug involving itimers and +threads). Profiling works correctly with sub-processes: each child +process gets its own profile with its own name (generated by combining +CPUPROFILE with the child's process id).

+ +

For security reasons, CPU profiling will not write to a file -- and +is thus not usable -- for setuid programs.

+ +

Controlling Behavior via the Environment

+ +

In addition to the environment variable CPUPROFILE, +which determines where profiles are written, there are several +environment variables which control the performance of the CPU +profile.

+ + + + + + +
PROFILEFREQUENCY=xHow many interrupts/second the cpu-profiler samples. +
+ +

Analyzing the Output

+ +

pprof is the script used to analyze a profile. It has many output +modes, both textual and graphical. Some give just raw numbers, much +like the -pg output of gcc, and others show the data in the form of a +dependency graph.

+ +

pprof requires perl5 to be installed to run. It also +requires dot to be installed for any of the graphical output routines, +and gv to be installed for --gv mode (described below).

+ +

Here are some ways to call pprof. These are described in more +detail below.

+ +
% pprof "program" "profile"
+  Generates one line per procedure
+
+% pprof --gv "program" "profile"
+  Generates annotated call-graph and displays via "gv"
+
+% pprof --gv --focus=Mutex "program" "profile"
+  Restrict to code paths that involve an entry that matches "Mutex"
+
+% pprof --gv --focus=Mutex --ignore=string "program" "profile"
+  Restrict to code paths that involve an entry that matches "Mutex"
+  and does not match "string"
+
+% pprof --list=IBF_CheckDocid "program" "profile"
+  Generates disassembly listing of all routines with at least one
+  sample that match the --list= pattern.  The listing is
+  annotated with the flat and cumulative sample counts at each line.
+
+% pprof --disasm=IBF_CheckDocid "program" "profile"
+  Generates disassembly listing of all routines with at least one
+  sample that match the --disasm= pattern.  The listing is
+  annotated with the flat and cumulative sample counts at each PC value.
+
+ +

Node Information

+ +

In the various graphical modes of pprof, the output is a call graph +annotated with timing information, like so:

+ + +
+ +
+
+ +

Each node represents a procedure. +The directed edges indicate caller to callee relations. Each node is +formatted as follows:

+ +
Class Name
+Method Name
+local (percentage)
+of cumulative (percentage)
+
+ +

The last one or two lines contains the timing information. (The +profiling is done via a sampling method, where by default we take 100 +samples a second. Therefor one unit of time in the output corresponds +to about 10 milliseconds of execution time.) The "local" time is the +time spent executing the instructions directly contained in the +procedure (and in any other procedures that were inlined into the +procedure). The "cumulative" time is the sum of the "local" time and +the time spent in any callees. If the cumulative time is the same as +the local time, it is not printed. + +

For instance, the timing information for test_main_thread() +indicates that 155 units (about 1.55 seconds) were spent executing the +code in test_main_thread() and 200 units were spent while executing +test_main_thread() and its callees such as snprintf().

+ +

The size of the node is proportional to the local count. The +percentage displayed in the node corresponds to the count divided by +the total run time of the program (that is, the cumulative count for +main()).

+ +

Edge Information

+ +

An edge from one node to another indicates a caller to callee +relationship. Each edge is labelled with the time spent by the callee +on behalf of the caller. E.g, the edge from test_main_thread() to +snprintf() indicates that of the 200 samples in +test_main_thread(), 37 are because of calls to snprintf().

+ +

Note that test_main_thread() has an edge to vsnprintf(), even +though test_main_thread() doesn't call that function directly. This +is because the code was compiled with -O2; the profile reflects the +optimized control flow.

+ +

Meta Information

+ +The top of the display should contain some meta information like: +
      /tmp/profiler2_unittest
+      Total samples: 202
+      Focusing on: 202
+      Dropped nodes with <= 1 abs(samples)
+      Dropped edges with <= 0 samples
+
+ +This section contains the name of the program, and the total samples +collected during the profiling run. If the --focus option is on (see +the Focus section below), the legend also +contains the number of samples being shown in the focused display. +Furthermore, some unimportant nodes and edges are dropped to reduce +clutter. The characteristics of the dropped nodes and edges are also +displayed in the legend. + +

Focus and Ignore

+ +

You can ask pprof to generate a display focused on a particular +piece of the program. You specify a regular expression. Any portion +of the call-graph that is on a path which contains at least one node +matching the regular expression is preserved. The rest of the +call-graph is dropped on the floor. For example, you can focus on the +vsnprintf() libc call in profiler2_unittest as follows:

+ +
% pprof --gv --focus=vsnprintf /tmp/profiler2_unittest test.prof
+
+ +
+ +
+
+ +

+Similarly, you can supply the --ignore option to ignore +samples that match a specified regular expression. E.g., +if you are interested in everything except calls to snprintf(), +you can say: +

% pprof --gv --ignore=snprintf /tmp/profiler2_unittest test.prof
+
+ +

pprof Options

+ +

Output Type

+ +

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
--text + Produces a textual listing. This is currently the default + since it does not need to access to an X display, or + dot or gv. However if you + have these programs installed, you will probably be + happier with the --gv output. +
--gv + Generates annotated call-graph, converts to postscript, and + displays via gv. +
--dot + Generates the annotated call-graph in dot format and + emits to stdout. +
--ps + Generates the annotated call-graph in Postscript format and + emits to stdout. +
--gif + Generates the annotated call-graph in GIF format and + emits to stdout. +
--list=<regexp> +

Outputs source-code listing of routines whose + name matches <regexp>. Each line + in the listing is annotated with flat and cumulative + sample counts.

+ +

In the presence of inlined calls, the samples + associated with inlined code tend to get assigned + to a line that follows the location of the + inlined call. A more precise accounting can be + obtained by disassembling the routine using the + --disasm flag.

+
--disasm=<regexp> + Generates disassembly of routines that match + <regexp>, annotated with flat and + cumulative sample counts and emits to stdout. +
+
+ +

Reporting Granularity

+ +

By default, pprof produces one entry per procedure. However you can +use one of the following options to change the granularity of the +output. The --files option seems to be particularly useless, and may +be removed eventually.

+ +
+ + + + + + + + + + + + + + +
--addresses + Produce one node per program address. +
--lines + Produce one node per source line. +
--functions + Produce one node per function (this is the default). +
--files + Produce one node per source file. +
+
+ +

Controlling the Call Graph Display

+ +

Some nodes and edges are dropped to reduce clutter in the output +display. The following options control this effect:

+ +
+ + + + + + + + + + + + + + + + + + + + + +
--nodecount=<n> + This option controls the number of displayed nodes. The nodes + are first sorted by decreasing cumulative count, and then only + the top N nodes are kept. The default value is 80. +
--nodefraction=<f> + This option provides another mechanism for discarding nodes + from the display. If the cumulative count for a node is + less than this option's value multiplied by the total count + for the profile, the node is dropped. The default value + is 0.005; i.e. nodes that account for less than + half a percent of the total time are dropped. A node + is dropped if either this condition is satisfied, or the + --nodecount condition is satisfied. +
--edgefraction=<f> + This option controls the number of displayed edges. First of all, + an edge is dropped if either its source or destination node is + dropped. Otherwise, the edge is dropped if the sample + count along the edge is less than this option's value multiplied + by the total count for the profile. The default value is + 0.001; i.e., edges that account for less than + 0.1% of the total time are dropped. +
--focus=<re> + This option controls what region of the graph is displayed + based on the regular expression supplied with the option. + For any path in the callgraph, we check all nodes in the path + against the supplied regular expression. If none of the nodes + match, the path is dropped from the output. +
--ignore=<re> + This option controls what region of the graph is displayed + based on the regular expression supplied with the option. + For any path in the callgraph, we check all nodes in the path + against the supplied regular expression. If any of the nodes + match, the path is dropped from the output. +
+
+ +

The dropped edges and nodes account for some count mismatches in +the display. For example, the cumulative count for +snprintf() in the first diagram above was 41. However the local +count (1) and the count along the outgoing edges (12+1+20+6) add up to +only 40.

+ + +

Caveats

+ + + +
+Last modified: Wed Apr 20 04:54:23 PDT 2005 + + diff --git a/docs/html/heap_checker.html b/docs/html/heap_checker.html new file mode 100644 index 0000000..fd82f37 --- /dev/null +++ b/docs/html/heap_checker.html @@ -0,0 +1,133 @@ + +Google Heap Checker +

Automatic Leaks Checking Support

+ +This document describes how to check the heap usage of a C++ +program. This facility can be useful for automatically detecting +memory leaks. + +

Linking in the Heap Checker

+ +

+You can heap-check any program that has the tcmalloc library linked +in. No recompilation is necessary to use the heap checker. +

+ +

+In order to catch all heap leaks, tcmalloc must be linked last into +your executable. The heap checker may mischaracterize some memory +accesses in libraries listed after it on the link line. For instance, +it may report these libraries as leaking memory when they're not. +(See the source code for more details.) +

+ +

+It's safe to link in tcmalloc even if you don't expect to +heap-check your program. Your programs will not run any slower +as long as you don't use any of the heap-checker features. +

+ +

+You can run the heap checker on applications you didn't compile +yourself, by using LD_PRELOAD: +

+
   $ LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPCHECK=normal 
+
+

+We don't necessarily recommend this mode of usage. +

+ +

Turning On Heap Checking

+ +

There are two alternatives to actually turn on heap checking for a +given run of an executable.

+ + \ No newline at end of file diff --git a/docs/html/heap_profiler.html b/docs/html/heap_profiler.html new file mode 100644 index 0000000..6936901 --- /dev/null +++ b/docs/html/heap_profiler.html @@ -0,0 +1,310 @@ + +Google Heap Profiler +

Profiling heap usage

+ +This document describes how to profile the heap usage of a C++ +program. This facility can be useful for + + +

Linking in the Heap Profiler

+ +

+You can profile any program that has the tcmalloc library linked +in. No recompilation is necessary to use the heap profiler. +

+ +

+It's safe to link in tcmalloc even if you don't expect to +heap-profiler your program. Your programs will not run any slower +as long as you don't use any of the heap-profiler features. +

+ +

+You can run the heap profiler on applications you didn't compile +yourself, by using LD_PRELOAD: +

+
   $ LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPPROFILE=... 
+
+

+We don't necessarily recommend this mode of usage. +

+ + +

Turning On Heap Profiling

+ +

+Define the environment variable HEAPPROFILE to the filename to dump the +profile to. For instance, to profile /usr/local/netscape: +

+
 $ HEAPPROFILE=/tmp/profile /usr/local/netscape           # sh
+ % setenv HEAPPROFILE /tmp/profile; /usr/local/netscape   # csh
+
+ +

Profiling also works correctly with sub-processes: each child +process gets its own profile with its own name (generated by combining +HEAPPROFILE with the child's process id).

+ +

For security reasons, heap profiling will not write to a file -- +and it thus not usable -- for setuid programs.

+ + + +

Extracting a profile

+ +

+If heap-profiling is turned on in a program, the program will periodically +write profiles to the filesystem. The sequence of profiles will be named: +

+
           <prefix>.0000.heap
+           <prefix>.0001.heap
+           <prefix>.0002.heap
+           ...
+
+

+where <prefix> is the value supplied in +HEAPPROFILE. Note that if the supplied prefix +does not start with a /, the profile files will be +written to the program's working directory. +

+ +

+By default, a new profile file is written after every 1GB of +allocation. The profile-writing interval can be adjusted by calling +HeapProfilerSetAllocationInterval() from your program. This takes one +argument: a numeric value that indicates the number of bytes of allocation +between each profile dump. +

+ +

+You can also generate profiles from specific points in the program +by inserting a call to HeapProfile(). Example: +

+
    extern const char* HeapProfile();
+    const char* profile = HeapProfile();
+    fputs(profile, stdout);
+    free(const_cast<char*>(profile));
+
+ +

What is profiled

+ +The profiling system instruments all allocations and frees. It keeps +track of various pieces of information per allocation site. An +allocation site is defined as the active stack trace at the call to +malloc, calloc, realloc, or, +new. + +

Interpreting the profile

+ +The profile output can be viewed by passing it to the +pprof tool. The pprof tool can print both +CPU usage and heap usage information. It is documented in detail +on the CPU Profiling page. +Heap-profile-specific flags and usage are explained below. + +

+Here are some examples. These examples assume the binary is named +gfs_master, and a sequence of heap profile files can be +found in files named: +

+
  profile.0001.heap
+  profile.0002.heap
+  ...
+  profile.0100.heap
+
+ +

Why is a process so big

+ +
    % pprof --gv gfs_master profile.0100.heap
+
+ +This command will pop-up a gv window that displays +the profile information as a directed graph. Here is a portion +of the resulting output: + +

+

+ +
+

+ +A few explanations: + + +

Comparing Profiles

+ +

+You often want to skip allocations during the initialization phase of +a program so you can find gradual memory leaks. One simple way to do +this is to compare two profiles -- both collected after the program +has been running for a while. Specify the name of the first profile +using the --base option. Example: +

+
   % pprof --base=profile.0004.heap gfs_master profile.0100.heap
+
+ +

+The memory-usage in profile.0004.heap will be subtracted from +the memory-usage in profile.0100.heap and the result will +be displayed. +

+ +

Text display

+ +
% pprof gfs_master profile.0100.heap
+   255.6  24.7%  24.7%    255.6  24.7% GFS_MasterChunk::AddServer
+   184.6  17.8%  42.5%    298.8  28.8% GFS_MasterChunkTable::Create
+   176.2  17.0%  59.5%    729.9  70.5% GFS_MasterChunkTable::UpdateState
+   169.8  16.4%  75.9%    169.8  16.4% PendingClone::PendingClone
+    76.3   7.4%  83.3%     76.3   7.4% __default_alloc_template::_S_chunk_alloc
+    49.5   4.8%  88.0%     49.5   4.8% hashtable::resize
+   ...
+
+ +

+

+ +

Ignoring or focusing on specific regions

+ +The following command will give a graphical display of a subset of +the call-graph. Only paths in the call-graph that match the +regular expression DataBuffer are included: +
% pprof --gv --focus=DataBuffer gfs_master profile.0100.heap
+
+ +Similarly, the following command will omit all paths subset of the +call-graph. All paths in the call-graph that match the regular +expression DataBuffer are discarded: +
% pprof --gv --ignore=DataBuffer gfs_master profile.0100.heap
+
+ +

Total allocations + object-level information

+ +

+All of the previous examples have displayed the amount of in-use +space. I.e., the number of bytes that have been allocated but not +freed. You can also get other types of information by supplying +a flag to pprof: +

+ +
+ + + + + + + + + + + + + + + + + + + + + +
--inuse_space + Display the number of in-use megabytes (i.e. space that has + been allocated but not freed). This is the default. +
--inuse_objects + Display the number of in-use objects (i.e. number of + objects that have been allocated but not freed). +
--alloc_space + Display the number of allocated megabytes. This includes + the space that has since been de-allocated. Use this + if you want to find the main allocation sites in the + program. +
--alloc_objects + Display the number of allocated objects. This includes + the objects that have since been de-allocated. Use this + if you want to find the main allocation sites in the + program. +
+
+ +

Caveats

+ + + +
+
Sanjay Ghemawat
+ + +Last modified: Wed Apr 20 05:46:16 PDT 2005 + + + diff --git a/docs/html/tcmalloc.html b/docs/html/tcmalloc.html new file mode 100644 index 0000000..aa2d3ee --- /dev/null +++ b/docs/html/tcmalloc.html @@ -0,0 +1,373 @@ + +TCMalloc : Thread-Caching Malloc + + + + + + +

TCMalloc : Thread-Caching Malloc

+ +
Sanjay Ghemawat, Paul Menage <opensource@google.com>
+ +

Motivation

+ +TCMalloc is faster than the glibc 2.3 malloc (available as a separate +library called ptmalloc2) and other mallocs +that I have tested. ptmalloc2 takes approximately 300 nanoseconds to +execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The +TCMalloc implementation takes approximately 50 nanoseconds for the +same operation pair. Speed is important for a malloc implementation +because if malloc is not fast enough, application writers are inclined +to write their own custom free lists on top of malloc. This can lead +to extra complexity, and more memory usage unless the application +writer is very careful to appropriately size the free lists and +scavenge idle objects out of the free list + +

+TCMalloc also reduces lock contention for multi-threaded programs. +For small objects, there is virtually zero contention. For large +objects, TCMalloc tries to use fine grained and efficient spinlocks. +ptmalloc2 also reduces lock contention by using per-thread arenas but +there is a big problem with ptmalloc2's use of per-thread arenas. In +ptmalloc2 memory can never move from one arena to another. This can +lead to huge amounts of wasted space. For example, in one Google application, the first phase would +allocate approximately 300MB of memory for its data +structures. When the first phase finished, a second phase would be +started in the same address space. If this second phase was assigned a +different arena than the one used by the first phase, this phase would +not reuse any of the memory left after the first phase and would add +another 300MB to the address space. Similar memory blowup problems +were also noticed in other applications. + +

+Another benefit of TCMalloc is space-efficient representation of small +objects. For example, N 8-byte objects can be allocated while using +space approximately 8N * 1.01 bytes. I.e., a one-percent +space overhead. ptmalloc2 uses a four-byte header for each object and +(I think) rounds up the size to a multiple of 8 bytes and ends up +using 16N bytes. + + +

Usage

+ +

To use TCmalloc, just link tcmalloc into your application via the +"-ltcmalloc" linker flag.

+ +

+You can use tcmalloc in applications you didn't compile yourself, by +using LD_PRELOAD: +

+
   $ LD_PRELOAD="/usr/lib/libtcmalloc.so" 
+
+

+LD_PRELOAD is tricky, and we don't necessarily recommend this mode of +usage. +

+ +

TCMalloc includes a heap checker +and heap profiler as well.

+ +

If you'd rather link in a version of TCMalloc that does not include +the heap profiler and checker (perhaps to reduce binary size for a +static binary), you can link in libtcmalloc_minimal +instead.

+ + +

Overview

+ +TCMalloc assigns each thread a thread-local cache. Small allocations +are satisfied from the thread-local cache. Objects are moved from +central data structures into a thread-local cache as needed, and +periodic garbage collections are used to migrate memory back from a +thread-local cache into the central data structures. +
+ +

+TCMalloc treates objects with size <= 32K ("small" objects) +differently from larger objects. Large objects are allocated +directly from the central heap using a page-level allocator +(a page is a 4K aligned region of memory). I.e., a large object +is always page-aligned and occupies an integral number of pages. + +

+A run of pages can be carved up into a sequence of small objects, each +equally sized. For example a run of one page (4K) can be carved up +into 32 objects of size 128 bytes each. + +

Small Object Allocation

+ +Each small object size maps to one of approximately 170 allocatable +size-classes. For example, all allocations in the range 961 to 1024 +bytes are rounded up to 1024. The size-classes are spaced so that +small sizes are separated by 8 bytes, larger sizes by 16 bytes, even +larger sizes by 32 bytes, and so forth. The maximal spacing (for sizes +>= ~2K) is 256 bytes. + +

+A thread cache contains a singly linked list of free objects per size-class. +

+ +When allocating a small object: (1) We map its size to the +corresponding size-class. (2) Look in the corresponding free list in +the thread cache for the current thread. (3) If the free list is not +empty, we remove the first object from the list and return it. When +following this fast path, TCMalloc acquires no locks at all. This +helps speed-up allocation significantly because a lock/unlock pair +takes approximately 100 nanoseconds on a 2.8 GHz Xeon. + +

+If the free list is empty: (1) We fetch a bunch of objects from a +central free list for this size-class (the central free list is shared +by all threads). (2) Place them in the thread-local free list. (3) +Return one of the newly fetched objects to the applications. + +

+If the central free list is also empty: (1) We allocate a run of pages +from the central page allocator. (2) Split the run into a set of +objects of this size-class. (3) Place the new objects on the central +free list. (4) As before, move some of these objects to the +thread-local free list. + +

Large Object Allocation

+ +A large object size (> 32K) is rounded up to a page size (4K) and +is handled by a central page heap. The central page heap is again an +array of free lists. For i < 256, the +kth entry is a free list of runs that consist of +k pages. The 256th entry is a free list of +runs that have length >= 256 pages: +
+ +

+An allocation for k pages is satisfied by looking in the +kth free list. If that free list is empty, we look in +the next free list, and so forth. Eventually, we look in the last +free list if necessary. If that fails, we fetch memory from the +system (using sbrk, mmap, or by mapping in portions of /dev/mem). + +

+If an allocation for k pages is satisfied by a run +of pages of length > k, the remainder of the +run is re-inserted back into the appropriate free list in the +page heap. + +

Spans

+ +The heap managed by TCMalloc consists of a set of pages. A run of +contiguous pages is represented by a Span object. A span +can either be allocated, or free. If free, the span +is one of the entries in a page heap linked-list. If allocated, it is +either a large object that has been handed off to the application, or +a run of pages that have been split up into a sequence of small +objects. If split into small objects, the size-class of the objects +is recorded in the span. + +

+A central array indexed by page number can be used to find the span to +which a page belongs. For example, span a below occupies 2 +pages, span b occupies 1 page, span c occupies 5 +pages and span d occupies 3 pages. +

+A 32-bit address space can fit 2^20 4K pages, so this central array +takes 4MB of space, which seems acceptable. On 64-bit machines, we +use a 3-level radix tree instead of an array to map from a page number +to the corresponding span pointer. + +

Deallocation

+ +When an object is deallocated, we compute its page number and look it up +in the central array to find the corresponding span object. The span tells +us whether or not the object is small, and its size-class if it is +small. If the object is small, we insert it into the appropriate free +list in the current thread's thread cache. If the thread cache now +exceeds a predetermined size (2MB by default), we run a garbage +collector that moves unused objects from the thread cache into central +free lists. + +

+If the object is large, the span tells us the range of pages covered +by the object. Suppose this range is [p,q]. We also +lookup the spans for pages p-1 and q+1. If +either of these neighboring spans are free, we coalesce them with the +[p,q] span. The resulting span is inserted into the +appropriate free list in the page heap. + +

Central Free Lists for Small Objects

+ +As mentioned before, we keep a central free list for each size-class. +Each central free list is organized as a two-level data structure: +a set of spans, and a linked list of free objects per span. + +

+An object is allocated from a central free list by removing the +first entry from the linked list of some span. (If all spans +have empty linked lists, a suitably sized span is first allocated +from the central page heap.) + +

+An object is returned to a central free list by adding it to the +linked list of its containing span. If the linked list length now +equals the total number of small objects in the span, this span is now +completely free and is returned to the page heap. + +

Garbage Collection of Thread Caches

+ +A thread cache is garbage collected when the combined size of all +objects in the cache exceeds 2MB. The garbage collection threshold +is automatically decreased as the number of threads increases so that +we don't waste an inordinate amount of memory in a program with lots +of threads. + +

+We walk over all free lists in the cache and move some number of +objects from the free list to the corresponding central list. + +

+The number of objects to be moved from a free list is determined using +a per-list low-water-mark L. L records the +minimum length of the list since the last garbage collection. Note +that we could have shortened the list by L objects at the +last garbage collection without requiring any extra accesses to the +central list. We use this past history as a predictor of future +accesses and move L/2 objects from the thread cache free +list to the corresponding central free list. This algorithm has the +nice property that if a thread stops using a particular size, all +objects of that size will quickly move from the thread cache to the +central free list where they can be used by other threads. + +

Performance Notes

+ +

PTMalloc2 unittest

+The PTMalloc2 package (now part of glibc) contains a unittest program +t-test1.c. This forks a number of threads and performs a series of +allocations and deallocations in each thread; the threads do not +communicate other than by synchronization in the memory allocator. + +

t-test1 (included in google-perftools/tests/tcmalloc, and compiled +as ptmalloc_unittest1) was run with a varying numbers of threads +(1-20) and maximum allocation sizes (64 bytes - 32Kbytes). These tests +were run on a 2.4GHz dual Xeon system with hyper-threading enabled, +using Linux glibc-2.3.2 from RedHat 9, with one million operations per +thread in each test. In each case, the test was run once normally, and +once with LD_PRELOAD=libtcmalloc.so. + +

The graphs below show the performance of TCMalloc vs PTMalloc2 for +several different metrics. Firstly, total operations (millions) per elapsed +second vs max allocation size, for varying numbers of threads. The raw +data used to generate these graphs (the output of the "time" utility) +is available in t-test1.times.txt. + +

+ + + + + + + + + + + + + + + + +
+ + +

+ +

Next, operations (millions) per second of CPU time vs number of threads, for +max allocation size 64 bytes - 128 Kbytes. + +

+ + + + + + + + + + + + + + + + +
+ +

Here we see again that TCMalloc is both more consistent and more +efficient than PTMalloc2. For max allocation sizes <32K, TCMalloc +typically achieves ~2-2.5 million ops per second of CPU time with a +large number of threads, whereas PTMalloc achieves generally 0.5-1 +million ops per second of CPU time, with a lot of cases achieving much +less than this figure. Above 32K max allocation size, TCMalloc drops +to 1-1.5 million ops per second of CPU time, and PTMalloc drops almost +to zero for large numbers of threads (i.e. with PTMalloc, lots of CPU +time is being burned spinning waiting for locks in the heavily +multi-threaded case). + +

Caveats

+ +

For some systems, TCMalloc may not work correctly on with +applications that aren't linked against libpthread.so (or the +equivalent on your OS). It should work on Linux using glibc 2.3, but +other OS/libc combinations have not been tested. + +

TCMalloc may be somewhat more memory hungry than other mallocs, +though it tends not to have the huge blowups that can happen with +other mallocs. In particular, at startup TCMalloc allocates +approximately 6 MB of memory. It would be easy to roll a specialized +version that trades a little bit of speed for more space efficiency. + +

+TCMalloc currently does not return any memory to the system. + +

+Don't try to load TCMalloc into a running binary (e.g., using +JNI in Java programs). The binary will have allocated some +objects using the system malloc, and may try to pass them +to TCMalloc for deallocation. TCMalloc will not be able +to handle such objects. + + + +

diff --git a/docs/images/heap-example1.png b/docs/images/heap-example1.png new file mode 100644 index 0000000..9a14b6f Binary files /dev/null and b/docs/images/heap-example1.png differ diff --git a/docs/images/overview.gif b/docs/images/overview.gif new file mode 100644 index 0000000..43828da Binary files /dev/null and b/docs/images/overview.gif differ diff --git a/docs/images/pageheap.gif b/docs/images/pageheap.gif new file mode 100644 index 0000000..6632981 Binary files /dev/null and b/docs/images/pageheap.gif differ diff --git a/docs/images/pprof-test.gif b/docs/images/pprof-test.gif new file mode 100644 index 0000000..9eeab8a Binary files /dev/null and b/docs/images/pprof-test.gif differ diff --git a/docs/images/pprof-vsnprintf.gif b/docs/images/pprof-vsnprintf.gif new file mode 100644 index 0000000..42a8547 Binary files /dev/null and b/docs/images/pprof-vsnprintf.gif differ diff --git a/docs/images/spanmap.gif b/docs/images/spanmap.gif new file mode 100644 index 0000000..a0627f6 Binary files /dev/null and b/docs/images/spanmap.gif differ diff --git a/docs/images/tcmalloc-opspercpusec.png b/docs/images/tcmalloc-opspercpusec.png new file mode 100644 index 0000000..18715e3 Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec.png differ diff --git a/docs/images/tcmalloc-opspercpusec_002.png b/docs/images/tcmalloc-opspercpusec_002.png new file mode 100644 index 0000000..3a99cbc Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_002.png differ diff --git a/docs/images/tcmalloc-opspercpusec_003.png b/docs/images/tcmalloc-opspercpusec_003.png new file mode 100644 index 0000000..642e245 Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_003.png differ diff --git a/docs/images/tcmalloc-opspercpusec_004.png b/docs/images/tcmalloc-opspercpusec_004.png new file mode 100644 index 0000000..183a77b Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_004.png differ diff --git a/docs/images/tcmalloc-opspercpusec_005.png b/docs/images/tcmalloc-opspercpusec_005.png new file mode 100644 index 0000000..3a080de Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_005.png differ diff --git a/docs/images/tcmalloc-opspercpusec_006.png b/docs/images/tcmalloc-opspercpusec_006.png new file mode 100644 index 0000000..6213021 Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_006.png differ diff --git a/docs/images/tcmalloc-opspercpusec_007.png b/docs/images/tcmalloc-opspercpusec_007.png new file mode 100644 index 0000000..48ebdb6 Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_007.png differ diff --git a/docs/images/tcmalloc-opspercpusec_008.png b/docs/images/tcmalloc-opspercpusec_008.png new file mode 100644 index 0000000..db59d61 Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_008.png differ diff --git a/docs/images/tcmalloc-opspercpusec_009.png b/docs/images/tcmalloc-opspercpusec_009.png new file mode 100644 index 0000000..8c0ae6b Binary files /dev/null and b/docs/images/tcmalloc-opspercpusec_009.png differ diff --git a/docs/images/tcmalloc-opspersec.png b/docs/images/tcmalloc-opspersec.png new file mode 100644 index 0000000..d7c79ef Binary files /dev/null and b/docs/images/tcmalloc-opspersec.png differ diff --git a/docs/images/tcmalloc-opspersec_002.png b/docs/images/tcmalloc-opspersec_002.png new file mode 100644 index 0000000..e8a3c9f Binary files /dev/null and b/docs/images/tcmalloc-opspersec_002.png differ diff --git a/docs/images/tcmalloc-opspersec_003.png b/docs/images/tcmalloc-opspersec_003.png new file mode 100644 index 0000000..d45458a Binary files /dev/null and b/docs/images/tcmalloc-opspersec_003.png differ diff --git a/docs/images/tcmalloc-opspersec_004.png b/docs/images/tcmalloc-opspersec_004.png new file mode 100644 index 0000000..37d406d Binary files /dev/null and b/docs/images/tcmalloc-opspersec_004.png differ diff --git a/docs/images/tcmalloc-opspersec_005.png b/docs/images/tcmalloc-opspersec_005.png new file mode 100644 index 0000000..1093e81 Binary files /dev/null and b/docs/images/tcmalloc-opspersec_005.png differ diff --git a/docs/images/tcmalloc-opspersec_006.png b/docs/images/tcmalloc-opspersec_006.png new file mode 100644 index 0000000..779eec6 Binary files /dev/null and b/docs/images/tcmalloc-opspersec_006.png differ diff --git a/docs/images/tcmalloc-opspersec_007.png b/docs/images/tcmalloc-opspersec_007.png new file mode 100644 index 0000000..da0328a Binary files /dev/null and b/docs/images/tcmalloc-opspersec_007.png differ diff --git a/docs/images/tcmalloc-opspersec_008.png b/docs/images/tcmalloc-opspersec_008.png new file mode 100644 index 0000000..76c125a Binary files /dev/null and b/docs/images/tcmalloc-opspersec_008.png differ diff --git a/docs/images/tcmalloc-opspersec_009.png b/docs/images/tcmalloc-opspersec_009.png new file mode 100644 index 0000000..52d7aee Binary files /dev/null and b/docs/images/tcmalloc-opspersec_009.png differ diff --git a/docs/images/threadheap.gif b/docs/images/threadheap.gif new file mode 100644 index 0000000..c43d0a3 Binary files /dev/null and b/docs/images/threadheap.gif differ