4.4BSD/usr/src/contrib/mprof/mprof.tex

% Master File: mprof.tex
% Document type: LaTeX
\documentstyle[11pt,captions,doublespace,twoside]{article}

\input{defs}
\setstretch{1}

\fullpagestyle

\begin{document}

\title{A Memory Allocation Profiler for C and Lisp Programs
\footnotetext{This research
was funded by DARPA contract number N00039-85-C-0269 as part of the
SPUR research project.}}


\author{Benjamin Zorn \\
	Department of Computer Science \\
	University of Colorado at Boulder \\
	\\
	Paul Hilfinger \\
	Computer Science Division, Dept. of EECS \\
	University of California at Berkeley \\
	}

\date{}

\maketitle

\begin{singlespace}
\begin{abstract}        
This paper describes {\tt mprof}, a tool used to study the dynamic
memory allocation behavior of programs.  {\tt Mprof} records the
amount of memory that a function allocates, breaks down allocation
information by type and size, and displays a program's dynamic call
graph so that functions indirectly responsible for memory allocation
are easy to identify.  {\tt Mprof} is a two-phase tool.  The monitor
phase is linked into executing programs and records information each
time memory is allocated.  The display phase reduces the data
generated by the monitor and presents the information to a user in
several tables.  {\tt Mprof} has been implemented for C and Kyoto
Common Lisp.  Measurements of these implementations are presented.
\end{abstract}
\end{singlespace}


\section{Introduction}
\setlength{\parskip}{6pt plus 2pt}

Dynamic memory allocation, hereafter referred to simply as memory
allocation, is an important part of many programs.  By dynamic
allocation, we mean the memory allocated from the heap.  Unnecessary
allocation can decrease program locality and increase execution time
for the allocation itself and for possible memory reclamation.  If
reclamation is not performed, or if some objects are accidently not
reclaimed (a ``memory leak''), programs can fail when they reach the
memory size limit.  Programmers often write their own versions of
memory allocation routines to measure and reduce allocation overhead.
It is estimated that Mesa programmers spent 40\% of their time solving
storage-management related problems before automatic storage
reclamation techniques were introduced in Cedar~\cite{rovner85:gc}.
Even with automatic storage management, in which reclamation occurs
transparently, memory allocation has a strong effect on the
performance of programs~\cite{moon84:gc}.  Although memory allocation
is important, few software tools exist to help programmers understand
the memory allocation behavior of their programs.

{\tt Mprof} is a tool that allows programmers to identify where and
why dynamic memory is allocated in a program.  It records which
functions are directly responsible for memory allocation and also
records the dynamic call chain at an allocation to show which
functions were indirectly responsible for the allocation.  The design
of {\tt mprof} was inspired by the tool {\tt gprof}, a dynamic
execution profiler~\cite{graham83:gprof}.  {\tt gprof} is a very
useful tool for understanding the execution behavior of programs; {\tt
mprof} extends the ideas of {\tt gprof} to give programmers
information about the dynamic memory allocation behavior of their
programs. {\tt Mprof} is a two-phase tool.  A monitor phase is linked
into an executing application and collects data as the application
executes.  A reduction and display phase takes the data collected by
the monitor and presents it to the user in concise tables.

A profiling program such as {\tt mprof} should satisfy several
criteria.  First, a program monitor should not significantly alter the
behavior of the program being monitored.  In particular, a monitor
should not impose so much overhead on the program being monitored that
large programs cannot be profiled.  Second, a monitor should be easy
to integrate into existing applications.  To use {\tt mprof},
programmers simply have to relink their applications with a special
version of the system library.  No source code modifications are
required.  Finally, a monitor should provide a programmer with
information that he can understand and use to reduce the memory
allocation overhead of his programs.  We will present an example that
illustrates such a use of {\tt mprof}.

In Section \ref{using mprof}, we present a simple program and
describe the use of {\tt mprof} with respect to the example. In
Section \ref{implementing mprof} we discuss techniques for the
effective implementation of {\tt mprof} .  Section \ref{measurements}
presents some measurements of {\tt mprof}.  Section \ref{related work}
describes other memory profiling tools and previous work on which {\tt
mprof} is based, while Section \ref{conclusions} contains our
conclusions.

\section{Using {\tt mprof}}
\label{using mprof}

\subsection{A Producer/Consumer Example}

To illustrate how {\tt mprof} helps a programmer understand the memory
allocation in his program, consider the C program in Figure~\ref{C
example figure}.  In this program, a simplified producer/consumer
simulation, objects are randomly allocated by two producers and freed
by the consumer.  The function {\tt random\_flip}, which is not shown,
randomly returns 1 or 0 with equal probability.  The function {\tt
consume\_widget}, which is responsible for freeing the memory
allocated, contains a bug and does not free red widgets.  If the
simulation ran for a long time, memory would eventually be exhausted,
and the program would fail.

\begin{figure}[htbp]
\begin{singlespace}
{\footnotesize
\include{expr}
}
\end{singlespace}
\caption{A Simple Producer/Consumer Simulation Program}
\label{C example figure}
\end{figure}

In the example, {\tt make\_widget} is the only function that allocates
memory directly.  To fully understand the allocation behavior of the
program, we must know which functions called {\tt make\_widget} and
hence were indirectly responsible for memory allocation.  {\tt Mprof}
provides this information.

To use {\tt mprof}, programmers link in special versions of the system
functions {\tt malloc} and {\tt free}, which are called each time
memory is allocated and freed, respectively.  The application is then
run normally.  The {\tt mprof} monitor function, linked in with {\tt
malloc}, gathers statistics as the program runs and writes this
information to a file when the application exits.  The programmer then
runs a display program over the data file, and four tables are
printed: a list of memory leaks, an allocation bin table, a direct
allocation table, and a dynamic call graph.  Each table presents the
allocation behavior of the program from a different perspective.  The
rest of this section describes each of the tables for the C program in
Figure~\ref{C example figure}.

% Fields in the tables are described in detail in the appendix.

\subsection{The Memory Leak Table}

C programmers must explicitly free memory objects when they are done
using them.  Memory leaks arise when programmers accidently forget to
release memory.  Because Lisp reclaims memory automatically, the
memory leak table is not necessary in the Lisp version of {\tt mprof}.

The memory leak table tells a programmer which functions allocated the
memory associated with memory leaks.  The table contains a list of
partial call paths that resulted in memory that was allocated and not
subsequently freed.  The paths are partial because complete path
information is not recorded; only the last five callers on the call
stack are listed in the memory leak table.  In our simple example,
there is only one such path, and it tells us immediately what objects
are not freed.  Figure~\ref{memory leak figure} contains the memory
leak table for our example.

\begin{figure}[htbp]
\begin{singlespace}
{
\include{extable0a}
}
\end{singlespace}
\caption{Memory Leak Table for Producer/Consumer Example}
\label{memory leak figure}
\end{figure}

In larger examples, more than one path through a particular function
is possible.  We provide an option that distinguishes individual call
sites within the same function in the memory leak table if such a
distinction is needed.

\subsection{The Allocation Bin Table}

A major part of understanding the memory allocation behavior of a
program is knowing what objects were allocated.  In C, memory
allocation is done by object size; the type of object being allocated
is not known at allocation time.  The allocation bin table provides
information about what sizes of objects were allocated and what
program types correspond to the sizes listed.  This knowledge helps a
programmer recognize which data structures consume the most memory and
allows him to concentrate any space optimizations on them.

The allocation bin table breaks down object allocation by the size, in
bytes, of allocated objects.  Figure~\ref{allocation bin figure} shows
the allocation bin table for the program in Figure~\ref{C example
figure}.

\begin{figure}[htbp]
\begin{singlespace}
{
\include{extable0}
}
\end{singlespace}
\caption{Allocation Bin Table for Producer/Consumer Example}
\label{allocation bin figure}
\end{figure}

The allocation bin table contains information about objects of each
byte size from 0 to 1024 bytes and groups objects larger than 1024
bytes into a single bin.  For each byte size in which memory was
allocated, the allocation bin table shows the number of allocations of
that size ({\tt allocs}), the total number of bytes allocated to
objects of that size ({\tt bytes}), the number of frees of objects of
that size ({\tt frees}), the number of bytes not freed that were
allocated to objects of that size ({\tt kept}\footnote{The label {\tt
kept} is used throughout the paper to refer to objects that were never
freed.}), and user types whose size is the same as the bin size ({\tt
types}).  From the example, we can see that 10,000 widgets were
allocated by the program, but only 4,981 of these were eventually
freed, resulting in 1,023,876 bytes of memory lost to the memory leak.
The percentages show what fraction of all bins a particular bin
contributed.  This information is provided to allow a user to rapidly
determine which bins are of interest (i.e., contribute a substantial
percentage).  99\% is the largest percentage possible because we chose
to use a 2 character field width.

\subsection{The Direct Allocation Table}

Another facet of understanding memory allocation is knowing which
functions allocated memory and how much they allocated.  In C, memory
allocation is performed explicitly by calling {\tt malloc}, and so
programmers are often aware of the functions that allocate memory.
Even in C, however, knowing how much memory was allocated can point
out functions that do unnecessary allocation and guide the programmer
when he attempts to optimize the space consumption of his program.  In
Lisp, memory allocation happens implicitly in many primitive routines
such as {\tt mapcar}, {\tt *}, and {\tt intern}. The direct allocation
table can reveal unsuspected sources of allocation to Lisp
programmers.

Figure~\ref{direct allocation figure} contains the direct allocation
table for our example.  The direct allocation table corresponds to the
flat profile generated by {\tt gprof}.

\begin{figure}[htbp]
\begin{singlespace}
{\scriptsize
\include{extable1}
}
\end{singlespace}

\caption{Direct Allocation Table for Producer/Consumer Example}
\label{direct allocation figure}
\end{figure}

The first line of the direct allocation table contains the totals for
all functions allocating memory.  In this example, only one function,
{\tt make\_widget}, allocates memory.  The direct allocation table
prints percent of total allocation that took place in each function
({\tt \%~mem}), the number of bytes allocated by each function ({\tt
bytes}), the number of bytes allocated by the function and never freed
({\tt bytes kept}), and the number of calls made to the function that
resulted in allocation ({\tt calls}).  The {\tt \%~mem(size)} fields
contain a size breakdown\footnote{ Both the direct allocation table
and the dynamic call graph break down object allocation into four
categories of object size: small (s), from 0--32 bytes; medium (m),
from 33--256 bytes; large (l), from 257--2048 bytes; and extra large
(x), larger than 2048 bytes.  For Lisp, categorization is by type
rather than size: cons cell (c), floating point number (f), structure
or vector (s), and other (o).} of the memory allocated by each
function as a fraction of the memory allocated by all functions.  In
this example, 99\% of the memory allocated by the program was
allocated in {\tt make\_widget} for medium-sized objects.  Blank
columns indicate values less than 1\%.  The other size breakdown given
in the direct allocation table is for the memory that was allocated
and never freed.  The {\tt \%~all~kept} field contains a size
breakdown of the memory not freed by a particular function as a
fraction of all the memory not freed.  In the example, 99\% of the
unfreed memory was of medium-sized objects allocated by {\tt
make\_widget}.

\subsection{The Allocation Call Graph}

Understanding the memory allocation behavior of a programs sometimes
requires more information than just knowing the functions that are
directly responsible for memory allocation.  Sometimes, as happens in
Figure~\ref{C example figure}, the same allocation function is called
by several different functions for different purposes.  The allocation
call graph shows all the functions that were indirect callers of
functions that allocated memory.

Because the dynamic caller/callee relations of a program are numerous,
the dynamic call graph is a complex table with many entries.  Often,
the information provided by the first three tables is enough to allow
programmers to understand the memory allocation of their program.
Nevertheless, for a full understanding of the allocation behavior of
programs the allocation call graph is useful.  Figure \ref{call graph
figure} contains the allocation call graph for the producer/consumer
example and corresponds to the call graph profile generated by {\tt
gprof}. 	

\begin{figure}[ht]
\begin{singlespace}
{\scriptsize
\include{extable2}
}
\end{singlespace}

\caption{Allocation Call Graph for Producer/Consumer Example}
\label{call graph figure}
\end{figure}

The allocation call graph is a large table with an entry for each
function that was on a call chain when memory was allocated.  Each
table entry is divided into three parts.  The line for the function
itself (called the {\em entry function\/}); lines above that line,
each of which represents a caller of the entry function (the
ancestors), and lines below that line, each of which represents a
function called by the entry function (the descendents).  The entry
function is easy to identify in each table entry because a large rule
appears in the {\tt frac} column on that row.  In the first entry of
Figure \ref{call graph figure}, {\tt main} is the entry function;
there are no ancestors and two descendents.

The entry function line of the allocation call graph contains
information about the function itself.  The {\tt index} field provides
a unique index to help users navigate through the call graph.  The
{\tt self~$+$~desc} field contains the percent of total memory
allocated that was allocated in this function and its descendents.
The call graph is sorted by decreasing values in this field.  The {\tt
self} field contains the number of bytes that were allocated directly
in the entry function.  The {\tt size-func} fields contain a size
breakdown of the memory allocated in the function itself.  Some
functions, like {\tt main} (index 0) allocated no memory directly, so
the {\tt size-func} fields are all blank.  The {\tt called} field
shows the number of times this function was called during a memory
allocation, with the number of recursive calls recorded in the
adjacent field.

Each caller of the entry function is listed on a separate line above
it.  A summary of all callers is given on the top line of the entry if
there is more than one ancestor.  The {\tt self} field of ancestors
lists the number of bytes that the entry function and its descendents
allocated on behalf of the ancestor.  The {\tt size-ances} field
breaks down those bytes into size categories, while the {\tt
frac-ances} field shows the size breakdown of the bytes requested by
this ancestor as a fraction of bytes allocated at the request of all
ancestors.  For example, in the entry for function {\tt make\_widget}
(index 1), the ancestor {\tt make\_red\_widget} can be seen to have
requested 1,023,876 bytes of data from {\tt make\_widget}, 99\% of which
was of medium-sized objects.  Furthermore, calls from {\tt
make\_red\_widget} accounted for 50\% of the total memory allocated by
{\tt make\_widget} and its descendents.  Other fields show how many
calls the ancestor made to the entry function and how many calls the
ancestor made in total.  In a similar fashion, information about the
function's descendents appears below the entry function.

Had the memory leak table not already told us what objects were not
being freed, we could use the allocation call graph for the same
purpose.  The direct allocation table told us that {\tt make\_widget}
allocated 1,023,876 bytes of unfreed memory, all for medium-sized
objects.  From the allocation call graph, we can see that the function
{\tt make\_red\_widget} was the function calling {\tt make\_widget}
that requested 1,023,876 bytes of medium-sized objects.  

Cycles in the call graph are not illustrated in Figure~\ref{call graph
figure}.  As described in the next section, cycles obscure allocation
information among functions that are members of a cycle.  When the
parent/child relationships that appear in the graph are between
members of the same cycle, most of the fields in the graph must be
omitted.


\section{Implementation}
\label{implementing mprof}

We have implemented {\tt mprof} for use with C and Common Lisp
programs.  Since the implementations are quite similar, the C
implementation will be described in detail, and the minor differences
in the Lisp implementation will be noted at the end of the section.

\subsection{The Monitor} 

The first phase of {\tt mprof} is a monitor that is linked into the
executing application.  The monitor includes modified versions of {\tt
malloc} and {\tt free} that record information each time they are
invoked.  Along with {\tt malloc} and {\tt free}, {\tt mprof} provides
its own {\tt exit} function, so that when the application program
exits, the data collected by the monitor is written to a file.  The
monitor maintains several data structures needed to construct the
tables.

To construct the leak table, the monitor associates a list of the last
five callers in the call chain, the {\em partial call chain\/}, with
the object allocated.  {\tt mprof} augments every object allocated
with two items: an integer which is the object size as requested by
the user (since the allocator may allocate an object of a different
size for convenience), and a pointer to a structure that contains the
object's partial call chain and a count of allocations and frees of
objects with that call chain.  A hash table is used to map a partial
call chain to the structure containing the counters.  When an object
is allocated, its partial call chain is used as a hash key to retrieve
the structure containing the counters.  A pointer to the structure is
placed in the allocated object and the allocation counter is
incremented.  When the object is later freed, the pointer is followed
and the counter of frees is incremented.  Any partial call chain in
which the number of allocations does not match the number of frees
indicates a memory leak and is printed in the leak table.

To construct the allocation bin table, the monitor has a 1026-element
array of integers to count allocations and another 1026-element array
to count frees.  When objects of a particular size from 0--1024 bytes
are allocated or freed, the appropriated bin is incremented.  Objects
larger than 1024 bytes are grouped into the same bin.

The construction of the direct allocation table falls out directly
from maintaining the allocation call graph information, which is
described in the next section.

\subsection{Constructing the Allocation Call Graph}

To construct the allocation call graph, the monitor must associate the
number of bytes allocated with every function on the current dynamic
call chain, each time {\tt malloc} is called.  Consider the sample
call chain in Figure~\ref{cchain:fig}, which we abbreviate:
{\tt main}-$>${\tt foo}-$>${\tt bar}(24).

\begin{figure}[htbp]
\begin{verbatim}
     CALL STACK:                     MPROF RECORDS:
     main calls foo                       24 bytes over main -> foo
           foo calls bar                  24 bytes over foo -> bar
              bar calls malloc(24)        24 bytes allocated in bar
\end{verbatim}
\caption{Example of a Dynamic Call Chain}
\label{cchain:fig}
\end{figure}

In {\tt mprof}, the monitor traverses the entire call chain by
following return addresses.  This differs from {\tt gprof}, where only
the immediate caller of the current function is recorded.  {\tt gprof}
makes the assumption that each call takes an equal amount of time and
uses this assumption to reconstruct the complete dynamic call graph
from information only about the immediate callers.  In {\tt mprof}, we
actually traverse the entire dynamic call chain and need to make no
assumptions. 

In choosing to traverse the entire call chain, we have elected to
perform an operation that is potentially expensive both in time and
space.  One implementation would simply record every function in every
chain and write the information to a file (i.e., in the example we
would output [{\tt main}-$>${\tt foo}-$>${\tt bar}, 24]).  Considering that
many programs execute millions of calls to {\tt malloc} and that the
depth of a call chain can be hundreds of functions, the amount of
information could be prohibitive.

An alternative to recording the entire chain of callers is to break
the call chain into a set of caller/callee pairs, and associate the
bytes allocated with each pair in the chain.  For the call in the
example, we could maintain the pairs [{\tt main}, {\tt foo}] and [{\tt foo},
{\tt bar}], and associate 24 bytes with each pair.  Conceptually, the
data structure our monitor maintains is an association between
caller/callee pairs and the cumulative bytes allocated over the pair,
which we denote ([{\tt main}, {\tt foo}], 24).  To continue with the
example, if the next allocation was: {\tt main}-$>${\tt foo}-$>${\tt
otherbar}(10), where this is the first call to {\tt otherbar}, we
would update the byte count associated with the [{\tt main}, {\tt foo}] pair
to 34 from 24.  Furthermore, we would create a new association between
[{\tt foo}, {\tt otherbar}] and the byte count, 10.  A disadvantage
with this implementation is that the exact call chains are no longer
available.  However, from the pairs we can construct the correct
dynamic call graph of the program, which is the information that we
need for the allocation call graph.

For the overhead imposed by the monitor to be reasonable, we have to
make the association between caller/callee pairs and cumulative byte
counts fast.  We use a hash table in which the hash function is a
simple byte-swap XOR of the callee address.  Each callee has a list of
its callers and the number of allocated bytes associated with each
pair.  In an effort to decrease the number of hash lookups, we noted
that from allocation to allocation, most of the call chain remains the
same.  Our measurements show that on the average, 60--75\% of the call
chain remains the same between allocations.  This observation allows
us to cache the pairs associated with the current caller chain and to
use most of these pairs the next time a caller chain is recorded.
Thus, on any particular allocation, only a few addresses need to be
hashed.  Here are the events that take place when a call to {\tt
malloc} is monitored:

\begin{enumerate}
  \item The chain of return addresses is stored in a vector.
  \item The new chain is compared with the previous chain, and the point
at which they differ is noted.
  \item For the addresses in the chain that have not changed, the
caller/callee byte count for each pair is already available and is
incremented. 
  \item For new addresses in the chain, each caller/callee byte count is
looked up and updated.
  \item For the tail of the chain (i.e., the function that called {\tt
malloc} directly), the direct allocation information is recorded.
\end{enumerate}

Maintaining allocation call graph information requires a byte count
for every distinct caller/callee pair in every call chain that
allocates memory.  Our experience is that there are a limited number
of such pairs, even in very large C programs, so that the memory
requirements of the {\tt mprof} monitor are not large (see
section~\ref{mmeas:sec}).

\subsection{Reduction and Display}

The second phase of {\tt mprof} reads the output of the monitor,
reduces the data to create a dynamic call graph, and displays the data
in four tables.  The first part of the data reduction is to map the
caller/callee address pairs to actual function names.  A program {\tt
mpfilt} reads the executable file that created the monitor trace
(compiled so that symbol table information is retained), and outputs a
new set of function caller/callee relations.  These relations are then
used to construct the subset of the program's dynamic call graph that
involved memory allocation.

The call graph initially can contain cycles due to recursion in the
program's execution.  Cycles in the call graph introduce spurious
allocation relations, as is illustrated in Figure~\ref{recursive call
figure}.  In this example, {\tt main} is credited as being indirectly
responsible for 10 bytes, but because we only keep track of
caller/callee pairs, {\tt F} appears to have requested 20 bytes from
{\tt G}, even though only 10 bytes were allocated.

\begin{figure}[htbp]
\begin{verbatim}
        CALL STACK:                     MPROF RECORDS:
        main calls F                    (10 bytes over main -> F)
             F calls G                  (10 bytes over F -> G)
               G calls F                (10 bytes over G -> F)
                 F calls G              (10 MORE bytes over F -> G)
                   G calls malloc(10)   (10 bytes allocated in G)
\end{verbatim}
\caption{Problems Caused by Recursive Calls}
\label{recursive call figure}
\end{figure}

We considered several solutions to the problems caused by cycles and
adopted the most conservative solution.  One way to avoid recording
spurious allocation caused by recursion is for the monitor to identify
the cycles before recording the allocation.  For example, in
Figure~\ref{recursive call figure}, the monitor could realize that it
had already credited {\tt F} with the 10 bytes when it encountered
{\tt F} calling {\tt G} the second time.  This solution adds overhead
to the monitor and conflicts with our goal to make the monitor as
unobtrusive as possible.

The solution that we adopted was to merge functions that are in a
cycle into a single node in the reduction phase.  Thus, each strongly
connected component in the dynamic call graph is merged into a single
node.  The result is a call graph with no cycles.  This process is
also used by {\tt gprof}, and described carefully
elsewhere~\cite{graham83:gprof}.  Such an approach works well in {\tt
gprof} because C programs, for which {\tt gprof} was primarily
intended, tend to have limited amounts of recursion.  Lisp programs,
for which {\tt mprof} is also intended, intuitively contain much more
recursion.  We have experience profiling a number of large Common Lisp
programs.  We observe several recursive cycles in most programs, but
the cycles generally contain a small percentage of the total functions
and {\tt mprof} is quite effective.

\subsection{Lisp Implementation}

So far, we have described the implementation of {\tt mprof} for C.
The Lisp implementation is quite similar, and here we describe the
major differences.  C has a single function, {\tt malloc}, that is
called to allocate memory explicitly.  Lisp has a large number of
primitives that allocate memory implicitly (i.e., {\tt cons}, {\tt *},
{\tt intern}, etc.).  To make {\tt mprof} work, these primitives must
be modified so that every allocation is recorded.  Fortunately, at the
Lisp implementation level, all memory allocations may be channeled
through a single routine.  We worked with KCL (Kyoto Common Lisp),
which is implemented in C.  In KCL, all Lisp memory allocations are
handled by a single function, {\tt alloc\_object}.  Just as we had
modified {\tt malloc} in C, we were able to simply patch {\tt
alloc\_object} to monitor memory allocation in KCL.

The other major difference in monitoring Lisp is that the addresses
recorded by the monitor must be translated into Lisp function names.
Again, KCL makes this quite easy because Lisp functions are defined in
a central place in KCL and the names of the functions are known when
they are defined.  Many other Lisp systems are designed to allow
return addresses to be mapped to symbolic function names so that the
call stack can be printed at a breakpoint.  In this case, the monitor
can use the same mechanism to map return addresses to function names.
Therefore, in Lisp systems in which addresses can be quickly mapped to
function names, memory profiling in the style of {\tt mprof} is not a
difficult problem.  In systems in which symbolic names are not
available in compiled code, profiling is more difficult.  Furthermore,
many systems open-code important allocation functions, like {\tt
cons}.  Because open-coded allocation functions will not necessarily
call a central allocation function (like {\tt alloc\_object}), such
allocations will not be observed by {\tt mprof}.  To avoid such a loss
of information, {\tt mprof} should be used in conjunction with program
declarations that will force allocation functions such as {\tt cons}
to be coded out-of-line.

\section{Measurements}
\label{measurements}

We have measured the C implementation of {\tt mprof} by instrumenting
four programs using {\tt mprof}.  The first program, {\tt example}, is
our example program with the number of widgets allocated increased to
100,000 to increase program execution time.  The second program, {\tt
fidilrt}, is the runtime library of FIDIL, a programming language for
finite difference computations~\cite{hilfinger88:fidil}.  The third
program, {\tt epoxy}, is an electrical and physical layout optimizer
written by Fred Obermeier~\cite{fwo87:epoxy}.  The fourth program,
{\tt crystal}, is a VLSI timing analysis
program~\cite{ouster85:crystal}.  These tests represent a small
program ({\tt example}, 100 lines); a medium-sized program ({\tt
fidilrt}, 7,100 lines); and two large programs ({\tt epoxy}, 11,000
lines and {\tt crystal}, 10,500 lines).  In the remainder of this
section, we will look at the resource consumption of {\tt mprof} from
two perspectives: execution time overhead and space consumption.

\subsection{Execution Time Overhead}

There are two sources of execution time overhead associated with {\tt
mprof}: additional time spent monitoring an application and the time
to reduce and print the data produced by the monitor.  The largest
source of monitor overhead is the time required to traverse the
complete call chain and associate allocations with caller/callee
pairs.  We implemented a version of {\tt mprof}, called {\tt mprof-},
which does not create the allocation call graph.  With this version,
we can see the relative cost of the allocation call graph.  The ratio
of the time spent with profiling to the time spent without profiling
is called the {\em slowdown factor\/}.  Table \ref{cpu:tab} summarizes
the execution time overheads for our four applications.  Measurements
were gathered running the test programs on a VAX 8800 with 80
megabytes of physical memory.

\begin{table}[htbp]
\begin{singlespace}
\begin{center}
\begin{tabular}{|l|r|r|r|r|} \hline
\multicolumn{1}{|c|}{} & \multicolumn{4}{|c|}{Cost} \\ \cline{2-5}
\multicolumn{1}{|c|}{Resource Description} & 
\multicolumn{1}{|c|}{\tt example} & 
\multicolumn{1}{|c|}{\tt fidilrt} & 
\multicolumn{1}{|c|}{\tt epoxy} & 
\multicolumn{1}{|c|}{\tt crystal} \\ \hline
Number of allocations & 100000 & 77376 & 306295 & 31158 \\
Execution time with {\tt mprof} (seconds) & 62.7 & 132.7 & 188.8 & 134.1 \\
Execution time with {\tt mprof-} (seconds) & 44.1 & 116.0 & 149.7 & 25.5 \\
Execution time without {\tt mprof} (seconds) & 17.9 & 107.1 & 52.1 & 13.2 \\
\hline
Slowdown using {\tt mprof} & 3.5 & 1.2 & 3.6 & 10.1 \\
Slowdown using {\tt mprof-} & 2.5 & 1.1 & 2.9 & 1.9 \\
\hline 
Reduction and display time (seconds) & 10.3 & 28.8 & 28.3 & 36.8 \\
\hline 
\end{tabular}
\caption{Execution Time Overhead of {\tt mprof}}
\label{cpu:tab}
\end{center}
\end{singlespace}
\end{table}

The slowdown associated with {\tt mprof} varies widely, from 1.5 to
10.  {\tt crystal} suffered the worst degradation from profiling
because {\tt crystal} uses a depth-first algorithm that results in
long call chains.  Programs without long call chains appear to slow
down by a factor of 2--4.  If the allocation call graph is not
generated and long call chains are not traversed, the slowdown is
significantly less, especially in the extreme cases.  Since {\tt
mprof} is a prototype and has not been carefully optimized, this
overhead seems acceptable.  From the table, we see the reduction and
display time is typically less than a minute.

\subsection{Storage Consumption}
\label{mmeas:sec}

The storage consumption of {\tt mprof} is divided into the additional
memory needed by the monitor as an application executes, and the
external storage required by the profile data file.  The most
significant source of memory used by the monitor is the data stored
with each object allocated: an object size and a pointer needed to
construct the memory leak table.  The monitor also uses memory to
record the memory bins and caller/callee byte counts and must write
this information to a file when the application is finished.  We
measured how many bytes of memory and disk space are needed to store
this information.  Table~\ref{mem:tab} summarizes the measurements of
storage consumption associated with {\tt mprof}.

\begin{table}[htbp]
\begin{singlespace}
\begin{center}
\begin{tabular}{|l|r|r|r|r|} \hline
\multicolumn{1}{|c|}{} & \multicolumn{4}{|c|}{Cost} \\ \cline{2-5}
\multicolumn{1}{|c|}{Resource Description} & 
\multicolumn{1}{|c|}{\tt example} & 
\multicolumn{1}{|c|}{\tt fidilrt} & 
\multicolumn{1}{|c|}{\tt epoxy} & 
\multicolumn{1}{|c|}{\tt crystal} \\ \hline
Number of allocations & 100000 & 61163 & 306295 & 31158 \\
User memory allocated (Kbytes) & 20000 & 2425 & 6418  & 21464 \\
\hline
Per object memory (Kbytes) & 781 & 477 & 2393 & 168 \\
Other monitor memory (Kbytes) & 8.7 & 23.3 & 52.3 & 17.5 \\
Total monitor memory (Kbytes) & 790 & 500 & 2445 & 186 \\
Monitor fraction (\% memory allocated) & 4 & 17 & 28 & 1 \\
\hline
Data file size (Kbytes) & 4.5 & 8.1 & 28.6 & 9.6 \\
\hline 
\end{tabular}
\caption{Storage Consumption of {\tt mprof}}
\label{mem:tab}
\end{center}
\end{singlespace}
\end{table}

The memory overhead of {\tt mprof} is small, except that an additional
8 bytes of storage are allocated with every object.  In programs in
which many small objects are allocated, like {\tt epoxy}, {\tt mprof}
can contribute significantly to the total memory allocated.
Nevertheless, in the worst case, {\tt mprof} increases application
size by 1/3, and since {\tt mprof} is a development tool, this
overhead seems acceptable.  From the table we also see that the data
files created by {\tt mprof} are quite small ($<$ 30 Kbytes).

\section{Related Work}
\label{related work}

{\tt Mprof} is similar to the tool {\tt gprof}~\cite{graham83:gprof},
a dynamic execution profiler.  Because some of the problems of
interpreting the dynamic call graph are the same, we have borrowed
these ideas from {\tt gprof}.  Also, we have used ideas from the user
interface of {\tt gprof} for two reasons: because the information
being displayed is quite similar and because users familiar with {\tt
gprof} would probably also be interested in {\tt mprof} and would
benefit from a similar presentation.

Barach, Taenzer, and Wells developed a tool for finding storage
allocation errors in C programs~\cite{barach82:memprof}.  Their
approach concentrated on finding two specific storage allocation
errors: memory leaks and duplicate frees.  They modified {\tt malloc}
and {\tt free} so that every time that these functions were called,
information about the memory block being manipulated was recorded in a
file.  A program that examines this file, {\tt prleak}, prints out
which memory blocks were never freed or were freed twice.  This
approach differs from {\tt mprof} in two ways.  First, {\tt mprof}
provides more information about the memory allocation of programs than
{\tt prleak}, which just reports on storage errors.  Second, {\tt
prleak} generates extremely large intermediate files that are
comparable in size to the total amount of memory allocated by the
program (often megabytes of data).  Although {\tt mprof} records more
useful information, the data files it generates are of modest size
(see the table above).

\section{Conclusions}
\label{conclusions}

We have implemented a memory allocation profiling program for both C
and Common Lisp.  Our example has shown that {\tt mprof} can be
effective in elucidating the allocation behavior of a program so that
a programmer can detect memory leaks and identify major sources of
allocation.

Unlike {\tt gprof}, {\tt mprof} records every caller in the call chain
every time an object is allocated.  The overhead for this recording is
large but not impractically so, because we take advantage of the fact
that a call chain changes little between allocations.  Moreover,
recording this information does not require large amounts of memory
because there are relatively few unique caller/callee address pairs on
call chains in which allocation takes place, even in very large
programs.  We have measured the overhead of {\tt mprof}, and find that
typically it slows applications by a factor of 2--4 times, and adds up
to 33\% to the memory allocated by the application.  Because {\tt
mprof} is intended as a development tool, these costs are acceptable.

Because {\tt mprof} merges cycles caused by recursive function calls,
{\tt mprof} may be ineffective for programs with large cycles in their
call graph.  Only with more data will we be able to decide if many
programs (especially those written in Lisp) contain so many recursive
calls that cycle merging makes {\tt mprof} ineffective.  Nevertheless,
{\tt mprof} has already been effective in detecting KCL system
functions that allocate memory extraneously.\footnote{Using {\tt
mprof}, we noted that for a large object-oriented program written in
KCL, the system function {\tt every} accounted for 13\% of the memory
allocated.  We rewrote {\tt every} so it would not allocate any
memory, and decreased the memory consumption of the program by 13\%.}

As a final note, we have received feedback from C application
programmers who have used {\tt mprof}.  They report that the memory
leak table and the allocation bin table are both extremely useful,
while the direct allocation table and the allocation call graph are
harder to understand and also less useful.  Considering the execution
overhead associated with the allocation call graph and the complexity
of the table, it is questionable whether the allocation call graph
will ever be as helpful C programmers as the memory leak table.  On
the other hand, with automatic storage reclamation, the memory leak
table becomes unnecessary.  Yet for memory intensive languages, such
as Lisp, the need for effective use of the memory is more important,
and tools such as the allocation call graph might prove very useful.
Because we have limited feedback from Lisp programmers using {\tt
mprof}, we cannot report their response to this tool.

\bibliography{/spur2/larus/bib/abbrev,lispm,gc,/spur2/larus/bib/optim,spur,misc,lisp}
\bibliographystyle{plain}

\end{document}
%