V10/vol2/grap/paper.ms

.so ../ADM/mac
.XX grap 109 "Grap \(em A Language for Typesetting Graphs"
.EQ
delim $$
.EN
.so macros
.ds g \f2grap\fP
.ds G \f2Grap\fP
.TL
Grap \(em A Language for Typesetting Graphs
.br
Tutorial and User Manual
.AU
Jon L. Bentley
Brian W. Kernighan
.AI
.MH
.AB
\*G
is a language for describing plots of data.
This graph of the 1984
age distribution in the United States
.grap agepop1.g
is produced by the
\*g
commands
.P1
.get agepop1.g
.P2
(Each line in the data file
.UL agepop.d
contains an age and the number of Americans of that
age alive in 1984; the file is sorted by age.)
.PP
The
\*g
preprocessor works with
.I pic |reference(latest pic)
and
.I troff |reference(latest troff reference).
Most of its input is passed
through untouched, but statements between
.UL .G1
and
.UL .G2
are translated into
.I pic
commands that draw graphs.
.AE
.NH
Introduction
.PP
\*G
is a language for describing graphical
displays of data.
It provides such services as automatic scaling and
labeling of axes, and
.UL for
statements,
.UL if
statements, and macros to facilitate user
programmability.
\*G
is intended primarily for including graphs in
documents prepared on the
.UX
operating system, and is only marginally
useful for elementary tasks in data analysis.
.PP
Section 2 of this document is a tutorial introduction to
\*g;
readers who find it slow going may wish to skim ahead.
The examples in Section 3 illustrate
the various kinds of graphs that
\*g
can produce and some common
\*g
idioms.
Mundane matters about using
\*g
are discussed in Section 4,
and Section 5 contains a brief reference manual.
.PP
We have tried to illustrate good principles of
statistics and graphical design in the
graphs we present.
In several places, though, good taste has lost to
the necessity of illustrating
\*g
capabilities.
Readers interested in statistical
integrity and taste should
consult the literature, for example |reference(chambers graphs)
|reference(tufte graphs) |reference(cleveland elements).
.NH
Tutorial
.PP
The following is a simple
\*g
program\(dg
.FS
\(dg Throughout
this document we will show only the first five
lines and the last line of data files;
omitted lines are indicated by ``...''.
.FE
.P1
\&.G1
.d 400mtimes.d
\&.G2
.P2
The single number on each line
is the winning time in seconds for the
men's 400 meter run,
from the first modern Olympic Games (1896)
to the twenty-first (1988).
If the file
.UL olymp.g
contains the text above,
then typing the command
.P1
grap olymp.g | pic | troff > junk
.P2
creates a
.I troff
output file
.UL junk
that contains the
picture
.grap 4001.g
The graph shows the decrease
in winning times from 54.2
seconds to 43.87 seconds.
If the times are
contained in the file
.UL 400mtimes.d ,
we could
produce the same graph with the
shorter program
.P1
.get 4001.g
.P2
Writing
.UL copy
.UL \&"fname"
in a
\*g
program is equivalent to including the
contents of file
.UL fname
at that point in the file.
(In the interests of compatibility with other programs,
.UL include
is a synonym for
.UL copy .)
.PP
Each line in the file
.UL 400mpairs.d
contains two numbers, the
year of the Olympics and the winning time:
.P1
.d 400mpairs.d
.P2
If we plot this data with the program
.P1
.get 4002.g
.P2
the bottom ($x$) axis represents the year of the Olympics.
.grap 4002.g
The ``holes'' in $x$-values reflect the fact
that the 1916, 1940, and 1944 Olympics
were cancelled due to war.
Because the previous data
(in
.UL 400mtimes.d )
had just one number per
line,
\*g
viewed it as a ``time series'' and
supplied $x$-values of $1, ~ 2, ~ 3, ...$
before plotting
the data as $y$-values.
The input to the
second program has two values per line,
so they are interpreted as $( x , y )$ pairs.
.PP
Rather than a scatter plot of points, we might prefer to
see the winning times connected by a solid
line.
The program
.P1
.get 4003.g
.P2
produces the graph
.grap 4003.g
Eric Liddell of Great Britain
won his gold medal
in Paris in 1924 with a time of 47.6 seconds.
(Remember ``Chariots
of Fire''?)
.PP
We can make the graph more attractive
by modifying its frame
and adding labels.
.P1
.get 4004.g
.P2
The
.UL frame
command describes
the graph's bounding box:
the overall frame (which has four sides)
is invisible, it is 2 inches high and 3 inches
wide (which happen to be the
default height and width),
and the left and bottom
sides are solid (they could have been
dashed or dotted instead).
The labels appear on the left and bottom, as requested.
.grap 4004.g
.PP
To set the range of each axis,
\*g
examines the data and pads both
dimensions
by seven percent at each end.
The
.UL coord
(``coordinates'') command
allows you to specify the range of one or both axes explicitly;
it also turns off automatic padding.
.P1
.get 4005.g
.P2
The $y$-axis now ranges from 42 to 56 seconds
(a little more than before),
and the $x$-axis from 1894 to 1990
(a little less).
.grap 4005.g
.PP
The ticks in the preceding graphs were generated
by
\*g
guessing at reasonable values.
If you would rather provide your own,
you may
use the
.UL ticks
command,
which comes in the flavors illustrated below.
.P1
.get 4006.g
.P2
The first
.UL ticks
command deals with the left axis:
it puts the ticks facing out at
the numbers in the list.
\*G
puts labels only at values
with strings,
except that when no labels at all are
given, each number serves as its own label,
as in the second
.UL ticks
command.
That command
is for the bottom axis:
it puts the ticks facing in at steps of 20
from 1900 to 1980.
The command
.UL "ticks off"
turns off all ticks.
\*G
does its best to place labels appropriately, but
it sometimes needs your help:
the
.UL "left .2"
clause moves the left label 0.2 inches further left to
avoid the new ticks.
.grap 4006.g
.PP
The file
.UL 400wpairs.d
contains the times for
the women's 400 meter race, which has been run
only since 1964.
.P1
.d 400wpairs.d
.P2
To add these times to the graph,
we use
.P1
.get 4007.g
.P2
The
.UL new
command tells
\*g
to end
the old curve and to start a new curve
(which in this case will be drawn
with a dotted line).
Text is placed on the graph by
commands of the form
.P1
"string" at xvalue, yvalue
.P2
The
.UL size
clauses following the quoted strings tell
\*g
to shrink the characters by three points (absolute point sizes
may also be specified).
Strings are usually centered at the specified position,
but can be adjusted by clauses to be illustrated shortly.
.grap 4007.g
.PP
The file
.UL phone.d
records the number of telephones in the United States from
1900 to 1970.
.P1
.d phone.d
.P2
Each line gives a year and the number of telephones
present in that year
(in millions, truncated to the nearest hundred thousand).
The simple
\*g
program
.P1
.get phone1.g
.P2
produces the simple graph
.grap phone1.g
.PP
The number of telephones appears to
grow exponentially;
to study that we will plot the data with
a logarithmic $y$-axis by adding
.UL log
.UL y
to the
.UL coord
command.
We will also add cosmetic changes of labels, more ticks,
and a solid line to replace the unconnected dots.
.P1
.get phone2.g
.P2
The third
.UL ticks
command provides a string that is used to print the tick
labels.
.UC C
programmers will recognize it as a
.UL printf
format string; others may view the
.CW %g
as the place to put
the number and anything else (in this case just an apostrophe) as
literal text to appear in the labels.
To suppress
labels, use the empty format string ("").
The program produces
.grap phone2.g
The number of telephones grew rapidly
in the first decade of this century,
and then settled down to an exponential growth rate upset only
by a decrease in the Great Depression and a post-war growth
spurt
to return the curve to its pre-Depression line.
.PP
Our presentation so far has been to
start with a simple
\*g
program that illustrates the data, and then refine it.
Later in this document we will ignore the design
phase, and present rather complex graphs in
their final form.
Beware.
.PP
All the examples so far have placed data on the
graph implicitly by
.UL copy ing
a file of numbers
(either a time series with one number per line or
pairs of numbers).
It is also possible to draw points and lines explicitly.
The
\*g
commands to draw on a graph
are illustrated in the following
fragment.
.P1
.get geom.g
.P2
.PP
The
.UL grid
command is similar to the
.UL ticks
command, except that grid lines extend
across the frame.
The next few commands plot text at specified positions.
The plotting characters (such as
.UL bullet )
are implemented as predefined
macros \(em more on that shortly.
Unlike arbitrary characters,
the visual centers of the markers
are near their plotting centers.
The
.UL circle
command draws a circle centered at the specified location.
A radius in inches may be specified;
if no radius is given, then the circle will be the
small circle shown at the center of the graph.
The
.UL line
and
.UL arrow
commands draw the obvious objects shown at the upper left.
.grap geom.g
.PP
This figure also illustrates the combined use of the
.UL draw
and
.UL next
commands.
Saying
.UL draw
.UL A
.UL solid
defines the style
for a connected sequence of line fragments to be called
.UL A .
Subsequent commands of
.UL next
.UL A
.UL at
.I point
add
.I point
to the end of
.UL A .
There are two such sequences active in the above
example
.UL A "" (
and
.UL B );
note that their
.UL next
commands are intermixed.
Because the predefined string
.UL delta
follows the specification of
.UL B ,
that string is plotted at each point in the sequence.
.PP
\*G
has numeric variables (implemented as double-precision
floating point numbers) and
the usual collection of arithmetic operators and
mathematical functions; see the reference section
for details.
.PP
\*G
provides the same rudimentary macro facility that
.I pic
does:
.P1
define \f2name\fP { \f2replacement text\fP }
.P2
defines
.IT name
to be the
.IT "replacement text" .
The replacement may be any text that contains balanced open and closing braces
.UL "{ }" .
(Alternatively, the
.IT "replacement text
may be quoted by
any single character that does not appear in the replacement;
the string is terminated by the next occurrence of that character.)
Any subsequent occurrence of
.IT name
will be replaced by
.IT "replacement text" .
.EQ
delim %%
.EN
.PP
The replacement text of a macro definition may
contain occurrences of
.UL $1 ,
.UL $2 ,
etc.;
these will be replaced by the corresponding actual
arguments when the macro is invoked.
The invocation for a macro with arguments is
.P1
name(arg1, arg2, ...)
.P2
Non-existent arguments are replaced by null
strings.
.EQ
delim $$
.EN
.PP
The following
\*g
program uses macros and arithmetic to plot
crude approximations to
the square and square root functions.
.P1
.get macarith.g
.P2
The macro
.UL root
uses the
.UL ^
exponentiation operator.
(Because
\*g
has the square root function
.UL sqrt ,
that macro is in fact superfluous.)
The program produces
.grap macarith.g
.PP
The
.UL copy
command has a
.UL thru
parameter that allows each line of a file to
be treated as though it were a macro call, with
the first field serving as
the first argument,
and so on.
This is the typical
\*g
mechanism for plotting files that are not stored as
time series or as $(x,y)$ pairs.
We will illustrate its use on the file
.UL states.d ,
which contains data on the fifty states.
.P1
.d states.d
.P2
The first field is the postal abbreviation of the state's
name (Alaska, Wyoming, Vermont, ...), the second field
is the number of Representatives to Congress from the state
after the 1981 reapportionment, and the third field is
the population of the state as measured in the 1980 Census.
The states appear in increasing order of
population.
.PP
We will first plot this data as
population, representative pairs.
(In the
.UL coord
statement,
.UL "log log"
is a synonym for
.UL "log x log y" .)
.P1
.get states1.g
.P2
Although the population is given in persons,
the
.UL PlotState
macro
plots the population in millions by dividing
the third input field
by one million (written in exponential notation
as
.UL 1e6 ,
for $1 times 10 sup 6$).
.grap states1.g
Using
.UL circle
as a plotting symbol displays
overlapping points that are obscured when
the data is plotted with bullets.
The representation of a state is roughly proportional
to its population, except in the very small states.
.PP
Our next plot will use the state's rank
in population as the $x$-coordinate and two
different $y$-coordinates: population and number of
representatives.
We will use two
.UL coord
commands to define the two coordinate systems
.UL pop
and
.UL rep .
We then explicitly give the coordinate system
whenever we refer to a point,
both in constructing axes and plotting data.
.P1
.get states2.g
.P2
The
.UL copy
statement in the program uses an
.I "immediate macro"
enclosed in curly brackets and thus avoids having to
name a macro for this task.
Because the program assumes that the states are
sorted in increasing order of population, it
generates
.UL thisrank
internally as a
\*g
variable.
The program produces
.grap states2.g
.PP
The plotting symbols were chosen for contrast in
both shape and shading.
This graph also indicates that representation is proportional
to population.
Once we see this graph, though, we should realize that we don't
really need two coordinate systems: we can relate the two by
dividing the population of the U.S. \(em about 226,000,000 \(em by
the number of representatives \(em 435 \(em to see that each
representative should count as 520,000 people.
If the purpose of this graph were to tell a story about
American politics rather than to illustrate
multiple coordinate systems,
it should be redrawn with a single coordinate
system.
.PP
Many graphs plot both observed data and a function
that (theoretically) describes the data.
There are many ways to draw a function
in \*g:
a series of
.UL next
commands is tedious but works, as does writing a
simple program to write a data file that is subsequently
read and plotted by \*g.
The
.UL for
statement often provides a better solution.
This
\*g
program
.P1
.get sin1.g
.P2
produces
.grap sin1.g
.a
The
.UL for
statement uses the same syntax as the
.UL ticks
statement, but the
.UL from
keyword can be replaced by
.UL = '', ``
which will look more familiar to programmers.
It varies the index variable over the specified range
and for each value executes all statements inside the delimiter
characters, which use the same rules as macro
delimiters.
It is, of course, useful for many tasks beyond plotting functions.
.EQ
delim %%
.EN
.PP
The
.UL if
statement provides a simple mechanism for conditional execution.
If a file contains data on both cities and states (and lines
describing states have ``S'' in the first field), it could be plotted
by statements like
.P1
if "$1" == "S" then {
PlotState($2,$3,$4)
} else {
PlotCity($2,$3,$4,$5,$6)
}
.P2
The
.UL else
clause
is optional; delimiters use the same rules as macros and
.UL for
statements.
.EQ
delim $$
.EN
.NH
A Collection of Examples
.PP
The previous section covered the
\*g
commands that are used in common graphs.
In this section we'll spend less time on
language features, and survey a wider variety of
graphs.
These examples are intended more for browsing and
reference than for straight-through reading.
Be prepared to refer to the manual in Section 5 when you stumble over a new
\*g
feature.
.PP
The file
.UL cars.d
contains the mileage (miles per gallon) and the weight
(pounds) for 74 models of automobiles sold in the United States
in the 1979 model year.
.P1
.d cars.d
.P2
The trivial
\*g
program
.P1
.get cars1.g
.P2
produces
.grap cars1.g
This graph shows that weights bottom out somewhat
below 2000
pounds and that heavier cars get worse mileage;
it is hard to say much more about the relationship
between weight and mileage.
.PP
The next graph provides labels, uses circles
to expose data hidden in the clouds of bullets,
and re-expresses the $x$-axis in gallons per mile.
It also changes the point size and vertical spacing
to a size appropriate for camera-ready journal articles
and books; the size changes should be made outside the
\*g
program.
The
.UL \&.ft
command changes to a Helvetica font, which
some people prefer for graphs.
.P1
.get cars2.g
.P2
\*G
supports logarithmic re-expression of data with the
.UL log
clause in the
.UL coord
statement; any other re-expression of data must be done
with
\*g
arithmetic, as above.
.br
.grap cars2.g
This graph shows that
gallons per mile is roughly proportional to weight.
(The two outliers near 4000 pounds are the Cadillac
Seville and the Oldsmobile 98.)
.PP
In
.I "Visual Display of Quantitative Information" ,
Tufte proposes the ``dot-dash-plot'' as a means for maximizing
data ink (showing the two-dimensional distribution and
the two one-dimensional marginal distributions) while minimizing
what he calls ``chart junk'' \(em ink wasted on borders
and non-data labels.
His preference is easy to express in \*g:
.P1
.get cars3.g
.P2
Although visually attractive, we do not find the
resulting graph as useful for interpreting the data.
.grap cars3.g
Tufte's graph does point out two facts that are
not obvious in the previous graphs:
there is a gap in car weights near 3000 pounds (exhibited
by the hole in the $y$-axis ticks), and the gallons per
mile axis is regularly structured (the ticks
are the reciprocals of an almost dense sequence of integers).
The reader may decide whether those insights are worth
the decrease in clarity.
.PP
Throughout the twentieth century, horses, cars and people
have gotten faster;
let's study those improvements.
For horses, we'll consider the winning times
of the Kentucky Derby from 1909 to 1988, in
the file
.UL speedhorse.d :
.P1
.d speedhorse.d
.P2
The program
.P1
.get speedhorse1.g
.P2
produces the graph
.grap speedhorse1.g
Each race is recorded with a bullet and
record times are marked by horizontal lines.
Secretariat is the only horse to have run the
one-and-a-quarter-mile
race in under two minutes; he won in 1973 in
1:59.4.
.PP
For automobiles we will study the
world land speed record (even though those vehicles
are by now just low-flying airplanes).
The file
.UL speedcar.d
lists years in which speed records were set and the record
set in that year, in miles per hour averaged over a one-mile
course.
.P1
.d speedcar.d
.P2
We will plot the data with the following
\*g
program, which uses nested braces in the
.UL copy
and
.UL if
statements.
.P1
.get speedcar1.g
.P2
.PP
Each record line is drawn after the
.I next
record is read, because
the program must know when the record was broken to draw
its line.
The
.UL if
statement handles the first record, and the extra
.UL line
command extends the last record out to the current date.
.grap speedcar1.g
The horizontal lines reflect the nature of world records: they
last until they are broken.
The records could also have been plotted by a scatterplot
in which each point represents the setting of a record,
but it would be misleading to connect adjacent
points with line segments
(which we inappropriately did in the graphs
of the Olympic 400 meter run).
.PP
The following graph shows the world record times for the
one mile run;
because its
\*g
program is so similar to its automotive counterpart,
we won't show the program or data.
.grap speedman1.g
The three graphs show three different kinds of
changes.
Although horses are getting faster, they appear to
be approaching a barrier near two minutes.
Cars show great jumps as new technologies are introduced
followed by a plateau as limits of the
technology are reached.
Milers have shown a fairly consistent
linear improvement
over this century, but there must be an
asymptote down there somewhere.
.PP
The next file gives the median heights of boys
in the United States aged 2 to 18, together with
the fifth and ninety-fifth percentiles.
.P1
.d boyhts.d
.P2
The heights are given in centimeters (1 foot = 30.48 centimeters).
The trivial program
.P1
.get boyhts1.g
.P2
displays the data as
.grap boyhts1.g
Because there are four numbers on each input line, the first is
taken as an $x$-value and the remaining three are plotted
as $y$-values.
.PP
The three curves appear to be roughly straight
(at least up to age 16),
so it makes sense to fit a line
through them.
We will use the standard least squares regression
in which
.EQ
slope ~=~ {
{n SIGMA x y ~ - ~ SIGMA x SIGMA y }
over
{n SIGMA x sup 2 ~ - ~ ( SIGMA x ) sup 2 }
}
.EN
(where the summations range over all $n$ $x$ and $y$ values
in the data set) and the $y$-intercept is
.EQ
{SIGMA y ~ - ~ slope times SIGMA x} over n
.EN
The following
\*g
program boldly (and rather foolishly) implements that formula.
.P1
.get boyhts3.g
.P2
It plots the extreme fifth percentiles as a bar through
the median, which is plotted as a bullet.
All heights are converted to feet before plotting and calculating
the regression line.
.grap boyhts3.g
.PP
\*G
.UL print
statements write on
.UL stderr
as they are processed by \*g;
their single argument can be either an expression or a string.
The
.UL print
statements (which are commented out in
the above
\*g
program) at one time
showed that the regression line is
.EQ
Height ~ in ~ Feet ~ = ~ 2.61 ~ + ~ .19 times Age
.EN
Thus for most American
boys between 3 and 16, you may safely assume
that they started out life at 2 feet 7 inches and grew at the
rate of two and a quarter inches per year.
.PP
This program probably misapplies \*g;
if you really want to perform least squares regressions on
data, you should usually use a simple
.I awk
program like
.P1
.get regress.awk
.P2
(Be warned, though, that this program is not numerically
robust.)
.PP
While we're on the subject of fitting straight lines to data,
we'll redraw three graphs from J. W. Tukey's
.I "Exploratory Data Analysis" .
The file
.UL usapop.d
records the population of the United States
in millions at ten-year intervals.
.P1
.d usapop.d
.P2
Tukey's first two graphs indicate that the later population
growth was linear while the early growth was exponential.
The following
\*g
program plots them as a pair, using
.UL graph
commands to place internally unrelated graphs adjacent to
one another.
.P1
.get usapop1.g
.P2
The statements defining each graph are indented for clarity.
The second graph has the northern point of its frame 0.05
inch below the southern point of the frame of the first graph;
the
.UL with
clause is passed directly through to
.I pic
without being evaluated for macros or expressions.
The names of both graphs begin with capital letters to
conform to
.I pic
syntax for labels.
.grap usapop1.g
.PP
Polynomial functions lie between the linear and exponential
functions; Tukey shows how a seventh-degree polynomial provides
a better (and longer) fit to the early population growth.
.P1
.get usapop2.g
.P2
This program re-expresses the $x$-axis with
\*g
arithmetic and uses an
.UL if
statement to graph only part of the data file.
It produces
.grap usapop2.g
.nr k \n%
The
.I eqn
.UL "space 0"
clause is necessary to keep
.I eqn
from adding extra space that would interfere
with positions computed by \*g;
see Section 4.
.PP
The file
.UL army.d
contains four related time series
describing the United States Army.
.P1
.d army.d
.P2
The first field is the year; the next four fields give
the number of male officers, female officers, enlisted males
and enlisted females, each in thousands.
(Actually, there were no female enlisted personnel in the
Army until 1943; the value 1 in 1940 and 1942 is just
a placeholder, since
\*g
has no mechanism for handling missing data.)
The following
\*g
program draws the four series with four different sets of
.UL draw
and
.UL next
commands.
.P1
.get army1.g
.P2
The program labels the lines by
.UL copy ing
immediate data;
the program is therefore shorter to write and easier to change.
The delimiter string
.UL XXX
in the
.UL until
clause could be deleted in this graph: the
.UL \&.G2
line also denotes the end of data.
Even though that string is enclosed in quotes,
it may not contain spaces.
The $y$-positions of the labels are the
result of several iterations.
.grap army1.g
.PP
This data can tell many stories: the buildup during the
Second World War is obvious, as is the exodus after the
war; increases during Korea and Vietnam are
also apparent.
We will consider a different story: the ratio of
enlisted men to the three other classes of personnel.
There are several ways to plot this data
(the most obvious graph uses three time series showing how
the ratios change over time, and is
left as an exercise for the reader).
.PP
We will instead construct a graph that gives little insight into this
data, but illustrates a general method that is quite useful
in conjunction with \*g.
The graph is a ``scatterplot vector'' that shows how one
variable (the number of enlisted men) varies as a function of
the other three.
Breaking with tradition, we first show the final graphs, all
of which have logarithmic scales.
.grap army2.g
The number of enlisted men is almost linearly
related to the number of male officers, it is somewhat related to the number
of female officers, and it varies widely as a function of the number
of enlisted women.
.PP
Much more interesting than the graph itself is the method we used to
produce it.
We wrote a miniature ``compiler'' that accepts as
its ``source language'' a description of a scatterplot vector and
produces as ``object code'' a
\*g
program to draw the graph.
The source program for the above example is
.P1
.get army2.v
.P2
The program lists several
global attributes of the graph, the
$y$-variable to be plotted, and as many $x$-variables as
are desired; with each variable is its field in the file
and a descriptive string.
The language is ``compiled'' by the following
.I awk
program.
.P1
.get scatvec.awk
.P2
Running this program on the above description produces the following
output, which is typically piped directly to \*g.
.P1
.get army2.g
.P2
The generated program uses the
.I pic
trick of re-using the same name
.UL A ) (
for several objects.
.PP
Although the program above is merely a toy,
``minicompilers'' can produce useful preprocessors
for \*g.
The
.UL scatmat
program, for instance, is a 90-line
.I awk
program that reads a simple input language and produces as
output a
\*g
program to produce a ``scatterplot matrix'', which
is a handy graphical device for spotting pairwise interactions
among several variables.
If
\*g
lacks a feature you desire, consider building
a simple preprocessor to provide it.
An alternative is to define
macros for the task; which approach is best depends
strongly on the job you wish to accomplish.
.PP
The next graph uses iterators to make a graph without
reading data from a file.
Rather, its ``data'' is a
function of two variables
that describes a
derivative field and a function of one variable
that describes one solution to the differential
equation.
.P1
.get ode1.g
.P2
The left label uses
.I eqn
text between the $font CW "$$"$ delimiters.
The variable
.UL scale
ensures that all lines in the direction field are the same
length.
The
.UL in
clauses in the
.UL ticks
statements specify that the ticks go in zero inches
to avoid overprinting.
The variables
.UL tx
and
.UL ty
are so named because
.UL x
and
.UL y
are reserved words for the
.UL coord
statement.
.grap ode1.g
.PP
Programmers familiar with floating point arithmetic may be
surprised that the above graph is correct.
Because of roundoff error, iteration
.UL "from 0 to 1 by .05" '' ``
usually produces the values
$0, ~ .05, ~ .10, ~ ..., ~ .95$.
\*G
uses a ``fuzzy test''
in the
.UL for
statement to avoid that problem, which may in turn introduce
other problems.
Such problems may be avoided by iterating over an integer range
and incrementing a non-integer value within the loop.
.PP
Most of the data we have seen so far is inherently
two (or more) dimensional.
As an example of one-dimensional data, we will return to
the populations of the fifty states, which
is the third field in the file
.UL states.d
introduced earlier;
the file is sorted in increasing order of population.
Our first graph takes the most space, but
it also gives the most information.
.P1
.get states8.g
.P2
The
.UL L
macro (for Label)
with input parameter $X$ evaluates to the number
$2 sup X / 1,000,000$ followed by the string "$X$"
(the
.UL ticks
command expects a number followed by a string label).
.grap states8.g
The dotted line is the least squares regression
.EQ
log sub 10 ~ Population ~ = ~ 7.214 ~ - ~ .03 times Rank
.EN
which gives 15.3 million as the population of the
largest state and .515 million as the population
of the smallest state.
It says that
population drops by a factor of two every ten states
(compare the top and left scales).
As sloppy as the exponential fit is, though, it is a much better
fit to this data
than a Zipf's Law curve is (drawing that curve is left as
an exercise for the reader).
.PP
The next graph is a more standard representation of
one-dimensional data.
.P1
.get states3.g
.P2
The markers were chosen to be
.UL vticks
because they denote only an $x$-value.
.grap states3.g
.PP
The next one-dimensional graph uses the state's name as
its marker; to reduce overprinting the graph is ``jittered''
by using a random number as a $y$-value.
.P1
.get states4.g
.P2
The function
.UL rand()
returns a pseudo-random real number chosen uniformly over the interval [0,1).
.grap states4.g
This graph is too cluttered; circles would have been
a better choice as a plotting symbol (bullets, once again, would
hide data).
.PP
Histograms are a standard way of presenting one-dimensional
data in two-dimensional form.
Our first step in building a histogram of the population
data is the following
.I awk
program, which counts how many states are in each ``bin''
of a million people.
.P1
.get states5.awk
.P2
The variable
.UL bzs
tells where bin zero starts; although it is zero in this
graph, it might be 95 in a histogram
of human body temperatures in degrees Fahrenheit.
The program produces the following output in
.UL states2.d :
.P1
.d states2.d
.P2
There are 12 states with population between 0 and 999,999,
5 states with population between 1,000,000 and 1,999,999,
and so on.
.PP
This
\*g
program uses three
.UL line
commands to plot each rectangle in the histogram.
.P1
.get states5.g
.P2
It produces
.grap states5.g
.PP
The same file can be plotted in a
more attractive (and more useful) form by
.P1
.get states6.g
.P2
which produces
one of Bill Cleveland's ``dot charts'' or ``lolliplots'':
.grap states6.g
(We use
.UL \e(bu ,
the
.I troff
character for a bullet, rather than the built-in string to
get a larger size.)
.PP
Other histograms are possible.
The following
.I awk
program
.P1
.get states7.awk
.P2
produces the file
.UL states3.d
.P1
.d states3.d
.P2
which lists the state's abbreviation, bin number, and
height within the bin.
The
\*g
program
.P1
.get states7.g
.P2
reads that file to make the following histogram, in which
the state names are used to display the heights of the bins.
In each bin, the states occur in increasing order of
population from bottom to top.
.grap states7.g
.PP
The next data set is a run-time profile of an early version of \*g,
created by compiling the program with the
.UL -p
option and running
.UL prof
after the program executed.
.P1
.d prof1.d
.P2
Although there were more than fifty procedures in the program, the
top four time-hogs accounted for more than half of the run time.
This file is difficult for
\*g
to deal with:
even though
.UL if
statements would allow us to extract lines 2 through 11
of the file, we could not remove the leading
.CW _
from a routine name or access the last field in a record.
We will therefore process it with
the following
.I awk
program.
.P1
.get prof1.awk
.P2
The program produces
.P1
.d prof2.d
.P2
We could even use the
.I sh
statement to execute the
.I awk
program from within \*g, which would make the latter entirely
self-contained (see the reference manual for details).
.PP
We will display the data with this program.
.P1
.get prof1.g
.P2
Observe that the program knows nothing about the range of the data.
It uses default ticks and a
.UL frame
statement with a computed height to achieve
total data independence.
.grap prof1.g
This bar chart highlights the fact that most of the time spent by
\*g
is devoted to input and output.
.PP
J. W. Tukey's box and whisker plots
represent the median, quartiles, and extremes of a
one-dimensional distribution.
The following
\*g
program defines a macro to draw a box plot, and then
uses that shape to compare the distribution of heights of
volcanoes with the distribution of heights of States of the Union.
.P1
.get box1.g
.P2
Boxes are one of many shapes used for the graphical
representation of several quantities.
If you use such shapes frequently then you should
make a library file of their macros to
.UL copy
into your
\*g
programs.
The above program produces
.grap box1.g
Even though the extreme heights are the same, state heights
have a lower median and a greater spread.
.PP
Someday you may use
\*g
to prepare overhead transparencies, only to find that
everything comes out too small.
The following program illustrates some ways to get larger
graphs.
.P1
.zzz slide1.g
.P2
The
.UL ps
and
.UL vs
commands preceding the graph set the text size to 14 points and
the vertical spacing to 18 points; the two quantities are
reset by the commands following the
.UL .G2 .
Such size changes should be made outside the
\*g
program, as mentioned earlier.
The
.UL 4
following the
.UL .G1
stretches the graph (including
\*g's
estimate of the accompanying text) to be four inches wide;
it is an alternative to altering the
.UL frame
command.
The macro
.UL blob
is a plotting symbol that is much larger than
.UL bullet ;
the different name ensures that later references to
.UL bullet
are unaffected.
The
.I troff
commands within the
.UL blob
string move the character down one-tenth of an em
to center its plotting position (determined experimentally)
and then reset the vertical position.
The program produces this trivial (but large) graph.
.br
.grap slide1.g
.NH
Using Grap
.PP
Following are a few day-to-day matters about using \*g.
.NH 2
Errors
.PP
\*G
attempts to pinpoint input errors; for example,
the input
.P1
\&.G1
i = i + 1
.P2
results in this message on
.UL stderr :
.P1
grap: syntax error near line 1, file -
context is
i = i >>> + <<< 1
.P2
The error was noticed
at the
.UL + .
Unfortunately, pinpointing is not the same as explaining:
the real error is that the variable
.UL i
was not initialized.
.PP
The ``words''
.UL x
and
.UL y
are reserved (for the
.UL coord
statement);
you will get an equally inexplicable syntax error message if you use them
as variable names.
(This design is bad, but not nearly so bad as
having the
.UL log
and
.UL exp
functions use base 10.)
.PP
\*G
tries to load a file of standard macro definitions
.UL /usr/lib/grap.defines ) (
for terms like
.UL bullet ,
.UL plus ,
etc.
It doesn't complain if that file isn't found,
but if you later use one of these words,
you'll get a syntax error message.
.PP
Certain constructs suggested by analogy to
.I pic
do not work.
For example,
.UL .GS
and
.UL .GE
would have been nicer than
.UL .G1
and
.UL .G2 ,
but they were already taken.
The
.I pic
construct
.P1
\&.PS <file
.P2
has been superseded by
\*g's
.UL copy
command (which in turn has been retrofitted into
.I pic ).
.NH 2
\fITroff\fP issues
.PP
You may use
.I troff
commands like
.UL .ps
or
.UL .ft
to change text sizes and fonts within a graph,
or use balanced
.UL \es
and
.UL \ef
commands within a string.
Do not, however,
add space
.UL .sp ) (
or change the line spacing
.UL .vs , (
.UL .ls )
within a graph.
Some defined terms like
.UL bullet
contain embedded size changes;
further qualifying them with
\*g
.UL size
commands may not always work.
.PP
Because
\*g
is built on top of
.I pic ,
the following quote from the
.I pic
manual is relevant:
``There is a subtle problem with complicated equations inside
.I pic
pictures \(em they come out wrong if
.I eqn
has to leave extra vertical space for the equation.
If your equation involves more than subscripts and superscripts,
you must add to the beginning of each such equation the extra information
.UL "space 0" ''.
This feature was illustrated in the graph of the
United States population in Section 3.
.NH 2
Alternatives
.PP
Besides
\*g
and your local draftsperson, what other choices are there?
.PP
The S system |reference(slanguage chambers) provides
a host of tools for statistical analysis,
but somewhat fewer tools than
\*g
for producing document-quality graphs.
S produces graphs on the screen of a DMD 5620 terminal much more quickly than
\*g
(often in seconds rather than minutes), but it
takes somewhat longer to learn (at least for us).
If you expect to do a lot of interactive data analysis, then
S is probably the right tool for you.
S may be used to generate
.I pic
commands.
.PP
The standard UNIX program
.I graph
provides many of the basic features of
\*g,
though with quite a bit less control over details, particularly
text.
It produces output only in the
.UX
.I plot (5)
language,
which may be processed by a variety of filters
for a variety of output devices.
.PP
The original
.UX
typesetter graphics programs are
.I pic
and
.I ideal ;
you may be able to do as well without using
\*g
as an intermediary.
In particular,
.I ideal
provides shading and clipping,
which are useful
in presentation-quality bar charts and the like, but are
well beyond the capabilities of
.I pic .
.EQ
delim $$
.EN
.NH
References
.LP
|reference_placement
.NH
Reference Manual
.PP
In the following,
.I italic
terms are syntactic categories,
.UL typewriter
terms are literals,
parenthesized constructs are optional, and ... indicates repetition.
In most cases, the order of statements,
constructs and attributes is immaterial.
.P1
.IT "grap program" :
.G1 \f2(width in inches)\fP
\f2grap statement\fP
...
.G2
.P2
A width on the
.UL .G1
line overrides the computed width, as in
.I pic .
.P1
.IT "grap statement" :
.I
frame \(or label \(or coord \(or ticks \(or grid \(or plot \(or line \(or circle \(or draw \(or new \(or next
\(or graph \(or numberlist \(or copy \(or for \(or if \(or sh \(or pic \(or assignment \(or print
.ft
.P2
.PP
The
.UL frame
statement defines the frame that surrounds the graph:
.P1
.IT frame :
frame \f2(\fPht \f2expr)\fP \f2(\fPwid \f2expr)\fP \f2((side) linedesc)\fP \f2...\fP
.IT side :
top \(or bot \(or left \(or right
.IT linedesc :
solid \(or invis \(or dotted \f2(expr)\fP \(or dashed \f2(expr)\fP
.P2
Height and width default to 2 and 3 inches;
sides default to solid.
If
.I side
is omitted, the
.I linedesc
applies to the entire frame.
The optional expressions after
.UL dotted
and
.UL dashed
change the spacing exactly as in
.I pic .
.PP
The
.UL label
statement places a label on a specified side:
.P1
.IT label :
label \f2side\fP \f2strlist\fP \f2...\fP \f2shift\fP
.IT shift:
left\f2 \(or \fPright\f2 \(or \fPup\f2 \(or \fPdown \f2expr ...\fP
.IT strlist :
\f2str ... (\fPrjust\f2 \(or \fPljust\f2 \(or \fPabove\f2 \(or \fPbelow\f2) ... (\fPsize \f2(\fP\(+-\f2) expr) ...\fP
.IT str :
"\f2...\fP"
.P2
Lists of text strings are stacked vertically.
In any context, string lists may contain clauses
to adjust the position or change the point size.
Each clause applies to the string preceding it
and all following strings.
Labels may also have a
.UL width
attribute, to override
\*g's
default computation.
.PP
Normally the coordinate system is defined by the data,
with 7 percent extra on each side.
(To change that to 5 percent, assign 0.05 to the
\*g
variable
.UL margin ,
which is reset to 0.07 at each
.UL .G1
statement.)
The
.UL coord
statement defines an overriding system:
.P1
.IT coord :
coord \f2(name)\fP \f2(\fPx \f2expr,expr)\fP \f2(\fPy \f2expr,expr)\fP \f2(\fPlog x \(or log y \(or log log\f2) \fP
.P2
Coordinate systems can be named;
ranges, logarithmic scaling, etc., are done separately for each.
.PP
The
.UL ticks
statement places tick marks on one side of the frame:
.P1
.IT ticks :
ticks \f2side\fP \f2(\fPin \(or out \f2(expr))\fP \f2(shift) (tick-locations)\fP
.IT tick-locations :
at \f2(name) expr (str)\fP, \f2expr (str)\fP, \f2...\fP
\(or from \f2(name) expr\fP to \f2expr\fP \f2(\fPby \f2(op) expr)\fP \f2str\fP
.P2
If no ticks are specified, they will be provided automatically;
.UL ticks
.UL off
suppresses automatic ticks.
The optional expression after
.UL in
or
.UL out
specifies the length of the ticks in inches.
The optional name refers to a coordinate system.
If
.IT str
contains
format specifiers like
.UL %f
or
.UL %g ,
they are interpreted as by
.UL printf .
If no
.IT str
is supplied, the tick labels will be the values of the
expressions.
.PP
If the
.UL by
clause is omitted, steps are of size 1.
If the
.UL by
expression is preceded by one of
.UL + ,
.UL - ,
.UL *
or
.UL / ,
the step is scaled by that operator,
e.g.,
.UL *10
means that each step is 10 times the previous one.
.PP
The
.UL grid
statement produces grid lines along (i.e., perpendicular to)
the named side.
.P1
.IT grid :
grid \f2side (linedesc) (shift) (tick-locations)\fP
.P2
Grids are labeled by the same mechanism as
.UL ticks .
It is possible to draw grids without ticks by placing the phrase
.UL ticks
.UL off
after the side name and before the iterator.
.PP
Plot
statements place text at a point:
.P1
.IT plot :
\f2strlist\fP at \f2point\fP
plot \f2expr (str)\fP at \f2point\fP
.IT point :
\f2(name) expr,expr\fP
.P2
As in the
.UL label
statement, the string list may contain
position and size modifiers.
The
.UL plot
statement uses the optional format string as in C's
.UL printf
statement \(em it may contain a
.UL %f
or
.UL %g .
The optional name refers to a coordinate system.
.PP
The
.UL line
statement draws a line or arrow from here to there:
.P1
.IT line :
\f2(\fPline \(or arrow\f2)\fP from \f2point\fP to \f2point (linedesc)\fP
.P2
The
.UL circle
statement draws a circle:
.P1
.IT circle :
circle at \f2point (\fPradius \f2expr)\fP
.P2
The radius is in inches; the default size is small.
.PP
The
.UL draw
statement defines a sequence of lines:
.P1
.IT draw :
draw \f2(name) linedesc (str)\fP
.P2
Subsequent data for the named sequence
will be plotted as a line of the specified style,
with the optional
.IT str
plotted at each point.
The
.UL next
statement continues a sequence:
.P1
.IT next :
next \f2(name)\fP at \f2point (linedesc)\fP
.P2
If a line description is specified, it overrides the default
display mode for the line segment ending at
.I point .
The
.UL new
statement starts a new sequence; it has the same format as the
.UL draw
statement.
.PP
A line consisting of a set of numbers
is treated as a family of points
$x$, $y sub 1$, $y sub 2$, etc.,
to be plotted at the single
$x$ value.
.P1
.IT numberlist :
\f2number\fP ...
.P2
If there is only one number it is treated as
a $y$ value, and $x$ values of 1, 2, 3, ...
are supplied automatically.
.PP
\*G
provides arithmetic with the operators
.UL + ,
.UL - ,
.UL * ,
.UL / ,
and
.UL ^ .
Variables may be assigned to;
assignments are expressions.
Built-in functions include
.UL log ,
.UL exp
(both base 10 \(em beware!),
.UL int
(truncates towards zero),
.UL sin ,
.UL cos
(both use radians),
.UL atan2(dy,dx) ,
.UL sqrt ,
.UL min
(two arguments only),
.UL max
(ditto),
and
.UL rand()
(returns a real number random on [0,1)).
.PP
The
.UL for
statement provides a modest looping facility:
.P1
.IT for :
for \f2var\fP from \f2expr\fP to \f2expr (\fPby \f2(op) expr)\fP do { \f2anything\fP }
.P2
The string may contain internally balanced braces.
Alternatively, any other character may appear immediately after the word
.UL do ,
and the string is terminated by the next occurrence of that character.
The text
.IT anything
(which may contain newlines) is repeated as
.IT var
takes on values from
.IT expr1
to
.IT expr2 .
As with tick iterators, the
.UL by
clause is optional, and may proceed arithmetically or multiplicatively.
In a
.UL for
statement,
the
.UL from
may be replaced by
.UL = ''. ``
.PP
The
.UL if-then-else
statement provides conditional evaluation:
.P1
.IT if :
if \f2expr\fP then { \f2anything\fP } else { \f2anything\fP }
.P2
The
.UL else
clause
is optional.
Relational operators include
.UL == ,
.UL != ,
.UL > ,
.UL >= ,
.UL < ,
.UL <= ,
.UL ! ,
.UL || ,
and
.UL && .
Strings may be compared with the operators
.UL ==
and
.UL != .
.PP
It is possible to convert numeric expressions to formatted strings:
.P1
sprintf("\f2format\fP", \f2expr\fP, \f2expr\fP, ...)
.P2
is equivalent to a quoted string in any context.
Variants of
.UL %f
and
.UL %g
are the only sensible format conversions.
.PP
\*G
provides the same macro processor that
.I pic
does:
.P1
define \f2macro-name\fP { \f2anything\fP }
.P2
.EQ
delim %%
.EN
Subsequent occurrences of the macro name will be replaced
by the string, with arguments of the form \f(CW$\fIn\fR
replaced by corresponding actual arguments.
Macro definitions persist across
.UL .G2
boundaries, as do values of variables.
.EQ
delim $$
.EN
.PP
The
.UL copy
statement is somewhat overloaded:
.P1
copy "\f2filename\fP"
.P2
includes the contents of the named file at that point;
.P1
copy "\f2filename\fP" thru \f2macro-name\fP
.P2
copies the file through the macro; and
.P1
copy thru \f2macro-name\fP
.P2
copies subsequent lines through the macro;
each number or quoted string is treated as an argument.
In each case, copying continues until end of file or the next
.UL .G2 .
The optional clause
.UL until
.IT str
causes copying to terminate when a line whose
first field is
.IT str
occurs.
In all cases, the macro can be specified inline rather than by name:
.P1
copy thru { \f2macro body\fP }
.P2
.PP
The
.UL sh
command passes text through to the UNIX shell.
.P1
.IT sh :
sh { \f2anything\fP }
.P2
The body of the command is scanned for macros.
The built-in macro
.UL pid
is a string consisting of the process identification number;
it can be used to generate unique file names.
.PP
The
.UL pic
command passes text through to
.I pic
with the
.UL pic '' ``
removed; variables and macros are not evaluated.
Lines beginning with a period (that are not numbers)
are passed through literally, under the assumption that they
are
.I troff
commands.
.PP
The
.UL graph
statement
.P1
.IT graph :
graph \f2Picname (pic-text)\fP
.P2
defines a new graph named
.I Picname ,
resetting all coordinate systems.
If any
.UL graph
commands are used in a
\*g
program, then the statement after the
.UL \&.G1
must be a
.UL graph
command.
The
.I pic-text
can be used to position this graph relative
to previous graphs by referring to their
.UL Frame s,
as in
.P1
graph First
...
graph Second with .Frame.w at First.Frame.e + (0.1,0)
.P2
Macros and expressions in
.I pic-text
are not evaluated.
.I Picname s
must begin with a capital letter to satisfy
.I pic
syntax.
.PP
The
.UL print
statement
.P1
.IT print :
print \f2(expr\fP \(or \f2str)\fP
.P2
writes on
.UL stderr
as
\*g
processes its input; it is sometimes useful for debugging.
.PP
Many reserved words have synonyms, such as
.UL thru
for
.UL through ,
.UL tick
for
.UL ticks,
and
.UL bot
for
.UL bottom .
.PP
The
.UL #
introduces a comment, which ends at the end of the line.
Statements may be continued over several lines by preceding each
newline with a
backslash character.
Multiple statements may appear on a single line separated
by semicolons.
\*G
ignores any line that is entirely blank, including those
processed by
.UL "copy thru"
commands.
.PP
When
\*g
is first executed it reads standard macro definitions
from the file
.UL /usr/lib/grap.defines .
The definitions include
.UL bullet ,
.UL plus ,
.UL box ,
.UL star ,
.UL dot ,
.UL times ,
.UL htick ,
.UL vtick ,
.UL square ,
and
.UL delta .