4.3BSD-Tahoe/usr/doc/misc/comments.t

.TL
Comments on the performance of
.UX
on the
\s-2VAX\s0\(dd
.FS
\(dd \s-2VAX\s0 and \s-2VMS\s0 are trademarks of Digital Equipment Corporation.
.FE
.AU
William Joy
.AI
Computer Science Division
Electrical Engineering and Computer Science Department
University of California, Berkeley
Berkeley, Ca.  94720
.SH
Introduction
.PP
A recent paper written by David Kashtan of SRI discussed some
measurements he made comparing the performance of the two available operating
systems for the \s-2VAX\s0: \s-2UNIX\s0 and \s-2VMS\s0.
The \s-2UNIX\s0 system measured by Kashtan was \s-2UNIX/32V\s0 as extended
by U.C. Berkeley to support virtual memory.
The measurements were made on version 2.1 of the Berkeley system and
version 1.6 of \s-2VMS\s0.
This note seeks to help interpret these
results and more clearly point out the differences in approach and current
state of the two systems.
.PP
We will show that the differences that Kashtan measured have simple
explanations, and that the comparison of performance on such benchmarks
can guide short-term tuning.  Long-term decisions as to which system to
run have been made on the basis of portability and flexibility considerations.
The results reporte here confirm the correctness of the choice
of \s-2UNIX\s0 on these grounds by confirming its flexibility.
The detected slownesses in \s-2UNIX\s0
are neither fundamental to the way \s-2UNIX\s0 does business, nor difficult to
mitigate if it is felt important to work on the areas in question.
We will discuss simple changes to the \s-2UNIX\s0 system that have
been made improving performance in the areas mentioned.
The improvements described here were incorporated in the production system
at Berkeley during the first three weeks of March, 1980.
.AE
.SH
System call overhead
.PP
Let us first consider system call overhead, which underlies
the measurements that were made of ``Context
Switching Performance.''
>From the measurements Kashtan gives for context switching overheads,
we can first estimate some fundamental
times for the \s-2VMS\s0 system: a simple
system call (e.g. an event flag call) on \s-2VMS\s0 appears to take about 114\(*msec,
and a context switch about 178\(*msec.  To get an idea of scale here, the \s-2VAX\s0
memory cycle time (for cache write-through) is about 1.2\(*msec,
the simplest procedure
call \fBjsb\fR and matching return \fBrsb\fR
takes about 4\(*msec, and the high level
language call instruction \fBcalls\fR and matching return
\fBret\fR takes a minimum of 16\(*msec,
more if some registers are to be saved.
.PP
On the version of \s-2UNIX\s0 Kashtan had, a trivial system call
took about 350\(*msec.  This is almost three times as much time as \s-2VMS\s0 used.
Where was this time being spent?  The following is a rough breakdown:
.TS
center box;
l n.
save registers, call \fItrap\fR	40\(*msec
setup in \fItrap\fR for syscall	30\(*msec
fetching arguments for syscall	60\(*msec
saving context in case interrupted	100\(*msec
system call proper	30\(*msec
checking to see if signals waiting	30\(*msec
recomputing process priority	60\(*msec
restore registers, \fBrei\fR	30\(*msec
.TE
.PP
This code is all modularly and straightforwardly written, using a
small set of primitives (for instance the ``non-local-goto'' at
the point of an interrupted system call is effected by using part of
the primitives performing context switching.)  There are a many
calls to small routines here.  Each system call argument is fetched
by calling a primitive \fIfuword\fR which fetches a word from user-space
after performing access checks.  The saving of the context involves a procedure
call, as does checking for signals and recomputing priority.
Altogether, when a system call has two
arguments, seven \fBcalls\fR instructions are performed here to various
subroutines.  These alone take roughly 105\(*msec, assuming no registers
are specified to be saved in the entry mask.
Since the saving and restoring of the registers for the
call to \fItrap\fR takes an additional 20\(*msec, this means that 125\(*msec
is accounted for without any of the work for the system call itself.
Clearly, if this is to be sped up, some of these routine calls must
be eliminated.
.PP
With minor changes to the code for trap handling we
have sped up the system call time so that the basic overhead
is 140\(*msec, instead of 350\(*msec.  This was done as follows:
.IP 1.
The \fItrap\fR routine's ``entry mask'' as generated by the C compiler
is modified at boot time to save all possible registers, to avoid
saving some registers twice.\(dg
.FS
\(dg Previously,
the assembly language for the system saved all the registers, and
made no assumptions about what \fItrap\fR itself did, resulting in registers
which were used for register variables within \fItrap\fR being saved twice.
.FE
.IP 2.
The routine to handle system calls was broken out from the handler
for all other traps and called \fIsyscall\fR.  This minimizes the changes
to the code for handling other traps in making the
following improvements, and allows some small simplifications.
.IP 3.
Fetching of all arguments can be done by a single routine \fIcopyin\fR,
instead of calling \fIfuword\fR for each argument.\(dd
.FS
\(dd The copy of arguments into system space
is important to avoid severe system security problems with system calls
that self-modify their arguments after they have been checked for consistency.
.FE
Since the number of argument words to system calls is very small
it is easy to expand the \fIcopyin\fR primitive in line in this critical path.
It can be implemented, in this special case, by two basic \s-2VAX\s0 instructions:
a \fBprober\fR to determine accessibility, and a \fBmovc3\fR to move the
arguments into system space.*
.FS
* The \fIcopyin\fR primitive is complicated in the general case by
the rather strange semantics of \fBprober\fR, which only checks accessibility
of the first and last pages in the address range it is given.  This forces
software to loop over each page involved in the \fIcopyin\fR.  The looping
checks are not needed in the code which fetches arguments for system calls,
because there are at most 16 bytes of direct arguments to a system call.
.FE
.IP 4.
To restore after an interrupted system call it suffices to be able to
locate the frame pointer and stack pointer of the original \fIsyscall\fR
procedure.  We can effect this context save
using just three \fBmovl\fR instructions.
.IP 5.
A subroutine call to check if signals are pending to this process
can be avoided in almost all cases by first checking the mask of pending
signals.  This partial inline expansion of the \fIissig\fR routine reduces
the overhead here by at least 16\(*msec.
.IP 6.
Recomputation of the process priority at each system call can be avoided
by ``unloading'' the \fBp_pri\fR per-process information field.
The system used a single field to encode both the processes user-CPU usage
dependent scheduling priority, and priorities related to process blocking
during system calls.  By adding a \fBp_usrpri\fR information field reflecting
user-CPU usage, the code in \fItrap\fR reduces to an assignment of this
priority to the \fBp_pri\fR field, instead of recomputing the user
priority after each system call.  The space overhead is one byte per
process, the way in which the system works is unaffected, and the code
is then somewhat easier to understand.
.PP
These changes reflect only local optimization of the code;
the substance of system call handling has not been changed.
It is simply the case that \fBcalls\fR sequence used by the C compiler
for procedure calls is relatively expensive.
Expansion of small routines and machine dependent primitives
in critical paths is an important
technique that can be used to quickly and easily mitigate routine
call slowness when profiling or other measurements show this to be
necessary.
.PP
It should be noted that the use of \fBcalls\fR and passing of
routine parameters on the stack in the current \s-2VAX\s0 C compiler is
different from the way \s-2VMS\s0 is coded.\(dg
.FS
\(dg VMS is written in assembly language.
.FE
\s-2VMS\s0 makes almost exclusive use of the \fBjsb\fR and
\fBrsb\fR instructions to
avoid the overhead of \fBcalls\fR, and has stylized conventions
for assignment of registers across assembly language routines so that
pushing and popping of data on the stack can be avoided.  This basic
mechanism is a good deal faster than \fBcalls\fR if register usage can be
optimized carefully, but tends to make the code harder to change.
.LP
.LP
.B Perspective.
Our departmental \s-2VAX\s0 running 20 users
in the afternoon does an average of about 100 system calls a second.
Under the old system the basic overhead for these 100 system calls was
about 3.5% of the available processor time.
The faster system call interface reduces this
to about 1.4%.
.PP
The fact that the primitives for critical sections (the \fIspl\fR
set priority level routines) were implemented by \fBcalls\fR to the two
instructions implementing them accounted for nearly 1/3 of the time in
a \fIread\fR or \fIwrite\fR system call.
Since over half of all system calls are reads or writes,
a simple inline expansion of these primitives accounted for more
improvement for reads and writes than the changes to the basic system
call routines.
.SH
Context switching
.PP
The context switching tests attempted to measure how fast the
systems could pass control from one process to another.  Kashtan measured
that \s-2VMS\s0 could switch between two processes at a net rate of 2000 switches
per second using the ``event flag'' mechanism to signal process exchange.
On \s-2UNIX\s0, using the \fIkill\fR system call as a signalling mechanism, Kashtan
found that the maximum switching rate was 210 per second.  He estimated that
the basic switching rate of the two systems was 5600 switches per
second on \s-2VMS\s0 and 425 per second on \s-2UNIX\s0.  He concluded:
.QS
``\s-2UNIX\s0, as currently implemented, has to do considerably more
work when scheduling a process.  In addition, \s-2UNIX\s0 must do a
context switch to process number 0 in order to make the decision
as to which process to be run next.  Even this cannot
explain the greater than 10 to 1 difference in the performance
of the two systems.''
.QE
We will see that this difference involves no great mystery, and that, in
fact, \s-2UNIX\s0 can be made to context switch nearly
as fast as \s-2VMS\s0 does simply
by changing the (assembly-language) primitive support of context
switching to match the hardware.  Remaining differences in timings are
not the result of ``inefficiencies'' in \s-2UNIX\s0,
but from
the close fit of the \s-2VMS\s0 strategy with
the \s-2VAX\s0 architecture.\(dd
.FS
\(dd Specifically,
blocked jobs are queued on linked lists with \fBinsque\fR (insert in queue
instruction) for removal with \fBremque\fR (remove from queue instruction)
while \s-2UNIX\s0 uses subroutines which hash sleeping jobs.  The
overhead of calling the hashed \fIsleep\fR and \fIwakeup\fR routines and searching
the hash chains will account for the remaining difference in time.
.FE
.PP
The \s-2VAX\s0 instruction set caters to a certain regime of context switching.
To make its idea of context switching amenable to \s-2UNIX\s0,
it sufficed to include the C library routines \fIsetjmp\fR
and \fIlongjmp\fR in the system to handle internal non-local goto's, and to
provide a new context switch primitive \fIresume\fR that corresponds
to a \fBsvpctx\fR followed by a \fBldpctx\fR.
.PP
There were two further problems with the context switching primitives:
as on the PDP-11, the per-process system stack and control information were
kept in kernel address space.  It is much more efficient on the \s-2VAX\s0 to
keep this information in the P0 or P1 region where it will be remapped
when \fBldpctx\fR is used.  This change was made by moving this information
to the base of the stack.
The final problem was the one Kashtan noted; that the system switched to
process 0 each time before switching to the eventual target for the switch.
This was done only to allow the system to idle in process 0.  This is cleaner
than idling on an arbitrary process because, e.g., the process that
called \fIswtch\fR might have just swapped itself out,
and we would then be running
on system control information in memory that had been released.
There is actually no problem in not switching to process 0,
since the system cannot run anything but interrupt code until
we come out of the idle loop in \fIswitch\fR.
.PP
After these changes, the times for context switching improved dramatically,
dropping by more than a factor of 5 to about 400\(*msec per switch.
Examination of the remaining overhead revealed that there were several
small routines being called in critical paths in the \fIsleep\fR routine.
This was simply eliminated by an inline expansion of the code.
.PP
After the above changes had been made, we measured the system
context switch time and broke it down as follows:
.TS
center box;
l n.
blocking time in \fIsleep\fR	50 \(*msec
rescheduling time in \fIswtch\fR	60\(*msec
\fIresume\fR primitive	110\(*msec 
unblocking time in \fIwakeup\fR	50\(*msec
.TE
Thus the context switching time had been reduced to 270\(*msec,
a factor of seven improvement.
.PP
In practice, there is a further efficiency
issue here that is not pointed out by the benchmarks.  The \fIswtch\fR routine
used a search over a linked list testing to find the highest priority
job on the list.  The \s-2VMS\s0 system uses the \fBffs\fR (find first set bit),
\fBinsque\fR and \fBremque\fR
instructions and an array of run-queues ordered by priority to make this
selection as rapid as possible.  We coded this new \fIswtch\fR routine (it is
about 10 lines of assembly language), and changed the system so that only
truly runnable jobs were on the run queue (the old system left runnable jobs
on the run queue even when they were swapped out).  This allows the \fIswtch\fR
primitive to run in time independent of the length of the run queue.
.LP
.LP
.B Perspective.
Let us try to get some perspective for timesharing \s-2UNIX\s0 systems
(such as the machine where this paper is being prepared).  The system
is currently supporting about 20 users, and doing roughly 50 context switches
per second.  The original code, which ran in about 2 milliseconds per context switch
would have cost 10% of the machine in context switching.
A version of the system changed to not switch to process 0
in \fIswtch\fR ran in about 800\(*msec per context switch, resulting in a
average context switching overhead of about 4% of the machine.  The current
system, with all the changes mentioned above, uses roughly 1.3% of the
machine in context switching.
.PP
If applications need absolutely fastest possible context switching time,
then \s-2UNIX\s0 would have to be changed so that the calls to
\fIsleep\fR and \fIwakeup\fR
were less expensive.  Roughly half of the 100\(*msec spent in these routines
could be saved by writing them in assembly language and calling them
with \fBjsb\fR instead of \fBcalls\fR.  They would then not have the 16\(*msec
overhead of \fBcalls\fR and \fBret\fR and could use registers 0-5 for scratch
work instead of saving and restoring registers 6-11, the
registers normally used by the C compiler for register variables.  Measurements
of the incidence of \fIsleep\fR and \fIwakeup\fR in our environment do not
justify such a change.
.SH
IPC Mechanisms
.PP
The measurements here were of the time for a process to send either
4 or 512 byte packets to another process.  The system call overhead and
context switch overhead reductions improved performance here, as did
the inline expansion of the priority level
routines.  Finally, a few primitives (file lock and unlock, and user/kernel
data movement) were defined as macros or partially expanded inline to
save time.
.PP
The improvements measured in 4 byte packet transmission are reported in
the following table.  The three experiments measured the rate at which a
process could send 4 bytes of data to itself,
to send 4 bytes to another process, and send 4 bytes to another process
and receive a 4 byte reply.
The measurements give the number of 4 byte packets transferred each second.
.TS
center box;
l l l l
a n n n.
Mechanism	To self	One-way	Two-way
_
\s-2VMS\s0 Mailboxes	440	297	363
\s-2UNIX\s0 Before	370	294	281
\s-2UNIX\s0 After	819	714	600
.TE
.LP
.LP
.B Perspective.
The overhead in using pipes on \s-2UNIX\s0 was reduced greatly by simply
eliminating unnecessary \fBcalls\fR of primitives.  The ultimate speed of
message passing through pipes should be between 600 and 1000 packets per
second, when no buffering is taking place.
.PP
We see no reason that the \fImpx\fR multiplexed file mechanism cannot
be made to operate at speeds greater than that of the pipe mechanism.
Most of the overhead in the pipe mechanism is in the block input/output system,
which must be called to access and release the file system blocks of
a pipe on each \fIread\fR or \fIwrite\fR operation.  With \fImpx\fR,
messages will be buffered in memory, and this expensive indexing
can be avoided.  By placing the \fImpx\fR buffers in paged memory,
it should be simple to provide fast response when activity is heavy without
tying up large amounts of core if buffered data accumulates.
.SH
Paging
.PP
Kashtan measured performance of the paging system on two kinds of
paging activity: sequential and random.  He also tried varying,
in the random case,
the standard deviation of successive references.  Never did his
paging experiments involve any ``memory'' in the behavior of the processes,
and they modified every referenced page.
To help cope with jobs which reference or modify
large numbers of pages rapidly,
we have changed the system to read and write
clusters of pages from the paging device.
This helps to minimize the incidence of pageins
and pageouts.  Write clustering
alone make nearly a factor of two improvement in the time taking to run
most of Kashtan's benchmarks, since he modifies every page.  Read clustering
improves the performance of jobs that sequentially access virtual memory,
and has a good, but less significant impact on other jobs, provided it
is used in moderation.*
.FS
* An excessive amount of prepaging tends to overload
the page replacement algorithm by creating an artificially high demand
for memory.  We can mitigate this somewhat by not validating the pre-paged
pages, forcing a ``reclaim'' to occur quickly if the page is not to
be discarded.  Observations show, however, that prepaging more than 2048 bytes
(2 of the basic 1024 byte pages that \s-2UNIX\s0 uses) at a time tends to
degrade general system performance.
.FE
.PP
\s-2UNIX\s0 attempts to determine the set of pages that have been
recently referenced using
software simulation of the page reference bits that are not provided by the
hardware.  While experiments show that this works well on jobs that
are typical of our working environment, jobs such as those run by Kashtan
are not well observed by this method.  A simple circular or random page
replacement algorithm is nearly optimal for the experiments that he made.
Disabling the gathering and use of page reference information
for ``memoryless'' jobs, and instead relying more on simple circular or random
replacement, more akin to the algorithm used by \s-2VMS\s0, improves
system performance also.
We have added an advisory
system call that informs the system that the process will not be well observed
by the normal reference information gathering algorithm.  This is used
by \s-2LISP\s0 during garbage collection.  One can easily imagine processes
informing the system of strongly sequential or random paging behavior.
.PP
The following table gives the relative times for the sequential
and random benchmarks on
Kashtan's machine and ours.  His benchmark used 8192 pages of process
virtual space while ours used only 7500.  He had 2 megabytes of real
memory while we had only 1.75.
These numbers are unfortunately not exactly comparable, and
we hope to run the experiment on a 2 megabyte machine in the near future.
.TS
center box;
l l l l
a n n n.
Experiment	VMS	Old UNIX	New UNIX
Sequential access	4:32	20:16	6:45
Random access	6:00	17:24	10:37
.TE
.PP
Kashtan also measured a program with random paging behavior (each successive
reference was a random distance from the previous), with varying standard
deviations of references.  We report here the times Kashtan measured
for \s-2VMS\s0 and \s-2UNIX\s0 when he made the experiment, and also
measurements on our system before and after the changes mentioned above
were introduced.  Kashtan altered the basic paging strategy in \s-2VMS\s0
for these tests; the strategy used in \s-2UNIX\s0 here was always the one we
use during general timesharing.
.TS
center box;
l | l l | l l
n | n n | n n.
Deviation	VMS	SRI UNIX	Old Berk UNIX	New Berk UNIX
1	0:16	0:16	0:23	0:23
10	0:18	0:16	0:24	0:24
30	0:25	0:18	0:66	0:44
40	0:47	1:08		1:34
50	1:21	3:54	4:37	2:38
60	1:53	5:46	6:21	3:13
80	3:04	6:38	8:47	4:53
100	3:27	8:26	10:28	6:12
.TE
.PP
These measurements are somewhat less than satisfactory for two reasons:
the machines had different configurations making the results more difficult
to interpret.  Also, the experiments were too short, and therefore the
start-up time for the system to reach a stable state has not been factored
out of the results.\(dg
.FS
\(dg For example, the \s-2UNIX\s0 simulation of software reference bits
is disabled when there is a large amount of free memory and a certain
amount of time (15 to 20 seconds) is required after free memory drops
below a threshold before the reference information begins to become
available.  This interval is of the same length as the experiments,
clouding the results.
.FE
.LP
.LP
.B Perspective.
The changes to the paging system to introduce clustering and for special
treatment of processes that are not well handled by the normal techniques
markedly improve the performance of \s-2UNIX\s0 on Kashtan's benchmarks.
For the jobs we normally see on our system the improvement
was not as noticeable.  The effect of these changes on a standard synthetic
workload that we had previously run to evaluate system performance was
negligible.
.PP
Of far more importance to system performance was a revision of the swap
scheduling algorithm.  The previous version of the system used an oldest
job first strategy for swapping out runnable jobs.  Changing the system
to instead swap out the oldest of the \fIn\fR largest jobs and then slowing
down the rate of swapping had dramatic effects under the heavy
loads we have been encountering.  This change is of much more importance
to the typical \s-2UNIX\s0 installation than the other changes mentioned above.
.SH
Conclusions
.PP
In selecting between \s-2UNIX\s0 and \s-2VMS\s0 as an operating system
for use in the \s-2ARPA\s0 Image Understanding and VLSI research communities,
the \s-2UNIX\s0 system was chosen primarily out of concern for portability.
For our purposes within the Computer Science Department at Berkeley, we
have chosen \s-2UNIX\s0 because of its pliability to meet our needs.
.PP
In the short term, there are areas of the \s-2UNIX\s0 system that are still
suffering growing pains from the porting of the system to larger machines.
We believe that the simplicity and modularity of the system, which are
the keys to its portability, are also the reasons why \s-2UNIX\s0 is easy
to tune, as this paper has demonstrated.
.PP
We have by no means made all the changes to the system that will be needed
by the \s-2ARPA\s0 community.  The throughput of the \s-2UNIX\s0 file system
is more than adequate for our current time-sharing applications,
but will not be great enough
for large image processing applications, nor will it support page
clustering to the file system, which will be essential if large files
are to be shared among processes and paged from the file system.
.PP
Changes to support such facilities have been designed, and implementation
of these facilities is not difficult, because of the simplicity of the
current system.  We believe that we can, with minor changes
to the file system structure, increase the throughput sufficiently to
support most of these applications.\(dg
.FS
\(dg We plan to initially implement
a file system, based on the current \fIinode\fR structure, but which maintains
a pool of 8192 byte buffers as well as a pool of 1024 byte buffers.  By
judiciously allocating these 8192 byte contiguous chunks to large files,
and allowing programs which need access to large amounts of data to read
such large blocks directly into their address space (without copying them
in and out of the system buffer cache), we should achieve an significant
improvement in file system throughput on applications
which (for example) sequentially access large amounts of data.  Such
a clustering scheme will also allow the paging system to cluster the
write-back of pages which are being shared by processes after being
``mapped'' from files.
.FE
In the long term, a different basic file system organization may be needed.
We are examining other file system schemes, such as extent based file systems,
confident that we can easily change the system to incorporate
whatever scheme we decide upon.  This flexibility, essential for long-term
viability of the system, is the reason we choose to use \s-2UNIX\s0.