.TL Comments on the performance of .UX on the \s-2VAX\s0\(dd .FS \(dd \s-2VAX\s0 and \s-2VMS\s0 are trademarks of Digital Equipment Corporation. .FE .AU William Joy .AI Computer Science Division Electrical Engineering and Computer Science Department University of California, Berkeley Berkeley, Ca. 94720 .SH Introduction .PP A recent paper written by David Kashtan of SRI discussed some measurements he made comparing the performance of the two available operating systems for the \s-2VAX\s0: \s-2UNIX\s0 and \s-2VMS\s0. The \s-2UNIX\s0 system measured by Kashtan was \s-2UNIX/32V\s0 as extended by U.C. Berkeley to support virtual memory. The measurements were made on version 2.1 of the Berkeley system and version 1.6 of \s-2VMS\s0. This note seeks to help interpret these results and more clearly point out the differences in approach and current state of the two systems. .PP We will show that the differences that Kashtan measured have simple explanations, and that the comparison of performance on such benchmarks can guide short-term tuning. Long-term decisions as to which system to run have been made on the basis of portability and flexibility considerations. The results reporte here confirm the correctness of the choice of \s-2UNIX\s0 on these grounds by confirming its flexibility. The detected slownesses in \s-2UNIX\s0 are neither fundamental to the way \s-2UNIX\s0 does business, nor difficult to mitigate if it is felt important to work on the areas in question. We will discuss simple changes to the \s-2UNIX\s0 system that have been made improving performance in the areas mentioned. The improvements described here were incorporated in the production system at Berkeley during the first three weeks of March, 1980. .AE .SH System call overhead .PP Let us first consider system call overhead, which underlies the measurements that were made of ``Context Switching Performance.'' >From the measurements Kashtan gives for context switching overheads, we can first estimate some fundamental times for the \s-2VMS\s0 system: a simple system call (e.g. an event flag call) on \s-2VMS\s0 appears to take about 114\(*msec, and a context switch about 178\(*msec. To get an idea of scale here, the \s-2VAX\s0 memory cycle time (for cache write-through) is about 1.2\(*msec, the simplest procedure call \fBjsb\fR and matching return \fBrsb\fR takes about 4\(*msec, and the high level language call instruction \fBcalls\fR and matching return \fBret\fR takes a minimum of 16\(*msec, more if some registers are to be saved. .PP On the version of \s-2UNIX\s0 Kashtan had, a trivial system call took about 350\(*msec. This is almost three times as much time as \s-2VMS\s0 used. Where was this time being spent? The following is a rough breakdown: .TS center box; l n. save registers, call \fItrap\fR 40\(*msec setup in \fItrap\fR for syscall 30\(*msec fetching arguments for syscall 60\(*msec saving context in case interrupted 100\(*msec system call proper 30\(*msec checking to see if signals waiting 30\(*msec recomputing process priority 60\(*msec restore registers, \fBrei\fR 30\(*msec .TE .PP This code is all modularly and straightforwardly written, using a small set of primitives (for instance the ``non-local-goto'' at the point of an interrupted system call is effected by using part of the primitives performing context switching.) There are a many calls to small routines here. Each system call argument is fetched by calling a primitive \fIfuword\fR which fetches a word from user-space after performing access checks. The saving of the context involves a procedure call, as does checking for signals and recomputing priority. Altogether, when a system call has two arguments, seven \fBcalls\fR instructions are performed here to various subroutines. These alone take roughly 105\(*msec, assuming no registers are specified to be saved in the entry mask. Since the saving and restoring of the registers for the call to \fItrap\fR takes an additional 20\(*msec, this means that 125\(*msec is accounted for without any of the work for the system call itself. Clearly, if this is to be sped up, some of these routine calls must be eliminated. .PP With minor changes to the code for trap handling we have sped up the system call time so that the basic overhead is 140\(*msec, instead of 350\(*msec. This was done as follows: .IP 1. The \fItrap\fR routine's ``entry mask'' as generated by the C compiler is modified at boot time to save all possible registers, to avoid saving some registers twice.\(dg .FS \(dg Previously, the assembly language for the system saved all the registers, and made no assumptions about what \fItrap\fR itself did, resulting in registers which were used for register variables within \fItrap\fR being saved twice. .FE .IP 2. The routine to handle system calls was broken out from the handler for all other traps and called \fIsyscall\fR. This minimizes the changes to the code for handling other traps in making the following improvements, and allows some small simplifications. .IP 3. Fetching of all arguments can be done by a single routine \fIcopyin\fR, instead of calling \fIfuword\fR for each argument.\(dd .FS \(dd The copy of arguments into system space is important to avoid severe system security problems with system calls that self-modify their arguments after they have been checked for consistency. .FE Since the number of argument words to system calls is very small it is easy to expand the \fIcopyin\fR primitive in line in this critical path. It can be implemented, in this special case, by two basic \s-2VAX\s0 instructions: a \fBprober\fR to determine accessibility, and a \fBmovc3\fR to move the arguments into system space.* .FS * The \fIcopyin\fR primitive is complicated in the general case by the rather strange semantics of \fBprober\fR, which only checks accessibility of the first and last pages in the address range it is given. This forces software to loop over each page involved in the \fIcopyin\fR. The looping checks are not needed in the code which fetches arguments for system calls, because there are at most 16 bytes of direct arguments to a system call. .FE .IP 4. To restore after an interrupted system call it suffices to be able to locate the frame pointer and stack pointer of the original \fIsyscall\fR procedure. We can effect this context save using just three \fBmovl\fR instructions. .IP 5. A subroutine call to check if signals are pending to this process can be avoided in almost all cases by first checking the mask of pending signals. This partial inline expansion of the \fIissig\fR routine reduces the overhead here by at least 16\(*msec. .IP 6. Recomputation of the process priority at each system call can be avoided by ``unloading'' the \fBp_pri\fR per-process information field. The system used a single field to encode both the processes user-CPU usage dependent scheduling priority, and priorities related to process blocking during system calls. By adding a \fBp_usrpri\fR information field reflecting user-CPU usage, the code in \fItrap\fR reduces to an assignment of this priority to the \fBp_pri\fR field, instead of recomputing the user priority after each system call. The space overhead is one byte per process, the way in which the system works is unaffected, and the code is then somewhat easier to understand. .PP These changes reflect only local optimization of the code; the substance of system call handling has not been changed. It is simply the case that \fBcalls\fR sequence used by the C compiler for procedure calls is relatively expensive. Expansion of small routines and machine dependent primitives in critical paths is an important technique that can be used to quickly and easily mitigate routine call slowness when profiling or other measurements show this to be necessary. .PP It should be noted that the use of \fBcalls\fR and passing of routine parameters on the stack in the current \s-2VAX\s0 C compiler is different from the way \s-2VMS\s0 is coded.\(dg .FS \(dg VMS is written in assembly language. .FE \s-2VMS\s0 makes almost exclusive use of the \fBjsb\fR and \fBrsb\fR instructions to avoid the overhead of \fBcalls\fR, and has stylized conventions for assignment of registers across assembly language routines so that pushing and popping of data on the stack can be avoided. This basic mechanism is a good deal faster than \fBcalls\fR if register usage can be optimized carefully, but tends to make the code harder to change. .LP .LP .B Perspective. Our departmental \s-2VAX\s0 running 20 users in the afternoon does an average of about 100 system calls a second. Under the old system the basic overhead for these 100 system calls was about 3.5% of the available processor time. The faster system call interface reduces this to about 1.4%. .PP The fact that the primitives for critical sections (the \fIspl\fR set priority level routines) were implemented by \fBcalls\fR to the two instructions implementing them accounted for nearly 1/3 of the time in a \fIread\fR or \fIwrite\fR system call. Since over half of all system calls are reads or writes, a simple inline expansion of these primitives accounted for more improvement for reads and writes than the changes to the basic system call routines. .SH Context switching .PP The context switching tests attempted to measure how fast the systems could pass control from one process to another. Kashtan measured that \s-2VMS\s0 could switch between two processes at a net rate of 2000 switches per second using the ``event flag'' mechanism to signal process exchange. On \s-2UNIX\s0, using the \fIkill\fR system call as a signalling mechanism, Kashtan found that the maximum switching rate was 210 per second. He estimated that the basic switching rate of the two systems was 5600 switches per second on \s-2VMS\s0 and 425 per second on \s-2UNIX\s0. He concluded: .QS ``\s-2UNIX\s0, as currently implemented, has to do considerably more work when scheduling a process. In addition, \s-2UNIX\s0 must do a context switch to process number 0 in order to make the decision as to which process to be run next. Even this cannot explain the greater than 10 to 1 difference in the performance of the two systems.'' .QE We will see that this difference involves no great mystery, and that, in fact, \s-2UNIX\s0 can be made to context switch nearly as fast as \s-2VMS\s0 does simply by changing the (assembly-language) primitive support of context switching to match the hardware. Remaining differences in timings are not the result of ``inefficiencies'' in \s-2UNIX\s0, but from the close fit of the \s-2VMS\s0 strategy with the \s-2VAX\s0 architecture.\(dd .FS \(dd Specifically, blocked jobs are queued on linked lists with \fBinsque\fR (insert in queue instruction) for removal with \fBremque\fR (remove from queue instruction) while \s-2UNIX\s0 uses subroutines which hash sleeping jobs. The overhead of calling the hashed \fIsleep\fR and \fIwakeup\fR routines and searching the hash chains will account for the remaining difference in time. .FE .PP The \s-2VAX\s0 instruction set caters to a certain regime of context switching. To make its idea of context switching amenable to \s-2UNIX\s0, it sufficed to include the C library routines \fIsetjmp\fR and \fIlongjmp\fR in the system to handle internal non-local goto's, and to provide a new context switch primitive \fIresume\fR that corresponds to a \fBsvpctx\fR followed by a \fBldpctx\fR. .PP There were two further problems with the context switching primitives: as on the PDP-11, the per-process system stack and control information were kept in kernel address space. It is much more efficient on the \s-2VAX\s0 to keep this information in the P0 or P1 region where it will be remapped when \fBldpctx\fR is used. This change was made by moving this information to the base of the stack. The final problem was the one Kashtan noted; that the system switched to process 0 each time before switching to the eventual target for the switch. This was done only to allow the system to idle in process 0. This is cleaner than idling on an arbitrary process because, e.g., the process that called \fIswtch\fR might have just swapped itself out, and we would then be running on system control information in memory that had been released. There is actually no problem in not switching to process 0, since the system cannot run anything but interrupt code until we come out of the idle loop in \fIswitch\fR. .PP After these changes, the times for context switching improved dramatically, dropping by more than a factor of 5 to about 400\(*msec per switch. Examination of the remaining overhead revealed that there were several small routines being called in critical paths in the \fIsleep\fR routine. This was simply eliminated by an inline expansion of the code. .PP After the above changes had been made, we measured the system context switch time and broke it down as follows: .TS center box; l n. blocking time in \fIsleep\fR 50 \(*msec rescheduling time in \fIswtch\fR 60\(*msec \fIresume\fR primitive 110\(*msec unblocking time in \fIwakeup\fR 50\(*msec .TE Thus the context switching time had been reduced to 270\(*msec, a factor of seven improvement. .PP In practice, there is a further efficiency issue here that is not pointed out by the benchmarks. The \fIswtch\fR routine used a search over a linked list testing to find the highest priority job on the list. The \s-2VMS\s0 system uses the \fBffs\fR (find first set bit), \fBinsque\fR and \fBremque\fR instructions and an array of run-queues ordered by priority to make this selection as rapid as possible. We coded this new \fIswtch\fR routine (it is about 10 lines of assembly language), and changed the system so that only truly runnable jobs were on the run queue (the old system left runnable jobs on the run queue even when they were swapped out). This allows the \fIswtch\fR primitive to run in time independent of the length of the run queue. .LP .LP .B Perspective. Let us try to get some perspective for timesharing \s-2UNIX\s0 systems (such as the machine where this paper is being prepared). The system is currently supporting about 20 users, and doing roughly 50 context switches per second. The original code, which ran in about 2 milliseconds per context switch would have cost 10% of the machine in context switching. A version of the system changed to not switch to process 0 in \fIswtch\fR ran in about 800\(*msec per context switch, resulting in a average context switching overhead of about 4% of the machine. The current system, with all the changes mentioned above, uses roughly 1.3% of the machine in context switching. .PP If applications need absolutely fastest possible context switching time, then \s-2UNIX\s0 would have to be changed so that the calls to \fIsleep\fR and \fIwakeup\fR were less expensive. Roughly half of the 100\(*msec spent in these routines could be saved by writing them in assembly language and calling them with \fBjsb\fR instead of \fBcalls\fR. They would then not have the 16\(*msec overhead of \fBcalls\fR and \fBret\fR and could use registers 0-5 for scratch work instead of saving and restoring registers 6-11, the registers normally used by the C compiler for register variables. Measurements of the incidence of \fIsleep\fR and \fIwakeup\fR in our environment do not justify such a change. .SH IPC Mechanisms .PP The measurements here were of the time for a process to send either 4 or 512 byte packets to another process. The system call overhead and context switch overhead reductions improved performance here, as did the inline expansion of the priority level routines. Finally, a few primitives (file lock and unlock, and user/kernel data movement) were defined as macros or partially expanded inline to save time. .PP The improvements measured in 4 byte packet transmission are reported in the following table. The three experiments measured the rate at which a process could send 4 bytes of data to itself, to send 4 bytes to another process, and send 4 bytes to another process and receive a 4 byte reply. The measurements give the number of 4 byte packets transferred each second. .TS center box; l l l l a n n n. Mechanism To self One-way Two-way _ \s-2VMS\s0 Mailboxes 440 297 363 \s-2UNIX\s0 Before 370 294 281 \s-2UNIX\s0 After 819 714 600 .TE .LP .LP .B Perspective. The overhead in using pipes on \s-2UNIX\s0 was reduced greatly by simply eliminating unnecessary \fBcalls\fR of primitives. The ultimate speed of message passing through pipes should be between 600 and 1000 packets per second, when no buffering is taking place. .PP We see no reason that the \fImpx\fR multiplexed file mechanism cannot be made to operate at speeds greater than that of the pipe mechanism. Most of the overhead in the pipe mechanism is in the block input/output system, which must be called to access and release the file system blocks of a pipe on each \fIread\fR or \fIwrite\fR operation. With \fImpx\fR, messages will be buffered in memory, and this expensive indexing can be avoided. By placing the \fImpx\fR buffers in paged memory, it should be simple to provide fast response when activity is heavy without tying up large amounts of core if buffered data accumulates. .SH Paging .PP Kashtan measured performance of the paging system on two kinds of paging activity: sequential and random. He also tried varying, in the random case, the standard deviation of successive references. Never did his paging experiments involve any ``memory'' in the behavior of the processes, and they modified every referenced page. To help cope with jobs which reference or modify large numbers of pages rapidly, we have changed the system to read and write clusters of pages from the paging device. This helps to minimize the incidence of pageins and pageouts. Write clustering alone make nearly a factor of two improvement in the time taking to run most of Kashtan's benchmarks, since he modifies every page. Read clustering improves the performance of jobs that sequentially access virtual memory, and has a good, but less significant impact on other jobs, provided it is used in moderation.* .FS * An excessive amount of prepaging tends to overload the page replacement algorithm by creating an artificially high demand for memory. We can mitigate this somewhat by not validating the pre-paged pages, forcing a ``reclaim'' to occur quickly if the page is not to be discarded. Observations show, however, that prepaging more than 2048 bytes (2 of the basic 1024 byte pages that \s-2UNIX\s0 uses) at a time tends to degrade general system performance. .FE .PP \s-2UNIX\s0 attempts to determine the set of pages that have been recently referenced using software simulation of the page reference bits that are not provided by the hardware. While experiments show that this works well on jobs that are typical of our working environment, jobs such as those run by Kashtan are not well observed by this method. A simple circular or random page replacement algorithm is nearly optimal for the experiments that he made. Disabling the gathering and use of page reference information for ``memoryless'' jobs, and instead relying more on simple circular or random replacement, more akin to the algorithm used by \s-2VMS\s0, improves system performance also. We have added an advisory system call that informs the system that the process will not be well observed by the normal reference information gathering algorithm. This is used by \s-2LISP\s0 during garbage collection. One can easily imagine processes informing the system of strongly sequential or random paging behavior. .PP The following table gives the relative times for the sequential and random benchmarks on Kashtan's machine and ours. His benchmark used 8192 pages of process virtual space while ours used only 7500. He had 2 megabytes of real memory while we had only 1.75. These numbers are unfortunately not exactly comparable, and we hope to run the experiment on a 2 megabyte machine in the near future. .TS center box; l l l l a n n n. Experiment VMS Old UNIX New UNIX Sequential access 4:32 20:16 6:45 Random access 6:00 17:24 10:37 .TE .PP Kashtan also measured a program with random paging behavior (each successive reference was a random distance from the previous), with varying standard deviations of references. We report here the times Kashtan measured for \s-2VMS\s0 and \s-2UNIX\s0 when he made the experiment, and also measurements on our system before and after the changes mentioned above were introduced. Kashtan altered the basic paging strategy in \s-2VMS\s0 for these tests; the strategy used in \s-2UNIX\s0 here was always the one we use during general timesharing. .TS center box; l | l l | l l n | n n | n n. Deviation VMS SRI UNIX Old Berk UNIX New Berk UNIX 1 0:16 0:16 0:23 0:23 10 0:18 0:16 0:24 0:24 30 0:25 0:18 0:66 0:44 40 0:47 1:08 1:34 50 1:21 3:54 4:37 2:38 60 1:53 5:46 6:21 3:13 80 3:04 6:38 8:47 4:53 100 3:27 8:26 10:28 6:12 .TE .PP These measurements are somewhat less than satisfactory for two reasons: the machines had different configurations making the results more difficult to interpret. Also, the experiments were too short, and therefore the start-up time for the system to reach a stable state has not been factored out of the results.\(dg .FS \(dg For example, the \s-2UNIX\s0 simulation of software reference bits is disabled when there is a large amount of free memory and a certain amount of time (15 to 20 seconds) is required after free memory drops below a threshold before the reference information begins to become available. This interval is of the same length as the experiments, clouding the results. .FE .LP .LP .B Perspective. The changes to the paging system to introduce clustering and for special treatment of processes that are not well handled by the normal techniques markedly improve the performance of \s-2UNIX\s0 on Kashtan's benchmarks. For the jobs we normally see on our system the improvement was not as noticeable. The effect of these changes on a standard synthetic workload that we had previously run to evaluate system performance was negligible. .PP Of far more importance to system performance was a revision of the swap scheduling algorithm. The previous version of the system used an oldest job first strategy for swapping out runnable jobs. Changing the system to instead swap out the oldest of the \fIn\fR largest jobs and then slowing down the rate of swapping had dramatic effects under the heavy loads we have been encountering. This change is of much more importance to the typical \s-2UNIX\s0 installation than the other changes mentioned above. .SH Conclusions .PP In selecting between \s-2UNIX\s0 and \s-2VMS\s0 as an operating system for use in the \s-2ARPA\s0 Image Understanding and VLSI research communities, the \s-2UNIX\s0 system was chosen primarily out of concern for portability. For our purposes within the Computer Science Department at Berkeley, we have chosen \s-2UNIX\s0 because of its pliability to meet our needs. .PP In the short term, there are areas of the \s-2UNIX\s0 system that are still suffering growing pains from the porting of the system to larger machines. We believe that the simplicity and modularity of the system, which are the keys to its portability, are also the reasons why \s-2UNIX\s0 is easy to tune, as this paper has demonstrated. .PP We have by no means made all the changes to the system that will be needed by the \s-2ARPA\s0 community. The throughput of the \s-2UNIX\s0 file system is more than adequate for our current time-sharing applications, but will not be great enough for large image processing applications, nor will it support page clustering to the file system, which will be essential if large files are to be shared among processes and paged from the file system. .PP Changes to support such facilities have been designed, and implementation of these facilities is not difficult, because of the simplicity of the current system. We believe that we can, with minor changes to the file system structure, increase the throughput sufficiently to support most of these applications.\(dg .FS \(dg We plan to initially implement a file system, based on the current \fIinode\fR structure, but which maintains a pool of 8192 byte buffers as well as a pool of 1024 byte buffers. By judiciously allocating these 8192 byte contiguous chunks to large files, and allowing programs which need access to large amounts of data to read such large blocks directly into their address space (without copying them in and out of the system buffer cache), we should achieve an significant improvement in file system throughput on applications which (for example) sequentially access large amounts of data. Such a clustering scheme will also allow the paging system to cluster the write-back of pages which are being shared by processes after being ``mapped'' from files. .FE In the long term, a different basic file system organization may be needed. We are examining other file system schemes, such as extent based file systems, confident that we can easily change the system to incorporate whatever scheme we decide upon. This flexibility, essential for long-term viability of the system, is the reason we choose to use \s-2UNIX\s0.