Report No. 3970 Bolt Beranek and Newman Inc. Performance Improvements in UNIX Pipes and Ports November 1978 Morris Kranc Prepared for: Defense Advanced Research Projects Agency Report No. 3970 Bolt Beranek and Newman Inc. TABLE OF CONTENTS 1. INTRODUCTION ............................................. 1 2. PERFORMANCE OF PIPES .................................... 4 3. IMPROVEMENTS TO PORTS ................................... 10 4. PERFORMANCE ON THE PDP-11/70 ............................ 15 5. REAL-WORLD PERFORMANCE WVALUATION ....................... 19 6. REFERENCES .............................................. 22 Report No. 3970 Bolt Beranek and Newman Inc. 1. INTRODUCTION This report discusses improvements made to the pipe and port routines as implemented in the UNIX operating system [1]. All tests, unless otherwise noted, were conducted on a PDP-11/40 with dual RK05 disks and 128K of memory. An understanding of UNIX I/O is essential to this discus- sion. Before describing specific improvements, we will therefore examine general I/O strategies of the standard UNIX system. When a user process `writes' some data, the system copies that data to a buffer taken from the buffer freepool. At this point, the system can proceed in one of three ways: (1) It can perform a synchronous write. _ __ ___ _______ _ ___________ _____ A routine is initiated which writes the contents of the buffer onto the disk. The operating system initiates this write and waits until its conclusion before freeing the buffer and return- ing control to the user. - 1 - Report No. 3970 Bolt Beranek and Newman Inc. (2) It can perform an asynchronous write. _ __ ___ _______ __ ____________ _____ The routine to write the contents of the buffer onto the disk is initiated, but the operating system does not wait for its conclu- sion. Rather, control is returned to the user process. Later, when the write is finished, an interrupt is generated, and the system regains control in order to return the buffer to the freepool. (3) It can perform a delayed write. _ __ ___ _______ _ _______ _____ Nothing is actually written to the device. Rather, the buffer is returned to the freepool with a special delayed write flag (B_DELWRI) turned on. Additionally, the buffer contains a (dev- ice number,block number) pair that identifies its ultimate disk location. This buffer will eventually reach the front of the freepool queue and will be requested for some other purpose. At this point, before the buffer is actually given away, the buffer allocation routine will initiate a write to the disk. At the conclusion of the write, the buffer will be allocated to its new purpose. Thus, between the time it enters the freepool and the time it reaches the front of the freepool, a delayed write buffer represents the only up-to-date copy of the disk block (device number,block number). The purpose of a delayed write is as follows. Both the read - 2 - Report No. 3970 Bolt Beranek and Newman Inc. and write routines are smart enough to check the freepool for delayed write copies of a particular block before they actually initiate any disk I/O. Let us suppose that a user is writing ten times into the same disk block. If each write is done asynchro- nously, this will require 10 disk reads and 10 disk writes. If, however, a delayed write is performed each time, all updates can probably be done within the buffer, and only one disk write is required. The operating system uses the following rule to choose between a delayed write and an asynchronous write: if the final character of a block has been changed, i.e. the block is full, then it is assumed that no further updates will occur for that block, and it is written asynchronously. Otherwise, there is still hope that later updates will occur, and a delayed write is effected. - 3 - Report No. 3970 Bolt Beranek and Newman Inc. 2. PERFORMANCE OF PIPES A program named t.pipe1.c was written to exercise UNIX pipe I/O [2]. t.pipe1.c forks off into a parent and a child process. The parent writes a variable number of variable-sized blocks into the pipe (maximum blocksize = 4096). The child reads the amount of data the parent was supposed to write. The UNIX `time' com- mand was used to ascertain the real time and the system time required for each run. Additionally, the `CPU usage' was calcu- lated. By this we mean the percentage of real time taken by actual CPU activity, as opposed to idle time, I/O waits, etc. A high `CPU usage' value will mean the system is running efficient- ly and is always finding things to do. Running on an otherwise unoccupied system, the following results were obtained: - 4 - Report No. 3970 Bolt Beranek and Newman Inc. SIZE OF BLOCK EXECUTION TIME CPU USAGE real time system time 2 10 8.0 80% 4 9 7.4 80% 8 9 7.5 82% 16 9 7.7 82% 32 10 8.1 81% 64 12 9.5 78% 128 15 10.2 66% 256 32 11.0 34% 512 1:02 15.2 24% 1024 2:03 26.4 22% 2048 4:06 49.1 19% 4096 8:08 1:34 19% TABLE 1 One Thousand Blocks Note that, for suitably large block size, execution time is almost linear. Note also that at the maximum block size of 4096 bytes, throughput is about 65 Kbps. Of the over 8 minutes re- quired for this operation, only 1:34 is actually consumed in sys- tem time - the rest is non-specific system overhead. What is the system doing during the other 6 minutes? - 5 - Report No. 3970 Bolt Beranek and Newman Inc. To visualize the problem, let us construct an `ideal' algo- rithm for pipe reads and writes. We would like to avoid disk activity as much as possible. The writer, then, should be per- forming all its writes as delayed writes. The information will thus reside in a freepool buffer rather than on a disk. If we are lucky, the reader will activate before that buffer reaches the front of the freepool queue and is actually written. The reader can then snatch the information from the buffer, thus avoiding an entire disk write disk read cycle. Recall now the actual algorithm that the system uses in determining whether to perform a delayed or an asynchronous write. In the case of blocksize = 4096, it performs four asyn- chronous writes, of 512 bytes each, since the last character of each 512-byte block has been changed. The reader, when activat- ed, must then perform four disk reads. Clearly, different cri- teria are in order for pipe files. Since a reader and a writer are active simultaneously, it will always make sense for the writer to perform delayed writes, in the hope that the reader will activate in time to get the buffer before it is actually written. To implement this, the following changes were made: (1) The standard write routine checks if the file it is dealing with is a pipe file, and, if it is, always performs a delayed - 6 - Report No. 3970 Bolt Beranek and Newman Inc. write. (2) The pipe reader, after it has read a block of data from a buffer in the freepool, turns off the delayed write flag associ- ated with that buffer. Thus, the block will never be written to disk. We no longer care whether the disk file is up to date, as long as the reader has received the data. (3) The `getbuffer' routine avoids the buffers with delayed write flags on for as long as possible, preferring to retrieve `unat- tached' buffers if it can. The hope is that by giving a delayed write buffer a slightly longer life in core, we increase the chance that some reader will activate in time to use it. With these changes, the following statistics were obtained: - 7 - Report No. 3970 Bolt Beranek and Newman Inc. STANDARD SYSTEM IMPROVED SYSTEM BLOCK SIZE EXECUTION TIME CPU EXECUTION TIME CPU USAGE USAGE real system real system 2 10 8.0 80% 9 7.5 80% 4 9 7.4 80% 9 7.7 81% 8 9 7.5 82% 10 8.1 81% 16 9 7.7 82% 10 8.3 83% 32 10 8.1 81% 10 8.7 87% 64 12 9.5 78% 11 9.0 81% 128 15 10.2 68% 15 10.0 66% 256 32 11.0 34% 26 12.2 46% 512 1:02 15.2 24% 45 16.0 35% 1024 2:03 26.4 22% 49 26.3 53% 2048 4:06 49.1 19% 1:23 46.9 56% 4096 8:08 1:34 19% 1:31 1:25.2 93% TABLE 2 (standard system statistics reproduced from TABLE 1) - 8 - Report No. 3970 Bolt Beranek and Newman Inc. For blocksizes of 16 and smaller, performance is identical to that of the standard system. For blocksizes between 32 and 512, real time is equivalent, although system time is slightly higher for the improved system. For large blocks (1024 bytes and up) the savings in real time become dramatic, while system time drops only slightly. At blocksize = 4096, the throughput is 360 Kbps, representing an improvement by more than a factor of 5 over the standard system. Note also that CPU usage is over 90% - the system is almost always finding useful tnings to do rather than blocking on disk I/O. - 9 - Report No. 3970 Bolt Beranek and Newman Inc. 3. IMPROVEMENTS TO PORTS Ports [3] are similar in conception to pipes, and they share a significant amount of code. Again, we have a reader and a writer interacting, and we should expect improvements similar to those seen in pipes. A program called t.port1.c was written to exercise port I/O [2]. Again, results were obtained via the 'time' command on a PDP-11/40 with dual RK05 disks. The `before' and `after' statis- tics appear below. - 10 - Report No. 3970 Bolt Beranek and Newman Inc. STANDARD SYSTEM IMPROVED SYSTEM BLOCK SIZE EXECUTION TIME CPU EXECUTION TIME CPU USAGE USAGE real user system real user system 2 16 .8 13.7 87% 17 .9 15.2 94% 4 16 .7 13. 9 92% 17 .9 15.4 95% 8 16 .7 14.1 93% 17 .9 15.5 95% 16 18 1.3 14.9 83% 19 1.3 16.4 94% 32 18 1.7 14.6 88% 20 1.5 17.2 94% 64 20 2.9 15.2 90% 22 2.8 18.2 95% 128 24 4.8 17.0 90% 26 4.5 19.5 95% 256 42 9.0 19.5 69% 32 8.8 21.5 93% 512 1:27 17.2 25.9 49% 47 17.1 27.9 95% 1024 2:49 34.3 38.1 42% 1:15 33.8 40.0 98% 2048 5:26 1:07.0 1:03.4 39% 2:14 1:06.6 1:04.6 98% 4096 10:28 2:13.5 1:53.2 39% 4:09 2:12.7 1:51.9 98% TABLE 3 - 11 - Report No. 3970 Bolt Beranek and Newman Inc. For smaller blocksizes, the improved system performed slightly worse than the standard one. For blocksize > 128, how- ever, the improvement is noticeable both in system and real time. At the maximum blocksize of 4096, the improved system performs more than twice as well as the standard system. Why are the times so much worse for ports than for pipes on the new system? For one thing, t.port1.c does much more extensive checking than t.pipe1.c. At 4096 bytes/block, over 2 minutes are consumed in this checking. Also, t.port1.c writes an 8-byte header before it writes each block. Thus, for small blocksizes, time consumed writing the header is comparable to that consumed in writing the block itself. For the purpose of comparison, t.port1.c was modified so that it did no checking and wrote no headers. The results appear below. - 12 - Report No. 3970 Bolt Beranek and Newman Inc. PIPE I/O PORT I/O BLOCK SIZE EXECUTION TIME CPU EXECUTION TIME CPU USAGE USAGE real system real system 2 9 7.5 80% 9 7.8 87% 4 9 7.7 81% 10 8.1 81% 8 9 8.1 81% 11 8.9 81% 16 10 8.3 83% 11 9.0 81% 32 10 8.7 87% 12 9.5 80% 64 11 9.0 81% 12 10.1 83% 128 15 10.0 66% 12 11.0 91% 256 26 12.2 46% 28 12.5 83% 512 45 16.0 35% 48 18.4 90% 1024 49 26.3 53% 53 30.1 93% 2048 1:23 46.9 56% 1:30 52.3 57% 4096 1:31 1:25.2 93% 1:40 1:36.4 96% TABLE 4 - 13 - Report No. 3970 Bolt Beranek and Newman Inc. As can be seen, port I/O involves slightly more overhead than pipe I/O. At 4096 bytes/block, throughput is about 325 Kbps, still a pretty reasonable rate. - 14 - Report No. 3970 Bolt Beranek and Newman Inc. 4. PERFORMANCE ON THE PDP-11/70 For the sake of comparison, t.pipe1.c and t.port1.c were run on an otherwise unoccupied PDP-11/70 system with dual RP06 disks. The results for pipes and ports appear below. - 15 - Report No. 3970 Bolt Beranek and Newman Inc. PIPE PERFORMANCE STANDARD SYSTEM IMPROVED SYSTEM BLOCK SIZE EXECUTION TIME CPU EXECUTION TIME CPU USAGE USAGE real system real system 2 3 3.0 100% 5 3.7 74% 4 5 3.6 72% 4 3.8 95% 8 4 3.2 80% 6 4.4 73% 16 4 3.3 82% 5 4.2 84% 32 5 4.0 80% 4 4.0 100% 64 5 4.1 82% 5 4.3 86% 128 6 5.0 83% 6 5.4 90% 256 7 5.4 77% 6 5.5 91% 512 15 6.6 44% 8 6.6 82% 1024 30 11.6 38% 11 10.6 96% 2048 1:00 20.4 34% 20 18.3 91% 4096 1:58 38.1 32% 36 34.6 96% TABLE 5 - 16 - Report No. 3970 Bolt Beranek and Newman Inc. PORT PERFORMANCE STANDARD SYSTEM IMPROVED SYSTEM BLOCK SIZE EXECUTION TIME CPU EXECUTION TIME CPU USAGE USAGE real user system real user system 2 7 .3 6.4 95% 9 .6 7.2 86% 4 7 .5 6.5 100% 8 .5 7.4 98% 8 7 .8 6.1 98% 8 .6 7.4 100% 16 8 .5 6.5 87% 9 .8 7.2 88% 32 8 .6 6.8 92% 10 .6 7.9 85% 64 9 1.0 6.9 87% 10 1.1 8.2 93% 128 10 1.5 7.9 94% 11 1.4 9.1 95% 256 12 2.9 8.9 98% 13 2.2 10.1 94% 512 20 5.3 11.7 85% 17 5.2 11.8 100% 1024 35 9.7 17.4 77% 27 9.9 16.2 96% 2048 1:08 19.0 27.7 68% 46 19.0 26.1 98% 4096 2:25 36.5 47.9 13% 1:23 38.1 43.9 98% TABLE 6 - 17 - Report No. 3970 Bolt Beranek and Newman Inc. The improvements observed on the PDP 11/70 reflect those seen on the 11/40. At 4096 bytes/block, throughput for pipes is 910 Kbps, an improvement of a factor of 4 over the standard system. - 18 - Report No. 3970 Bolt Beranek and Newman Inc. 5. REAL-WORLD PERFORMANCE EVALUATION All the statistics presented above were obtained on an oth- erwise inactive system. If other processes were sharing the sys- tem, this would clearly degrade performance. The improved system depends very heavily on being able to keep a `working set' of buffers in core. If other processes are competing for buffers, we would expect disk activity for a port or pipe file to in- crease. To measure this increase, the following test was conducted. Two processes were run in background, each of which executed t.pipe1.c. Additionally, a third process was run which copied a file continuously (thus guaranteeing continuous disk activity). The test was run twice. In the first experiment, the two pipe routines each transferred 1000 blocks of 128 bytes each. The times for these two routines were then recorded. Results for two such runs were: - 19 - Report No. 3970 Bolt Beranek and Newman Inc. STANDARD SYSTEM IMPROVED SYSTEM EXECUTION TIME EXECUTION TIME real system real system RUN 1 ___ _ PROCESS 1: 2:05 9.3 1:40 11.4 PROCESS 2: 2:07 9.3 1:40 11.5 RUN 2 ___ _ PROCESS 1: 2:15 9.4 1:46 11.2 PROCESS 2: 2:24 9.2 1.46 11.6 TABLE 7 At this small blocksize, the improved system shows a small but noticeable improvement in real time. System time, however, has gone up - the extra checking required to prevent unnescessary disk activity has its cost. In the second experiment, the two pipe routines each transferred 100 blocks of 4096 bytes each. The times for two such runs was: - 20 - Report No. 3970 Bolt Beranek and Newman Inc. STANDARD SYSTEM IMPROVED SYSTEM EXECUTION TIME EXECUTION TIME real system real system RUN 1 ___ _ PROCESS 1: 1:36 9.6 29 9.5 PROCESS 2: 1:47 10.3 46 9.8 RUN 2 ___ _ PROCESS 1: 1:52 10.0 37 10.1 PROCESS 2: 1:53 9.8 39 9.5 TABLE 8 The improvement is now much more noticeable - the improved system is more than three times as fast as the standard version. - 21 - Report No. 3970 Bolt Beranek and Newman Inc. 6. REFERENCES [1] J.F. Haverty and R.D. Rettburg, "Interprocess Communications for a Server in UNIX", Proceedings of Compcon 78: Com- puter Communications Networks, September 1978, pp. 312-315. [2] G. Fostel, "A Note on UNIX 11/40 Communication Rates", BBN Departmental Memo, July, 1978. [3] "Interprocess Communication Extensions for the UNIX Operating System: II. Implementation", Rand Corporation Report No. R-2064/2-PR, Rand Corporation, Santa Monica, CA, 22, April 1977. - 22 -