BBN-V6/doc/ipc/port.runout





Report No. 3970                      Bolt Beranek and Newman Inc.














Performance Improvements in UNIX Pipes and Ports










November 1978


Morris Kranc






Prepared for:
Defense Advanced Research Projects Agency




























Report No. 3970                      Bolt Beranek and Newman Inc.



                        TABLE OF CONTENTS






1.   INTRODUCTION .............................................  1



2.   PERFORMANCE OF PIPES ....................................  4



3.   IMPROVEMENTS TO PORTS ................................... 10



4.   PERFORMANCE ON THE PDP-11/70 ............................ 15



5.   REAL-WORLD PERFORMANCE WVALUATION ....................... 19



6.   REFERENCES .............................................. 22


































Report No. 3970                      Bolt Beranek and Newman Inc.



1. INTRODUCTION



     This report discusses improvements made to the pipe and port

routines as implemented in the UNIX operating  system  [1].   All

tests, unless otherwise noted, were conducted on a PDP-11/40 with

dual RK05 disks and 128K of memory.



     An  understanding  of  UNIX I/O is essential to this discus-

sion.  Before describing specific improvements, we will therefore

examine general I/O strategies of the standard UNIX system.



     When a user process `writes' some data,  the  system  copies

that  data  to  a buffer taken from the buffer freepool.  At this

point, the system can proceed in one of three ways:





(1) It can perform a synchronous write.
 _  __ ___ _______ _ ___________ _____


A routine is initiated which writes the contents  of  the  buffer

onto  the  disk.   The  operating system initiates this write and

waits until its conclusion before freeing the buffer and  return-

ing control to the user.












                              - 1 -







Report No. 3970                      Bolt Beranek and Newman Inc.



(2) It can perform an asynchronous write.
 _  __ ___ _______ __ ____________ _____


The  routine to write the contents of the buffer onto the disk is

initiated, but the operating system does not wait for its conclu-

sion.   Rather,  control is returned to the user process.  Later,

when the write is finished, an interrupt is  generated,  and  the

system  regains  control  in  order  to  return the buffer to the

freepool.





(3) It can perform a delayed write.
 _  __ ___ _______ _ _______ _____


Nothing is actually written to the device.  Rather, the buffer is

returned  to  the  freepool  with  a  special  delayed write flag

(B_DELWRI) turned on.  Additionally, the buffer contains a  (dev-

ice  number,block  number) pair that identifies its ultimate disk

location.  This buffer will eventually reach  the  front  of  the

freepool  queue and will be requested for some other purpose.  At

this point, before the buffer is actually given away, the  buffer

allocation  routine  will  initiate  a write to the disk.  At the

conclusion of the write, the buffer will be allocated to its  new

purpose.   Thus,  between the time it enters the freepool and the

time it reaches the front of the freepool, a delayed write buffer

represents  the  only  up-to-date  copy of the disk block (device

number,block number).



     The purpose of a delayed write is as follows.  Both the read



                              - 2 -







Report No. 3970                      Bolt Beranek and Newman Inc.



and write routines are smart enough to  check  the  freepool  for

delayed  write  copies of a particular block before they actually

initiate any disk I/O.  Let us suppose that a user is writing ten

times  into the same disk block.  If each write is done asynchro-

nously, this will require 10 disk reads and 10 disk writes.   If,

however,  a delayed write is performed each time, all updates can

probably be done within the buffer, and only one  disk  write  is

required.



     The operating system  uses  the  following  rule  to  choose

between  a  delayed write and an asynchronous write: if the final

character of a block has been changed, i.e. the  block  is  full,

then  it  is  assumed that no further updates will occur for that

block, and it is written  asynchronously.   Otherwise,  there  is

still  hope that later updates will occur, and a delayed write is

effected.























                              - 3 -







Report No. 3970                      Bolt Beranek and Newman Inc.



2. PERFORMANCE OF PIPES



     A program named t.pipe1.c was written to exercise UNIX  pipe

I/O  [2].  t.pipe1.c forks off into a parent and a child process.

The parent writes a variable number of variable-sized blocks into

the  pipe (maximum blocksize = 4096).  The child reads the amount

of data the parent was supposed to write.  The UNIX  `time'  com-

mand  was  used  to  ascertain  the real time and the system time

required for each run.  Additionally, the `CPU usage' was  calcu-

lated.   By  this  we  mean  the percentage of real time taken by

actual CPU activity, as opposed to idle time, I/O waits, etc.   A

high `CPU usage' value will mean the system is running efficient-

ly and is always finding things to do.  Running on  an  otherwise

unoccupied system, the following results were obtained:



























                              - 4 -







Report No. 3970                      Bolt Beranek and Newman Inc.



SIZE OF BLOCK                EXECUTION TIME             CPU USAGE

                       real time    system time




2                       10              8.0                 80%

4                       9               7.4                 80%

8                       9               7.5                 82%

16                      9               7.7                 82%

32                      10              8.1                 81%

64                      12              9.5                 78%

128                     15              10.2                66%

256                     32              11.0                34%

512                     1:02            15.2                24%

1024                    2:03            26.4                22%

2048                    4:06            49.1                19%

4096                    8:08            1:34                19%



                        TABLE 1  One Thousand Blocks







     Note  that, for suitably large block size, execution time is

almost linear.  Note also that at the maximum block size of  4096

bytes,  throughput  is  about 65 Kbps.  Of the over 8 minutes re-

quired for this operation, only 1:34 is actually consumed in sys-

tem time - the rest is non-specific system overhead.  What is the

system doing during the other 6 minutes?


                              - 5 -







Report No. 3970                      Bolt Beranek and Newman Inc.



     To  visualize the problem, let us construct an `ideal' algo-

rithm for pipe reads and writes.  We would  like  to  avoid  disk

activity  as  much as possible.  The writer, then, should be per-

forming all its writes as delayed writes.  The  information  will

thus  reside  in  a freepool buffer rather than on a disk.  If we

are lucky, the reader will activate before  that  buffer  reaches

the  front  of  the  freepool queue and is actually written.  The

reader can then snatch the  information  from  the  buffer,  thus

avoiding an entire disk write disk read cycle.



     Recall now the actual algorithm  that  the  system  uses  in

determining  whether  to  perform  a  delayed  or an asynchronous

write.  In the case of blocksize = 4096, it performs  four  asyn-

chronous  writes,  of 512 bytes each, since the last character of

each 512-byte block has been changed.  The reader, when  activat-

ed,  must  then perform four disk reads.  Clearly, different cri-

teria are in order for pipe files.  Since a reader and  a  writer

are  active  simultaneously,  it  will  always make sense for the

writer to perform delayed writes, in the  hope  that  the  reader

will  activate  in  time  to get the buffer before it is actually

written.



     To implement this, the following changes were made:




(1) The standard write routine checks if the file it  is  dealing

with  is  a  pipe  file, and, if it is, always performs a delayed


                              - 6 -







Report No. 3970                      Bolt Beranek and Newman Inc.



write.




(2)  The  pipe  reader,  after it has read a block of data from a

buffer in the freepool, turns off the delayed write flag  associ-

ated  with that buffer.  Thus, the block will never be written to

disk.  We no longer care whether the disk file is up to date,  as

long as the reader has received the data.




(3) The `getbuffer' routine avoids the buffers with delayed write

flags  on  for as long as possible, preferring to retrieve `unat-

tached' buffers if it can.  The hope is that by giving a  delayed

write  buffer  a  slightly  longer  life in core, we increase the

chance that some reader will activate in time to use it.



     With these changes, the following statistics were obtained:























                              - 7 -







Report No. 3970                      Bolt Beranek and Newman Inc.



                   STANDARD SYSTEM                 IMPROVED SYSTEM


BLOCK SIZE      EXECUTION TIME   CPU            EXECUTION TIME   CPU
                                USAGE                           USAGE

               real    system                   real   system




2               10      8.0       80%           9       7.5       80%

4               9       7.4       80%           9       7.7       81%

8               9       7.5       82%           10      8.1       81%

16              9       7.7       82%           10      8.3       83%

32              10      8.1       81%           10      8.7       87%

64              12      9.5       78%           11      9.0       81%

128             15      10.2      68%           15      10.0      66%

256             32      11.0      34%           26      12.2      46%

512             1:02    15.2      24%           45      16.0      35%

1024            2:03    26.4      22%           49      26.3      53%

2048            4:06    49.1      19%           1:23    46.9      56%

4096            8:08    1:34      19%           1:31    1:25.2    93%



        TABLE 2 (standard system statistics reproduced from TABLE 1)
















                              - 8 -







Report No. 3970                      Bolt Beranek and Newman Inc.



     For  blocksizes  of 16 and smaller, performance is identical

to that of the standard system.  For blocksizes  between  32  and

512,  real  time  is equivalent, although system time is slightly

higher for the improved system.  For large blocks (1024 bytes and

up)  the  savings in real time become dramatic, while system time

drops only slightly.  At blocksize = 4096, the throughput is  360

Kbps, representing an improvement by more than a factor of 5 over

the standard system.  Note also that CPU usage is over 90% -  the

system  is  almost always finding useful tnings to do rather than

blocking on disk I/O.



































                              - 9 -







Report No. 3970                      Bolt Beranek and Newman Inc.



3. IMPROVEMENTS TO PORTS



     Ports [3] are similar in conception to pipes, and they share

a  significant  amount  of  code.   Again, we have a reader and a

writer interacting, and we should expect improvements similar  to

those seen in pipes.



     A program called t.port1.c was written to exercise port  I/O

[2].   Again,  results  were obtained via the 'time' command on a

PDP-11/40 with dual RK05 disks.  The `before' and `after' statis-

tics appear below.

































                             - 10 -







Report No. 3970                      Bolt Beranek and Newman Inc.



                STANDARD SYSTEM                 IMPROVED SYSTEM


BLOCK SIZE    EXECUTION TIME   CPU            EXECUTION TIME    CPU
                              USAGE                            USAGE


           real  user  system               real  user  system




2         16      .8    13.7    87%         17      .9     15.2   94%

4         16      .7    13. 9   92%         17      .9     15.4   95%

8         16      .7    14.1    93%         17      .9     15.5   95%

16        18     1.3    14.9    83%         19     1.3     16.4   94%

32        18     1.7    14.6    88%         20     1.5     17.2   94%

64        20     2.9    15.2    90%         22     2.8     18.2   95%

128       24     4.8    17.0    90%         26     4.5     19.5   95%

256       42     9.0    19.5    69%         32     8.8     21.5   93%

512      1:27   17.2    25.9    49%         47    17.1     27.9   95%

1024     2:49   34.3    38.1    42%        1:15   33.8     40.0   98%

2048     5:26 1:07.0  1:03.4    39%        2:14  1:06.6   1:04.6  98%

4096    10:28 2:13.5  1:53.2    39%        4:09  2:12.7   1:51.9  98%




                        TABLE 3














                             - 11 -







Report No. 3970                      Bolt Beranek and Newman Inc.



     For   smaller  blocksizes,  the  improved  system  performed

slightly worse than the standard one.  For blocksize > 128,  how-

ever, the improvement is noticeable both in system and real time.

At the maximum blocksize of 4096, the  improved  system  performs

more than twice as well as the standard system.



     Why are the times so much worse for ports than for pipes  on

the new system? For one thing, t.port1.c does much more extensive

checking than t.pipe1.c.  At 4096 bytes/block, over 2 minutes are

consumed  in  this  checking.   Also,  t.port1.c writes an 8-byte

header before it writes each block.  Thus, for small  blocksizes,

time  consumed  writing the header is comparable to that consumed

in writing the block itself.



     For  the  purpose  of  comparison, t.port1.c was modified so

that it did no checking and wrote no headers.  The results appear

below.





















                             - 12 -







Report No. 3970                      Bolt Beranek and Newman Inc.



                     PIPE I/O                        PORT I/O




BLOCK SIZE   EXECUTION TIME    CPU          EXECUTION TIME    CPU
                              USAGE                          USAGE

             real    system                   real    system





2             9       7.5       80%           9       7.8      87%

4             9       7.7       81%          10       8.1      81%

8             9       8.1       81%          11       8.9      81%

16           10       8.3       83%          11       9.0      81%

32           10       8.7       87%          12       9.5      80%

64           11       9.0       81%          12      10.1      83%

128          15      10.0       66%          12      11.0      91%

256          26      12.2       46%          28      12.5      83%

512          45      16.0       35%          48      18.4      90%

1024         49      26.3       53%          53      30.1      93%

2048       1:23      46.9       56%        1:30      52.3      57%

4096       1:31     1:25.2      93%        1:40    1:36.4      96%



                        TABLE 4













                             - 13 -







Report No. 3970                      Bolt Beranek and Newman Inc.



     As  can  be  seen,  port I/O involves slightly more overhead

than pipe I/O.  At 4096  bytes/block,  throughput  is  about  325

Kbps, still a pretty reasonable rate.

















































                             - 14 -







Report No. 3970                      Bolt Beranek and Newman Inc.



4. PERFORMANCE ON THE PDP-11/70







     For the sake of comparison, t.pipe1.c and t.port1.c were run

on an otherwise unoccupied PDP-11/70 system with dual RP06 disks.

The results for pipes and ports appear below.









































                             - 15 -







Report No. 3970                      Bolt Beranek and Newman Inc.



                        PIPE PERFORMANCE





                     STANDARD SYSTEM                IMPROVED SYSTEM

BLOCK SIZE      EXECUTION TIME    CPU           EXECUTION TIME    CPU
                                 USAGE                           USAGE

                real   system                   real    system




2               3       3.0       100%          5       3.7        74%

4               5       3.6        72%          4       3.8        95%

8               4       3.2        80%          6       4.4        73%

16              4       3.3        82%          5       4.2        84%

32              5       4.0        80%          4       4.0       100%

64              5       4.1        82%          5       4.3        86%

128             6       5.0        83%          6       5.4        90%

256             7       5.4        77%          6       5.5        91%

512             15      6.6        44%          8       6.6        82%

1024            30      11.6       38%          11      10.6       96%

2048          1:00      20.4       34%          20      18.3       91%

4096          1:58      38.1       32%          36      34.6       96%



                        TABLE 5











                             - 16 -







Report No. 3970                      Bolt Beranek and Newman Inc.



                        PORT PERFORMANCE



                STANDARD SYSTEM                 IMPROVED SYSTEM




BLOCK SIZE      EXECUTION TIME    CPU           EXECUTION TIME    CPU
                                 USAGE                           USAGE

             real  user  system               real  user  system




2            7       .3    6.4     95%         9     .6     7.2    86%

4            7       .5    6.5    100%         8     .5     7.4    98%

8            7       .8    6.1     98%         8     .6     7.4   100%

16           8       .5    6.5     87%         9     .8     7.2    88%

32           8       .6    6.8     92%        10     .6     7.9    85%

64           9      1.0    6.9     87%        10     1.1    8.2    93%

128         10      1.5    7.9     94%        11     1.4    9.1    95%

256         12      2.9    8.9     98%        13     2.2   10.1    94%

512         20      5.3   11.7     85%        17     5.2   11.8   100%

1024        35      9.7   17.4     77%        27     9.9   16.2    96%

2048      1:08     19.0   27.7     68%        46    19.0   26.1    98%

4096      2:25     36.5   47.9     13%      1:23    38.1   43.9    98%



                        TABLE 6










                             - 17 -







Report No. 3970                      Bolt Beranek and Newman Inc.



The improvements observed on the PDP 11/70 reflect those seen  on

the  11/40.   At  4096  bytes/block,  throughput for pipes is 910

Kbps, an improvement of a factor of 4 over the standard system.

















































                             - 18 -







Report No. 3970                      Bolt Beranek and Newman Inc.



5. REAL-WORLD PERFORMANCE EVALUATION



     All the statistics presented above were obtained on an  oth-

erwise inactive system.  If other processes were sharing the sys-

tem, this would clearly degrade performance.  The improved system

depends  very  heavily  on  being able to keep a `working set' of

buffers in core.  If other processes are competing  for  buffers,

we  would  expect  disk  activity  for a port or pipe file to in-

crease.



     To  measure this increase, the following test was conducted.

Two processes were run in  background,  each  of  which  executed

t.pipe1.c.   Additionally, a third process was run which copied a

file continuously (thus guaranteeing continuous  disk  activity).

The  test  was  run twice.  In the first experiment, the two pipe

routines each transferred 1000 blocks  of  128  bytes  each.  The

times  for these two routines were then recorded. Results for two

such runs were:



















                             - 19 -







Report No. 3970                      Bolt Beranek and Newman Inc.



                STANDARD SYSTEM                 IMPROVED SYSTEM


                EXECUTION TIME                  EXECUTION TIME

                real    system                  real    system




RUN 1
___ _



PROCESS 1:      2:05     9.3                    1:40     11.4

PROCESS 2:      2:07     9.3                    1:40     11.5




RUN 2
___ _



PROCESS 1:      2:15     9.4                    1:46    11.2

PROCESS 2:      2:24     9.2                    1.46    11.6



                        TABLE 7






At  this  small  blocksize, the improved system shows a small but

noticeable improvement in real time.  System time,  however,  has

gone  up  -  the  extra checking required to prevent unnescessary

disk activity has its cost.



In  the second experiment, the two pipe routines each transferred

100 blocks of 4096 bytes each.  The times for two such runs was:



                             - 20 -







Report No. 3970                      Bolt Beranek and Newman Inc.



                STANDARD SYSTEM         IMPROVED SYSTEM


                EXECUTION TIME          EXECUTION TIME

                real     system         real      system




RUN 1
___ _



PROCESS 1:      1:36    9.6             29      9.5

PROCESS 2:      1:47    10.3            46      9.8




RUN 2
___ _



PROCESS 1:      1:52    10.0            37      10.1

PROCESS 2:      1:53    9.8             39      9.5



                        TABLE 8






The improvement is now much more noticeable - the improved system

is more than three times as fast as the standard version.













                             - 21 -







Report No. 3970                      Bolt Beranek and Newman Inc.



6. REFERENCES





[1]  J.F. Haverty and R.D. Rettburg, "Interprocess Communications

     for a Server in UNIX",  Proceedings of Compcon 78:  Com-

     puter Communications Networks, September 1978, pp. 312-315.





[2]  G. Fostel, "A Note on UNIX 11/40 Communication Rates", BBN

     Departmental Memo,  July, 1978.





[3]  "Interprocess Communication Extensions for the UNIX Operating

     System: II. Implementation", Rand  Corporation Report  No.

     R-2064/2-PR, Rand Corporation, Santa Monica, CA, 22, April

     1977.























                             - 22 -