[TUHS] Pipes and PRISM

Wed Mar 2 02:56:37 AEST 2022

Last week there was a bit of discussion about the different shells
that would eventually lead to srb writing the shell that took his name
and the command syntax and semantics that most modern shells use
today.   Some of you may remember, VMS had a command interpreter
called DCL (Digital Command language), part of an attempt to make
command syntax uniform across DEC's different operating systems
(TOPS-20 also used DCL).  As DEC started to recognize the value of the
Unix marketplace, a project was born in DEC's Commercial Languages and
Tools group to bring the Unix Bourne shell to VMS and to sell it as a
product they called DEC Shell.

I had been part of that effort and one of the issues we had to solve
is providing formal UNIX pipe semantics.  They of course needed to
somehow implement UNIX style process pipelines.  VMS from the
beginning has had an interprocess communications pseudo-device called
the mailbox that can be written to and read from via the usual I/O
mechanism (the QIO system service).  A large problem with them is that
it is not possible to detect the "broken pipe" condition with a
mailbox and that feature deficiency made them unsuitable for use with
DEC Shell.  So the team had me write a new device driver, based
closely on the mailbox driver, but that could detect broken pipes
lines UNIX-style.

Shortly after I finished the VMS pipe driver, the team at DECwest had
started work on the MICA project, which was Dave Culter's proposed OS
unification.  Dave's team had developed a machine architecture called
PRISM (Proposed RISC Machine) to be the VAX follow-on.  For forward
compatibility purposes, PRISM would have to support both Ultrix and
VMS.  Dave and team had already written a microkernel-based,
lightweight OS for VAX called VAXeln that was intended for real-time
applications.  His new idea was to have a MACH-like microkernel OS
which he called MICA and then to put three user mode personality
modules on top of that:

    P.VMS, implementing the VMS system services and ABI
    P.Ultrix, implementing the Unix  system calls and ABI
    P.TBD, a new OS API and ABI intended to supersede VMS

So I wrote the attached "why pipes" memo to explain to Cutler's team
why it was important to implement pipes natively in P.TBD if they
wanted that OS to be a viable follow-on to VMS and Ultrix.

In the end, Dick Sites's 64-bit RISC machine architecture proposal,
which was called Alpha, won out over PRISM. Cutler and a bunch of his
DECwest engineering team went off to Microsoft.  Dave's idea of a
microkernel-based OS with multiple personalities of course saw the
light of day originally as NT OS/2, but because of the idea of
multiple personalities, when Microsoft and IBM divorced Dave was able
to quickly pivot to the now infamous Win32 personality, as what would
be called Windows NT.  It was also easy for Softway Systems to later
complete the NT POSIX layer for their Interix product, which now a few
generations later is called WSL by Microsoft.

-Paul W.
-------------- next part --------------
The prime functional characteristics of pipes on Unix are:

	1) they are a pseudo-device that can be created on demand by programs

	2) communications is one-way, from one or more write channel to one or
	   more read channels

	3) if either all read channels or all write channels are deassigned,
	   the other communicating partners are notified via an error condition
	   (Unix calls this "broken pipe").

The last characteristic, detection of broken pipes, is the key.  The restriction
of pipes to one-way communication is required for broken pipe detection.
If there were exactly one reader and one writer, then it would be easy to
detect a broken pipe.  If the channel count drops to 1, then the pipe is broken.
This is how DECnet-VAX detects "network partner exited" conditions.  However,
because fork(2) can cause the cloning of open I/O channels, this isn't
sufficient in the Unix pipe case.  One can have multiple readers and multiple
writers.  It therefore is necessary to restrict each individual I/O channel to
be either read-only or write-only.  With that restriction in place, the I/O
system can detect "broken pipe"--the condition exists when there are readers
but no writers, or writers but no readers.

VMS mailboxes have most of the characteristics required for pipes.  There is
a service to create the pseudo-devices from user-mode code.  You can have
multiple read and write channels.  The only problem is that you cannot detect
the broken pipe condition.  One cannot do this because channels assigned to
a mailbox can be used for both reading and writing.  There is no way to tell
which is which, so one cannot tell when all readers have gone away or when all
writers have gone away.

This is why I had to write a new driver to support pipes on VMS.  The driver
was written from scratch, but follows the design of the mailbox driver very
closely.  Here is a functional summary of the VMS pipe driver.

- You create a pipe by assigning a channel to the template device PIPE0:.
  You get back a single I/O channel to a newly-cloned pipe pseudo-device.
  You can use $GETDVI to find out the name of the new device so that you can
  assign other channels to it.  The device is created with the characteristics
  bits MBX, REC, SHR, IDV, ODV, device class DC$_MAILBOX, device type
  DC$_PIPE.  RMS treats it as a mailbox.

  I opted for dynamic UCB cloning rather than building a $CREPIPE system
  service to do the cloning.  Since I was not in the VMS group, I could not
  add a system service easily.  Also, using dynamic cloning is more flexible.
  It makes it easy to create pipes from DCL level, for example, whereas it is
  nearly impossible to create and use mailboxes from DCL level because there
  is no way to get at $CREMBX from command level.

  This does mean, though, that one cannot control buffer quota, max. message
  size, device protection, or assignment of a logical name as part of the
  device creation call.  Max. message size (the UCB device buffer size) and
  device protection can be set by IO$_SETMODE calls.  The pipe driver does not
  allow for user control of buffer quota (more about that later, though).
  The device protection set upon cloning is (S,O:RWLP,G,W).  This is what you
  want for communication among child processes in the same job tree, which is
  the usual application of pipes.

- When one initially assigns a channel to a pipe, it is "untyped"--the driver
  does not know if it is a read or a write channel.  The first I/O operation
  to the channel determines which type of channel it is.  If the first
  operation is IO$_READxBLK (virtual, logical, and physical are all the same
  for pipes) or IO$_SETMODE!IO$M_WRTATTN, then the channel is a read-only
  channel.  If the first operation is IO$_WRITExBLK or IO$_SETMODE!IO$M_READATTN
  then the channel becomes a write-only channel.  Once the type of a channel
  has been set, attempts to do the opposite operation (a write to a read
  channel or a read to a write channel) generate SS$_ILLIOFUNC errors.

  Sometimes it is desireable to declare the type of a channel before you
  actually do any I/O to it.  The I/O functions IO$_SETMODE!IO$M_READCHAN and
  IO$_SETMODE!IO$M_WRITECHAN exist for this purpose.

- I/O to a pipe is done via the usual IO$_READxBLK, IO$_WRITExBLK, and
  IO$_WRITEOF function codes.  These behave the same way that they do with
  mailboxes, except that there is no IO$M_NOW modifier (see the next item).
  The IO$M_READATTN and IO$M_WRTATTN modifiers to IO$_SETMODE are available
  and function identically with the mailbox driver.  There is a IO$M_STREAM
  modifier to IO$_READxBLK and IO$_WRITExBLK that implements true Unix-style
  stream-mode I/O (this is the only device in VMS to do this on the $QIO
  level, in fact).  The default is record mode.  More about stream mode later.

- BUFFERING:  Both mailboxes and pipes buffer all in-transit records in non-
  paged pool.  The drivers differ in the quota bookkeeping, however.

  The mailbox driver grabs a fixed amount of pool quota from the creator of
  the device at $CREMBX time.  The buffer quota is set in the
  $CREMBX call or defaulted from a SYSGEN parameter.  When a program writes to
  a mailbox, the data are copied into non-paged pool and the buffer quota is
  decremented accordingly.  If the quota reaches zero, then the writing process
  is put into RWMBX state (if system service resource wait mode is on), or
  the write failes with SS$_MBFULL status (if system service resource wait mode
  is off).  RWMBX state is quite nasty because it prevents the process from
  being deleted until and unless somebody empties the mailbox enough to let
  the write operation complete.  However, it does mean that writes to a
  mailbox never use up process BYTLM quota (the quota having been previously
  deducted from the creator of the mailbox).

  I asked several old-time VMS developers and was never able to get a
  satisfactory explanation or rationalization for the existece of RWMBX state
  in the system.  It seems to cause more harm than good, so I left it out of
  the pipe driver design.  Pipes have a fixed 4096-byte lien on non-paged
  pool.  When a process writes to a pipe, the driver allocates a buffer of the
  appropriate size from non-paged pool and moves the data into it.  If the
  4096-byte lien has not been fully used up, then the driver does not deduct
  any process BYTLM quota from the writer.  If the 4096-byte lien has been
  used up, then the pipe driver deducts the difference from process BYTLM
  quota--in other words, it is like any run-of-the-mill buffered I/O
  operation.  There is no use of RWMBX state for buffer quota control.

- I/O COMPLETION:  The normal operation of the mailbox driver is for write
  operations not to complete until the record was read from the mailbox.
  One must specify IO$M_NOW if one requires the operation to complete as soon
  as the data are moved into the mailbox buffer in non-paged pool.  Likewise,
  the mailbox driver provides a IO$M_NOW function for reads to allow a reader
  to poll mailbox for the presence of messages.

  The pipe driver does not provide a IO$M_NOW function.  A write operation
  always completes immediately if the message is covered by the 4096-byte
  lien on non-paged pool.  If user proces BYTLM was required to cover all or
  part of the message, then the write operation does not complete until all
  of the message is covered by the lien, or until the message is read from
  the pipe.  Thus, a write of a message longer than 4096 bytes will not
  complete until all but 4096 bytes of it have been read (if mixed record
  and stream mode I/O is being done, it is possible for messages to be
  partially read--more about this below).  Reads never complete until
  something has been read.

  I could have implemented IO$M_NOW, but I chose not to do so.  The chosen
  implementation offers proper quota management (without RWMBX state), and
  allows for asynchronous operation of the readers and writers in the default
  I/O case (e.g., RMS I/O), something that doesn't happen with mailboxes.
  A writer doing RMS I/O to a mailbox stalls until the message is read.  With
  pipes, he does not stall until he gets over 4096 bytes ahead of the reader.
  The pipe driver design allows efficient processing overlap in pipelines
  without the need for any special programming.

- MAILBOX MODE VS. ULTRIX MODE:  In a Unix-style pipeline, several images may
  be run in succession with their output going down the pipe.  RMS thinks the
  pipe is a mailbox, and so it writes an EOF record to the pipe every time one
  of the images closes its file on the pipe.  In this case, we want to ignore
  the EOF records and treat the breakage of the pipe at the end of the whole
  I/O sequence as the EOF condition.  However, in "normal" VMS useage, one
  wants to pass EOF records just as the mailbox driver does.

  I invented the concept of "Ultrix mode" versus "mailbox mode" to handle this.
  Pipes are created in VMS mode.  The $QIO call IO$_SETMODE!IO$M_ULTRIX places
  the pipe device in Ultrix mode.  IO$M_SETMODE!IO$M_MAILBOX will put it back.

  There are two differences between the modes:

	a) In mailbox mode, a IO$_WRITEOF operation puts a EOF record in
	   the pipe.  In Ultrix mode, IO$_WRITEOF completes successfully but
	   is a no-op.

	b) In mailbox mode, reads to a broken pipe terminate with SS$_LINKDISCON
	   status.  In Ultrix mode, reads to a broken pipe terminate with
	   SS$_ENDOFFILE status.

- BROKEN PIPE NOTIFICATION:  The pipe driver keeps counts of the numbers of
  read and write channels assigned, and two additional state bits:  readers-
  have-existed and writers-have-existed.  The readers-have-existed bit is
  set when the first read channel is assigned.  The writers-have-existed bit is
  set when the first write channel is assigned.  A broken pipe condition
  exists whenever:

	a) a write operation is pending on the pipe, readers-have-existed is
	   set, but the current count of read channels is zero

	b) a read operation is pending on the pipe, the pipe is empty,
	   writers-have-existed is set, but the current count of write
	   channels is zero

  The two "have-existed" bits exist to coordinate startup of the pipe
  communication.  Without those bits, there would be a race condition between
  the first write to the pipe and the first read to the pipe.  It is not an
  error to be writing to a pipe that has no readers and has never had readers,
  or to be reading from a pipe that has no writers but has never had writers.
  There is the potential for a hang condition here if, for example, the
  reader process dies before it ever gets a chance to open its channel to the
  pipe.  The same potential exists on Unix.  In practice, it is not a
  problem, especially since writes to a pipe cannot put you in a resource-wait
  state from which there is no exit (RWMBX state).

  If all writers have exited, readers can continue to read from the pipe
  without error until they have emptied it.  This is necessary so that writers
  don't have to wait around for all of their data to be read.

  Reads to a broken pipe complete with SS$_ENDOFFILE status if the pipe has
  been set in Ultrix mode, or with SS$_LINKDISCON (network partner
  disconnected logical link) status if the pipe is in mailbox mode (the
  default).

  Writes to a broken pipe complete with SS$_LINKDISCON status regardless of
  mode.

- STREAM MODE:  The pipe driver provides a modifier, IO$M_STREAM, for both
  IO$_READxBLK and IO$_WRITExBLK.  The presence of the modifier indicates
  that the I/O operation is to be performed in stream mode rather than in
  record mode.

  A stream mode read operation always reads the requested number of bytes
  from the pipe.  It ignores record boundaries.  For example, if there are
  three 10-byte records in the pipe, and a $QIO specifies a 15-byte stream
  mode read, the first record and the first 5 bytes of the second record
  will be read and put in the user's buffer as one chunk of data.  That
  will leave the remaining 5 bytes of record 2 and all of record 3 in the
  pipe.  A subsequent $QIO read in record mode will read the 5 bytes of
  record 2.

  There is one case where a stream mode read doesn't read exactly the number
  of bytes that the user specified.  That is if end-of-pipe is detected.
  End-of-pipe is either a EOF record in the pipe, or a broken pipe.  In both
  of these cases, the read in progress terminates with a short byte count.
  The next read issued picks up the EOF or LINKDISCON condition.  This is
  exactly the Unix semantics for reads from pipes.

  A stream mode write operation is the same as a record mode write operation
  except that the write doesn't imply a record boundary.  For example, suppose
  there are three $QIOs specifying stream mode write, each for 5 bytes,
  followed by two record mode writes, each for 10 bytes.  A record mode reader
  will see two records.  The first is 25 bytes long, the second is 10 bytes
  long.

I will send you a copy of the complete pipe driver specification so you can
see how this looks in toto.

IMPLICATIONS FOR P.TBD INTERPROCESS COMMUNICATION

Clearly P.ULTRIX requires Unix-compatible pipes.  The driver design outlined
above accomplishes this in a way that is compatible with simultaneous use by
a record-oriented access method such as RMS.  IO$M_STREAM probably isn't
necessary:  the Ultrix read(2) and write(2) facilities, which are what present
a stream mode interface to programs, have to deal with block-oriented devices
such as disks and tapes anyway, so they are capable of doing the necessary
record blocking and deblocking to make anything appear to be stream-oriented
regardless of its underlying record-oriented characteristics.  One thing
that you get for free with the VMS pipe driver design is that pipes have
names.  "Named pipes" are a relatively recent innovation in the Unix world
and are all the rage these days.

It is not as clear that P.VMS needs VAX/VMS-compatible mailboxes.  In the
vast majority of cases, channels assigned to mailboxes are used either
exclusively for reading or exclusively for writing, and therefore pipes would
suffice.  In fact, use of pipes in place of mailboxes would relieve
implementors of all the defensive programming you need with mailboxes to get
around the fact that with a mailbox there's no way to tell that the other
end of the communications link has gone away.  To be conservative, though,
it's probably a good idea to provide a mailbox-compatible facility.

I think that all of the needs in this space could be addressed by a single
I/O facility for interprocess communication.  The desireable characteristics
are:

1)  The pseudo-device object is created by the P.TBD equivalent of dynamic
    UCB cloning on VAX/VMS.  That is, the object is created when an I/O
    channel is assigned to a template object.  It should be possible to get
    one of these devices without doing an explicit system service call such
    as SYS$CREMBX or pipe(2).  Using dynamic UCB cloning allows the
    pseudo-devices to be created from command level without any special
    support (such as a lexical function).

2)  The object has two major modes of operation:  mailbox mode and pipe mode.
    Upon creation, the object is in pipe mode.  The P.TBD equivalent of
    a IO$_SETMODE $QIO operation switches the device between mailbox mode
    and pipe mode.

    a)  In mailbox mode, the device acts like a VAX/VMS mailbox.  Channels can
        be used for either reading or writing.  You get no "broken pipe"
        notification.  Operations stall unless a IO$M_NOW modifier is present.

    b)  In pipe mode, the device acts like a Unix pipe.  Channels can be
        used only for reading or writing, but not both.  The first operation
        to a channel determines its type (read-only or write-only).  Type
        can be explicitly declared via a IO$_SETMODE-like call.  Broken pipe
        notification semantics are as for the VAX/VMS pipe driver when
        operating in Ultrix mode.  IO$_WRITEOF is a no-op.  IO$M_NOW is
        ignored--the device always behaves like the VAX/VMS pipe driver
        as regards stalling.

    c)  The equivalent of IO$M_READATTN and IO$M_WRITEATTN routines should
        operate the way they do for VAX/VMS mailboxes regardless of mode.
        These two operations set the mode of the I/O channel.

3)  The VMS compatibility library would supply a SYS$CREMBX call that would
    assign a channel to the pseudo-device template object, put the object in
    mailbox mode, do IO$_SETMODE-equlvalent calls to set the protection and
    buffer size characteristics the way the user wanted them, then return
    the assigned channel to the caller.

4)  The P.Ultrix library would supply a pipe(2) call that would assign a
    channel to the pseudo-device template object, assign another channel to
    the cloned pipe object, then do a IO$_SETMODE!IO$M_READCHAN-equivalent
    I/O call to set the first channel read-only, and a IO$M_WRITECHAN-
    equivalent call to set the second channel write-only.  Then it would
    return both channels to the caller.

5)  No equivalent of the RWMBX resource wait state should be provided.
    If a pipe or mailbox fills, the current I/O operation merely should be
    stalled.

I hope all this stuff is helpful.  If there are any specific questions about
pipes or mailboxes that I can answer, just ask.

--PSW