BSD File Systems and MSCP

Tue Nov 22 17:01:37 AEST 1988

In article <652 at hscfvax.harvard.edu> pavlov at hscfvax.harvard.edu (G.Pavlov)
writes:
>The several discussions about the BSD file system and one reference to MSCP
>brings up something that we have played with but don't really understand: the
>interaction between tunefs and DEC's MSCP.  A recent message from Chris Torek
>implied that manipulating tunefs parameters may impact performance in the DEC
>Ultrix environment.

tunefs *may* affect performance anywhere; it need not actually do so.
I have never been able to get any decent results from RA81s (all my
timings were always the same, or close enough not to matter).

>Actually, I always wondered what really happens "between" tunefs and
>MSCP.  To me, on paper, they seem to be in direct conflict with each
>other, purpose-wise.  Anyone know?

Not really.

MSCP stands for `Mass Storage Control Protocol'.  It defines a (rather
overly large) set of commands and responses that make up transactions
between host computers and disk drives.  The command set is comparable
to the SCSI disk command set.  Opcodes include `get unit status', `set
controller characteristics', `set unit on line', `read', and so forth.
There are numerous modifiers (`express', `compare', `suppress
shadowing', etc.) and return status values (`ok', `offline', `write
protected', etc.), and even sub-modifiers (`offline because in
diagnostics', etc.).  Transactions are sent as `datagrams' (message
blocks), which come in different flavours according to the opcode.
Fortunately, all datagrams are the same size.

The most interesting commands are `read' and `write', and `get unit
status' and `on line'.  Read and write use only generic fields,
including a byte count, a `buffer descriptor' (really a bus address), a
page map address (for 610/8200/8300/8500/8800, and probably 6200), a
logical block number, and a `command reference number'.  The last is
simply a number you hand to the controller that the controller will
hand back in its response; the BSD and Ultrix drivers stuff into it
a pointer to the `struct buf' that caused the read or write.  [Aside:
early Emulex SC41/MS ROMs have a bug that causes them occasionally to
zero out the low word of this field.  The 4.3BSD-tahoe driver has a
hack to work around this.]  Get Unit Status and On Line commands
return more information: GUS includes the tuple <s, t, g>, where
s is sectors/track, t is tracks/group, and g is groups/cylinder;
ONLINE includes the drive size in sectors.  From these one can compute
<s, t, g, c>, where c is data cylinders.  (`g' exists mainly because
switching heads is slow on RA81s.  Normally one of t and g will be
1, and the other will be tracks/cylinder.)

The important thing to note here is that the drive and controller will
not tell you what sort of mapping exists between `logical block number'
and actual physical location on the disk.  (The same is true of SCSI
commands.)  One can, however, guess at the relationship, knowing that
the only sane layout is for LBNs 0..s-1 to refer to cylinder 0, track
0; s..2s-1 to cyl 0, track 1; 2s..3s-1 to cyl 0, track 2; and so
forth.  [2nd aside: on the the RA60, head switch is *slower* than
adjacent track seek.  The best mapping is to interlace t and c
divisions.  The BSD FFS is not particularly well equipped to deal with
this.]  Note also that `s' here is not necessarily the number of
physical sectors per track.  RA81s reserve one sector on each track as
a replacement for a bad sector.  When a bad sector is replaced with the
spare on the same track, this is considered a `primary' replacement,
and---I am guessing here---the controller rewrites the sector headers
so that the replacement *is* the original sector.  That is, if the skew
factor were 1, the old order might be 0,1,2,3,4,5,6,<spare>; the new
order might then be 0,1,2,3,<bad>,5,6,4.  This is a reasonable
approach, although one can argue that the new order should be
0,1,2,3,<bad>,4,5,6, entailing moving sectors 5 and 6.  Non-primary (or
`secondary', although the fifth replacement of a bad sector is still
`secondary') replacements are `different', but still involve rewriting
the sector header.  If the sector headers cannot be written, this is a
`tertiary' replacement.

So what does all this mean with respect to the Fast File System?

The FFS does its job by arranging to put the data blocks of each file
into locations that produce the fastest possible access, assuming
sequential or near-sequential reading.  (Even near-random access is
improved in most cases, since speeding sequential accesses involves
clustering.)  To do this, it expects:

 - predictable (simple function) seek and rotational delays;
 - zero head switch time (nonzero times are a planned extension);
 - predictable (constant) sectors per track and tracks per cylinder;
 - probably others I cannot think of offhand.

MSCP does not guarantee any of these, and in particular, is careful
not to promise that rotational and seek delay functions, which depend
on sector and track layout respectively, are simple functions.  Since
logical-block to physical-block mapping is done by the controller,
it is allowed to make arbitrarily bad choices.

The fact is, however, that it does not do so.  LBN to PBN mapping is
done the obvious way, except in a few cases (replaced bad sectors).
This is equivalent to older formats (e.g., DEC STD 144 on massbuss
disks).  The main difference is that the available MSCP controllers
all seem to be slower than their massbuss counterparts, not to mention
being forced to do more, including the LBN->PBN mapping.  [Funny
thing: the integer divide in a $500k VAX is faster than the integer
divide in a $10k peripheral board... :-) ]
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris at mimsy.umd.edu	Path:	uunet!mimsy!chris