I always used the design principle "Write locally, read over NFS".  This obviated locking issues and fit in with the idea of fate-sharing: a write would always succeed, even if reading would have to wait until R (the machine doing the reading) was up.  The only additional thing I needed was the ability for W (the machine doing the writing) to notify R that something had changed, which I did by having R run a process that listened on a port that would be opened and then closed by W: no data flowed over this connection.  If this connection could not be made, the process on the W side would loop in bounded exponential backoff. 

On Sun, Jul 5, 2020 at 4:09 PM Clem Cole <clemc@ccc.com> wrote:


On Sun, Jul 5, 2020 at 10:43 AM Larry McVoy <lm@mcvoy.com> wrote:
My guess is that other people didn't understand the "rules" and did
things that created problems.  Sun's clients did understand and did
not push NFS in ways that would break it.
I >>believe<< that a difference was file I/O was based on mmap on SunOS and not on other systems (don't know about Solaris).   The error was handled by the OS memory system.  You tell me about how SGI handled I/O.  Tru64 used mmap and I think macOS does also from the Mach heritage.   RTU/Ultrix was traditional BSD.  Stellix was SRV3.  Both had a file system cache with write-behind.

I never knew for sure, but I always suspected that was crux of the difference in how/where the write failure were handled.  But as you pointed out, many production NFS sites not running Suns had huge problems with holes in files that were not discovered until it was too late to fix them.  SCCS/RCS repositories were particularly suspect and because people tried to use them for shared development areas, it could be a big issue.