All,
I can't believe it's been 9 years since I wrote up my original notes on
getting Research Unix v7 running in SIMH. Crazy how time flies. Well,
this past week Clem found a bug in my scripts that create tape images.
It seem like they were missing a tape mark at the end. Not a showstopper
by any means, but we like to keep a clean house. So, I applied his fixes
and updated the scripts along with the resultant tape image and Warren
has updated them in the archive:
https://www.tuhs.org/Archive/Distributions/Research/Keith_Bostic_v7/
I've also updated the note to address the fixes, to use the latest
version of Open-SIMH on Linux Mint 21.3 "Virginia" (my host of choice
these days), and to bring the transcripts up to date:
https://decuser.github.io/unix/research-unix/v7/2024/05/23/research-unix-v7…
Later,
Will
Well this is obviously a hot button topic. AFAIK I was nearby when fuzz-testing for software was invented. I was the main advocate for hiring Andy Payne into the Digital Cambridge Research Lab. One of his little projects was a thing that generated random but correct C programs and fed them to different compilers or compilers with different switches to see if they crashed or generated incorrect results. Overnight, his tester filed 300 or so bug reports against the Digital C compiler. This was met with substantial pushback, but it was a mostly an issue that many of the reports traced to the same underlying bugs.
Bill McKeemon expanded the technique and published "Differential Testing of Software" https://www.cs.swarthmore.edu/~bylvisa1/cs97/f13/Papers/DifferentialTesting…
Andy had encountered the underlying idea while working as an intern on the Alpha processor development team. Among many other testers, they used an architectural tester called REX to generate more or less random sequences of instructions, which were then run through different simulation chains (functional, RTL, cycle-accurate) to see if they did the same thing. Finding user-accessible bugs in hardware seems like a good thing.
The point of generating correct programs (mentioned under the term LangSec here) goes a long way to avoid irritating the maintainers. Making the test cases short is also maintainer-friendly. The test generator is also in a position to annotate the source with exactly what it is supposed to do, which is also helpful.
-L
I'm surprised by nonchalance about bad inputs evoking bad program behavior.
That attitude may have been excusable 50 years ago. By now, though, we have
seen so much malicious exploitation of open avenues of "undefined behavior"
that we can no longer ignore bugs that "can't happen when using the tool
correctly". Mature software should not brook incorrect usage.
"Bailing out near line 1" is a sign of defensive precautions. Crashes and
unjustified output betray their absence.
I commend attention to the LangSec movement, which advocates for rigorously
enforced separation between legal and illegal inputs.
Doug
>> Another non-descriptive style of error message that I admired was that
>> of Berkeley Pascal's syntax diagnostics. When the LR parser could not
>> proceed, it reported where, and automatically provided a sample token
>> that would allow the parsing to progress. I found this uniform
>> convention to be at least as informative as distinct hand-crafted
>> messages, which almost by definition can't foresee every contingency.
>> Alas, this elegant scheme seems not to have inspired imitators.
> The hazard with this approach is that the suggested syntactic correction
> might simply lead the user farther into the weeds
I don't think there's enough experience to justify this claim. Before I
experienced the Berkeley compiler, I would have thought such bad outcomes
were inevitable in any language. Although the compilers' suggestions often
bore little or no relationship to the real correction, I always found them
informative. In particular, the utterly consistent style assured there was
never an issue of ambiguity or of technical jargon.
The compiler taught me Pascal in an evening. I had scanned the Pascal
Report a couple of years before but had never written a Pascal program.
With no manual at hand, I looked at one program to find out what
mumbo-jumbo had to come first and how to print integers, then wrote the
rest by trial and error. Within a couple of hours I had a working program
good enough to pass muster in an ACM journal.
An example arose that one might think would lead "into the weeds". The
parser balked before 'or' in a compound Boolean expression like 'a=b and
c=d or x=y'. It couldn't suggest a right paren because no left paren had
been seen. Whatever suggestion it did make (perhaps 'then') was enough to
lead me to insert a remote left paren and teach me that parens are required
around Boolean-valued subexpressions. (I will agree that this lesson might
be less clear to a programming novice, but so might be many conventional
diagnostics, e.g. "no effect".)
Doug
I just revisited this ironic echo of Mies van der Rohe's aphorism, "Less is
more".
% less --help | wc
298
Last time I looked, the line count was about 220. Bloat is self-catalyzing.
What prompted me to look was another disheartening discovery. The "small
special tool" Gnu diff has a 95-page manual! And it doesn't cover the
option I was looking up (-h). To be fair, the manual includes related
programs like diff3(1), sdiff(1) and patch(1), but the original manual for
each fit on one page.
Doug
> was ‘usage: ...’ adopted from an earlier system?
"Usage" was one of those lovely ideas, one exposure to which flips its
status from unknown to eternal truth. I am sure my first exposure was on
Unix, but I don't remember when. Perhaps because it radically departs from
Ken's "?" in qed/ed, I have subconsciously attributed it to Dennis.
The genius of "usage" and "?" is that they don't attempt to tell one what's
wrong. Most diagnostics cite a rule or hidden limit that's been violated or
describe the mistake (e.g. "missing semicolon") , sometimes raising more
questions than they answer.
Another non-descriptive style of error message that I admired was that of
Berkeley Pascal's syntax diagnostics. When the LR parser could not proceed,
it reported where, and automatically provided a sample token that would
allow the parsing to progress. I found this uniform convention to be at
least as informative as distinct hand-crafted messages, which almost by
definition can't foresee every contingency. Alas, this elegant scheme seems
not to have inspired imitators.
Doug
So fork() is a significant nuisance. How about the far more ubiquitous
problem of IO buffering?
On Sun, May 12, 2024 at 12:34:20PM -0700, Adam Thornton wrote:
> But it does come down to the same argument as
>
https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.…
The Microsoft manifesto says that fork() is an evil hack. One of the cited
evils is that one must remember to flush output buffers before forking, for
fear it will be emitted twice. But buffering is the culprit, not the
victim. Output buffers must be flushed for many other reasons: to avoid
deadlock; to force prompt delivery of urgent output; to keep output from
being lost in case of a subsequent failure. Input buffers can also steal
data by reading ahead into stuff that should go to another consumer. In all
these cases buffering can break compositionality. Yet the manifesto blames
an instance of the hazard on fork()!
To assure compositionality, one must flush output buffers at every possible
point where an unknown downstream consumer might correctly act on the
received data with observable results. And input buffering must never
ingest data that the program will not eventually use. These are tough
criteria to meet in general without sacrificing buffering.
The advent of pipes vividly exposed the non-compositionality of output
buffering. Interactive pipelines froze when users could not provide input
that would force stuff to be flushed until the input was informed by that
very stuff. This phenomenon motivated cat -u, and stdio's convention of
line buffering for stdout. The premier example of input buffering eating
other programs' data was mitigated by "here documents" in the Bourne shell.
These precautions are mere fig leaves that conceal important special cases.
The underlying evil of buffered IO still lurks. The justification is that
it's necessary to match the characteristics of IO devices and to minimize
system-call overhead. The former necessity requires the attention of
hardware designers, but the latter is in the hands of programmers. What can
be done to mitigate the pain of border-crossing into the kernel? L4 and its
ilk have taken a whack. An even more radical approach might flow from the
"whitepaper" at www.codevalley.com.
In any even the abolition of buffering is a grand challenge.
Doug
On Sat, May 11, 2024 at 2:35 PM Theodore Ts'o <tytso(a)mit.edu> wrote:
>
> I bet most of the young'uns would not be trying to do this as a shell
> script, but using the Cloud SDK with perl or python or Go, which is
> *way* more bloaty than using /bin/sh.
>
> So while some of us old farts might be bemoaning the death of the Unix
> philosophy, perhaps part of the reality is that the Unix philosophy
> were ideal for a simpler time, but might not be as good of a fit
> today
I'm finding myself in agreement. I might well do this with jq, but as you
point out, you're using the jq DSL pretty extensively to pull out the
fields. On the other hand, I don't think that's very different than piping
stuff through awk, and I don't think anyone feels like _that_ would be
cheating. And jq -L is pretty much equivalent to awk -F, which is how I
would do this in practice, rather than trying to inline the whole jq bit.
But it does come down to the same argument as
https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.…
And it is true that while fork() is a great model for single-threaded
pipeline-looking tasks, it's not really what you want for an interactive
multithreaded application on your phone's GUI.
Oddly, I'd have a slightly different reason for reaching for Python (which
is probably how I'd do this anyway), and that's the batteries-included
bit. If I write in Python, I've got the gcloud api available as a Python
module, and I've got a JSON parser also available as a Python module (but I
bet all the JSON unmarshalling is already handled in the gcloud library),
and I don't have to context-switch to the same degree that I would if I
were stringing it together in the shell. Instead of "make an HTTP request
to get JSON text back, then parse that with repeated calls to jq", I'd just
get an object back from the instance fetch request, pick out the fields I
wanted, and I'd be done.
I'm afraid only old farts write anything in Perl anymore. The kids just
mutter "OK, Boomer" when you try to tell them how much better CPAN was than
PyPi. And it sure feels like all the cool kids have abandoned Go for Rust,
although Go would be a perfectly reasonable choice for this task as well
(and would look a lot like Python: get an object back, pick off the useful
fields).
Adam