I can never emphasize enough how much damage it does to just get people in on a contract and let them scribble all over the company's valuable IP. Soon they are gone but the damage lives on making the real programmers' lives very difficult. Luckily the thing that took the contract programmer months to do can usually be redone by the real programmer in a day or two, but if it's been released it takes careful planning and opportunism to get all the breakage out of the system. This was my life when I worked on cash registers since software was basically a pimple on the side of hardware, my bosses didn't understand software (or worse, thought they understood it and hence trivialized it) so this would happen a lot, the software was seen as more of a vehicle to get the hardware where it needed to go, rather than an end in itself. Working there and dealing with customer complaints and overseeing the expansion of what was originally a few BASIC programs to print a report of the day's transactions on the receipt printer... to a thousands and thousands of lines of code so the cash registers could participate in multiple networks, download software and price updates, report on takings and performances of different categories, order stock, track till balances etc etc etc... really taught me a MASSIVE respect for code quality, I almost never meet anyone who cares as much about code quality and careful analysis of the system's assumptions and invariants (like the assumption about drivers modifying a process' s signal mask in your example)... as I do.

I can remember a conversation I had with a new hire in my research group later on (not cash registers this time)... this dude had a PhD and had written a hugely successful open source package that is still standard today for a lot of courses etc in our field... and was hired to rewrite some similar stuff created by our research group that was a bit of a dogs breakfast but nevertheless was in daily use and publicly disseminated. Well I had hacked on this dude's code a lot and I hated it, way overcomplicated and using a very awkward structure of millions of interdependent C++ templates and what-have-you. He showed me his progress after some months and I showed him a competing implementation (very immature) that I had put together in my summer holiday using Python. So I tried to sell him the idea of doing it in Python and structuring it all for simplicity and maintainability... he was not having it. I could see his code would rapidly descend into a dogs breakfast as soon as it was used to solve any real world problem, because he was repeating all the same mistakes as in his open source package.

So fast forward 5 years or so and he has a useable system, it is in daily use and is being publicly disseminated. It is not too bad, until one looks under the hood. I used it as a module in one of my major research tools and it is great that it's available, BUT, it falls over miserably when you stray away from the normal standard use cases that his group have tested and made to work by extensive layers of band-aid fixes, leaving the code in an incomprehensible state. I would spend days debugging and then send him a comprehensive report on my investigation including several proposals for short and long term fixes, he was initially enthusiastic about this but lately my reports get labelled "won't fix" with weak excuses about it being outside the common use cases. "Can't fix" would be more accurate. In the process of all this I looked at the changelogs for the releases. In the past 3 years there were a couple of feature releases and about 30 bugfix releases, each accompanied by a release note which just kind of casually passes this off as no big deal and implies the code is approaching a reliable state. Ha!!

By contrast in May/June this year I decided to enter my tool in a competition run by my research group and open to outside entrants, I think about 20 groups entered including 3 or 4 internal entries like mine. Well my tool was far from perfect since I had embarked on a major rewrite of the frontend some months earlier and it was hard to produce anything working at all, let alone competition quality. Luckily I had help from the competition organizer, since internal entries are not eligible for prizes he was happy to alert me to any problems he found and let me submit fixes if it did not mess up his schedule. Well he found quite a few issues and I fixed them and ended up having the fastest and best tool in the competition even though it was not eligible for prizes. But now to the point of the story: The CHARACTER of the problems he found. So much do I care about code quality that it turns out most of the problems amounted to basically an oversight, a misplaced comma that was hard to see, a pointer violation that occurred because a realloc had moved some data during the evaluation of an expression, that sort of thing. The fix never required any significant restructuring of the code, except in one case where it was basically caused by my using that other broken software as a module and I had to work around it. I am so happy that my basic asusumptions and algorithms turned out to be robust, because this means that after some period of getting all the typos and minor oversights out, I will have a tool that is close to perfect despite its complexity and the things I still plan to refactor and rewrite. The guys who do not understand code quality will never experience this.

cheers, Nick

On 01/01/2017 4:00 PM, "Larry McVoy" <lm@mcvoy.com> wrote:
Inspired by:

> Stephen Bourne after some time wrote a cron job that checked whether an
> update in a binary also resulted in an updated man page and otherwise
> removed the binary. This is why these programs have man pages.

I want to tell a story about working at Sun.  I feel like I've sent this
but I can't find it in my outbox.  If it's a repeat chalk it up to old
age.

I wanted to work there, they were the Bell Labs of the day, or as close
as you could get.

I got hired as a contractor through Lachman (anyone remember them?) to do
POSIX conformance in SunOS (the 4.x stuff, not that Solaris crap that I
hate).

As such, I was frequently the last guy to touch any file in the kernel,
my fingerprints were everywhere.  So when there was a panic, it was
frequently laid at my doorstep.

So here is how I got a pager and learned about source management.

Sun had two guys, who will remain nameless, but they were known as
"the SCSI twins".  These guys decided, based on feedback that "people
can interrupt sun install", to go into the SCSI tape driver and disable
SIGINT, in the driver.  The kernel model doesn't allow for drivers messing
with your signal mask so on exit, sometimes, we would get a "panic: psig".

Somehow, I sure was because of the POSIX stuff, I ended up debugging this
panic.  It had nothing to with me, I'm not a driver person (I've written
a few but I pretty much suck at them), but it landed in my lap.

Once I figured it out (which was not easy, you had to hit ^C to trigger it
so unless you did that, and who does that during an install) I tracked down
the code to SCSI twins.

No problem, everyone makes mistakes.  Oh, wait.  Over the next few months
I'm tracking down more problems, that were blamed on me since I'm all over
the kernel, but came from the twins.

Suns integration machines were argon, radon, and krypton.  I wrote
scripts, awk I think, that watched every update to the tree on all
of those machines and if anything came from the SCSI twins the script
paged me.

That way I could go build and test that kernel and get ahead of the bugs.
If I could fix up their bugs before the rest of the team saw it then I
wouldn't get blamed for them.

I wish I could have figured out something like Steve did that would have
made them not screw up so much but this was the next best thing.  I actually
got bad reviews because of their crap.  My boss at the time, Eli Lamb, just
said "you are in kadb too much".

--lm