[TUHS] ATC/OSDI'21 joint keynote: It's Time for Operating Systems to Rediscover Hardware (Timothy Roscoe)

Fri Sep 17 11:33:02 AEST 2021

On Thu, Sep 16, 2021 at 8:34 PM Theodore Ts'o <tytso at mit.edu> wrote:

> On Thu, Sep 16, 2021 at 03:27:17PM -0400, Dan Cross wrote:
> > >
> > > I'm really not convinced trying to build distributed computing into
> > > the OS ala Plan 9 is viable.
> >
> > It seems like plan9 itself is an existence proof that this is possible.
> > What it did not present was an existence proof of its scalability and it
> > wasn't successful commercially. It probably bears mentioning that that
> > wasn't really the point of plan9, though; it was a research system.
>
> I should have been more clear.  I'm not realliy convinced that
> building distributed computing into the OS ala Plan 9 is viable from
> the perspective of commercial success.  Of course, Plan 9 did it; but
> it did it as a research system.
>
> The problem is that if a particular company is convinced that they
> want to use Yellow Pages as their directory service --- or maybe X.509
> certificates as their authentication system, or maybe Apollo RPC is
> the only RPC system for a particularly opinionated site administrator
> --- and these prior biases disagree with the choices made by a
> particular OS that had distributed computing services built in as a
> core part of its functionality, that might be a reason for a
> particular customer *not* to deploy a particular distributed OS.
>

Ah, I take your meaning. Yes, I can see that being a problem. But we've had
similar problems before: "we only buy IBM", or, "does it integrate into our
VAXcluster?" Put another way, _every_ system has opinions about how to do
things. I suppose the distinction you're making is that we can paper over
so many of those by building abstractions on top of the "node" OS. But the
node OS is already forcing a shape onto our solutions. Folks working on the
Go runtime have told me painful stories about detection of blocking system
calls using timers and signals: wouldn't it be easier if the system
provided real asynchronous abstractions? But the system call model in
Unix/Linux/plan9 etc is highly synchronous. If `open` takes a while for
whatever reason (say, blocking on reading directory entries looking up a
name?) there's no async IO interface for that, hence shenanigans. But
that's what the local node gives me; c'est la vie.

Of course, this doesn't matter if you don't care if anyone uses it
> after the paper(s) about said OS has been published.
>

I suspect most researchers don't expect the actual research artifacts to
make it directly into products, but that the ideas will hopefully have some
impact. Interestingly, Unix seems to have been an exception to this in that
Unix itself did make it into industry.

> Plan 9, as just one example, asked a lot of questions about the issues you
> > mentioned above 30 years ago. They came up with _a_ set of answers; that
> > set did evolve over time as things progressed. That doesn't mean that
> those
> > questions were resolved definitively, just that there was a group of
> > researchers who came up with an approach to them that worked for that
> group.
>
> There's nothing stopping researchers from creating other research OS's
> that try to answer that question.

True, but they aren't. I suspect there are a number of confounding factors
at play here; certainly, the breadth and size of the standards they have to
implement is an issue, but so is lack of documentation. No one is seriously
looking at new system architectures, though.

> However, creating an entire new
> local node OS from scratch is challenging[1], and then if you then
> have to recreate new versions of Kerberos, an LDAP directory server,
> etc., so they all of these functions can be tightly integrated into a
> single distributed OS ala Plan 9, that seems to be a huge amount of
> work, requiring a lot of graduate students to pull off.
>
> [1] http://doc.cat-v.org/bell_labs/utah2000/   (Page 14, Standards)
>

Yup. That is the presentation I meant when I mentioned Rob Pike lamenting
the situation 20 years ago in the previous message and earlier in the
thread.

An interesting thing here is that we assume that we have to redo _all_ of
that, though. A lot of the software out there is just code that does
something interesting, but actually touches the system in a pretty small
way. gvisor is an interesting example of this; it provides something that
looks an awful lot like Linux to an application, and a lot of stuff can run
under it. But the number of system calls _it_ in turn makes to the
underlying system is much smaller.

> What's changed is that we now take for granted that Linux is there, and
> > we've stopped asking questions about anything outside of that model.
>
> It's unclear to me that Linux is blamed as the reason why researchers
> have stopped asking questions outside of that model.  Why should Linux
> have this effect when the presence of Unix didn't?
>

a) There's a lot more Linux in the world than there ever was Unix. b) There
are more computers now than there were when Unix was popular. c) computers
are significantly more complex now than they were when Unix was written.

But to be clear, I don't think this trend started with Linux; I get the
impression that by the 1980s, a lot of research focused on a Unix-like
model to the exclusion of other architectures. The PDP-10 was basically
dead by 1981, and we haven't seen a system like TOPS-20 since the 70s.

Or is the argument that it's Linux's fault that Plan 9 has apparently
> failed to compete with it in the marketplace of ideas?

It's hard to make that argument when Linux borrowed so many of plan9's
ideas: /proc, per-process namespaces, etc.

> And arguably,
> Plan 9 failed to make headway against Unix (and OSF/DCE, and Sun NFS,
> etc.) in the early to mid 90's, which is well before Linux's became
> popular, so that argument doesn't really make sense, either.
>

That wasn't the argument. There are a number of reasons why plan9 failed to
achieve commercial success relative to Unix; most of them have little to do
with technology. In many ways, AT&T strangled the baby by holding it too
tightly to its chest, fearful of losing control the way they "lost" control
of Unix (ironically, something that allowed Unix to flourish and become
wildly successful). Incompatibility with the rest of the world was likely
an issue, but inaccessibility and overly restrictive licensing in the early
90s practically made it a foregone conclusion.

Also, it's a little bit of an aside, but I think we often undercount the
impact of individual preference on systems. In so many ways, Linux
succeeded because, simply put, people liked working on Linux more than they
liked working on other systems. You've mentioned yourself that it was more
fun to hack on Linux without having to appease some of the big
personalities in the BSD world.

        - Dan C.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://minnie.tuhs.org/pipermail/tuhs/attachments/20210916/8809ce9a/attachment.htm>