[TUHS] SunOS code?

Sun Sep 2 15:05:06 AEST 2018

On Sat, Sep 1, 2018 at 3:19 PM, Theodore Y. Ts'o <tytso at mit.edu> wrote:
> On Sat, Sep 01, 2018 at 01:17:59PM -0400, Arthur Krewat wrote:
>> On 9/1/2018 12:27 PM, Kevin Bowling wrote:
>> > > I find it's about equal, and even exceeds Linux in terms of it's NUMA
>> > > support and multi-processor support. I need to move some systems away from
>> > > Solaris and off to Linux, and I find it's NUMA support lacking in certain
>> > > ways.
>> > This is pure fantasy.  To understand Linux performance on high core
>> > count and multi-socket machines is to have at least passing knowledge
>> > of Paul McKenney's genius work on RCU [1] and NUMA [2] at Sequent [3]
>> > and on Linux.  IBM bought Sequent, made a favorable patent grant of
>> > RCU for Linux, and the rest history.
>>
>> Thanks :) - I'm basing this on Oracle database performance, for the most
>> part, and it's weird way of supporting NUMA on Linux in a bass-ackwards sort
>> of way. Nothing I see in the latest RedHat/CentOS tells me it even cares
>> about NUMA, but maybe that's more of their "we know better than you"
>> mentality and it's all hidden under the covers somewhere.
>
> It wouldn't surprise me if Linux's NUMA performance is pretty weak
> compared to Solaris.  There was an attempt to try to make NUMA work
> well on Linux, with a lot of the effort coming from IBM and SGI, but
> that effort was overtaken by events.  Back in Sequent's day, the
> remote to local memory latency was ten to one, so making the system
> NUMA aware was critical.  But by early 2000's, the remote to local
> ratio was under 3:1 (or 2:1) for 4 socket systems, and with AMD's
> "Sufficiently Uniform Memory Organization" (SUMO), the ratio was under
> 1.5:1 or less.

Sorry this is just bogus about being weak compared to Solaris.  Are
you looking back with rosy glasses or have you scanned the code in the
past couple years?  I have and there is nothing particularly special
about Solaris internals here or elsewhere.  In fact, there are a lot
of pessimization all over the place.  As Larry said, a lot of folks in
the Linux community clearly cared about performance.  Although the
Solaris code is fairly clean It's not clear Sun valued performance at
all.  A stroll through arch/*/include/asm/ was enough to convince me
of Larry's claims.  I'm not a Linux fanboy but credit goes where it's
due.

Solaris has lgroups, which are a clean design but that is the extent
of its NUMA support, one shot at placement and scheduling.  Linux has
a NUMA allocator, aware scheduler, NUMA-optimized spinlocks and
mutexes, various subsystems correctly use the primitives, and can use
cgroups to contain or gang things.  There is a userland policy tool
called numad that tries to add some additional runtime affinity and
movement policy decisions.

I agree that architecturally Linux NUMA is nowhere near where Sequent
and especially SGI was.  And the reasons you cite are valid, Linux
implementation is good for maybe 8-16 sockets of modern core count
with a much tighter off chip network than the big dogs were building.

Keep in mind IBM wants to sell RockHoppers and E980s (4 drawers, 16
sockets, 768 threads) for dedicated Linux use which have similar
north/south and east/west off chip networks.  They have a lot of very
talented people on the firmware, kernel, compilers to make these
things work fast, including Paul.

> The main reason for this was that Windows was (and as far as I know,
> still is) NUMA oblivious.  So x86 chip and motherboard designers
> solved the problem, by brute foruce, in hardware.  So by 2003 or 2004,
> the Linux Scalability Effort had more or less petered out.  (You can
> see the leftover remnants at http://lse.sourceforge.net)

Windows' NUMA support is on par with Solaris insofar as there is
domain aware memory allocator and scheduler hierarchy that takes the
domain (and SMT etc) into account.  What Windows lacks is the finely
tuned concurrency primitives and everything else Linux has done..
which Solaris lacks as well.  I'm not even talking about RCU (or epoch
based reclamation or proxy collection or hazard pointers, at least one
of which is not patent encumbered), I'm just talking about the quality
of primitives like spinlocks and mutex and rwlock.  There are big
tradeoffs to the implementations of these in terms of fairness,
progress guarantees, and thread scalability.  Linux leads the pack by
a long shot in this department.

Where you start going beyond Linux-like NUMA IMO is when you get
Irix-like features of page copying, migration, and multiple advanced
placement policies.

> Fundamentally, the economics of 4 socket and higher machines was such
> that for many workloads, scale out was much cheaper than scale up.  So
> why buy super-expensive IBM X440, x450, and x460 servers, which were
> huge cabinets connected by one or more "scalability cables" (sometimes
> referred to as the "scalability bottleneck"), when most of the time,
> you could just buy a rack of 2U x86 servers which would be much, much
> cheaper?

Agreed, this is why x86 has dominated the server market for a long time.

> There were certainly workloads this wasn't applicable, of course.  But
> when Sun was selling Sun 10k's to web startups during the dot com
> boom, and they were using it to serve web traffic, they probably had
> too much VC money to burn, because that was *not* the most cost
> effective way to do things.

Agreed.  Those big margins must have caused them to take their eye off
what mattered right at the time Linux was getting some momentum from
the big HW vendors.

> Don't get me wrong; the Read Copy Update (RCU) technique was certainly
> very important, and is responsible for much of Linux's SMP scalability
> today.  But these days, when you can get up to 28 cores (56 threads)
> on a single socket, the need for more than 2 socket systems is already
> somewhat niche, and by the time you get to more than 4 sockets, it's
> positively microscopic.  As a result, NUMA support on Linux is
> certainly not as strong as it could be, and it wouldn't surprise me
> that Solaris has developed much better ways of handling the behemoths
> such as Sun Enterprise 10k.

The E10k was only a 64-core machine on a tight backplane compared to
other large systems.  It didn't have any of the pressing needs that
Sequent and SGI did with multi-drawer interconnects to drive
excellence in NUMA.

These are strange times.  Intel's been putting out some real doozy
chips.  The mesh in Skylake is a partial improvement over the dual
rings of Haswell (though they did some goofy things to increase
latency in undesirable ways), and they aren't going to continue to
brute force it like IBM did with their 17 metal layer process node..
many SKUs in Cascade Lake will be a dual die design and cost a
hilarious amount of money.  AMD's EPYC is really bad in this
department too, one EPYC behaves identically to a 4 socket system with
extremely poor inter-die latency [1].  I think POWER9 is universally
better and the high bin chips (22 core, 88 thread, mega cache) are
only around $2500 compared to Skylake's absurd $12,000.  POWER9 is a
single on chip NUMA domain for 24 cores.  Google publicly stated they
are using it for GPU servers, and that all their monorepo is built for
multiple ISAs.  Through the grapevine I've heard gmail is running on
POWER9 as well now.  That is pretty competent, the reason Intel is
sucking so bad is because people allowed themselves such lock in.  A
hyperscaler should be able to change between a couple ISAs as needed
between purchasing cycles.

>                                         - Ted
>
> P.S.  IBM made the RCU patent available for any GPL code, well before
> Sun decided on the CDDL for Solaris.  So if Sun management had chosen
> GPL, they could have used RCU...

True.  There is also at least one unencumbered strategy such as epoch
based reclamation which was known about around that time [2]

[1] https://www.servethehome.com/amd-epyc-infinity-fabric-latency-ddr4-2400-v-2666-a-snapshot/
[2] https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf

Regards,
Kevin