Interesting analysis of linux kernel threading by IBM

Vie Ene 21 22:07:22 CST 2000

: First of all, you're right.

Ahhh, I love it when people suck up to me :-)
Actually, smart move, it set my mind to actually reading the rest of your
message.  Everybody?  Please start out all mail to me with "You're right".

What?  No?  You don't think so?  Ahh, well, I can hope :-)

: Larry, you mentioned that you had thoughts on what Linux has to do to work on big
: servers.  I'm imagining, given your background and the content of this email, that
: they are also ideas that won't harm the performance on small systems (e.g.
: desktops).  I must have missed them somehow - could you recap?  

OK, but they are pretty radical.  Linus and I have talked them over and he
has always been of the opinion "sounds good, sounds like it might be right,
where's the code?".  And I'm side tracked onto BitKeeper.

Whatever, can the people who are really interested in high performance 
take a look at 

	http://www.bitmover.com/llnl/smp.{ps,pdf}
and then
	http://www.bitmover.com/llnl/labs.{ps,pdf}

I'll briefly summarize here.  No justification for these statements are
here, there are some in the papers.

Premise 1: SMP scaling is a bad idea beyond a very small number processors.
    The reasoning for this is that when you start out threading a kernel,
    it's just a few locks.  That quickly evolves into more locks, and
    for a short time, there is a 1:1 mapping between each sort of object
    in the system (file, file system, device, process, etc) and a lock.
    So there can be a lot of locks, but there is only one reader/writer
    lock per object instance.  This is a pretty nice place to be - it's
    understandable, explainable, and maintainable.

    Then people want more performance.  So they thread some more and now
    the locks aren't 1:1 to the objects.  What a lock covers starts to
    become fuzzy.  Thinks break down quickly after this because what
    happens is that it becomes unclear if you are covered or not and
    it's too much work to figure it out, so each time a thing is added
    to the kernel, it comes with a lock.  Before long, your 10 or 20
    locks are 3000 or more like what Solaris has.  This is really bad,
    it hurts performance in far reaching ways and it is impossible to 
    undo.

Premise 2: most/all locking follows a canonical form of "take a global
    data structure, split it up into N, where N is a function of the
    number of CPUs, and give each CPU or group of CPUS, their own data
    structure.  Classic example: global/local run queues.

Premise 3: it is far easier to take a bunch of operating system images 
    and make them share the parts they need to share (i.e., the page 
    cache), than to take a single image and pry it apart so that it 
    runs well on N processors.

All of this leads us to an interesting twist on the clustering idea,
one that I give credit for mostly to DEC (they have some of this
implemented already).  Suppose you were to take a single big machine
and run another instance of the OS every N processors where N is chosen
so that it is well under the knee of the locking curve, i.e., around 4,
maybe 8, but certainly no more than 8.

Multiple OS's on a single box?  Wacky, huh?  But if you think about it,
you've just instantly taken *EVERY* data structure in the kernel and
multi threaded it.  Cool, no?  And it cost you nothing but some boot
code.

That's kinda cute but not very useful because what you really want is to
be able to have all processors working together on the same data with only
one copy of the data.  In other words, I don't care if I have one OS or
1000, I want all processors to be able to mmap /space/damn_big_file and
poke at it.  And I don't want any stinkin' DSM - I want real, hardware
based coherency.  Well, bucky, I'm here to tell ya, praise the lord,
you can have it :-)  You need to make an SMPFS which lets other OS's
put reference counts on your inodes.  The operation is extremely 
similar to what you have to do when you invalidate a page - you shoot
down the other processor's TLB entries.  So we need something like 
that in the reverse.

If you think I'm waving my hands wildly, I am.  But this is definitely
doable, and as hard as it seems, it is easily an order of magnitude easier
than threading the kernel to get to even 32 processors.  I've lived
through that twice, it's about a 7 year process (in hind sight; before
hand, everyone said it would be maybe 18 months).

Read the papers.  Think.  Think again.  Let's talk.  I can set up a 
perf en bitmover aliase if this becomes too off topic.

--lm

P.S.  I call these SMP clusters, to distinguish them from HA or HPC clusters.
SMP clusters are for the enterprise - these are the clusters that will get
Linux on big iron running Oracle and kicking serious butt.  Fast.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo en vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/