Interesting analysis of linux kernel threading by IBM

Sab Ene 22 18:20:59 CST 2000

Larry,
    I like these ideas quite a bit, and they seem very sensible. However, there are some clarifications that I would like to
have. First of all, Premise 3 may be good for very large SMP systems, but it may be unnescessary or even harmful to ual or
quad systems. How would this system measure up on a small scale? Premise 2 seems very sensible, though adding another fs
should be avoided, I think. Perhaps there is some way to squeeze it into ext2 systems? This is of special concern when moving
an old hard drive with data from a UP to an SMP system (I did this a little while ago). One shouldn't be forced to reformat
the hard drive to an smpfs format in order to be able to use the improved capabilities. Premise 1 is excellent, no debate. Too
many locks become sloppy, hard to debug, and slow.
    However, having said that, I must say that such grandiose improvements are not bases for not including smaller level
fixes. (Davide's, for example, which seems very reasonable and cleaner than what we have now, and the struct rearrange from
IBM.)
    The IBM patch is simple in nature, and should be no problem. It's simply a cleanup and cache optimization, nothing that
will cause residual bugs that will haunt us for the next few years.
    Davide's patch is a little more risky, sure, but it could be added as an experimental option in 2.3-2.4 and finally
integrated in 2.5. The foundation of it seems reasonable, and I'm sure if a few people take stabs at it, it's performance can
be improved even further (giving no penalty at 1-2 tasks in queue.) I must state again that there are times that the queue
does get very large. I am currently working on a clustered simulated neural network. However, having only one machine, I will
have to simulate the running systems inside one system, running several servers at individual ports. There will not be much
I/O blocking, and each of the servers will be running several threads to communicate with linked servers. I plan to do
intensive testing, and expect averages of over 30. A little scheduling improvements would go a long way.

> From: Larry McVoy <lm en bitmover.com>
> Date: Thu, 20 Jan 2000 20:01:34 -0800
> Subject: Re: Interesting analysis of linux kernel threading by IBM
>
> : First of all, you're right.
>
> Ahhh, I love it when people suck up to me :-)
> Actually, smart move, it set my mind to actually reading the rest of your
> message.  Everybody?  Please start out all mail to me with "You're right".
>
> What?  No?  You don't think so?  Ahh, well, I can hope :-)
>
> : Larry, you mentioned that you had thoughts on what Linux has to do to work on big
> : servers.  I'm imagining, given your background and the content of this email, that
> : they are also ideas that won't harm the performance on small systems (e.g.
> : desktops).  I must have missed them somehow - could you recap?
>
> OK, but they are pretty radical.  Linus and I have talked them over and he
> has always been of the opinion "sounds good, sounds like it might be right,
> where's the code?".  And I'm side tracked onto BitKeeper.
>
> Whatever, can the people who are really interested in high performance
> take a look at
>
>         http://www.bitmover.com/llnl/smp.{ps,pdf}
> and then
>         http://www.bitmover.com/llnl/labs.{ps,pdf}
>
> I'll briefly summarize here.  No justification for these statements are
> here, there are some in the papers.
>
> Premise 1: SMP scaling is a bad idea beyond a very small number processors.
>     The reasoning for this is that when you start out threading a kernel,
>     it's just a few locks.  That quickly evolves into more locks, and
>     for a short time, there is a 1:1 mapping between each sort of object
>     in the system (file, file system, device, process, etc) and a lock.
>     So there can be a lot of locks, but there is only one reader/writer
>     lock per object instance.  This is a pretty nice place to be - it's
>     understandable, explainable, and maintainable.
>
>     Then people want more performance.  So they thread some more and now
>     the locks aren't 1:1 to the objects.  What a lock covers starts to
>     become fuzzy.  Thinks break down quickly after this because what
>     happens is that it becomes unclear if you are covered or not and
>     it's too much work to figure it out, so each time a thing is added
>     to the kernel, it comes with a lock.  Before long, your 10 or 20
>     locks are 3000 or more like what Solaris has.  This is really bad,
>     it hurts performance in far reaching ways and it is impossible to
>     undo.
>
> Premise 2: most/all locking follows a canonical form of "take a global
>     data structure, split it up into N, where N is a function of the
>     number of CPUs, and give each CPU or group of CPUS, their own data
>     structure.  Classic example: global/local run queues.
>
> Premise 3: it is far easier to take a bunch of operating system images
>     and make them share the parts they need to share (i.e., the page
>     cache), than to take a single image and pry it apart so that it
>     runs well on N processors.
>
> All of this leads us to an interesting twist on the clustering idea,
> one that I give credit for mostly to DEC (they have some of this
> implemented already).  Suppose you were to take a single big machine
> and run another instance of the OS every N processors where N is chosen
> so that it is well under the knee of the locking curve, i.e., around 4,
> maybe 8, but certainly no more than 8.
>
> Multiple OS's on a single box?  Wacky, huh?  But if you think about it,
> you've just instantly taken *EVERY* data structure in the kernel and
> multi threaded it.  Cool, no?  And it cost you nothing but some boot
> code.
>
> That's kinda cute but not very useful because what you really want is to
> be able to have all processors working together on the same data with only
> one copy of the data.  In other words, I don't care if I have one OS or
> 1000, I want all processors to be able to mmap /space/damn_big_file and
> poke at it.  And I don't want any stinkin' DSM - I want real, hardware
> based coherency.  Well, bucky, I'm here to tell ya, praise the lord,
> you can have it :-)  You need to make an SMPFS which lets other OS's
> put reference counts on your inodes.  The operation is extremely
> similar to what you have to do when you invalidate a page - you shoot
> down the other processor's TLB entries.  So we need something like
> that in the reverse.
>
> If you think I'm waving my hands wildly, I am.  But this is definitely
> doable, and as hard as it seems, it is easily an order of magnitude easier
> than threading the kernel to get to even 32 processors.  I've lived
> through that twice, it's about a 7 year process (in hind sight; before
> hand, everyone said it would be maybe 18 months).
>
> Read the papers.  Think.  Think again.  Let's talk.  I can set up a
> perf en bitmover aliase if this becomes too off topic.
>
> - --lm
>
> P.S.  I call these SMP clusters, to distinguish them from HA or HPC clusters.
> SMP clusters are for the enterprise - these are the clusters that will get
> Linux on big iron running Oracle and kicking serious butt.  Fast.
>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo en vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/