Interesting analysis of linux kernel threading by IBM

Sab Ene 22 18:01:21 CST 2000

On Fri, 21 Jan 2000, Larry McVoy wrote:
> Rendering is what we call an embarrassingly parallel application.
> In other words, very, very coarse grained parallelism works great for
> this, in fact, it works orders of magnitude better than what you descibed.
> Talk to Disney, Pixar, ILM, RFX - all of whom are heavily into this space,
> all of whom I've visited personally to talk about their computing needs,
> and all of whom use farms of uniprocessors for rendering.  There are 
> a bunch of other ones too, Digital Design, Pacific something (used to be
> Walnut Creek now are in Palo Alto), etc.  All the production and post
> production digital houses know that farms of machines that share nothing
> but a network are the highest performance and least cost way to do 
> rendering.

Are You saying that N processes that run in N uniprocessor systems 
echanging data through network perform better than a single SMP N way system 
echanging data in memory due to the cache effects ( given the same 
software architecture ) ?

> If you suggested a multithreaded application to do that to any of those
> guys in a job interview, and stuck to your opinion that it was a good
> idea, my predicition is that you would be standing on the street wondering
> what happened in less than 5 minutes.  Those people are doing hard work 
> on short schedules and and really don't have time to waste.

I've not the luckiness You've to meet so interesting peoples so I can't figure
out what they can say me.

> I am starting to wonder if you've ever coded up an application both ways
> and tested it. If you had tried the rendering model that you suggested
> and then tried the same thing all in one process, I believe that your
> way would show dramatically lower performance.  It's been shown that
> while the model of fine grained parallelism, especially in data parallel
> applications like what you are talking about, while that model can be
> supported, the cache effects of doing so on an SMP dramatically _REDUCE_
> the performance.  It's always been seen that you are better off to divide
> up the data, do all the different transformations to a chunk of data by
> one process on one processor in one cache, rather than by spreading the
> same data over a bunch of caches.  In fact, all the research in parallel
> applications boils down to ``how much can you divide up the data''.
> If there is so much focus on that, all of it performance related, why
> is it that you believe something that certainly seems to fly the face
> of both theory and practice?

The rendering pipeline ( as the keyword state ) in an highly parallel
environment  in which a subsystem takes one type of data, transform it in a new
kind of data, and pass the result to the next subsystem. This is true for a
scanline renderer ( using shadow maps and environment mapping ) not for a
raytracer.  In this environment I'll espect ( You're right I've only coded
single thread renderers ) that if I decompose the pipeline into N steps and
I've an N way SMP system I'll get good performance. Where good does not mean
TotalTime / N , but a time : 

(TotalTime / N) < T << TotalTime 

If even an highly parallel job like a renderer cannot be well coded in SMP, 
what we keep it for ?

OK, probably the solution You push is clusters of SMPs.

But recalling what I've asked You in head of this message, given a cluster of 
N computers having an M way SMP system and exchanging data through an 
ethernet, have You measured that ( cost apart ) a single M x N SMP system will
perform ( scale ) less than the cluster ?
I can't believe that cache effects are bigger that ethernet bottleneck.

Unfortunately I don't have neither a Beowulf system nor a 32 SMP system to 
try my thoughts ( only a poor 2 way ).

Davide.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo en vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/