A different metric for scheduler optimisation
Steve Underwood
steveu en coppice.org
Dom Ene 30 07:14:00 CST 2000
Davide Libenzi wrote:
> On Sun, 30 Jan 2000, Steve Underwood wrote:
> > Davide Libenzi wrote:
> >
> > > On Sat, 29 Jan 2000, Horst von Brand wrote:
> > > > In case your application switches rapidly, it is thrashing the cache, which
> > > > is crucial for performance with current CPUs. You simply don't want to do
> > > > that, ever. You get best performance by _never_ switching unless forced to
> > > > do so, but that isn't realistic.
> > >
> > > If You switch fast You have more cache reloads of probably less cache lines (
> > > or pages ).
> > > Since a task that run a short time has a lower probability to "touch" RAM
> > > locations.
> >
> > On what type of application mix to you base that hypothesis?
> >
> > Programs which reschedule at a high rate seem usually to be those shunting large
> > blocks of data around, without doing an awful lot of processing on them. They
> > can touch a great deal of data in a very short time.
> >
> > Programs which reschedule less often generally fall into one of two categories.
> > One class of program rapidly shifts large blocks of data, with lightweight
> > processing, using select (or some other event handling scheme, like SIGIO). The
> > other class is doing some heavyweight number crunching, and progresses through
> > the data comparatively slowly.
> >
> > I draw almost the opposite conclusion from this mix - fast rescheduling programs
> > flush the cache rapidly. "Ah!", you may say, "but they only flush the cache when
> > they have finished with the data, because sending out the results from
> > processing that data is the very reason for the reschedule". This is true. If
> > its a genuinely lightweight activity, such as "read some stuff from disk and
> > throw a section of it at a network connection", one could not do better (unless
> > you can use busmasting to completely avoid the processor ever seeing the data,
> > which would be extremely lightweight processing). It certainly doesn't point to
> > the pattern of cache behaviour which you hypothesise, though.
>
> We can stay here speaking about my hypothesise for a while, You showing me an
> app that confirm Your hypothesise and me one that confirm the mine.
> I've included the word "probability" meaning that can exist app types that get
> off my hypothesise.
> Anyway consider the faster switching app that I know ( this can be an FTP
> server for example ):
>
> char buffer[N];
>
> for(;;)
> {
> read(infd, buffer, N);
> write(outfd, buffer, N);
> }
>
> for N = 1 this app "can switch very fast" and touch very few bytes, and this
> seems to confirm my hypothesise.
This is a very artificial, and seriously inefficient, example. When you said fast
switching, I assumed your intention was fast _real_ application switching. Can we
forget the silly benchmark-like examples, and talk about real applications? The
design of some of those is pretty bad (Well, that may often be unfair. Some
applications have been around, with minimal maintenance, for 30 years, and their
design is just no longer inappropriate for a modern machine).
> What if N = 1000000 ?
> If N = 1000000 it flush out every single bit of cache, but ..., but can this app
> switch very fast ?
> What is the "time footprint" of this app ?
> The time footprint of this app can be defined as the average time taken by the
> instruction pointer ( EIP for Intel ) to reach the same value.
> The pattern of this app is :
> RCW
> R = read
> C = compute ( 0 )
> R = write
> and the entire pattern must be used to define the time footprint.
> A necessary condition to make a app switch fast, not one time but continuosly,
> is to have a low time footprint that this app don't have ( for that N ).
> And the time footprint of this app is inverse proportional to the average speed
> of devices used to transfer the data.
> More if You spawn M copies of this app, the switching rate is upper limited by
> the lower throwput of the used devices.
What types of device are we talking about here? Disk, networking, etc. will behave
differently. You won't get 1MB when you read 1MB from a TCP source, for example. I
think for most real world I/O what I said still holds. Again, you are talking about
something artificial.
What you seem to miss when we criticise your postings, is that we aren't a
stuck-in-the-mud bunch of old farts who won't listen to new ideas. We have just seen
enough artificial tests give artificial results, that we don't trust any examples ot
tests that can't be shown to be representative of what people have to contend with in
the real world on a large scale. In many cases, even when it can be shown to be like
real life, the response will be that the application design is dumb - fix that and
not the OS. If you try to use examples and tests which are closer to real life you
will get a far warmer reception here. This is a "no bench-marketers" zone!
> This app example put in evidence a thing that I'd really like in a processor :
>
> THE CONTROL OF CACHE LINES ( PAGES )
>
> If N = sizeof(cache) and infd and outfd are very fast devices become true Your
> hypothesise, cache trashing + fast switching.
> But We don't touch the data at all, so what if We can say ( and the IRQ
> drivers too ) :
>
> disable_cache_map(buffer, N);
>
> We don't need this data loaded into cache since we don't need to do computing
> on it ( and DMA operations read from main memory not from CPU cache ) !
The Pentium III can do this, more or less, with its media streaming facility. This
allows the L2 cache (I think its just L2, but it might be L1 as well - I've never
actually worked with it) to be bypassed in a controlled manner, primarily for
streaming multi-media DSP activities, where the data is read, processed, written, and
never read again. I don't think Linux currently permits use of this functionality,
but for certain key applications it has substantial potential for cache optimisation.
However, for what you want here it isn't necessary anyway. If the data passes through
the machine with bus-mastered (or DMA) read and write, the processor and its cache
need never see it on most machines. Of course, an exception would be if the relevant
main memory locations are already in cache, when bus snooping would update the cache.
Whether this can be achieved in practice depends on the I/O device types.
Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo en vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Más información sobre la lista de distribución Ayuda