A different metric for scheduler optimisation

Davide Libenzi dlibenzi en maticad.it
Dom Ene 30 14:19:38 CST 2000


On Sun, 30 Jan 2000, Steve Underwood wrote:
> > We can stay here speaking about my hypothesise for a while, You showing me an
> > app that confirm Your hypothesise and me one that confirm the mine.
> > I've included the word "probability" meaning that can exist app types that get
> > off my hypothesise.
> > Anyway consider the faster switching app that I know ( this can be an FTP
> > server for example ):
> >
> > char buffer[N];
> >
> > for(;;)
> > {
> >         read(infd, buffer, N);
> >         write(outfd, buffer, N);
> > }
> >
> > for N = 1 this app "can switch very fast" and touch very few bytes, and this
> > seems to confirm my hypothesise.
> 
> This is a very artificial, and seriously inefficient, example. When you said fast
> switching, I assumed your intention was fast _real_ application switching. Can we
> forget the silly benchmark-like examples, and talk about real applications? The
> design of some of those is pretty bad (Well, that may often be unfair. Some
> applications have been around, with minimal maintenance,  for 30 years, and their
> design is just no longer inappropriate for a modern machine).

This is only an example to prove that if an app perform fast switch there is a
higher probability that it touchs few data, and when I say touch I don't mean
only "for(i=0;i<N;i++) touch(data[i]);" but I mean also include I/O operations.
If You fast switch imply that You have a small "time footprint", but to have a
small time footprint You must either load / store few byte or have on both sides
very fast devices.
This does not mean that is always in that way, but, as I've specified in the
previous message, there are a prominent probability.

> > What if N = 1000000 ?
> > If N = 1000000 it flush out every single bit of cache, but ..., but can this app
> > switch very fast ?
> > What is the "time footprint" of this app ?
> > The time footprint of this app can be defined as the average time taken by the
> > instruction pointer ( EIP for Intel ) to reach the same value.
> > The pattern of this app is :
> > RCW
> > R = read
> > C = compute ( 0 )
> > R = write
> > and the entire pattern must be used to define the time footprint.
> > A necessary condition to make a app switch fast, not one time but continuosly,
> > is to have a low time footprint that this app don't have ( for that N ).
> > And the time footprint of this app is inverse proportional to the average speed
> > of devices used to transfer the data.
> > More if You spawn M copies of this app, the switching rate is upper limited by
> > the lower throwput of the used devices.
> 
> What types of device are we talking about here? Disk, networking, etc. will behave
> differently. You won't get 1MB when you read 1MB from a TCP source, for example. I
> think for most real world I/O what I said still holds. Again, you are talking about
> something artificial.

The TCP/IP layer will split 1Mb into M MTU chunks, but You must wait anyway
to get 1Mb of data. The same for write.


> What you seem to miss when we criticise your postings, is that we aren't a
> stuck-in-the-mud bunch of old farts who won't listen to new ideas. We have just seen
> enough artificial tests give artificial results, that we don't trust any examples ot
> tests that can't be shown to be representative of what people have to contend with in
> the real world on a large scale. In many cases, even when it can be shown to be like
> real life, the response will be that the application design is dumb - fix that and
> not the OS. If you try to use examples and tests which are closer to real life you
> will get a far warmer reception here. This is a "no bench-marketers" zone!

The only thing that I want to prove is there is a "little" probability that :

fast switching app <==> great cache trashing

The app showed as example is an extreme of a typical I/O bound process.
Now to prove that "fast switching app" imply "great cache trashing" You must
have, big data ( great trashing ) as long as low "time footprint" ( high switch
rate ).
To have a low time footprint You need a low computing time between the I/Os.
To have big data together with low time footprint You need that both sides of
I/O channel being very fast.
I want to give You some data.
Suppose that the loop that I shown You is the internal loop of an ftp server,
built upon fast SCSI disk and 100 Mb network card.
Suppose 80 Mb throughput for SCSI and 100 Mb for netcard.
Suppose that we move 32 Kb at a time.
This app can result in a great cache trashing but how will be its switch / sec.?
Without counting the I/O overhead we have ( very simple math ):
Time SCSI = 32 * 1024 * 8 / 80 * 1024 * 1024 = 3.2 ms
Time Net = 32 * 1024 * 8 / 100 * 1024 * 1024 = 2.5 ms

So thinking as 0 the overhead and computing, we have that the switch rate of
this app can't be more than ( and with devices fullfilled ! ):

Switch Rate Max = 1.0 / (0.0032 + 0.0025) = 175 switch / sec.

This built up inside me the opinion that "dangerous" processes ( cache
trashing speaking ) are located into average switching rate.

> > This app example put in evidence a thing that I'd really like in a
processor : > >
> > THE CONTROL OF CACHE LINES ( PAGES )
> >
> > If N = sizeof(cache) and infd and outfd are very fast devices become true Your
> > hypothesise, cache trashing + fast switching.
> > But We don't touch the data at all, so what if We can say ( and the IRQ
> > drivers too ) :
> >
> > disable_cache_map(buffer, N);
> >
> > We don't need this data loaded into cache since we don't need to do computing
> > on it ( and DMA operations read from main memory not from CPU cache ) !
> 
> The Pentium III can do this, more or less, with its media streaming facility. This
> allows the L2 cache (I think its just L2, but it might be L1 as well - I've never
> actually worked with it) to be bypassed in a controlled manner, primarily for
> streaming multi-media DSP activities, where the data is read, processed, written, and
> never read again. I don't think Linux currently permits use of this functionality,
> but for certain key applications it has substantial potential for cache optimisation.

PIII streaming facilities are not the thing I mean, what I mean is real cache
lines control to lock mapping over memory zones that we want to be covered and 
unmap where we know caching will not help us.

> However, for what you want here it isn't necessary anyway. If the data
passes through
> the machine with bus-mastered (or DMA) read and write, the processor and its
cache 
> need never see it on most machines. Of course, an exception would be if the
relevant
> main memory locations are already in cache, when bus snooping would
update the cache.
> Whether this can be achieved in practice depends on the I/O device types.


Remember that "buffer" address is continuous only in virtual memory and DMA bus
don't know virt. but only phis. memory ( and 16 Mb limit also ), so the driver
must do its own memory movements, so it touches RAM and fill cache.


Davide.

-- 
All this stuff is IMVHO


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo en vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



Más información sobre la lista de distribución Ayuda