Auto-Adaptive scheduler - Final chapter ( the numbers ) ...
Alex Khripin
kolex en erols.com
Vie Ene 28 01:12:23 CST 2000
From: Steve Underwood <steveu en coppice.org>
Date: Fri, 28 Jan 2000 09:25:46 +0800
Subject: Re: Auto-Adaptive scheduler - Final chapter ( the numbers ) ...
Davide Libenzi wrote:
> Thursday, January 27, 2000 6:10 PM
> Horst von Brand <vonbrand en pincoya.inf.utfsm.cl> wrote :
> > As was said here time and again: This is ridiculously unrealistic. Your
> > benchmark plus whatever scheduler you use fits in the smallest cache. Use
> > some code that is some 2 or 3Kb at least between schedules, and run
> > _different_ code each time. I'd bet you see quite different numbers that
> way.
>
> These are my words in Larry answer :
>
> <DONTSPLITTHIS>
> This is a pure switching time test.
> It is a _real_ switching test, that can be used to test switching times
> under _certain_
> RQ loads :
>
> vmstat of a "lat_ctx -s 0 20" give no more than 5-6 tasks in RQ
>
> vmstat of a "threads 20 xx" give always an RQ = 20 ( with no need of cost
> compensation )
>
> Now, as even my sister ( that works in business ) can realize this code is
> stored entirely in cache, and the write to the counter is onto the process
> stack.
> I _want_ this behaviour coz I _want_ a measure of clean switching times by
> keeping
> cache issues ot of the test.
> </DONTSPLITTHIS>
>
> Anyway adding a cache footprint into the two cases will add a constant term
> that minimize
> even more the percent result.
I really do urge you to go to a bookstore, find a recent book on computer
architecture, and try to find out how a modern machine works. Larry
recommended a book. Maybe you should try that one.
Almost any modern CPU has at least two levels of cache. What people are
refering when criticising the validity of your tests is the need to overflow
the bounds of the small level 1 I and D caches, and force loading from at
least the level 2 cache. Your test is too small scale to do that, and it makes
it totally meaningless.
If you don't understand why a modern CPU has multiple levels of cache (even on
a single chip design like the Celeron); if you don't understand their relative
performance, latencies, loading and storing behaviour, and why there are
separate I and D caches; if you don't understand how SDRAM and DDR SDRAM work,
why RDRAM doesn't work well, and the latency patterns of these parts; if you
don't understand the access and speed patterns of I/O busses, like PCI; and if
you don't understand the trends in the relative performance of these parts and
the CPU core over the next 2 to 5 years you really cannot contribute much to
OS design.
"Ah!", you might say, "but I'm not an electronics guy". It really doesn't
matter. Computer architecture design has almost nothing to do with
electronics, and almost everything to do with simple topology. You can learn
about the key constraints that electronic and mechanical design impose in a
couple of hours. The rest is _all_ about topology. You only need to understand
electronic design at the level of "A PCB can realistically have 6 to 8 layers.
Two are taken by power, so you 4 to 6 layers to work with in connecting
everything together. The refractive index of epoxy-glass boards means signals
travel at about 200mm/ns across the board. Current IC packaging constraints
are that you can have 40 pins cheaply, 400-500 pins aren't too expensive, and
you can go up to 1000-1500 on a really high value part.". Of course, the list
is rather longer than this, involving the performance of core logic, and
memory cells, but you only need to understand what the elements can achieve.
After that, computer architecture is about chaining these elements in the most
effective way to form a system - its all topology.
An OS is a hardware manager. If you have ever worked on an engineering project
whether the manager was not well versed in the technologies of the project you
will understand what happens when people with nothing more than software
knowledge design an OS! In fact, the need to understand the hardware is a
growing one. As machine progress to more to more complex designs, with SMP,
several layers of cache, more complex DRAM architectures, and so on, the
machines are becoming less homogenous. Even application designers now need a
much greater understanding of the machines they use, to grasp why their
program behaves the way it does.
Steve
Thanks, Steve, for showing your great grasp of the big picture and harware technology. I am very envious of your knowledge, and I am awed by your presence.
However, I think that Davide knows something about caches as well. Cache problems are important, but so is effective code and good algorithms. Of course, it's a pretty
secure stance - can't add any new things because they will increase cache misses. But an extra kilobyte (probably less) isn't going to matter. That number is dwarfed by the
process itself (i.e. the 'better' test program you've been asking for) and the task data that is needed to calculate goodness values. Actually, as Davide's patch doesn't need
to access every task in the RQ, the chance of a cache miss is actually lower - some task_structs, though out of the cache, will not be touched anyways, as they will belong to
low priority groups. Well, actually in the I cache, the task structures don't matter, but still, the added code is very small compared to the process code.
Furthermore, the continual assertions that "RUN LOADS OVER 4 DO NOT EXIST" are quite nonsensical. People have brought up a number of examples where the RQ gets over 10
elements. And at that point, you're already getting performance gains from the patch (I discussed the cache issues above.) Personally, I /like/ the idea of optimizing for
more than just one case (the case that doesn't even need optimization (RQ<4)).
Regards,
Alex K.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo en vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Más información sobre la lista de distribución Ayuda