Auto-Adaptive scheduler - Final chapter ( the numbers ) ...

Marco Colombo marco en esi.it
Lun Ene 31 06:50:31 CST 2000


On Sat, 29 Jan 2000, Davide Libenzi wrote:

> On Fri, 28 Jan 2000, Marco Colombo wrote:
> [..]
> 
> I'll skip all umor and related ...
> 
> > Please point me to the code you've written, instead of the books you've
> > read. This way i can get a better picture of your ideas on OS designing.
> 
> If You'd readed the patch probably We can stop about speaking about cache
> misses of the patch itself, and we're speaking about the real question that can
> makes the patch die.
> Can We admit such RQ loads in Linux, or We can say that in the rare cases
> that shows this behaviour We can tollerate a non optimal scheduler response ?

Those case are not "rare". They happen *only* by mistake. It's different.

> Even if I've received mail from guys that has servers with RQ > 30, and as I've
> stated into the opening message of this thread ( four question marks ), I'm all
> but sure about the convenience of merging the patch itself.
> 
> "Kernel code bloat" vs. "Uman loads gain"  :  1 - 0
> 
> This is a good reject point.
> 
> Not about speaking of "great cache trashing", etc.. that the patch itself don't
> have.

It's not a problem of your patch. It's a problem that is very like to
appear when RQ > 30 (no matter of what scheduler you're using). And I do
think that in such a condition the (standard) scheduler is not your major
problem. In other words, if you really find yourself with RQ > 30, you
should fix the cache problem *first*, then the scheduler. But the action
you take to reduce the cache trashing problem is *very* likely to reduce
the RQ lenght altogether... you see how this lets very little room for
your patch, to be useful in real world cases.

But don't get me wrong. I like your patch, and i'd like it to be included,
but as a compile (config) option. I'd call it "Optimize scheduler for
more than 8(16) CPUs". That exact number should be decided after real
world cases (not your benchmark). If you have 16 CPUs, a RQ of 20 is not
silly at all. So even if you patch gives only a 2% less scheduler overhead
(measured) with a RQ of 20, it's welcome. I think it makes sense for
a heavily-loaded 16 CPUs SMP to have an optimized version of the scheduler.

> 
> > 
> > 		 long RQ <=> high loads			(1)
> > 
> 
> With high loads I've ever meant high number of tasks in RQ ( since I'm
> measuring scheduling times versus RQ length ), so for that it counts it's true.

Ok. Maybe I misunderstood your words. I'll discuss this later on.

> > > How many switch / sec will do a system with RQ << 1 ?
> > > Stay big say 500 ( keep this number in mind ).

Here you (seem to) imply a kind of close relationship between RQ lenght and
switch rate. 500 is far a too small number. Just yesterday I was setting
up a NFS server, and it showed around 4000 switches and a RQ << 1 (CPU idle
time was >75%, BTW), with just one client.

After that, i put some stress on the server for a while, managing to
raise the load-avg up to 134 (by running may processes like 'sum /dev/mem &',
or 'find / > /dev/null &', netscape, gimp, with a lacal X server). And
under that (silly) high load, switches were only 600.

> > Let me state this: switch rate has *nothing* to do with load, nor does
> > the RQ. [ And RQ is not the same of context switching rate, BTW ].
> 
> 100 % agree.
> 

Ok. So please avoid such a phrasing:
<quoting you>
It'll be better to keep it on fast path since we need it faster under
high workloads not with RQ < 2.
</quoting you>

Here it seems that 'high workloads' and 'RQ < 2' are mutually exclusive
for you (thus 'high workloads' and 'RQ > 2' being the same).

> > That's 1 
> > That's 2
> > That's 3
> > That's 4
> 
> When Larry, Rik and Jamie state that long RQ are not usual, I agree
> with them ( see question marks above ).

Ok. 

> When They speak about cache misses, I agree with them if it's a general
> discussion about the need of keeping low the cache footprint of the kernel
> code, but I can't agree if we speak about a "great cache trashing" of the patch.

I don't think your patch causes a "great cache trashing". It *may* cause 
a few more cache misses in the normal case (small RQ). And when it becomes
effective (long RQ) other factors will cause "great cache trashing".

> > 
> > 		long RQ <=> broken design (somewhere)	(2)
> > 
> 
> As I've said You in a private mail, I can agree with You here but I think You
> can agree with me that this "broken design" exists.

I agree they exist. But the kernel should not "support" them. By no means
you loose a 0.1% in the normal case to get better performance in a case
which is a mistake by itself.
If people keep using the wrong Web server (a multi-process one) for they
fast gigabit intranet, they'll always have problems. The scheduler is just
one of them, and not the biggest.

> > 
> > 		long RQ <=> something very bad happening	(3)
> > 
> 
> I prefer to reserve the phrase "very bad" for other things ;) , but I can agree
> here.

OK.

> > [analysis skipped]
> > 
> > > The data cache cost of the patch with RQ < "Cluster scheduler boot point" is
> > > a variabile
> > > onto the stack ( 4 bytes ), that thinking on about, I can even remove.
> > > The code cache cost of the patch with RQ < "Cluster scheduler boot point" is
> > > 0,
> > > due to the fact that the code is the same of the previous scheduler.
> > > There is the block of code of the cluster scheduler that can be moved out of
> > > fast path,
> > > but I think that It'll be better to keep it on fast path since we need it
> > > faster under
> > > high workloads not with RQ < 2.
> > 
> > Againg, you can have high workloads with RQ ~= 1!
> 
> As You see from my previous emails I've ever meant "high workloads" with 
> "long RQ".

See above.

> > > But all of this topics would be clear if that guys that shoot over my patch
> > > cache misses,
> > > would have readed it.
> > 
> > You should be reading, in the first place...
> 
> I insist, I You want to speak about cache misses read the patch.

I've never said your patch causes many cache misses.

But I do think that even a single cache miss is important, too.

> > The cache issue mainly cames into play when the system is LOADED... and:
> > - this can happen with RQ being small. The scheduler should be optimized
> >   for that (and actually is), since this is the NORMAL condition under
> >   high loads;
> 
> And my patch ? Please underline its cache misses more than the current scheduler
> have and weight the difference with the current scheduler ones.

Again, even a single cache miss is important. By your own admission,
you patch slightly increases the i-cache footprint.

You argument against this is that when the RQ is small, switch rate is
small (below 500, you said), so scheduler performance have little impact.

If I do agree that when the switch rate is small, scheduler performance
had little overall impact, i do not agree when you say that happens when
RQ is small. I think there is no relationship between them.

Switch rate is a property on the kind of load you place on the system.
You can have a kind of load that causes a lot of switches, but keeps the
RQ small, and one that causes only a few switches but shows a long RQ.

Just try the following:

(idle Gnome WS)
$ vmstat 5
procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
[...]
 1  0  0  16120   1664   1816  14200   0   0     0     0  113   264   3   1  96

that's basicly me typing and a CPU monitor applet.

Now do:

$ dd if=/dev/zero bs=1 | dd of=/dev/null &
$ sleep 60
$ vmstat 5
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  0  0  16472   1664   1544  14536   0   0     0     0  101 59826  28  72   0
$ uptime
 11:45am  up 3 days, 21:15, 13 users,  load average: 1.32, 0.47, 0.20

(the vstat line show one "average" figure. Of course i've run vmstat 
for a while)

As you see, RQ is still small (<2), but switch rate is ~60000K. Here, 
your patch (assuming it does make things worse in the normal case) has a
measurable negative impact.

Then, kill the dd, and run (as root):

$ for i in 0 1 2 3 4 5 6 7 8 9 ; do
> sum /dev/mem &
> done 

$ sleep 60; uptime; vmstat 5
 11:50am  up 3 days, 21:20, 13 users,  load average: 7.16, 2.48, 0.97
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
10  0  0  16736   1600   1296  13736   0   0     0     0  105   164  98   2   0

As you can see, here the RQ is quite long (~10), but switch rate small.
Your patch here will have no effect, since, as you've already pointed out,
switch time has little importance under low swich rate.

This are just quick-and-dirty, shell based examples. I'm sure someone else
could code some C programs to even extend my results. The whole point here
is that these are NOT real world cases. Just like your program to measure
context switch time. What I mean is that when running a real application,
if you see a switch rate of ~60000K that's simply too much. You have to
either redesign you application, or, if it's already optimized, buy new
hardware. The same is true for a long RQ.

> 
> > - by the time you get a long RQ, the system is so overloaded (in real cases)
> >   that the cache issue is a MAJOR one anyway, no matter the scheduler...
> 
> I can agree here, more a process run higher is the probability that it touches
> new RAM locations.

Being the cache misses a major issue, under that kind of conditions,
and given that they do have a BIG performance impact, what's the whole 
purpose of optimizing the scheduler? You patch makes sense only if
you manage to show a (real) case where you have:
- a long RQ (for your patch to be effective);
- high switch rate (for the scheduler time to be important);
- low cache miss rate (for NOT having the cache misses slow the system down
  to 1/10 of performance).

My point is that, if a real application under real world load causes
both the first two conditions, which are necessary for your patch to
make sense, it's very likely it causes also the third one. At that point,
with a thrashing cache, you have much a bigger problem than a sub-optimal
scheduler.
You should really do your best to solve the cache problem, which means
redesigning the application or buying new hardware, or whatever.
At the point you solved the cache problem, it is very likely that your
RQ has shortened and your switch rate lowered as an side effect.

> > And, yes, I can think of a way to enlarge the RQ without other big
> > system impacts. Just spawn N identical processes (or threads, whatever),
> > doing just endless for(;;); and you'll see you RQ go to N.
> > This is NOT a RL workload. And anyway you'll patch will have little
> > effect on this. The scheduler gets invoked only 1/timeslice times a second.
> > No matter how long the RQ is.
> 
> IRQs apart, this is an evidence.
> 
> 
> > The only thing you can do to get the scheduler play a major role, is
> > to have the threads do *nothing*, just release the CPU as soon as they
> > get scheduled. I can't think of any real world application that will
> > come close to that behaviour, unless it's very badly designed (or very
> > badly used).
> 
> This is what threads.c do, but it's an instrument for switch times measurements.

No. It does not measure switch rate. It measures switch rate *under the long
RQ condition*. Running a (real) system under that condition is a mistake
99.9% of the times (unless you have *many* CPUs).

> To summarize ( and conclude ) with all guys that have taken part to discussion :
> 
> 1) RQ > 30 exist in "nature" and is proved by vmstats ( of real loads not
> benchs ) that other guys post me

Right.

> 2) We can agree to consider these cases as rare

Not "rare". Mark them as a result of a mistake.

> 3) We agree on keeping scheduler cache ( and icache ) footprint as low as
> possible ( even if I think that to make a single cache miss to have a percent of
> improvement You must push the system at high switch / sec )

Correct.
[Let me stress this again: you can have it no matter how long the RQ is].

> 4) We agree on the fact that improving scheduler switch times You don't have
> a great overall gain. As You can see in the message that open this thread,
> since the gain percent You get on the scheduler must be multiplied by the
> schedule time percent You get about 5% with 40 threads.
> Higher _scheduler_ performance will came with higher RQ, but I agree that
> these are exceptional cases.

It's not a matter of you common such cases are. 
1) they 99.9% of time are caused by a mistake. Fix it, instead of the kernel.
2) 99.9% of the time you see a long (>30) RQ, you have problems other than
   a sub-optimal scheduler. Fixing the scheduler won't solve a cache thrashing
   situation for example. While fixing those "other problems" you probably
   fix the long RQ problem at the same time. No need to patch the scheduler.
  
> 5) I don't agree when You say that the patch have a "great cache trashing".
> This means that You've not readed the patch.

I've read the patch, I've never said it causes "great cache trashing".
I repeat that if you see a long RQ, you also see a "great cache trashing",
in most of the real cases. And that's a bigger problem than a sub-optimal
scheduler.

> 6) I can agree finally when You say to reject the patch due to the fact that,
> as outlined above :
> 
> "Kernel code bloat" vs. "Uman loads gain"  :  1 - 0
> 

Yes and no. We should measure the patch gain with a 16 or more CPUs
system, under real world load (not with your test-program). I do think
it may give some improvement. But it should be measured. Even a 1% or 2%
gain is enough to start considering it for inclusion (as a compile
option), IMHO.

> Cheers,
> Davide.
> 

TM.
-- 
      ____/  ____/   /
     /      /       /			Marco Colombo
    ___/  ___  /   /		      Technical Manager
   /          /   /			 ESI s.r.l.
 _____/ _____/  _/		       Colombo en ESI.it



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo en vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



Más información sobre la lista de distribución Ayuda