buggy GFP_KERNEL allocators

Sab Ene 29 13:54:35 CST 2000

On Sat, 29 Jan 2000, Rik van Riel wrote:

>This won't work. I've seen TASK_UNINTERRUPTIBLE and TASK_ZOMBIE
>tasks trying to allocate memory. You don't want to change the

If the TASK_ZOMBIE thing would be true, that would be a do_exit bug. I
can't find the wrong allocation by reding the code though. The other
UNINTERRUPTIBLE thing may be just fine.

GFP is just like wait_on_page() or everything else that can block while
waiting for an event to happen. Since the caller said it can afford to
sleep into GFP, it also know it will schedule and so it also know that GFP
is going to clobber the current->state returning unconditionally in
TASK_RUNNING and that GFP won't care at all about the previous state of
current->state. Like all other blocking calls (wait_on_buffer() or
get_request_wait() etc..etc..) we know they will clobber the task->state
information because they are schedule-aware functions that needs to play
with the scheduling.

IMNSHO if some caller assumes GFP doesn't clobber the task state it should
be the caller that should be fixed. This looks cleaner and will obviously
also improve performance compared to another possible approch that is to
teach GFP and the other wait_event paths to not clobber the state of the
current task (saving/restoring the current->state on/from the stack
before/after dealing with the scheduler).

Threating the allocation as atomic as in the 2.2.15pre5 if the state is
not RUNNING is _definitly_ wrong if the caller claimed that it can wait so
the pre-kernel could go oom even if not necessary. Not beilieving in the
__GFP_WAIT make no sense otherwise __GFP_WAIT can be dropped completly if
we don't believe in it.

The 2.2.15pre5 pre-patch is also calling schedule as soon as it hits the
low watermark:

+               if (nr_free_pages <= freepages.low) {
+                       wake_up_interruptible(&kswapd_wait);
+                       /* a bit of defensive programming */
+                       if (gfp_mask & __GFP_WAIT)
+                               schedule();
                }

That make no sense. It will generate overscheduling for no good reason.
The only point for doing the pre-wakeup also in the non-atomic case (as I
extensively explained in my previous email on the GFP_ATOMIC failed
allocation problems) is only to try to scale the freeing load during
sleep-time or in a parallel CPU before we hit the min watermark. The
pre-kernel instead in the worst case is rescheduling kswapd into the
current CPU and so it will only for sure waste time in schedule and it
will also sync-wait for kswapd to complete and so it's obviously way
faster waiting the min watermark to trigger and then to free the pages
from the process context instead of having to reschedule kswapd for doing
the plain same thing. There's no either point in tring to never hit the
"min" watermark because it's too low for you, then the it's the min
waternarm that should be increased instead of wasting CPU (and TLB in
2.2.x) in doing such a thing. And this waste will happen in the
non-swapout case too thus it will make difference on numbers with memory
pressure going on.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo en vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/