Code optimization <LEA Instruction>

Sab Ene 29 04:30:16 CST 2000

Richard B. Johnson wrote:
> This is from GCC version 2.7.2.3, the exact utility that
> ../linux/Documentation/Changes requires.

>From another recent thread, we know that Linux 2.3.recent spits out a
few warnings about cache-aligned structures with GCC 2.7.2.3.

So I guess none of the people that write that code use GCC 2.7.2.3 :-)

> In most cases LEA is used where an offset is already known at compile-
> time. Note the stack operations using LEA and that the "pop" instruction
> uses the stack-pointer that has just been changed in value using LEA. This
> is a 3-clock penality.
> 
>      333:	8d 65 e0             	lea    0xffffffe0(%ebp),%esp
>      336:	5b                   	pop    %ebx

The AGI occurs when address calculation is dependent on the prior
instruction, in the pipeline.  There is no AGI in this example.

In fact AGIs are precisely because address calculation is done earlier
than ALU arithmetic, so in this example the use of "lea" actually helps!

(Aside: %esp is treated specially anyway, so additional rules apply
where "push" and "pop" are concerned).

Quite apart from all that, the alternative is three instructions long and
two bytes longer:

   0:	  89 e5			mov    %esp,%ebp
   2:	  83 c5 e0		add    $0xffffffe0,%ebp
   5:	  5b			pop    %ebx

Everything tells me "lea" is better than "add" here...

>      667:	8d 65 e0             	lea    0xffffffe0(%ebp),%esp
>      66a:	5b                   	pop    %ebx

etc., etc.  all these are smaller and faster with "lea".  We win :-)

> 00000a08 <schedule>:
>      d78:	8d 1c 9d 0d 00 00 00 	lea    0xd(,%ebx,4),%ebx
>      d7f:	8d 0c dd 00 00 00 00 	lea    0x0(,%ebx,8),%ecx

These are multiply-accumulate operations.  GCC decided that's the
fastest way to say %ecx = 32 * %ebx + 104.  There is an AGI stall as
there so it would be nice if appropriate scheduling would insert
something in the middle, but it's still faster than using adds and shifts.

>      dad:	8d 76 00             	lea    0x0(%esi),%esi ! NOP

It's as fast as a nop too, provided nobody wrote to %esi in the
preceding cycle.  It's done that way because it's a faster way to say
"nop 3 bytes long" for jump target alignment.

>      fda:	8d 5c 1e 01          	lea    0x1(%esi,%ebx,1),%ebx
>      fde:	89 5d cc             	mov    %ebx,0xffffffcc(%ebp)

This means %ebx = %esi + %ebx + 1.
You'd have to use two instructions to get the same result with "add".

There is no AGI stall between the two instructions because the
instruction's address calculation is not dependent on the first.  There
may be an AGI due to stalls between dependent pairs on a Pentium, but
you haven't shown enough context for me to say.

You show lots of these.  An AGI due to "lea" occurs when you have the
lea _after_ the instruction that sets a register used in the address.
On a Pentium, due to pairing, there can be two instructions in between.

>     1115:	8d 76 00             	lea    0x0(%esi),%esi
>     1118:	8b 71 38             	mov    0x38(%ecx),%esi

That's ok.  There's no AGI between these two because they're not dependent.

>     2e98:	8d 87 1a 03 00 00    	lea    0x31a(%edi),%eax
>     2e9e:	50                   	push   %eax

The first instruction's equivalent using "add" is two instructions long,
involving a move and an add.  It would be larger code and in most cases
slower.  That's because %edi and %eax are not the same register.  "lea"
is the x86 way of saying "add A to CONSTANT plus B times POWEROF2 and
store in C".  "add" can only do "add A to B".

>     2f24:	8d 90 08 fb ff ff    	lea    0xfffffb08(%eax),%edx
>     2f2a:	29 fa                	sub    %edi,%edx

There's no AGI here.  Max speed if the pairing state is working nicely.

>     2fe5:	8d 75 ec             	lea    0xffffffec(%ebp),%esi
>     2fe8:	56                   	push   %esi
>     2fe9:	8d 87 ac 04 00 00    	lea    0x4ac(%edi),%eax
>     2fef:	50                   	push   %eax
> 
>     2ff5:	8d 5d d8             	lea    0xffffffd8(%ebp),%ebx
>     2ff8:	53                   	push   %ebx
> 
>     2ff9:	8d 87 b4 04 00 00    	lea    0x4b4(%edi),%eax
>     2fff:	50                   	push   %eax
> 
>     304c:	8d 65 cc             	lea    0xffffffcc(%ebp),%esp
>     304f:	5b                   	pop    %ebx

No stalls in any of these either.  We still win :-)

All this, and I believe GCC 2.7.2.3 doesn't even know about AGIs.  From
your examples, it seems to work out nicely without considering them.

have a radiant day,
-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo en vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/