Because lwzu or stwu moves the pointer, I can remove an addi
instruction from the loop, so the loop is slightly faster.
I wrote a benchmark in Modula-2 that exercises some of these loops. I
measured its time on my old PowerPC Mac. Its user time decreases from
8.401s to 8.217s with the tighter loops.
The new features are the hi16/lo16 and ha16/lo16 syntax for
relocations, and the extended mnemonics like "blr".
Use ha16/lo16 to load some double floats with 2 instructions (lis/lfd)
instead of 3 (lis/ori/lfd).
Use the extended names for branches, comparisons, and bit rotations,
so I can more easily read the code. The new names often encode the
same machine instructions as the old names, except in a few places
where I changed the instructions.
Stop using andi. when we don't need to set cr0. In inn.s, I change
andi. to extrwi to extract the same bits. In los.s and sts.s, I
change "andi. r3, r3, ~3" to "clrrwi r3, r3, 2". This avoids setting
cr0 and also stops clearing the high 16 bits of r3.
In csa.s, los.s, sts.s, I change some comparisons and right shifts
from signed to unsigned (cmplw, cmplwi, srwi), because the sizes are
unsigned. In inn.s, the right shift can be signed (sraw) or unsigned
(srw), but I use srw because we don't need the carry bit.
In fef8.s, I save an instruction by using rlwinm instead of addis/andc
to rlwinm to clear a field. The code no longer kills r7. In both
fef8.s and fif8.s, I remove the list of killed registers.
Also remove some whitespace from ends of lines.
This provides and, ior, xor, com, zer, set, cms when defined($1) and
ior, set when !defined($1). I don't provide the other operations
!defined($1) because our Modula-2 compiler hasn't used them.
I wrote a Modula-2 example in
https://gist.github.com/kernigh/add79662bb3c63ffb7c46d01dc8ae788
Put a dummy comment in mach/powerpc/libem/build.lua so git checkout
will touch that file. Without the touch, the build system doesn't see
the new *.s files.