Switch some conversions from libem calls to inline code. The
conversions from integers to floats are now too slow, because each
conversion allocates 4 or 5 registers, and the register allocator is
too slow. I might use these slow conversions to experiment with the
register allocator.
I add the missing conversions between 4-byte single floats and
integers, simply by going through 8-byte double floats. (These
replace the calls to nonexistant functions in libem.)
I remove the placeholder for fef 4, because it doesn't exist in libem,
and our language runtimes only use fef 8.
This replaces a call to memmove() in libc. That was working for me,
but it can fail because EM programs don't always link to libc.
blm and bls only need to copy aligned words. They don't need to copy
bytes, and they don't need to copy between overlapping buffers, as
memmove() does. So the new loop is simpler than memmove().
Remove one addi instruction from some loops. These loops had
increased 2 pointers, they now increase 1 index. I must initialize
the index, so I add "li r6, 0" before each loop.
Change .zer to use subf instead of neg, add.
Change .xor to take the size on the real stack, as .and and .or have
done since 81c677d.
Use extended "mr" instead of basic "or" to move registers. Both "mr"
and "or" encode the same machine instruction. With "mr", I can more
easily search the assembly output for register moves.
Fold several stacking rules into a single rule ANY_BHW-REG to STACK.
Remove the EM patterns for loc mlu $2==2 and loc slu. The first
pattern had the wrong size (should be $2==4, not $2==2). Both
patterns were redundant. They rewrote loc mlu as loc mli and loc slu
as loc sli, but this table doesn't have patterns for loc mli or loc
sli, so it is enough to rewrite mlu as mli and slu as sli.
Add the tokens IND_RL_B, IND_RL_H, IND_RL_H_S, IND_RL_D, along with
the rules to use them. These rules emit shorter code. For example,
loading a byte becomes lis, lbz instead of lis, addi, lbz.
While making this, I wrongly set IND_RL_D to size 4. Then ncg made
infinite recursion in codegen() and stackupto(), until it crashed by
stack overflow. I correctly set IND_RL_D to size 8, preventing the
crash.
Remove coercion from LABEL to REG. The coercion never happens because
I have stopped putting LABEL on the stack. Also remove LABEL from set
ANY_BHW. Retain the move from LABEL to REG because pat gto uses it.
Remove li32 instruction, unused after the switch to the hi16, ha16,
lo16 syntax.
Remove COMMENT(...) lines from most moves. In my opinion, they took
too much space, both in the table and in the assembly output. The
stacking rules and coercions keep their COMMENT(...) lines.
In test GPR, don't write to RSCRATCH.
Fold several coercions into a single coercion from ANY_BHW uses REG.
Use REG instead of GPR in stack patterns. REG and GPR act the same,
because every GPR on the stack is a REG, but I want to be clear that I
expect a REG, not r0.
In code rules, sort SUM_RC before SORT_RR, so I can add SUM_RL later.
Remove rules to optimize loc loc cii loc loc cii. If $2==$4, the
peephole optimizer can optimize it. If $2!=$4, then the EM program is
missing a conversion from size $2 to size $4.
Remove rules to store a SEX_B with sti 1 or a SEX_H with sti 2. These
rules would never get used, unless the EM program is missing a
conversion from size 4 to size 1 or 2.
Use it to generate code like
lis r12,ha16[__II0]
lis r11,ha16[_f]
lfs f1,lo16[_f](r11)
lfs f2,lo16[__II0](r12)
fadds f13,f2,f1
stfs f13,lo16[_f](r11)
Here ncg has allocated r11 for ha16[_f]. We use r11 in lfs and again
in stfs. Before this change, we needed an extra lis before stfs,
because ncg did not remember that ha16[_f] was in a register.
This example has a gap between ha16[__II0] and lo16[__II0], because
the lo16 is not in the next instruction. This requires my previous
commit 1bf58cf for RELOLIS. There is a gap because ncg emits the lis
as soon as I allocate it. The "lfs f2,lo16[__II0](r12)" happens in a
coercion from IND_RL_W to FSREG. The coercion allocates one FSREG but
may not allocate any other registers. So I must allocate r12 earlier.
I allocate r12 in pat lae, but this causes a gap.
A 4-byte load from a label yields a token IND_RL_W. This token emits
either lis/lwz or lis/lfs, if we want a general-purpose register or a
floating-point register.
Remove the GPRINDIRECT token, and use the IND_RC_* tokens as operands
to instructions. We no longer need to unpack an IND_RC_* token and
repack it as a GPRINDIRECT to use it in an instruction.
Allow storing IND_ALL_B and IND_ALL_H in register variables. Create a
set ANY_BHW for anything that we can store in a regvar.
Push register variables on the stack without using GPRE, by changing
stwu to accept LOCAL. Then ncg will replace the string ">>> BUG IN
LOCAL" with the register name. (I copied ">>> BUG IN LOCAL" from
mach/arm/ncg/table.)
Fix the rule for "pat lil inreg($1)>0" to yield a IND_RC_W token, not
a register. We might need to kill the token with "kills MEMORY".
Rename CONST_ALL to CONST_STACK, because it only includes constants on
the stack, and excludes CONST tokens. Instructions still don't allow
CONST_STACK operands, so we still need to repack each CONST_STACK as a
CONST to use it in an instruction.
Rename LABEL_OFFSET_HI to just LABEL_HI, and same for LABEL_HA and
LABEL_HO.
r0 is a special case and can't be used when adding a register to a
constant. The few remaining users of the scratch register don't do
that. I removed other usages of the scratch register in 7c64dab,
5b5f774, 19f0eb8, f64b7d8.
The rewritten code rules bring 3 new features:
1. The new rules compare a small constant with a register by
reversing the comparison and using `cmpwi` or `cmplwi`. The old
rules put the constant in a register.
2. The new rules emit shorter code to yield the test results,
without referencing the tables in mach/powerpc/ncg/tge.s.
3. The new rules use the extended `beq` and relatives, not the
basic `bc`, in the assembly output.
I delete the old tristate tokens and the old moves, because they
confused me. Some of the old moves weren't really moves. For
example, `move R3, C0` and then `move C0, R0` did not move r3 to r0.
I rename C0 to CR0.
This fixes the coercion from IND_ALL_D to FREG. The coercion had
never happened, because IND_ALL_D had 8 bytes but FREG had 4 bytes.
Instead, ncg always stacked the IND_ALL_D and unstacked a FREG. The
stacking rule uses f0, so the code did load f0 with the indirect
value, push f0 to stack, load f1 to stack, move stack pointer. Now
that FREG has 8 bytes, ncg does the coercion, and the code just loads
f1 with the indirect value.
Always use 'kills ALL' when reaching a label, because our registers
and tokens have the wrong values if the program jumps to this label
from somewhere else.
When falling through a label, if the top element is in r3, then
require that the rest of the stack is in the real STACK, not in
registers or tokens.
I'm doing this to be certain that the missing constraints are not
causing bugs. I did not find any such bug, perhaps because the labels
are usually near other instructions (like conditional branches and
function calls) that stack or kill tokens.
This is for fef 8 and fif 8. I changed .fef8 so it no longer kills
r7, but I don't want to update the list. We already use "kills ALL"
for most other calls to libem.
possible values. Add the PowerPC ncg and mcg backend support to let the test
actually run, including modifying a bunch of PowrePC libem functions so that
they can be called from both ncg and mcg.
assembler directives, ha16() and has16(), for the upper half; has16() applies
the sign adjustment. .powerpcfixup is now gone, as we generate the relocation
in ha*() instead. Add special logic to the linker for undoing and redoing the
sign adjustment when reading/writing fixups. Tests still pass.
This provides and, ior, xor, com, zer, set, cms when defined($1) and
ior, set when !defined($1). I don't provide the other operations
!defined($1) because our Modula-2 compiler hasn't used them.
I wrote a Modula-2 example in
https://gist.github.com/kernigh/add79662bb3c63ffb7c46d01dc8ae788
Put a dummy comment in mach/powerpc/libem/build.lua so git checkout
will touch that file. Without the touch, the build system doesn't see
the new *.s files.
We only implement 'los 4', 'sts 4', 'cmi 4', 'cmu 4', not for sizes
other than 4. Add clause $1==4.
We only implement inn when defined($1).
The rule for aar needs 'kills ALL' because it kills many registers,
like other rules that call libem.
This allows 'move {CONST, $1}, R3' with a small enough $1 to emit one
instruction (addi) instead of two instructions (addis, ori). The
CONST token confusingly isn't in the CONST_ALL set.
The spec says, "ASS w: Adjust the stack pointer by w-byte integer".
The w argument "can either be given as argument or on top of the
stack." Therefore, 'ass 4' would pop the 4-byte integer from the
stack, but 'ass' would pop the size w from the stack, then pop the
w-byte integer.
PowerPC ncg wrongly implemented 'ass' as if it was 'ass 4'. Fix it to
accept only 'ass 4'.
These instructions would load or store the EM heap pointer. They
don't work. Programs must use brk() or sbrk() in libsys.
The last file to use 'lor 2' and 'str 2' was lang/pc/libpc/sav.e in
the Pascal library. Commit c084f9f deleted the file, so we no longer
need rules 'lor 2' or 'str 2' to build the ACK.
corresponding invocation in the ncg table so the same helpers can be used for
both mcg and ncg. Add a new IR opcode, FARJUMP, which jumps to a helper
function but saves volatile registers.
This would have happened later, if f14 to f31 became regvar (like r13
to r31 are now). I am doing it now because ncg is too slow for rules
"with FREG FREG uses FREG". We use such rules for adf 8 and other EM
instructions that operate on 2 floats. Like my last commit cfbc537,
this commit speeds ncg by removing choices for register allocation.
ncg is too slow with this many registers. A stack pattern "with GPR
GPR GPR" or "with REG REG REG" takes too long to pick registers,
causing ncg 8 to take about 2 seconds on each sti 8. I introduce
REG_PAIR and there are only 4 such pairs.
For programs that use sti 8 (including C programs that copy 8-byte
structs), this speed hack improves the ncg run from several seconds to
almost instantaneous.
Also add a few COMMENT(...) lines in stacking rules.
This fixes the SIGILL (illegal instruction) in startrek when firing
phasers. The 32-bit processors in my PowerPC Mac and in QEMU don't
have fctid, a 64-bit instruction.
I got the idea from mach/proto/fp/fif8.c to extract the exponent,
clear some bits to get an integer, then subtract the integer from
the original value to get the fraction.
Adjust some of the loi rules (and associated moves) so we can identify
the tokens that must be in MEMORY.
With this commit, I can navigate the Enterprise even if I comment out
my work-around from e22c888.
Because li32 always loads a label into a GPR, it is sufficient to
coerce LABEL to REG, then use IND_RC_W or IND_RC_D for indirection
through the label.
Now that SUM_RC always has a signed 16-bit constant, it happens that
the various IND_RC_* tokens also have a signed 16-bit constant, so
we no longer need to touch the scratch register.
When loc (load constant) pushes a constant, it now checks the value of
the constant and pushes any of 7 tokens. These tokens allow stack
patterns to recognize 16-bit signed integers (CONST2), 16-bit unsigned
integers (UCONST2), multiples of 0x10000 (CONST_HZ), and other
interesting forms of constants.
Use the new constant tokens in the rules for adi, sbi, and, ior, xor.
Adjust a few other rules to understand the new tokens.
Require that SUM_RC has a signed 16-bit constant, and OR_RC and XOR_RC
each have an unsigned 16-bit constant. The moves from SUM_RC, OR_RC,
XOR_RC to GPR no longer touch the scratch register, because the
constant is not too big.
Change the operator in his() from a - minus to a + plus. When los(n)
becomes negative, then his(n) needs to add 0x10000, not subtract it.
Also change los(n) to do the sign extension, because smalls(los(n))
should be true, not false.
Also change hi(n) and lo(n) to wrap n in parentheses, as (n), because
these are macros and n might still contain operators.
We only need GPRE in a few places where we write {GPRE, regvar(...)}
because ncgg can't parse plain regvar(...). In all other places, a
plain GPR works.
Also remove gpr_gpr_gpr and a few other unused and fake instructions
from the list of instructions.
Rename the scratch gpr (currently r11) from SCRATCH to RSCRATCH so I
can search for RSCRATCH without finding FSCRATCH. I also want to
avoid confusion with the SCRATCH keyword of the old code generator (cg
which came before ncg).
Change the stacking rules to prevent stacking of RSCRATCH or FSCRATCH
or any other GPR or FPR that isn't an allocatable REG or FREG. Then
ncgg rejects any rule that tries to stack a GPR or FPR, so change such
rules to stack a REG or FREG.
In our powerpc table, sdl fails to kill the old value of the local.
This is a bug, because a later ldl can load the old value instead of
the newly stored value. By rewriting "sdl 0" "ldl 0" as "dup 8" "sdl
0", the newly added rule works around the bug, but only when the ldl
is immediately after the sdl.
This rule improves code that uses double-precision floating point.
The output of printf("%f", 6.0) in C changes from all zero digits to
"6000000" but still doesn't print the decimal point. The result of
atof("-123.456") becomes correct. In startrek, I can now move the
Enterprise, but I still can't fire phasers without crashing the game.
We already have a rule for stl lol $1==$2. We had two copies of the
rule, so I am deleting the second copy.
In EM, fef splits a float into exponent and fraction. The old C code,
given an infinite float, got stuck in an infinite loop. The new
assembly code doesn't loop; it extracts the IEEE exponent.
This fixes code that tried to "addi SP, SP, 4" to drop a value that
was in a register, not on the real stack.
Add a rule to optimize "asp 4" (which becomes "loc 4" "ass") when
the value being dropped is already in a GPR.
When ncg fell back on this rule, it did emit the string "invalid" in
the assembly code and caused a syntax error in the assembler.
Adjust the stacking rules so we can stack LOCAL, CONST, and LABEL
without falling back on the "invalid" rule, and so we can stack them
when we have no free register except the scratch register.
GNU as has "la %r4,8(%r3)" as an alias for "addi %r4,%r3,8", meaning
to load the address of the thing at 8(%r3). Our 'la', now 'li32',
makes an addis/ori pair to load an immediate 32-bit value. For
example, "li32 r4,23456789" loads a big number.