Switch .cms to pass inputs and outputs on the real stack, not in
registers; like we do with .and, .or (81c677d) and .xor (c578c49).
At this point, nearly all functions in libem use the real stack, not
registers, for passing inputs and outputs. This simplifies the ncg
table (which needs fewer lists of specific registers) but slows calls
to libem.
For example, after ba9b021, each call to .aar4 is about 10
instructions slower. I moved 3 inputs and 1 output from registers to
the real stack. A program would take 4 instructions to move registers
to stack, 4 to move stack to registers, and perhaps 2 to adjust the
stack pointer.