ack/doc/sparc/4

.In
.hw data-structures
.nr H1 3
.NH
SOLUTIONS
.NH 2
Maintaining SPARC speed
.PP
In chapter 3 we wrote:
.sp 0.3
.nf
>If we want to generate efficient code, we should at least try to reduce the number of
>memory references and use registers wherever we can.
.fi
.sp 0.3
In this chapter we will device a strategy to swiftly generate acceptable
code by using push-pop optimization.
Note that this is not the push-pop
optimization already available in the EM-kit, since that is only present
in the assembler-to-binary part which we do not use
.[ [
The Code Expander Generator
.]].
Our push-pop optimization
works more like the fake-stack described in
.[ [
The table driven code generator
.]].
.NH 3
Ad-hoc optimization
.PP
Before getting involved in any optimization let's have a look at some
code generated with a straightforward EM to SPARC conversion of the
C statement: \*(Sif(a[i]);\*(So Note that \*(Si%SP\*(So is an alias
for a general purpose
register and acts as the EM stack pointer. It has nothing to do with
\*(Si%sp\*(So \(em the SPARC stack pointer.
Analogous \*(Si%LB\*(So is EMs local base pointer.
.br
.IP
.HS
.TS
;
l s l s l
l1f6 lf6 l2f6 lf6 l.
EM code	SPARC code	Comment

lae	_a	set	_a, %g1	! load address of external _a
		dec	4, %SP
		st	%g1, [%SP]

lol	-4	set	-4, %g1	! load local -4 (i)
		ld	[%g1+%LB], %g2
		dec	4, %SP
		st	%g2, [%SP]

loc	2	set	2, %g1	! load constant 2
		dec	4, %SP
		st	%g1, [%SP]

sli	4	ld	[%SP], %g1	! pop shift count
		ld	[%SP+4], %g2	! pop shiftee
		sll	%g2, %g1, %g3
		inc	4, %SP
		st	%g3, [%SP]	! push 4 * i

ads	4	ld	[%SP], %g1	! add pointer and offset
		ld	[%SP+4], %g2
		add	%g1, %g2, %g3
		inc	4, %SP
		st	%g3, [%SP]	! push address of _a + (4 * i)

loi	4	ld	[%SP], %g1	! load indirect 4 bytes
		ld	[%g1], %g2
		st	%g2, [%SP]	! push a[i]
cal	_f
		...
.TE
.HS
.LP
Although the code is easy understand, it clearly is far from optimal.
The above code uses approximately 60 clock-cycles\(dg
.FS
\(dg In general each instruction only takes one cycle,
except for \*(Sild\*(So and
\*(Sist\*(So which may both require additional clock cycles. The exact amount
of extra cycles needed depends on the SPARC implementation and memory access
time. Furthermore, the
\*(Siset\*(So pseudo-instruction is a bit tricky. It takes one cycle when
its argument lies between -4096 and 4095, and two cycles otherwise.
.FE
to push an array-element on the stack,
something which a 68020 can do in a single instruction. The SPARC
processor may be fast, but not fast enough to justify the above code.
.PP
The same statement can be translated much more efficiently:
.DS
.TS
;
l2f6 lf6 l.
sll	%i0, 2, %g2	! multiply index by 4
set	_a, g3
ld	[%g2+%g3], %g1	! get contents of a[i]
dec	4, SP
st	%g2, [SP]	! push a[i] onto the stack
.TE
.DE
which, instead of 60, uses only 5 clock cycles to retrieve the element
from memory and 5 additional cycles when the result has to be pushed
on the stack. Note that when the result is not a parameter it does not
have to be pushed on the stack. By making efficient use of the SPARC
registers we can fetch \*(Sia[i]\*(So in only 5 cycles!
.NH 3
Analyzing optimization
.PP
Instead of ad-hoc optimization we will need something more solid.
When one tries to optimize the above code in an ad-hoc manner one will
probably notice the large overhead due to stack access. Almost every EM
instruction requires at least three SPARC instructions: one to carry out
the EM instruction and two to pop and push the result from and onto the
stack. This happens for every instruction, even though the data being pushed
will probably be needed by the next instruction. To optimize this extensive
pushing and popping of data we will use the appropriately named push-pop
optimization.
.PP
The idea behind push-pop optimization is to delay the push operation until
it is almost certain that the data actually has to be pushed.
As is often the case, the data does not have to be pushed,
but will be used as input to another EM instruction.
If we can decide at compile time that this will indeed be
the case we can save the time of first pushing the data and then popping it
back again by temporarily storing the data (possibly only during compilation!)
and using it no sooner than it is actually needed.
.PP
The \*(Sisli 4\*(So instruction, for instance, expects two inputs on top of the
stack: on top a counter and right below that the shiftee (the number
to be shifted). As a result \*(Sisli\*(So
pushes 'shiftee << counter' back to the stack. Now consider the following
sequence, which could be the result of the expression \*(Si4 * i\*(So
.DS
.TS
;
l1f6 lf6 l.
lol	-4
loc	2
sli	4
.TE
.DE
In the non-optimized situation the \*(Silol\*(So would push
a local variable (whose offset is -4) on the stack.
Then the \*(Siloc\*(So pushes a 2 on the stack and finally \*(Sisli\*(So
retrieves both these numbers to replace then with the result.
On most machines it is not necessary to
push the 2 on the stack, since it can be used in the shift instruction
as an immediately operand. On a SPARC, for instance, one can write
.DS
.TS
;
l2f6 lf6 l.
ld	[%LB-4], %g1	! load local variable into register g1
sll	%g1, 2, %g2	! perform the shift-left-by-2
.TE
.DE
where the output of the \*(Silol\*(So, as well as the immediate operand 2 are used
in the shift instruction. As suggested before, all of this can be
achieved with push-pop optimization.
.NH 3
A mechanism for push-pop optimization
.PP
To implement the above optimization we need some mechanism to
temporarily store information during compilation.
We need to be able to store, compare and retrieve information from the
temporary storage (cache) without any
loss of information. Before describing all the routines used
to implement our cache we will first describe how the cache works.
.PP
Items in the cache are structures containing an external (\*(Sichar *\*(So),
two registers (\*(Sireg_t\*(So) and a constant (\*(Siarith\*(So),
any of which may be 0.
The value of such a structure is the sum of (the values of)
its elements. To put a register in the cache, one has to be allocated either
by calling \*(Sialloc_reg\*(So which returns a free register, by
\*(Siforced_alloc_reg\*(So which allocates a specific register or any
of the other routines available to allocate a register. The keep things
simple, we will not discuss all of the available primitives here.
When the register
is then put in the cache by the \*(Sipush_reg\*(So routine, the ownership will
be transferred from the user to the cache. Ownership is important, because
only the owner of a register may (and must!) deallocate it. Registers can be
owned by either an (imaginary) register manager, the cache or the user.
When the user retrieves a register from the stack with \*(Sipop_reg\*(So for
instance, ownership is back to the user.
The user should then call \*(Sifree_reg\*(So
to transfer ownership to the register manager or call \*(Sipush_reg\*(So
to give it back to the cache.
Since the cache behaves itself as a stack we will use the term pop resp. push
to get items from, resp. put items in the cache.
.PP
We shall now present the sets of routines that implement the cache.
.IP \(bu
The routines
.DS
\*(Si
reg_t alloc_reg(void)
reg_t alloc_reg_var(void)
reg_t alloc_float(void)
reg_t alloc_float_var(void)
reg_t alloc_double(void)
reg_t alloc_double_var(void)

void forced_alloc_reg(reg_t)
void soft_alloc_reg(reg_t)

void free_reg(reg_t)
void free_double_reg(reg_t)
\*(So
.DE
allocate and deallocate registers. If there are no more register left,
i.e. they are owned by the cache,
one or more registers will be freed by flushing part of the cache
onto the real stack.
The \*(Sialloc_xxx_var\*(So primitives try to allocate a register that
can be used to store local variables. (In the current implementation
only the input and local registers.) If none can be found \*(SiNULL\*(So
is returned. \*(Siforced_alloc_reg\*(So forces the allocation of a certain
register. If it was already in use, its contents are moved to another
register. Finally \*(Sisoft_alloc_reg\*(So provides the possibility to
push a register onto the cache and still keep a copy for later use.
(Used to implement the \*(Sidup 4\*(So for example.)
.IP \(bu
The routines
.DS
\*(Si
void push_const(arith)
arith pop_const(void)
\*(So
.DE
push or pop a constant onto or from the stack. Distinction between
constants and other types is made so as not to loose any information; constants
may be used later on as immediate operators, which is not the case
for other types. If \*(Sipop_const\*(So is called, but the element on top of
the cache has either one of the external or register fields non-zero a
fatal error will be reported.
.IP \(bu
The routines
.DS
\*(Si
reg_t pop_reg(void)
reg_t pop_float(void)
reg_t pop_double(void)
reg_t pop_reg_c13(char *n)

void pop_reg_as(reg_t)

void push_reg(reg_t)
\*(So
.DE
push or pop a register. These will be used most often since results from one
EM instruction, which are computed in a register, are often used in the next.
When the element on top of the cache is more
than just a register the cache manager
will generate code to compute the sum of its fields and put the result in a
register. This register will then be given to the user.
If the user wants the result is a special register, he should use the
\*(Sipop_reg_as\*(So routine.
The \*(Sipop_reg_c13\*(So gives an optional number (as character string) whose
value can be represented in 13 bits. The constant can then be used as an
offset for the SPARC \*(Sild\*(So and \*(Sist\*(So instructions.
.IP \(bu
The routine
.DS
\*(Si
void push_ext(char *)
\*(So
.DE
pushes an external onto the stack. There is no pop-variant of this one since
there is no use in popping an external.
.IP \(bu
The routines
.DS
\*(Si
void inc_tos(arith n)
void inc_tos_reg(reg_t r)
\*(So
.DE
increment the element on top of the cache by either the constant \*(Sin\*(So
or by a register. The latter is useful for pointer addition when referencing
external memory.
.KS
.IP \(bu
The routine
.DS
\*(Si
int type_of_tos(void)
\*(So
.DE
.KE
returns the type of the element on top of the cache. This is a combination
(binary OR) of \*(SiT_ext\*(So, \*(SiT_reg\*(So or \*(SiT_float\*(So,
\*(SiT_reg2\*(So or \*(SiT_float2\*(So, and \*(SiT_cst\*(So,
and tells the
user which of the three fields are non-zero. When the register-fields
represent \*(Si%g0\*(So, it is considered zero.
.IP \(bu
Miscellaneous routines:
.DS
\*(Si
void init_cache(void)
void cache_need(int)
void change_reg(void)
void flush_cache(void)
\*(So
.DE
\*(Siinit_cache\*(So should be called before any
other cache routines, to initialize some internal datastructures.
\*(Sicache_need\*(So is used to tell the cache that a certain number
of register are needed for the next operation. This way the cache can
load them efficiently in one fell swoop. \*(Sichange_reg\*(So is to be
called when the user changes a register of which the cache (possibly) has
co-ownership. Because the contents of registers in the cache are
not allowed to change the user should call \*(Sichange_reg\*(So to
instruct the cache to copy the contents to some other register.
\*(Siflush_cache\*(So writes the cache to the stack and invalidates
the cache. It should be used before branches,
before labels and on other places where the stack has to be valid (i.e. where
every item on the EM-stack should be stored on the real stack, not in some
virtual cache).
.NH 3
Implementing push-pop optimization in the EM_table
.PP
As indicated above, there is no regular way to represent the described
optimization in the EM_table. The only possible escapes from the EM_table
are function calls, but that is clearly not enough to implement a good
push-pop optimizer. Therefore we will use a modified version of the EM_table
format, where the description of, say, the \*(Silol\*(So instruction might look
like this\(dg:
.FS
\(dg This is not the way the \*(Silol\*(So actually looks in the EM_table;
it only shows how it \fImight\fR look using the forementioned push/pop
primitives.
.FE
.DS
\*(Si
reg_t A, B;
const_str_t n;

alloc_reg(A);
push_reg(LB);
inc_tos($1);
B = pop_reg_c13(n);
"ld  [$B+$n], $A";
push_reg(A);
free_reg(B);
\*(So
.DE
For more details about the exact implementation consult
appendix B which contains some characteristic excerpts from the EM_table.
.NH 2
Stack management
.PP
When converting EM code to some executable code there is the problem of
maintaining multiple stacks. The usual way to do this is described in
.[ [
Description of a Machine Architecture
.]]
and is shown in figure \*(SN1.
.KE
.PS
copy "pics/EM_stack.orig"
.PE
.ce 1
\fIFigure \*(SN1: usual stack management.
.KE
.sp
.LP
This means that the EM stack and the hardware stack (used
for subroutine calls, etc.) are interleaved in memory. On the SPARC, however,
this brings up a large problem: in the former model it is assumed that the
resolution of the stack pointer is a word, but this is not the case on the
SPARC processor. On the SPARC processor the stack-pointer as well as the
frame-pointer have to be aligned on 8-byte boundaries, so one can not simply
push a word on the stack and then lower the stack-pointer by 4 bytes!
.NH 3
Possible solutions
.PP
A simple idea might be to use a swiss-cheese stack; we could
push a 4-byte word onto the stack and then lower the stack by 8.
Unfortunately, this is not a very solid solution, because
pointer-arithmetic involving pointers to objects on the stack would cause
hard-to-predict anomalies.
.PP
Another try would be not to use the hardware stack at all. As long as we
do not generate subroutine-calls everything will be all right. This
approach, however, also has some disadvantages: first we would not be able
to use any of the existing debuggers such as \fIadb\fR, because they all
assume a regular stack format. Secondly, we would not be able to make use
of the SPARC's register windows to keep local variables. Finally, doing all the
administrative work necessary for subroutine calls ourselves instead of
letting the hardware handle it for us,
causes unnecessary procedure-call overhead.
.PP
Yet another alternative would be to emulate the EM-part of the stack,
and to let the hardware handle the subroutine call. Since we will
emulate our own stack, there are no alignment restrictions and because
we will use the hardware procedure call we can still make use of
the register windows.
.NH 3
Our implementation
.PP
To implement the hybrid stack we need two extra registers: one for the
the EM stack pointer (the forementioned \*(Si%SP\*(So) and one for the
EM local base pointer (\*(Si%LB\*(So). The most elegant solution would be to
put both stacks in different segments, so they would not influence
each other. Unfortunately
.UX
lacks the ability to add segments and
since we will implement our backend under
.UX,
we will have to put
both stacks in the same segment. Exactly how this can be done is shown
in figure \*(SN2.
.DS
.PS
copy "pics/mem_config"
.PE
.ce 1
\fIFigure \*(SN2: our stack management.\fR
.DE
.sp
During normal procedure execution, the SPARC stack pointer has to point to
a memory location where the operating system can dump the active part of
the register window. The rest of the
register window will be dumped in the therefor pre-allocated (stack) space
by following the frame
pointer. When a signal occurs things get even more complicated and
result in figure \*(SN3.
.DS
.PS
copy "pics/signal_stack"
.PE
.ce 1
\fIFigure \*(SN3: our signal stack.\fR
.DE
.PP
The exact implementation of the stack is shown in figure \*(SN4.
.KF
.PS
copy "pics/EM_stack.ours"
.PE
.ce 1
\fIFigure \*(SN4: stack overview.\fR
.KE
.NH 2
Miscellaneous
.PP
As mentioned in the previous chapter, the generated \fI.o\fR-files are
not compatible with Sun's own object format. The primary reason for
this is that Sun usually passes the first six parameters of a procedure call
through registers. If we were to do that too, we would always have
to fetch the top six words from the stack into registers, even when
the procedure would not have any parameters at all. Apart from this,
structure-passing is another exception in Sun's object format which
makes is impossible to generate object-compatible code.\(dg
.FS
\(dg Exactly how Sun passes structures as parameters is described in
Appendix D of the SPARC Architecture Manual (Software Considerations)
.FE
.bp