469 lines
16 KiB
Plaintext
469 lines
16 KiB
Plaintext
|
.so init
|
||
|
.hw data-structures
|
||
|
.nr H1 3
|
||
|
.NH
|
||
|
SOLUTIONS
|
||
|
.NH 2
|
||
|
Maintaining SPARC speed
|
||
|
.PP
|
||
|
In chapter 3 we wrote:
|
||
|
.sp 0.3
|
||
|
.nf
|
||
|
>If we want to generate efficient code, we should at least try to reduce the number of
|
||
|
>memory references and use registers wherever we can.
|
||
|
.fi
|
||
|
.sp 0.3
|
||
|
In this chapter we will device a strategy to swiftly generate acceptable
|
||
|
code by using push-pop optimization.
|
||
|
Note that this is not the push-pop
|
||
|
optimization already available in the EM-kit, since that is only present
|
||
|
in the assembler-to-binary part which we do not use
|
||
|
.[ [
|
||
|
The Code Expander Generator
|
||
|
.]].
|
||
|
Our push-pop optimization
|
||
|
works more like the fake-stack described in
|
||
|
.[ [
|
||
|
The table driven code generator
|
||
|
.]].
|
||
|
.NH 3
|
||
|
Ad-hoc optimization
|
||
|
.PP
|
||
|
Before getting involved in any optimization let's have a look at some
|
||
|
code generated with a straightforward EM to SPARC conversion of the
|
||
|
C statement: \*(Sif(a[i]);\*(So Note that \*(Si%SP\*(So is an alias
|
||
|
for a general purpose
|
||
|
register and acts as the EM stack pointer. It has nothing to do with
|
||
|
\*(Si%sp\*(So \(em the SPARC stack pointer.
|
||
|
Analogous \*(Si%LB\*(So is EMs local base pointer.
|
||
|
.br
|
||
|
.IP
|
||
|
.HS
|
||
|
.TS
|
||
|
;
|
||
|
l s l s l
|
||
|
l1f6 lf6 l2f6 lf6 l.
|
||
|
EM code SPARC code Comment
|
||
|
|
||
|
lae _a set _a, %g1 ! load address of external _a
|
||
|
dec 4, %SP
|
||
|
st %g1, [%SP]
|
||
|
|
||
|
lol -4 set -4, %g1 ! load local -4 (i)
|
||
|
ld [%g1+%LB], %g2
|
||
|
dec 4, %SP
|
||
|
st %g2, [%SP]
|
||
|
|
||
|
loc 2 set 2, %g1 ! load constant 2
|
||
|
dec 4, %SP
|
||
|
st %g1, [%SP]
|
||
|
|
||
|
sli 4 ld [%SP], %g1 ! pop shift count
|
||
|
ld [%SP+4], %g2 ! pop shiftee
|
||
|
sll %g2, %g1, %g3
|
||
|
inc 4, %SP
|
||
|
st %g3, [%SP] ! push 4 * i
|
||
|
|
||
|
ads 4 ld [%SP], %g1 ! add pointer and offset
|
||
|
ld [%SP+4], %g2
|
||
|
add %g1, %g2, %g3
|
||
|
inc 4, %SP
|
||
|
st %g3, [%SP] ! push address of _a + (4 * i)
|
||
|
|
||
|
loi 4 ld [%SP], %g1 ! load indirect 4 bytes
|
||
|
ld [%g1], %g2
|
||
|
st %g2, [%SP] ! push a[i]
|
||
|
cal _f
|
||
|
...
|
||
|
.TE
|
||
|
.HS
|
||
|
.LP
|
||
|
Although the code is easy understand, it clearly is far from optimal.
|
||
|
The above code uses approximately 60 clock-cycles\(dg
|
||
|
.FS
|
||
|
\(dg In general each instruction only takes one cycle,
|
||
|
except for \*(Sild\*(So and
|
||
|
\*(Sist\*(So which may both require additional clock cycles. The exact amount
|
||
|
of extra cycles needed depends on the SPARC implementation and memory access
|
||
|
time. Furthermore, the
|
||
|
\*(Siset\*(So pseudo-instruction is a bit tricky. It takes one cycle when
|
||
|
its argument lies between -4096 and 4095, and two cycles otherwise.
|
||
|
.FE
|
||
|
to push an array-element on the stack,
|
||
|
something which a 68020 can do in a single instruction. The SPARC
|
||
|
processor may be fast, but not fast enough to justify the above code.
|
||
|
.PP
|
||
|
The same statement can be translated much more efficiently:
|
||
|
.DS
|
||
|
.TS
|
||
|
;
|
||
|
l2f6 lf6 l.
|
||
|
sll %i0, 2, %g2 ! multiply index by 4
|
||
|
set _a, g3
|
||
|
ld [%g2+%g3], %g1 ! get contents of a[i]
|
||
|
dec 4, SP
|
||
|
st %g2, [SP] ! push a[i] onto the stack
|
||
|
.TE
|
||
|
.DE
|
||
|
which, instead of 60, uses only 5 clock cycles to retrieve the element
|
||
|
from memory and 5 additional cycles when the result has to be pushed
|
||
|
on the stack. Note that when the result is not a parameter it does not
|
||
|
have to be pushed on the stack. By making efficient use of the SPARC
|
||
|
registers we can fetch \*(Sia[i]\*(So in only 5 cycles!
|
||
|
.NH 3
|
||
|
Analyzing optimization
|
||
|
.PP
|
||
|
Instead of ad-hoc optimization we will need something more solid.
|
||
|
When one tries to optimize the above code in an ad-hoc manner one will
|
||
|
probably notice the large overhead due to stack access. Almost every EM
|
||
|
instruction requires at least three SPARC instructions: one to carry out
|
||
|
the EM instruction and two to pop and push the result from and onto the
|
||
|
stack. This happens for every instruction, even though the data being pushed
|
||
|
will probably be needed by the next instruction. To optimize this extensive
|
||
|
pushing and popping of data we will use the appropriately named push-pop
|
||
|
optimization.
|
||
|
.PP
|
||
|
The idea behind push-pop optimization is to delay the push operation until
|
||
|
it is almost certain that the data actually has to be pushed.
|
||
|
As is often the case, the data does not have to be pushed,
|
||
|
but will be used as input to another EM instruction.
|
||
|
If we can decide at compile time that this will indeed be
|
||
|
the case we can save the time of first pushing the data and then popping it
|
||
|
back again by temporarily storing the data (possibly only during compilation!)
|
||
|
and using it no sooner than it is actually needed.
|
||
|
.PP
|
||
|
The \*(Sisli 4\*(So instruction, for instance, expects two inputs on top of the
|
||
|
stack: on top a counter and right below that the shiftee (the number
|
||
|
to be shifted). As a result \*(Sisli\*(So
|
||
|
pushes 'shiftee << counter' back to the stack. Now consider the following
|
||
|
sequence, which could be the result of the expression \*(Si4 * i\*(So
|
||
|
.DS
|
||
|
.TS
|
||
|
;
|
||
|
l1f6 lf6 l.
|
||
|
lol -4
|
||
|
loc 2
|
||
|
sli 4
|
||
|
.TE
|
||
|
.DE
|
||
|
In the non-optimized situation the \*(Silol\*(So would push
|
||
|
a local variable (whose offset is -4) on the stack.
|
||
|
Then the \*(Siloc\*(So pushes a 2 on the stack and finally \*(Sisli\*(So
|
||
|
retrieves both these numbers to replace then with the result.
|
||
|
On most machines it is not necessary to
|
||
|
push the 2 on the stack, since it can be used in the shift instruction
|
||
|
as an immediately operand. On a SPARC, for instance, one can write
|
||
|
.DS
|
||
|
.TS
|
||
|
;
|
||
|
l2f6 lf6 l.
|
||
|
ld [%LB-4], %g1 ! load local variable into register g1
|
||
|
sll %g1, 2, %g2 ! perform the shift-left-by-2
|
||
|
.TE
|
||
|
.DE
|
||
|
where the output of the \*(Silol\*(So, as well as the immediate operand 2 are used
|
||
|
in the shift instruction. As suggested before, all of this can be
|
||
|
achieved with push-pop optimization.
|
||
|
.NH 3
|
||
|
A mechanism for push-pop optimization
|
||
|
.PP
|
||
|
To implement the above optimization we need some mechanism to
|
||
|
temporarily store information during compilation.
|
||
|
We need to be able to store, compare and retrieve information from the
|
||
|
temporary storage (cache) without any
|
||
|
loss of information. Before describing all the routines used
|
||
|
to implement our cache we will first describe how the cache works.
|
||
|
.PP
|
||
|
Items in the cache are structures containing an external (\*(Sichar *\*(So),
|
||
|
two registers (\*(Sireg_t\*(So) and a constant (\*(Siarith\*(So),
|
||
|
any of which may be 0.
|
||
|
The value of such a structure is the sum of (the values of)
|
||
|
its elements. To put a register in the cache, one has to be allocated either
|
||
|
by calling \*(Sialloc_reg\*(So which returns a free register, by
|
||
|
\*(Siforced_alloc_reg\*(So which allocates a specific register or any
|
||
|
of the other routines available to allocate a register. The keep things
|
||
|
simple, we will not discuss all of the available primitives here.
|
||
|
When the register
|
||
|
is then put in the cache by the \*(Sipush_reg\*(So routine, the ownership will
|
||
|
be transferred from the user to the cache. Ownership is important, because
|
||
|
only the owner of a register may (and must!) deallocate it. Registers can be
|
||
|
owned by either an (imaginary) register manager, the cache or the user.
|
||
|
When the user retrieves a register from the stack with \*(Sipop_reg\*(So for
|
||
|
instance, ownership is back to the user.
|
||
|
The user should then call \*(Sifree_reg\*(So
|
||
|
to transfer ownership to the register manager or call \*(Sipush_reg\*(So
|
||
|
to give it back to the cache.
|
||
|
Since the cache behaves itself as a stack we will use the term pop resp. push
|
||
|
to get items from, resp. put items in the cache.
|
||
|
.PP
|
||
|
We shall now present the sets of routines that implement the cache.
|
||
|
.IP \(bu
|
||
|
The routines
|
||
|
.DS
|
||
|
\*(Si
|
||
|
reg_t alloc_reg(void)
|
||
|
reg_t alloc_reg_var(void)
|
||
|
reg_t alloc_float(void)
|
||
|
reg_t alloc_float_var(void)
|
||
|
reg_t alloc_double(void)
|
||
|
reg_t alloc_double_var(void)
|
||
|
|
||
|
void forced_alloc_reg(reg_t)
|
||
|
void soft_alloc_reg(reg_t)
|
||
|
|
||
|
void free_reg(reg_t)
|
||
|
void free_double_reg(reg_t)
|
||
|
\*(So
|
||
|
.DE
|
||
|
allocate and deallocate registers. If there are no more register left,
|
||
|
i.e. they are owned by the cache,
|
||
|
one or more registers will be freed by flushing part of the cache
|
||
|
onto the real stack.
|
||
|
The \*(Sialloc_xxx_var\*(So primitives try to allocate a register that
|
||
|
can be used to store local variables. (In the current implementation
|
||
|
only the input and local registers.) If none can be found \*(SiNULL\*(So
|
||
|
is returned. \*(Siforced_alloc_reg\*(So forces the allocation of a certain
|
||
|
register. If it was already in use, its contents are moved to another
|
||
|
register. Finally \*(Sisoft_alloc_reg\*(So provides the possibility to
|
||
|
push a register onto the cache and still keep a copy for later use.
|
||
|
(Used to implement the \*(Sidup 4\*(So for example.)
|
||
|
.IP \(bu
|
||
|
The routines
|
||
|
.DS
|
||
|
\*(Si
|
||
|
void push_const(arith)
|
||
|
arith pop_const(void)
|
||
|
\*(So
|
||
|
.DE
|
||
|
push or pop a constant onto or from the stack. Distinction between
|
||
|
constants and other types is made so as not to loose any information; constants
|
||
|
may be used later on as immediate operators, which is not the case
|
||
|
for other types. If \*(Sipop_const\*(So is called, but the element on top of
|
||
|
the cache has either one of the external or register fields non-zero a
|
||
|
fatal error will be reported.
|
||
|
.IP \(bu
|
||
|
The routines
|
||
|
.DS
|
||
|
\*(Si
|
||
|
reg_t pop_reg(void)
|
||
|
reg_t pop_float(void)
|
||
|
reg_t pop_double(void)
|
||
|
reg_t pop_reg_c13(char *n)
|
||
|
|
||
|
void pop_reg_as(reg_t)
|
||
|
|
||
|
void push_reg(reg_t)
|
||
|
\*(So
|
||
|
.DE
|
||
|
push or pop a register. These will be used most often since results from one
|
||
|
EM instruction, which are computed in a register, are often used in the next.
|
||
|
When the element on top of the cache is more
|
||
|
than just a register the cache manager
|
||
|
will generate code to compute the sum of its fields and put the result in a
|
||
|
register. This register will then be given to the user.
|
||
|
If the user wants the result is a special register, he should use the
|
||
|
\*(Sipop_reg_as\*(So routine.
|
||
|
The \*(Sipop_reg_c13\*(So gives an optional number (as character string) whose
|
||
|
value can be represented in 13 bits. The constant can then be used as an
|
||
|
offset for the SPARC \*(Sild\*(So and \*(Sist\*(So instructions.
|
||
|
.IP \(bu
|
||
|
The routine
|
||
|
.DS
|
||
|
\*(Si
|
||
|
void push_ext(char *)
|
||
|
\*(So
|
||
|
.DE
|
||
|
pushes an external onto the stack. There is no pop-variant of this one since
|
||
|
there is no use in popping an external.
|
||
|
.IP \(bu
|
||
|
The routines
|
||
|
.DS
|
||
|
\*(Si
|
||
|
void inc_tos(arith n)
|
||
|
void inc_tos_reg(reg_t r)
|
||
|
\*(So
|
||
|
.DE
|
||
|
increment the element on top of the cache by either the constant \*(Sin\*(So
|
||
|
or by a register. The latter is useful for pointer addition when referencing
|
||
|
external memory.
|
||
|
.KS
|
||
|
.IP \(bu
|
||
|
The routine
|
||
|
.DS
|
||
|
\*(Si
|
||
|
int type_of_tos(void)
|
||
|
\*(So
|
||
|
.DE
|
||
|
.KE
|
||
|
returns the type of the element on top of the cache. This is a combination
|
||
|
(binary OR) of \*(SiT_ext\*(So, \*(SiT_reg\*(So or \*(SiT_float\*(So,
|
||
|
\*(SiT_reg2\*(So or \*(SiT_float2\*(So, and \*(SiT_cst\*(So,
|
||
|
and tells the
|
||
|
user which of the three fields are non-zero. When the register-fields
|
||
|
represent \*(Si%g0\*(So, it is considered zero.
|
||
|
.IP \(bu
|
||
|
Miscellaneous routines:
|
||
|
.DS
|
||
|
\*(Si
|
||
|
void init_cache(void)
|
||
|
void cache_need(int)
|
||
|
void change_reg(void)
|
||
|
void flush_cache(void)
|
||
|
\*(So
|
||
|
.DE
|
||
|
\*(Siinit_cache\*(So should be called before any
|
||
|
other cache routines, to initialize some internal datastructures.
|
||
|
\*(Sicache_need\*(So is used to tell the cache that a certain number
|
||
|
of register are needed for the next operation. This way the cache can
|
||
|
load them efficiently in one fell swoop. \*(Sichange_reg\*(So is to be
|
||
|
called when the user changes a register of which the cache (possibly) has
|
||
|
co-ownership. Because the contents of registers in the cache are
|
||
|
not allowed to change the user should call \*(Sichange_reg\*(So to
|
||
|
instruct the cache to copy the contents to some other register.
|
||
|
\*(Siflush_cache\*(So writes the cache to the stack and invalidates
|
||
|
the cache. It should be used before branches,
|
||
|
before labels and on other places where the stack has to be valid (i.e. where
|
||
|
every item on the EM-stack should be stored on the real stack, not in some
|
||
|
virtual cache).
|
||
|
.NH 3
|
||
|
Implementing push-pop optimization in the EM_table
|
||
|
.PP
|
||
|
As indicated above, there is no regular way to represent the described
|
||
|
optimization in the EM_table. The only possible escapes from the EM_table
|
||
|
are function calls, but that is clearly not enough to implement a good
|
||
|
push-pop optimizer. Therefore we will use a modified version of the EM_table
|
||
|
format, where the description of, say, the \*(Silol\*(So instruction might look
|
||
|
like this\(dg:
|
||
|
.FS
|
||
|
\(dg This is not the way the \*(Silol\*(So actually looks in the EM_table;
|
||
|
it only shows how it \fImight\fR look using the forementioned push/pop
|
||
|
primitives.
|
||
|
.FE
|
||
|
.DS
|
||
|
\*(Si
|
||
|
reg_t A, B;
|
||
|
const_str_t n;
|
||
|
|
||
|
alloc_reg(A);
|
||
|
push_reg(LB);
|
||
|
inc_tos($1);
|
||
|
B = pop_reg_c13(n);
|
||
|
"ld [$B+$n], $A";
|
||
|
push_reg(A);
|
||
|
free_reg(B);
|
||
|
\*(So
|
||
|
.DE
|
||
|
For more details about the exact implementation consult
|
||
|
appendix B which contains some characteristic excerpts from the EM_table.
|
||
|
.NH 2
|
||
|
Stack management
|
||
|
.PP
|
||
|
When converting EM code to some executable code there is the problem of
|
||
|
maintaining multiple stacks. The usual way to do this is described in
|
||
|
.[ [
|
||
|
Description of a Machine Architecture
|
||
|
.]]
|
||
|
and is shown in figure \*(SN1.
|
||
|
.KE
|
||
|
.PS
|
||
|
copy "pics/EM_stack.orig"
|
||
|
.PE
|
||
|
.ce 1
|
||
|
\fIFigure \*(SN1: usual stack management.
|
||
|
.KE
|
||
|
.sp
|
||
|
.LP
|
||
|
This means that the EM stack and the hardware stack (used
|
||
|
for subroutine calls, etc.) are interleaved in memory. On the SPARC, however,
|
||
|
this brings up a large problem: in the former model it is assumed that the
|
||
|
resolution of the stack pointer is a word, but this is not the case on the
|
||
|
SPARC processor. On the SPARC processor the stack-pointer as well as the
|
||
|
frame-pointer have to be aligned on 8-byte boundaries, so one can not simply
|
||
|
push a word on the stack and then lower the stack-pointer by 4 bytes!
|
||
|
.NH 3
|
||
|
Possible solutions
|
||
|
.PP
|
||
|
A simple idea might be to use a swiss-cheese stack; we could
|
||
|
push a 4-byte word onto the stack and then lower the stack by 8.
|
||
|
Unfortunately, this is not a very solid solution, because
|
||
|
pointer-arithmetic involving pointers to objects on the stack would cause
|
||
|
hard-to-predict anomalies.
|
||
|
.PP
|
||
|
Another try would be not to use the hardware stack at all. As long as we
|
||
|
do not generate subroutine-calls everything will be all right. This
|
||
|
approach, however, also has some disadvantages: first we would not be able
|
||
|
to use any of the existing debuggers such as \fIadb\fR, because they all
|
||
|
assume a regular stack format. Secondly, we would not be able to make use
|
||
|
of the SPARC's register windows to keep local variables. Finally, doing all the
|
||
|
administrative work necessary for subroutine calls ourselves instead of
|
||
|
letting the hardware handle it for us,
|
||
|
causes unnecessary procedure-call overhead.
|
||
|
.PP
|
||
|
Yet another alternative would be to emulate the EM-part of the stack,
|
||
|
and to let the hardware handle the subroutine call. Since we will
|
||
|
emulate our own stack, there are no alignment restrictions and because
|
||
|
we will use the hardware procedure call we can still make use of
|
||
|
the register windows.
|
||
|
.NH 3
|
||
|
Our implementation
|
||
|
.PP
|
||
|
To implement the hybrid stack we need two extra registers: one for the
|
||
|
the EM stack pointer (the forementioned \*(Si%SP\*(So) and one for the
|
||
|
EM local base pointer (\*(Si%LB\*(So). The most elegant solution would be to
|
||
|
put both stacks in different segments, so they would not influence
|
||
|
each other. Unfortunately
|
||
|
.UX
|
||
|
lacks the ability to add segments and
|
||
|
since we will implement our backend under
|
||
|
.UX,
|
||
|
we will have to put
|
||
|
both stacks in the same segment. Exactly how this can be done is shown
|
||
|
in figure \*(SN2.
|
||
|
.DS
|
||
|
.PS
|
||
|
copy "pics/mem_config"
|
||
|
.PE
|
||
|
.ce 1
|
||
|
\fIFigure \*(SN2: our stack management.\fR
|
||
|
.DE
|
||
|
.sp
|
||
|
During normal procedure execution, the SPARC stack pointer has to point to
|
||
|
a memory location where the operating system can dump the active part of
|
||
|
the register window. The rest of the
|
||
|
register window will be dumped in the therefor pre-allocated (stack) space
|
||
|
by following the frame
|
||
|
pointer. When a signal occurs things get even more complicated and
|
||
|
result in figure \*(SN3.
|
||
|
.DS
|
||
|
.PS
|
||
|
copy "pics/signal_stack"
|
||
|
.PE
|
||
|
.ce 1
|
||
|
\fIFigure \*(SN3: our signal stack.\fR
|
||
|
.DE
|
||
|
.PP
|
||
|
The exact implementation of the stack is shown in figure \*(SN4.
|
||
|
.KF
|
||
|
.PS
|
||
|
copy "pics/EM_stack.ours"
|
||
|
.PE
|
||
|
.ce 1
|
||
|
\fIFigure \*(SN4: stack overview.\fR
|
||
|
.KE
|
||
|
.NH 2
|
||
|
Miscellaneous
|
||
|
.PP
|
||
|
As mentioned in the previous chapter, the generated \fI.o\fR-files are
|
||
|
not compatible with Sun's own object format. The primary reason for
|
||
|
this is that Sun usually passes the first six parameters of a procedure call
|
||
|
through registers. If we were to do that too, we would always have
|
||
|
to fetch the top six words from the stack into registers, even when
|
||
|
the procedure would not have any parameters at all. Apart from this,
|
||
|
structure-passing is another exception in Sun's object format which
|
||
|
makes is impossible to generate object-compatible code.\(dg
|
||
|
.FS
|
||
|
\(dg Exactly how Sun passes structures as parameters is described in
|
||
|
Appendix D of the SPARC Architecture Manual (Software Considerations)
|
||
|
.FE
|
||
|
.bp
|