468 lines
		
	
	
	
		
			16 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			468 lines
		
	
	
	
		
			16 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
.In
 | 
						|
.hw data-structures
 | 
						|
.nr H1 3
 | 
						|
.NH
 | 
						|
SOLUTIONS
 | 
						|
.NH 2
 | 
						|
Maintaining SPARC speed
 | 
						|
.PP
 | 
						|
In chapter 3 we wrote:
 | 
						|
.sp 0.3
 | 
						|
.nf
 | 
						|
>If we want to generate efficient code, we should at least try to reduce the number of
 | 
						|
>memory references and use registers wherever we can.
 | 
						|
.fi
 | 
						|
.sp 0.3
 | 
						|
In this chapter we will device a strategy to swiftly generate acceptable
 | 
						|
code by using push-pop optimization.
 | 
						|
Note that this is not the push-pop
 | 
						|
optimization already available in the EM-kit, since that is only present
 | 
						|
in the assembler-to-binary part which we do not use
 | 
						|
.[ [
 | 
						|
The Code Expander Generator
 | 
						|
.]].
 | 
						|
Our push-pop optimization
 | 
						|
works more like the fake-stack described in
 | 
						|
.[ [
 | 
						|
The table driven code generator
 | 
						|
.]].
 | 
						|
.NH 3
 | 
						|
Ad-hoc optimization
 | 
						|
.PP
 | 
						|
Before getting involved in any optimization let's have a look at some
 | 
						|
code generated with a straightforward EM to SPARC conversion of the
 | 
						|
C statement: \*(Sif(a[i]);\*(So Note that \*(Si%SP\*(So is an alias
 | 
						|
for a general purpose
 | 
						|
register and acts as the EM stack pointer. It has nothing to do with
 | 
						|
\*(Si%sp\*(So \(em the SPARC stack pointer.
 | 
						|
Analogous \*(Si%LB\*(So is EMs local base pointer.
 | 
						|
.br
 | 
						|
.IP
 | 
						|
.HS
 | 
						|
.TS
 | 
						|
;
 | 
						|
l s l s l
 | 
						|
l1f6 lf6 l2f6 lf6 l.
 | 
						|
EM code	SPARC code	Comment
 | 
						|
 | 
						|
lae	_a	set	_a, %g1	! load address of external _a
 | 
						|
		dec	4, %SP
 | 
						|
		st	%g1, [%SP]
 | 
						|
 | 
						|
lol	-4	set	-4, %g1	! load local -4 (i)
 | 
						|
		ld	[%g1+%LB], %g2
 | 
						|
		dec	4, %SP
 | 
						|
		st	%g2, [%SP]
 | 
						|
 | 
						|
loc	2	set	2, %g1	! load constant 2
 | 
						|
		dec	4, %SP
 | 
						|
		st	%g1, [%SP]
 | 
						|
 | 
						|
sli	4	ld	[%SP], %g1	! pop shift count
 | 
						|
		ld	[%SP+4], %g2	! pop shiftee
 | 
						|
		sll	%g2, %g1, %g3
 | 
						|
		inc	4, %SP
 | 
						|
		st	%g3, [%SP]	! push 4 * i
 | 
						|
 | 
						|
ads	4	ld	[%SP], %g1	! add pointer and offset
 | 
						|
		ld	[%SP+4], %g2
 | 
						|
		add	%g1, %g2, %g3
 | 
						|
		inc	4, %SP
 | 
						|
		st	%g3, [%SP]	! push address of _a + (4 * i)
 | 
						|
 | 
						|
loi	4	ld	[%SP], %g1	! load indirect 4 bytes
 | 
						|
		ld	[%g1], %g2
 | 
						|
		st	%g2, [%SP]	! push a[i]
 | 
						|
cal	_f
 | 
						|
		...
 | 
						|
.TE
 | 
						|
.HS
 | 
						|
.LP
 | 
						|
Although the code is easy understand, it clearly is far from optimal.
 | 
						|
The above code uses approximately 60 clock-cycles\(dg
 | 
						|
.FS
 | 
						|
\(dg In general each instruction only takes one cycle,
 | 
						|
except for \*(Sild\*(So and
 | 
						|
\*(Sist\*(So which may both require additional clock cycles. The exact amount
 | 
						|
of extra cycles needed depends on the SPARC implementation and memory access
 | 
						|
time. Furthermore, the
 | 
						|
\*(Siset\*(So pseudo-instruction is a bit tricky. It takes one cycle when
 | 
						|
its argument lies between -4096 and 4095, and two cycles otherwise.
 | 
						|
.FE
 | 
						|
to push an array-element on the stack,
 | 
						|
something which a 68020 can do in a single instruction. The SPARC
 | 
						|
processor may be fast, but not fast enough to justify the above code.
 | 
						|
.PP
 | 
						|
The same statement can be translated much more efficiently:
 | 
						|
.DS
 | 
						|
.TS
 | 
						|
;
 | 
						|
l2f6 lf6 l.
 | 
						|
sll	%i0, 2, %g2	! multiply index by 4
 | 
						|
set	_a, g3
 | 
						|
ld	[%g2+%g3], %g1	! get contents of a[i]
 | 
						|
dec	4, SP
 | 
						|
st	%g2, [SP]	! push a[i] onto the stack
 | 
						|
.TE
 | 
						|
.DE
 | 
						|
which, instead of 60, uses only 5 clock cycles to retrieve the element
 | 
						|
from memory and 5 additional cycles when the result has to be pushed
 | 
						|
on the stack. Note that when the result is not a parameter it does not
 | 
						|
have to be pushed on the stack. By making efficient use of the SPARC
 | 
						|
registers we can fetch \*(Sia[i]\*(So in only 5 cycles!
 | 
						|
.NH 3
 | 
						|
Analyzing optimization
 | 
						|
.PP
 | 
						|
Instead of ad-hoc optimization we will need something more solid.
 | 
						|
When one tries to optimize the above code in an ad-hoc manner one will
 | 
						|
probably notice the large overhead due to stack access. Almost every EM
 | 
						|
instruction requires at least three SPARC instructions: one to carry out
 | 
						|
the EM instruction and two to pop and push the result from and onto the
 | 
						|
stack. This happens for every instruction, even though the data being pushed
 | 
						|
will probably be needed by the next instruction. To optimize this extensive
 | 
						|
pushing and popping of data we will use the appropriately named push-pop
 | 
						|
optimization.
 | 
						|
.PP
 | 
						|
The idea behind push-pop optimization is to delay the push operation until
 | 
						|
it is almost certain that the data actually has to be pushed.
 | 
						|
As is often the case, the data does not have to be pushed,
 | 
						|
but will be used as input to another EM instruction.
 | 
						|
If we can decide at compile time that this will indeed be
 | 
						|
the case we can save the time of first pushing the data and then popping it
 | 
						|
back again by temporarily storing the data (possibly only during compilation!)
 | 
						|
and using it no sooner than it is actually needed.
 | 
						|
.PP
 | 
						|
The \*(Sisli 4\*(So instruction, for instance, expects two inputs on top of the
 | 
						|
stack: on top a counter and right below that the shiftee (the number
 | 
						|
to be shifted). As a result \*(Sisli\*(So
 | 
						|
pushes 'shiftee << counter' back to the stack. Now consider the following
 | 
						|
sequence, which could be the result of the expression \*(Si4 * i\*(So
 | 
						|
.DS
 | 
						|
.TS
 | 
						|
;
 | 
						|
l1f6 lf6 l.
 | 
						|
lol	-4
 | 
						|
loc	2
 | 
						|
sli	4
 | 
						|
.TE
 | 
						|
.DE
 | 
						|
In the non-optimized situation the \*(Silol\*(So would push
 | 
						|
a local variable (whose offset is -4) on the stack.
 | 
						|
Then the \*(Siloc\*(So pushes a 2 on the stack and finally \*(Sisli\*(So
 | 
						|
retrieves both these numbers to replace then with the result.
 | 
						|
On most machines it is not necessary to
 | 
						|
push the 2 on the stack, since it can be used in the shift instruction
 | 
						|
as an immediately operand. On a SPARC, for instance, one can write
 | 
						|
.DS
 | 
						|
.TS
 | 
						|
;
 | 
						|
l2f6 lf6 l.
 | 
						|
ld	[%LB-4], %g1	! load local variable into register g1
 | 
						|
sll	%g1, 2, %g2	! perform the shift-left-by-2
 | 
						|
.TE
 | 
						|
.DE
 | 
						|
where the output of the \*(Silol\*(So, as well as the immediate operand 2 are used
 | 
						|
in the shift instruction. As suggested before, all of this can be
 | 
						|
achieved with push-pop optimization.
 | 
						|
.NH 3
 | 
						|
A mechanism for push-pop optimization
 | 
						|
.PP
 | 
						|
To implement the above optimization we need some mechanism to
 | 
						|
temporarily store information during compilation.
 | 
						|
We need to be able to store, compare and retrieve information from the
 | 
						|
temporary storage (cache) without any
 | 
						|
loss of information. Before describing all the routines used
 | 
						|
to implement our cache we will first describe how the cache works.
 | 
						|
.PP
 | 
						|
Items in the cache are structures containing an external (\*(Sichar *\*(So),
 | 
						|
two registers (\*(Sireg_t\*(So) and a constant (\*(Siarith\*(So),
 | 
						|
any of which may be 0.
 | 
						|
The value of such a structure is the sum of (the values of)
 | 
						|
its elements. To put a register in the cache, one has to be allocated either
 | 
						|
by calling \*(Sialloc_reg\*(So which returns a free register, by
 | 
						|
\*(Siforced_alloc_reg\*(So which allocates a specific register or any
 | 
						|
of the other routines available to allocate a register. The keep things
 | 
						|
simple, we will not discuss all of the available primitives here.
 | 
						|
When the register
 | 
						|
is then put in the cache by the \*(Sipush_reg\*(So routine, the ownership will
 | 
						|
be transferred from the user to the cache. Ownership is important, because
 | 
						|
only the owner of a register may (and must!) deallocate it. Registers can be
 | 
						|
owned by either an (imaginary) register manager, the cache or the user.
 | 
						|
When the user retrieves a register from the stack with \*(Sipop_reg\*(So for
 | 
						|
instance, ownership is back to the user.
 | 
						|
The user should then call \*(Sifree_reg\*(So
 | 
						|
to transfer ownership to the register manager or call \*(Sipush_reg\*(So
 | 
						|
to give it back to the cache.
 | 
						|
Since the cache behaves itself as a stack we will use the term pop resp. push
 | 
						|
to get items from, resp. put items in the cache.
 | 
						|
.PP
 | 
						|
We shall now present the sets of routines that implement the cache.
 | 
						|
.IP \(bu
 | 
						|
The routines
 | 
						|
.DS
 | 
						|
\*(Si
 | 
						|
reg_t alloc_reg(void)
 | 
						|
reg_t alloc_reg_var(void)
 | 
						|
reg_t alloc_float(void)
 | 
						|
reg_t alloc_float_var(void)
 | 
						|
reg_t alloc_double(void)
 | 
						|
reg_t alloc_double_var(void)
 | 
						|
 | 
						|
void forced_alloc_reg(reg_t)
 | 
						|
void soft_alloc_reg(reg_t)
 | 
						|
 | 
						|
void free_reg(reg_t)
 | 
						|
void free_double_reg(reg_t)
 | 
						|
\*(So
 | 
						|
.DE
 | 
						|
allocate and deallocate registers. If there are no more register left,
 | 
						|
i.e. they are owned by the cache,
 | 
						|
one or more registers will be freed by flushing part of the cache
 | 
						|
onto the real stack.
 | 
						|
The \*(Sialloc_xxx_var\*(So primitives try to allocate a register that
 | 
						|
can be used to store local variables. (In the current implementation
 | 
						|
only the input and local registers.) If none can be found \*(SiNULL\*(So
 | 
						|
is returned. \*(Siforced_alloc_reg\*(So forces the allocation of a certain
 | 
						|
register. If it was already in use, its contents are moved to another
 | 
						|
register. Finally \*(Sisoft_alloc_reg\*(So provides the possibility to
 | 
						|
push a register onto the cache and still keep a copy for later use.
 | 
						|
(Used to implement the \*(Sidup 4\*(So for example.)
 | 
						|
.IP \(bu
 | 
						|
The routines
 | 
						|
.DS
 | 
						|
\*(Si
 | 
						|
void push_const(arith)
 | 
						|
arith pop_const(void)
 | 
						|
\*(So
 | 
						|
.DE
 | 
						|
push or pop a constant onto or from the stack. Distinction between
 | 
						|
constants and other types is made so as not to loose any information; constants
 | 
						|
may be used later on as immediate operators, which is not the case
 | 
						|
for other types. If \*(Sipop_const\*(So is called, but the element on top of
 | 
						|
the cache has either one of the external or register fields non-zero a
 | 
						|
fatal error will be reported.
 | 
						|
.IP \(bu
 | 
						|
The routines
 | 
						|
.DS
 | 
						|
\*(Si
 | 
						|
reg_t pop_reg(void)
 | 
						|
reg_t pop_float(void)
 | 
						|
reg_t pop_double(void)
 | 
						|
reg_t pop_reg_c13(char *n)
 | 
						|
 | 
						|
void pop_reg_as(reg_t)
 | 
						|
 | 
						|
void push_reg(reg_t)
 | 
						|
\*(So
 | 
						|
.DE
 | 
						|
push or pop a register. These will be used most often since results from one
 | 
						|
EM instruction, which are computed in a register, are often used in the next.
 | 
						|
When the element on top of the cache is more
 | 
						|
than just a register the cache manager
 | 
						|
will generate code to compute the sum of its fields and put the result in a
 | 
						|
register. This register will then be given to the user.
 | 
						|
If the user wants the result is a special register, he should use the
 | 
						|
\*(Sipop_reg_as\*(So routine.
 | 
						|
The \*(Sipop_reg_c13\*(So gives an optional number (as character string) whose
 | 
						|
value can be represented in 13 bits. The constant can then be used as an
 | 
						|
offset for the SPARC \*(Sild\*(So and \*(Sist\*(So instructions.
 | 
						|
.IP \(bu
 | 
						|
The routine
 | 
						|
.DS
 | 
						|
\*(Si
 | 
						|
void push_ext(char *)
 | 
						|
\*(So
 | 
						|
.DE
 | 
						|
pushes an external onto the stack. There is no pop-variant of this one since
 | 
						|
there is no use in popping an external.
 | 
						|
.IP \(bu
 | 
						|
The routines
 | 
						|
.DS
 | 
						|
\*(Si
 | 
						|
void inc_tos(arith n)
 | 
						|
void inc_tos_reg(reg_t r)
 | 
						|
\*(So
 | 
						|
.DE
 | 
						|
increment the element on top of the cache by either the constant \*(Sin\*(So
 | 
						|
or by a register. The latter is useful for pointer addition when referencing
 | 
						|
external memory.
 | 
						|
.KS
 | 
						|
.IP \(bu
 | 
						|
The routine
 | 
						|
.DS
 | 
						|
\*(Si
 | 
						|
int type_of_tos(void)
 | 
						|
\*(So
 | 
						|
.DE
 | 
						|
.KE
 | 
						|
returns the type of the element on top of the cache. This is a combination
 | 
						|
(binary OR) of \*(SiT_ext\*(So, \*(SiT_reg\*(So or \*(SiT_float\*(So,
 | 
						|
\*(SiT_reg2\*(So or \*(SiT_float2\*(So, and \*(SiT_cst\*(So,
 | 
						|
and tells the
 | 
						|
user which of the three fields are non-zero. When the register-fields
 | 
						|
represent \*(Si%g0\*(So, it is considered zero.
 | 
						|
.IP \(bu
 | 
						|
Miscellaneous routines:
 | 
						|
.DS
 | 
						|
\*(Si
 | 
						|
void init_cache(void)
 | 
						|
void cache_need(int)
 | 
						|
void change_reg(void)
 | 
						|
void flush_cache(void)
 | 
						|
\*(So
 | 
						|
.DE
 | 
						|
\*(Siinit_cache\*(So should be called before any
 | 
						|
other cache routines, to initialize some internal datastructures.
 | 
						|
\*(Sicache_need\*(So is used to tell the cache that a certain number
 | 
						|
of register are needed for the next operation. This way the cache can
 | 
						|
load them efficiently in one fell swoop. \*(Sichange_reg\*(So is to be
 | 
						|
called when the user changes a register of which the cache (possibly) has
 | 
						|
co-ownership. Because the contents of registers in the cache are
 | 
						|
not allowed to change the user should call \*(Sichange_reg\*(So to
 | 
						|
instruct the cache to copy the contents to some other register.
 | 
						|
\*(Siflush_cache\*(So writes the cache to the stack and invalidates
 | 
						|
the cache. It should be used before branches,
 | 
						|
before labels and on other places where the stack has to be valid (i.e. where
 | 
						|
every item on the EM-stack should be stored on the real stack, not in some
 | 
						|
virtual cache).
 | 
						|
.NH 3
 | 
						|
Implementing push-pop optimization in the EM_table
 | 
						|
.PP
 | 
						|
As indicated above, there is no regular way to represent the described
 | 
						|
optimization in the EM_table. The only possible escapes from the EM_table
 | 
						|
are function calls, but that is clearly not enough to implement a good
 | 
						|
push-pop optimizer. Therefore we will use a modified version of the EM_table
 | 
						|
format, where the description of, say, the \*(Silol\*(So instruction might look
 | 
						|
like this\(dg:
 | 
						|
.FS
 | 
						|
\(dg This is not the way the \*(Silol\*(So actually looks in the EM_table;
 | 
						|
it only shows how it \fImight\fR look using the forementioned push/pop
 | 
						|
primitives.
 | 
						|
.FE
 | 
						|
.DS
 | 
						|
\*(Si
 | 
						|
reg_t A, B;
 | 
						|
const_str_t n;
 | 
						|
 | 
						|
alloc_reg(A);
 | 
						|
push_reg(LB);
 | 
						|
inc_tos($1);
 | 
						|
B = pop_reg_c13(n);
 | 
						|
"ld  [$B+$n], $A";
 | 
						|
push_reg(A);
 | 
						|
free_reg(B);
 | 
						|
\*(So
 | 
						|
.DE
 | 
						|
For more details about the exact implementation consult
 | 
						|
appendix B which contains some characteristic excerpts from the EM_table.
 | 
						|
.NH 2
 | 
						|
Stack management
 | 
						|
.PP
 | 
						|
When converting EM code to some executable code there is the problem of
 | 
						|
maintaining multiple stacks. The usual way to do this is described in
 | 
						|
.[ [
 | 
						|
Description of a Machine Architecture
 | 
						|
.]]
 | 
						|
and is shown in figure \*(SN1.
 | 
						|
.KE
 | 
						|
.PS
 | 
						|
copy "pics/EM_stack.orig"
 | 
						|
.PE
 | 
						|
.ce 1
 | 
						|
\fIFigure \*(SN1: usual stack management.
 | 
						|
.KE
 | 
						|
.sp
 | 
						|
.LP
 | 
						|
This means that the EM stack and the hardware stack (used
 | 
						|
for subroutine calls, etc.) are interleaved in memory. On the SPARC, however,
 | 
						|
this brings up a large problem: in the former model it is assumed that the
 | 
						|
resolution of the stack pointer is a word, but this is not the case on the
 | 
						|
SPARC processor. On the SPARC processor the stack-pointer as well as the
 | 
						|
frame-pointer have to be aligned on 8-byte boundaries, so one can not simply
 | 
						|
push a word on the stack and then lower the stack-pointer by 4 bytes!
 | 
						|
.NH 3
 | 
						|
Possible solutions
 | 
						|
.PP
 | 
						|
A simple idea might be to use a swiss-cheese stack; we could
 | 
						|
push a 4-byte word onto the stack and then lower the stack by 8.
 | 
						|
Unfortunately, this is not a very solid solution, because
 | 
						|
pointer-arithmetic involving pointers to objects on the stack would cause
 | 
						|
hard-to-predict anomalies.
 | 
						|
.PP
 | 
						|
Another try would be not to use the hardware stack at all. As long as we
 | 
						|
do not generate subroutine-calls everything will be all right. This
 | 
						|
approach, however, also has some disadvantages: first we would not be able
 | 
						|
to use any of the existing debuggers such as \fIadb\fR, because they all
 | 
						|
assume a regular stack format. Secondly, we would not be able to make use
 | 
						|
of the SPARC's register windows to keep local variables. Finally, doing all the
 | 
						|
administrative work necessary for subroutine calls ourselves instead of
 | 
						|
letting the hardware handle it for us,
 | 
						|
causes unnecessary procedure-call overhead.
 | 
						|
.PP
 | 
						|
Yet another alternative would be to emulate the EM-part of the stack,
 | 
						|
and to let the hardware handle the subroutine call. Since we will
 | 
						|
emulate our own stack, there are no alignment restrictions and because
 | 
						|
we will use the hardware procedure call we can still make use of
 | 
						|
the register windows.
 | 
						|
.NH 3
 | 
						|
Our implementation
 | 
						|
.PP
 | 
						|
To implement the hybrid stack we need two extra registers: one for the
 | 
						|
the EM stack pointer (the forementioned \*(Si%SP\*(So) and one for the
 | 
						|
EM local base pointer (\*(Si%LB\*(So). The most elegant solution would be to
 | 
						|
put both stacks in different segments, so they would not influence
 | 
						|
each other. Unfortunately
 | 
						|
.UX
 | 
						|
lacks the ability to add segments and
 | 
						|
since we will implement our backend under
 | 
						|
.UX,
 | 
						|
we will have to put
 | 
						|
both stacks in the same segment. Exactly how this can be done is shown
 | 
						|
in figure \*(SN2.
 | 
						|
.DS
 | 
						|
.PS
 | 
						|
copy "pics/mem_config"
 | 
						|
.PE
 | 
						|
.ce 1
 | 
						|
\fIFigure \*(SN2: our stack management.\fR
 | 
						|
.DE
 | 
						|
.sp
 | 
						|
During normal procedure execution, the SPARC stack pointer has to point to
 | 
						|
a memory location where the operating system can dump the active part of
 | 
						|
the register window. The rest of the
 | 
						|
register window will be dumped in the therefor pre-allocated (stack) space
 | 
						|
by following the frame
 | 
						|
pointer. When a signal occurs things get even more complicated and
 | 
						|
result in figure \*(SN3.
 | 
						|
.DS
 | 
						|
.PS
 | 
						|
copy "pics/signal_stack"
 | 
						|
.PE
 | 
						|
.ce 1
 | 
						|
\fIFigure \*(SN3: our signal stack.\fR
 | 
						|
.DE
 | 
						|
.PP
 | 
						|
The exact implementation of the stack is shown in figure \*(SN4.
 | 
						|
.KF
 | 
						|
.PS
 | 
						|
copy "pics/EM_stack.ours"
 | 
						|
.PE
 | 
						|
.ce 1
 | 
						|
\fIFigure \*(SN4: stack overview.\fR
 | 
						|
.KE
 | 
						|
.NH 2
 | 
						|
Miscellaneous
 | 
						|
.PP
 | 
						|
As mentioned in the previous chapter, the generated \fI.o\fR-files are
 | 
						|
not compatible with Sun's own object format. The primary reason for
 | 
						|
this is that Sun usually passes the first six parameters of a procedure call
 | 
						|
through registers. If we were to do that too, we would always have
 | 
						|
to fetch the top six words from the stack into registers, even when
 | 
						|
the procedure would not have any parameters at all. Apart from this,
 | 
						|
structure-passing is another exception in Sun's object format which
 | 
						|
makes is impossible to generate object-compatible code.\(dg
 | 
						|
.FS
 | 
						|
\(dg Exactly how Sun passes structures as parameters is described in
 | 
						|
Appendix D of the SPARC Architecture Manual (Software Considerations)
 | 
						|
.FE
 | 
						|
.bp
 |