.so init .hw data-structures .nr H1 3 .NH SOLUTIONS .NH 2 Maintaining SPARC speed .PP In chapter 3 we wrote: .sp 0.3 .nf >If we want to generate efficient code, we should at least try to reduce the number of >memory references and use registers wherever we can. .fi .sp 0.3 In this chapter we will device a strategy to swiftly generate acceptable code by using push-pop optimization. Note that this is not the push-pop optimization already available in the EM-kit, since that is only present in the assembler-to-binary part which we do not use .[ [ The Code Expander Generator .]]. Our push-pop optimization works more like the fake-stack described in .[ [ The table driven code generator .]]. .NH 3 Ad-hoc optimization .PP Before getting involved in any optimization let's have a look at some code generated with a straightforward EM to SPARC conversion of the C statement: \*(Sif(a[i]);\*(So Note that \*(Si%SP\*(So is an alias for a general purpose register and acts as the EM stack pointer. It has nothing to do with \*(Si%sp\*(So \(em the SPARC stack pointer. Analogous \*(Si%LB\*(So is EMs local base pointer. .br .IP .HS .TS ; l s l s l l1f6 lf6 l2f6 lf6 l. EM code SPARC code Comment lae _a set _a, %g1 ! load address of external _a dec 4, %SP st %g1, [%SP] lol -4 set -4, %g1 ! load local -4 (i) ld [%g1+%LB], %g2 dec 4, %SP st %g2, [%SP] loc 2 set 2, %g1 ! load constant 2 dec 4, %SP st %g1, [%SP] sli 4 ld [%SP], %g1 ! pop shift count ld [%SP+4], %g2 ! pop shiftee sll %g2, %g1, %g3 inc 4, %SP st %g3, [%SP] ! push 4 * i ads 4 ld [%SP], %g1 ! add pointer and offset ld [%SP+4], %g2 add %g1, %g2, %g3 inc 4, %SP st %g3, [%SP] ! push address of _a + (4 * i) loi 4 ld [%SP], %g1 ! load indirect 4 bytes ld [%g1], %g2 st %g2, [%SP] ! push a[i] cal _f ... .TE .HS .LP Although the code is easy understand, it clearly is far from optimal. The above code uses approximately 60 clock-cycles\(dg .FS \(dg In general each instruction only takes one cycle, except for \*(Sild\*(So and \*(Sist\*(So which may both require additional clock cycles. The exact amount of extra cycles needed depends on the SPARC implementation and memory access time. Furthermore, the \*(Siset\*(So pseudo-instruction is a bit tricky. It takes one cycle when its argument lies between -4096 and 4095, and two cycles otherwise. .FE to push an array-element on the stack, something which a 68020 can do in a single instruction. The SPARC processor may be fast, but not fast enough to justify the above code. .PP The same statement can be translated much more efficiently: .DS .TS ; l2f6 lf6 l. sll %i0, 2, %g2 ! multiply index by 4 set _a, g3 ld [%g2+%g3], %g1 ! get contents of a[i] dec 4, SP st %g2, [SP] ! push a[i] onto the stack .TE .DE which, instead of 60, uses only 5 clock cycles to retrieve the element from memory and 5 additional cycles when the result has to be pushed on the stack. Note that when the result is not a parameter it does not have to be pushed on the stack. By making efficient use of the SPARC registers we can fetch \*(Sia[i]\*(So in only 5 cycles! .NH 3 Analyzing optimization .PP Instead of ad-hoc optimization we will need something more solid. When one tries to optimize the above code in an ad-hoc manner one will probably notice the large overhead due to stack access. Almost every EM instruction requires at least three SPARC instructions: one to carry out the EM instruction and two to pop and push the result from and onto the stack. This happens for every instruction, even though the data being pushed will probably be needed by the next instruction. To optimize this extensive pushing and popping of data we will use the appropriately named push-pop optimization. .PP The idea behind push-pop optimization is to delay the push operation until it is almost certain that the data actually has to be pushed. As is often the case, the data does not have to be pushed, but will be used as input to another EM instruction. If we can decide at compile time that this will indeed be the case we can save the time of first pushing the data and then popping it back again by temporarily storing the data (possibly only during compilation!) and using it no sooner than it is actually needed. .PP The \*(Sisli 4\*(So instruction, for instance, expects two inputs on top of the stack: on top a counter and right below that the shiftee (the number to be shifted). As a result \*(Sisli\*(So pushes 'shiftee << counter' back to the stack. Now consider the following sequence, which could be the result of the expression \*(Si4 * i\*(So .DS .TS ; l1f6 lf6 l. lol -4 loc 2 sli 4 .TE .DE In the non-optimized situation the \*(Silol\*(So would push a local variable (whose offset is -4) on the stack. Then the \*(Siloc\*(So pushes a 2 on the stack and finally \*(Sisli\*(So retrieves both these numbers to replace then with the result. On most machines it is not necessary to push the 2 on the stack, since it can be used in the shift instruction as an immediately operand. On a SPARC, for instance, one can write .DS .TS ; l2f6 lf6 l. ld [%LB-4], %g1 ! load local variable into register g1 sll %g1, 2, %g2 ! perform the shift-left-by-2 .TE .DE where the output of the \*(Silol\*(So, as well as the immediate operand 2 are used in the shift instruction. As suggested before, all of this can be achieved with push-pop optimization. .NH 3 A mechanism for push-pop optimization .PP To implement the above optimization we need some mechanism to temporarily store information during compilation. We need to be able to store, compare and retrieve information from the temporary storage (cache) without any loss of information. Before describing all the routines used to implement our cache we will first describe how the cache works. .PP Items in the cache are structures containing an external (\*(Sichar *\*(So), two registers (\*(Sireg_t\*(So) and a constant (\*(Siarith\*(So), any of which may be 0. The value of such a structure is the sum of (the values of) its elements. To put a register in the cache, one has to be allocated either by calling \*(Sialloc_reg\*(So which returns a free register, by \*(Siforced_alloc_reg\*(So which allocates a specific register or any of the other routines available to allocate a register. The keep things simple, we will not discuss all of the available primitives here. When the register is then put in the cache by the \*(Sipush_reg\*(So routine, the ownership will be transferred from the user to the cache. Ownership is important, because only the owner of a register may (and must!) deallocate it. Registers can be owned by either an (imaginary) register manager, the cache or the user. When the user retrieves a register from the stack with \*(Sipop_reg\*(So for instance, ownership is back to the user. The user should then call \*(Sifree_reg\*(So to transfer ownership to the register manager or call \*(Sipush_reg\*(So to give it back to the cache. Since the cache behaves itself as a stack we will use the term pop resp. push to get items from, resp. put items in the cache. .PP We shall now present the sets of routines that implement the cache. .IP \(bu The routines .DS \*(Si reg_t alloc_reg(void) reg_t alloc_reg_var(void) reg_t alloc_float(void) reg_t alloc_float_var(void) reg_t alloc_double(void) reg_t alloc_double_var(void) void forced_alloc_reg(reg_t) void soft_alloc_reg(reg_t) void free_reg(reg_t) void free_double_reg(reg_t) \*(So .DE allocate and deallocate registers. If there are no more register left, i.e. they are owned by the cache, one or more registers will be freed by flushing part of the cache onto the real stack. The \*(Sialloc_xxx_var\*(So primitives try to allocate a register that can be used to store local variables. (In the current implementation only the input and local registers.) If none can be found \*(SiNULL\*(So is returned. \*(Siforced_alloc_reg\*(So forces the allocation of a certain register. If it was already in use, its contents are moved to another register. Finally \*(Sisoft_alloc_reg\*(So provides the possibility to push a register onto the cache and still keep a copy for later use. (Used to implement the \*(Sidup 4\*(So for example.) .IP \(bu The routines .DS \*(Si void push_const(arith) arith pop_const(void) \*(So .DE push or pop a constant onto or from the stack. Distinction between constants and other types is made so as not to loose any information; constants may be used later on as immediate operators, which is not the case for other types. If \*(Sipop_const\*(So is called, but the element on top of the cache has either one of the external or register fields non-zero a fatal error will be reported. .IP \(bu The routines .DS \*(Si reg_t pop_reg(void) reg_t pop_float(void) reg_t pop_double(void) reg_t pop_reg_c13(char *n) void pop_reg_as(reg_t) void push_reg(reg_t) \*(So .DE push or pop a register. These will be used most often since results from one EM instruction, which are computed in a register, are often used in the next. When the element on top of the cache is more than just a register the cache manager will generate code to compute the sum of its fields and put the result in a register. This register will then be given to the user. If the user wants the result is a special register, he should use the \*(Sipop_reg_as\*(So routine. The \*(Sipop_reg_c13\*(So gives an optional number (as character string) whose value can be represented in 13 bits. The constant can then be used as an offset for the SPARC \*(Sild\*(So and \*(Sist\*(So instructions. .IP \(bu The routine .DS \*(Si void push_ext(char *) \*(So .DE pushes an external onto the stack. There is no pop-variant of this one since there is no use in popping an external. .IP \(bu The routines .DS \*(Si void inc_tos(arith n) void inc_tos_reg(reg_t r) \*(So .DE increment the element on top of the cache by either the constant \*(Sin\*(So or by a register. The latter is useful for pointer addition when referencing external memory. .KS .IP \(bu The routine .DS \*(Si int type_of_tos(void) \*(So .DE .KE returns the type of the element on top of the cache. This is a combination (binary OR) of \*(SiT_ext\*(So, \*(SiT_reg\*(So or \*(SiT_float\*(So, \*(SiT_reg2\*(So or \*(SiT_float2\*(So, and \*(SiT_cst\*(So, and tells the user which of the three fields are non-zero. When the register-fields represent \*(Si%g0\*(So, it is considered zero. .IP \(bu Miscellaneous routines: .DS \*(Si void init_cache(void) void cache_need(int) void change_reg(void) void flush_cache(void) \*(So .DE \*(Siinit_cache\*(So should be called before any other cache routines, to initialize some internal datastructures. \*(Sicache_need\*(So is used to tell the cache that a certain number of register are needed for the next operation. This way the cache can load them efficiently in one fell swoop. \*(Sichange_reg\*(So is to be called when the user changes a register of which the cache (possibly) has co-ownership. Because the contents of registers in the cache are not allowed to change the user should call \*(Sichange_reg\*(So to instruct the cache to copy the contents to some other register. \*(Siflush_cache\*(So writes the cache to the stack and invalidates the cache. It should be used before branches, before labels and on other places where the stack has to be valid (i.e. where every item on the EM-stack should be stored on the real stack, not in some virtual cache). .NH 3 Implementing push-pop optimization in the EM_table .PP As indicated above, there is no regular way to represent the described optimization in the EM_table. The only possible escapes from the EM_table are function calls, but that is clearly not enough to implement a good push-pop optimizer. Therefore we will use a modified version of the EM_table format, where the description of, say, the \*(Silol\*(So instruction might look like this\(dg: .FS \(dg This is not the way the \*(Silol\*(So actually looks in the EM_table; it only shows how it \fImight\fR look using the forementioned push/pop primitives. .FE .DS \*(Si reg_t A, B; const_str_t n; alloc_reg(A); push_reg(LB); inc_tos($1); B = pop_reg_c13(n); "ld [$B+$n], $A"; push_reg(A); free_reg(B); \*(So .DE For more details about the exact implementation consult appendix B which contains some characteristic excerpts from the EM_table. .NH 2 Stack management .PP When converting EM code to some executable code there is the problem of maintaining multiple stacks. The usual way to do this is described in .[ [ Description of a Machine Architecture .]] and is shown in figure \*(SN1. .KE .PS copy "pics/EM_stack.orig" .PE .ce 1 \fIFigure \*(SN1: usual stack management. .KE .sp .LP This means that the EM stack and the hardware stack (used for subroutine calls, etc.) are interleaved in memory. On the SPARC, however, this brings up a large problem: in the former model it is assumed that the resolution of the stack pointer is a word, but this is not the case on the SPARC processor. On the SPARC processor the stack-pointer as well as the frame-pointer have to be aligned on 8-byte boundaries, so one can not simply push a word on the stack and then lower the stack-pointer by 4 bytes! .NH 3 Possible solutions .PP A simple idea might be to use a swiss-cheese stack; we could push a 4-byte word onto the stack and then lower the stack by 8. Unfortunately, this is not a very solid solution, because pointer-arithmetic involving pointers to objects on the stack would cause hard-to-predict anomalies. .PP Another try would be not to use the hardware stack at all. As long as we do not generate subroutine-calls everything will be all right. This approach, however, also has some disadvantages: first we would not be able to use any of the existing debuggers such as \fIadb\fR, because they all assume a regular stack format. Secondly, we would not be able to make use of the SPARC's register windows to keep local variables. Finally, doing all the administrative work necessary for subroutine calls ourselves instead of letting the hardware handle it for us, causes unnecessary procedure-call overhead. .PP Yet another alternative would be to emulate the EM-part of the stack, and to let the hardware handle the subroutine call. Since we will emulate our own stack, there are no alignment restrictions and because we will use the hardware procedure call we can still make use of the register windows. .NH 3 Our implementation .PP To implement the hybrid stack we need two extra registers: one for the the EM stack pointer (the forementioned \*(Si%SP\*(So) and one for the EM local base pointer (\*(Si%LB\*(So). The most elegant solution would be to put both stacks in different segments, so they would not influence each other. Unfortunately .UX lacks the ability to add segments and since we will implement our backend under .UX, we will have to put both stacks in the same segment. Exactly how this can be done is shown in figure \*(SN2. .DS .PS copy "pics/mem_config" .PE .ce 1 \fIFigure \*(SN2: our stack management.\fR .DE .sp During normal procedure execution, the SPARC stack pointer has to point to a memory location where the operating system can dump the active part of the register window. The rest of the register window will be dumped in the therefor pre-allocated (stack) space by following the frame pointer. When a signal occurs things get even more complicated and result in figure \*(SN3. .DS .PS copy "pics/signal_stack" .PE .ce 1 \fIFigure \*(SN3: our signal stack.\fR .DE .PP The exact implementation of the stack is shown in figure \*(SN4. .KF .PS copy "pics/EM_stack.ours" .PE .ce 1 \fIFigure \*(SN4: stack overview.\fR .KE .NH 2 Miscellaneous .PP As mentioned in the previous chapter, the generated \fI.o\fR-files are not compatible with Sun's own object format. The primary reason for this is that Sun usually passes the first six parameters of a procedure call through registers. If we were to do that too, we would always have to fetch the top six words from the stack into registers, even when the procedure would not have any parameters at all. Apart from this, structure-passing is another exception in Sun's object format which makes is impossible to generate object-compatible code.\(dg .FS \(dg Exactly how Sun passes structures as parameters is described in Appendix D of the SPARC Architecture Manual (Software Considerations) .FE .bp