Added

1991-09-27 16:19:24 +00:00 · 1991-09-27 16:19:24 +00:00 · fb51183da2
commit fb51183da2
parent 63c9fea5c2
29 changed files with 2085 additions and 0 deletions
--- a/doc/sparc/.distr
+++ b/doc/sparc/.distr
@ -0,0 +1,14 @@
 1
 2
 3
 4
 5
 A
 B
 init
 intro
 note_on_reg_wins
 refs
 timing
 title
 Makefile
--- a/doc/sparc/1
+++ b/doc/sparc/1
@ -0,0 +1,53 @@
 .so init
 .NH
 INTRODUCTION
 .NH 2
 Why an EM backend for SPARC processors?
 .PP
 With the introduction of SPARC-based computers like the Sun-4, a
 whole new range of fast computers became readily available to the general
 public. The power of large mainframes had been captured into a small
 desk-top computer at only a fraction of the cost.
 .PP
 In the older days, a new computer used to be very hard to integrate into
 the existing environment, but due to standardization in the software world
 incompatibility in hardware no longer means incompatibility in software.
 Programs that are written for computer A can often be run on computer B
 without major modifications. Unfortunately this is not true for all software.
 .PP
 There will always be programs that rely on the specific
 hardware of a certain computer for many different reasons. They
 can be categorized as:
 .IP -
 poorly written programs
 .IP -
 programs to directly control hardware (device drivers)
 .IP -
 code that requires efficiency (time-critical I/O drivers)
 .IP -
 programs to generate code to run on the hardware (compilers)
 .LP
 This project for instance, the design and implementation of an EM backend
 for SPARC processors, comes in the last category.
 .PP
 We have designed and implemented an algorithm to convert EM programs to code
 that will run directly on the SPARC hardware. Henceforth, both the algorithm
 and the implementation will be referred to as the EM-to-SPARC backend,
 or simply: the backend.
 .NH 2
 Why has nobody done this before?
 .PP
 Since EM was designed around 1981 and even SPARC has been around for some
 years now, one may wonder why nobody has ever written an EM to SPARC backend
 before. The reason is twofold. In the first place, there are some
 non-trivial problems to be solved in the design phase, and secondly,
 the SPARC-design combined with the lack of documentation, would surely
 cost a lot of blood, sweat and tears. The absence of
 clues to any of the design problems, combined with the \(em at first
 glance \(em inhuman
 SPARC instruction set did not make this a very attractive project.
 .PP
 On the other hand, these were exactly the reasons which made us take on
 this particular project: it would require design skills, as well as some
 hard work; a golden combination for a successful project.
 .bp
--- a/doc/sparc/2
+++ b/doc/sparc/2
@ -0,0 +1,109 @@
 .so init
 .nr H1 1
 .NH
 CLOSE-UP LOOK
 .NH 2 
 What is EM?
 .PP
 As the abstract of the IR-81 rapport on EM
 .[ [
 description of a machine architecture
 .]]
 says: \*(OQEM is a family
 of intermediate languages designed for producing portable compilers.\*(CQ
 Because EM is to be used on a wide range of languages and processors,
 the instruction set is kept simple enough to allow easy translation to,
 or interpretation on, almost any processor. Yet it is also powerful enough
 to accommodate easy translation from almost any block-structured language.
 .PP
 Even though EM was designed in the early 1980s, it
 is based on
 .\" already shows strong signs of being influenced by
 the (then innovative) RISC architecture. All instructions
 have 0 or 1 operands, there are no fancy addressing modes as in the
 68020's\*(Si move.w a3(_array,d3.w*2), -(sp)\*(So, no explicit registers,
 although instructions for higher languages
 such as array-operations, multiway branches (case) and
 floating point operations are provided.
 .PP
 To fully understand the discussion in the following chapters,
 the reader should at least have some knowledge of EM.
 .NH 2
 What is SPARC?
 .PP
 According to Sun's RISC tutorial: \*(OQSun Microsystems has designed a RISC
 architecture, called SPARC, and has implemented that architecture with
 the Sun-4 family of supercomputing workstations and servers. SPARC stands
 for Scalable Processor ARChitecture, emphasizing its applicability to
 large as well as small machines.\*(CQ
 .PP
 In sharp contrast to EM, SPARC does have
 explicit registers (31 integer and 32 floating point, all of which
 are 32 bits wide) and
 does not support any high level language operations: it does not even have
 multiplication or division instructions. Because the SPARC design is
 very straightforward, all instructions could be hard-coded (no microcode
 involved) to
 provided extremely high performance. All register-to-register operations
 require exactly one clock cycle, and all register-to-memory and
 memory-to-register operations require two clock cycles, one to retrieve
 the instruction and one to access external memory. At a clock speed of
 over 20 MHz this means that you can achieve well over 10 VAX MIPS:
 more than 4 times the speed of a 15 MHz 68020 used in the Sun3/50.
 .PP
 As above, the reader should also have some general knowledge about
 the SPARC processer to be able to understand the following chapters.
 .NH 2
 What exactly is a (fast) backend?
 .PP
 To put in the simplest of ways: a (fast) backend is a set of routines to
 translate EM code to code that will run 'on the metal' (for example the SPARC
 processor). The distinction between full-fledged backends (code generators)
 .[ [
 The table driven code generator
 .]]
 and fast backends (code expanders)
 .[ [
 The Code Expander Generator
 .]]
 is related to
 the compilation-time vs. run-time trade off. Code generators generate
 efficient code and code expanders generate code very efficient.
 For details about code expanders see also
 .[ [
 The design of very fast portable compilers
 .]].
 .PP
 The reasons for us to implement a code expander are numerous: Our first reason to
 implement a code expander, rather than a code generator was that implementing a
 code expander would be hard enough already. Code generators only give
 more problems and there were already enough problems to be solved. Secondly,
 we knew we would never be able to compete with original SPARC compilers due
 to loss of information in the frontends (see also chapter 5). By implementing
 a code expander we might be able to outrun the existing compilers on a
 completely different terrain: compile speed.
 .PP
 The third 'reason' to implement a code expander lies a little deeper and was
 not discovered until we had actually started the implementation... It was only
 then that we found out that for certain architectures, such as the SPARC,
 the idea behind the code-expander is not necessarily inferior to that
 behind a code-generator. It seems that for highly orthogonal instruction
 sets it is possible to generate near optimal code without using the
 code-expander. We have to say, however, that this is only true for our
 optimized version of the code-expander. With the original code-expander
 it would not have been possible to generate near-optimal code for the
 SPARC processor.
 .NH 2
 So, what are the main differences between EM and SPARC?
 .PP
 The main
 difference between EM and SPARC is the stack versus register orientation.
 The other differences, such as the presence of high level language
 operations in EM, can easily be overcome by subroutines,
 or small pieces of in-line SPARC code.
 The design-part of this project mostly concentrates on
 building a bridge between EM's stack and SPARC's registers.
 .PP
 In the next chapter we will make a list of all our design problems which
 will then be discussed in chapter 4.
 .bp
--- a/doc/sparc/3
+++ b/doc/sparc/3
@ -0,0 +1,82 @@
 .so init
 .nr H1 2
 .NH
 PROBLEMS
 .NH 2
 Maintain SPARC speed
 .PP
 If we want to generate SPARC code, we should try to generate efficient code
 as fast as possible. It would be quite embarrassing to find out that the
 same program would run faster on a Motorola 68020 than on a SPARC processor,
 when both operate at the same clock frequency.
 Looking at some code generated by Sun's C-compiler and optimizing assembler,
 we can spot a few remarkable characteristics of the generated SPARC code:
 .IP -
 There are almost no memory references
 .IP -
 Parameters to functions are passed through registers.
 .IP -
 Almost all delay slots\(dg
 .FS
 \(dg For details about delay slots see the SPARC Architecture Manual, chapter 4, pp. 42-48
 .FE
 are filled in by the assembler
 .LP
 If we want to generate efficient code, we should at least try to
 reduce the number of memory references and use registers wherever we can.
 Since EM is stack-oriented it references its stack for every operation so
 this will not be an easy task; a suitable solution will however be given in
 the next chapter.
 .NH 2
 Increase compilation speed
 .PP
 Because we will implement a code expander (fast backend) we should keep
 a close eye on efficiency; if we cannot beat regular compilers on producing
 efficient code we will try to beat them on fast code generation.
 The usual trick to achieve fast compilation is to pack the frontend,
 optimizer, code-generator and
 assembler all into a single large binary to reduce the overhead of
 reading and writing temporary files. Unfortunately, due to the
 SPARC instruction set, its relocation information is slightly bizarre
 and cannot be represented with the present primitives.
 This means that it will not be possible to generate the required output
 format directly from our backend.
 .PP
 There are three solutions here: generate assembler code, and let an
 existing assembler generate the required object (\fI.o\fR) files,
 create our own primitives than can handle the SPARC relocation format, or
 do not use any of the addressing modes that require the bizarre relocation.
 Because we have enough on our hands already we will
 let the existing assembler deal with generating object files.
 .NH 2
 Convert stack to register operations
 .PP
 As we wrote in the previous chapter, for RISC machines a code expander can
 produce almost as efficient code as a code generator. The fact that this is
 true for stack-oriented RISC processors is rather obvious. The problem we
 face, however, is that the SPARC processor is register, instead of 
 stack oriented. In the next chapter we will give a suitable solution to
 convert most stack accesses to register accesses.
 .NH 2
 Miscellaneous
 .PP
 Besides performance and \fI.o\fR-compatibility there are some other
 peculiarities of the SPARC processor and Sun's C-compiler (henceforth
 simply called \fIcc\fR).
 .PP
 For some reason, the SPARC stack pointer requires alignment
 on 8 bytes, so you cannot push a 4-byte integer on the stack
 and then \*(Sisub 4, %sp\*(So\(dd.
 .FS
 \(dd For more information about SPARC assembler see the Sun-4 Assembly
 Language Reference Manual
 .FE
 This too will be discussed in the next chapter, where we will take a
 more in-depth look into this problem and also discuss a couple of
 possible solutions.
 .PP
 Another thing is that \fIcc\fR usually passes the first six parameters of a
 function-call through registers. To be \fI.o\fR-compatible we would have to
 pass the first six parameters of each function call through registers as well.
 Exactly why this is not feasible will also be discussed in the next chapter.
 .bp
--- a/doc/sparc/4
+++ b/doc/sparc/4
@ -0,0 +1,468 @@
 .so init
 .hw data-structures
 .nr H1 3
 .NH
 SOLUTIONS
 .NH 2
 Maintaining SPARC speed
 .PP
 In chapter 3 we wrote:
 .sp 0.3
 .nf
 >If we want to generate efficient code, we should at least try to reduce the number of
 >memory references and use registers wherever we can.
 .fi
 .sp 0.3
 In this chapter we will device a strategy to swiftly generate acceptable
 code by using push-pop optimization.
 Note that this is not the push-pop
 optimization already available in the EM-kit, since that is only present
 in the assembler-to-binary part which we do not use
 .[ [
 The Code Expander Generator
 .]].
 Our push-pop optimization
 works more like the fake-stack described in
 .[ [
 The table driven code generator
 .]].
 .NH 3
 Ad-hoc optimization
 .PP
 Before getting involved in any optimization let's have a look at some
 code generated with a straightforward EM to SPARC conversion of the
 C statement: \*(Sif(a[i]);\*(So Note that \*(Si%SP\*(So is an alias
 for a general purpose
 register and acts as the EM stack pointer. It has nothing to do with
 \*(Si%sp\*(So \(em the SPARC stack pointer.
 Analogous \*(Si%LB\*(So is EMs local base pointer.
 .br
 .IP
 .HS
 .TS
 ;
 l s l s l
 l1f6 lf6 l2f6 lf6 l.
 EM code	SPARC code	Comment
 lae	_a	set	_a, %g1	! load address of external _a
 		dec	4, %SP
 		st	%g1, [%SP]
 lol	-4	set	-4, %g1	! load local -4 (i)
 		ld	[%g1+%LB], %g2
 		dec	4, %SP
 		st	%g2, [%SP]
 loc	2	set	2, %g1	! load constant 2
 		dec	4, %SP
 		st	%g1, [%SP]
 sli	4	ld	[%SP], %g1	! pop shift count
 		ld	[%SP+4], %g2	! pop shiftee
 		sll	%g2, %g1, %g3
 		inc	4, %SP
 		st	%g3, [%SP]	! push 4 * i
 ads	4	ld	[%SP], %g1	! add pointer and offset
 		ld	[%SP+4], %g2
 		add	%g1, %g2, %g3
 		inc	4, %SP
 		st	%g3, [%SP]	! push address of _a + (4 * i)
 loi	4	ld	[%SP], %g1	! load indirect 4 bytes
 		ld	[%g1], %g2
 		st	%g2, [%SP]	! push a[i]
 cal	_f
 		...
 .TE
 .HS
 .LP
 Although the code is easy understand, it clearly is far from optimal.
 The above code uses approximately 60 clock-cycles\(dg
 .FS
 \(dg In general each instruction only takes one cycle,
 except for \*(Sild\*(So and
 \*(Sist\*(So which may both require additional clock cycles. The exact amount
 of extra cycles needed depends on the SPARC implementation and memory access
 time. Furthermore, the
 \*(Siset\*(So pseudo-instruction is a bit tricky. It takes one cycle when
 its argument lies between -4096 and 4095, and two cycles otherwise.
 .FE
 to push an array-element on the stack,
 something which a 68020 can do in a single instruction. The SPARC
 processor may be fast, but not fast enough to justify the above code.
 .PP
 The same statement can be translated much more efficiently:
 .DS
 .TS
 ;
 l2f6 lf6 l.
 sll	%i0, 2, %g2	! multiply index by 4
 set	_a, g3
 ld	[%g2+%g3], %g1	! get contents of a[i]
 dec	4, SP
 st	%g2, [SP]	! push a[i] onto the stack
 .TE
 .DE
 which, instead of 60, uses only 5 clock cycles to retrieve the element
 from memory and 5 additional cycles when the result has to be pushed
 on the stack. Note that when the result is not a parameter it does not
 have to be pushed on the stack. By making efficient use of the SPARC
 registers we can fetch \*(Sia[i]\*(So in only 5 cycles!
 .NH 3
 Analyzing optimization
 .PP
 Instead of ad-hoc optimization we will need something more solid.
 When one tries to optimize the above code in an ad-hoc manner one will
 probably notice the large overhead due to stack access. Almost every EM
 instruction requires at least three SPARC instructions: one to carry out
 the EM instruction and two to pop and push the result from and onto the
 stack. This happens for every instruction, even though the data being pushed
 will probably be needed by the next instruction. To optimize this extensive
 pushing and popping of data we will use the appropriately named push-pop
 optimization.
 .PP
 The idea behind push-pop optimization is to delay the push operation until
 it is almost certain that the data actually has to be pushed.
 As is often the case, the data does not have to be pushed,
 but will be used as input to another EM instruction.
 If we can decide at compile time that this will indeed be
 the case we can save the time of first pushing the data and then popping it
 back again by temporarily storing the data (possibly only during compilation!)
 and using it no sooner than it is actually needed.
 .PP
 The \*(Sisli 4\*(So instruction, for instance, expects two inputs on top of the
 stack: on top a counter and right below that the shiftee (the number
 to be shifted). As a result \*(Sisli\*(So
 pushes 'shiftee << counter' back to the stack. Now consider the following
 sequence, which could be the result of the expression \*(Si4 * i\*(So
 .DS
 .TS
 ;
 l1f6 lf6 l.
 lol	-4
 loc	2
 sli	4
 .TE
 .DE
 In the non-optimized situation the \*(Silol\*(So would push
 a local variable (whose offset is -4) on the stack.
 Then the \*(Siloc\*(So pushes a 2 on the stack and finally \*(Sisli\*(So
 retrieves both these numbers to replace then with the result.
 On most machines it is not necessary to
 push the 2 on the stack, since it can be used in the shift instruction
 as an immediately operand. On a SPARC, for instance, one can write
 .DS
 .TS
 ;
 l2f6 lf6 l.
 ld	[%LB-4], %g1	! load local variable into register g1
 sll	%g1, 2, %g2	! perform the shift-left-by-2
 .TE
 .DE
 where the output of the \*(Silol\*(So, as well as the immediate operand 2 are used
 in the shift instruction. As suggested before, all of this can be
 achieved with push-pop optimization.
 .NH 3
 A mechanism for push-pop optimization
 .PP
 To implement the above optimization we need some mechanism to
 temporarily store information during compilation.
 We need to be able to store, compare and retrieve information from the
 temporary storage (cache) without any
 loss of information. Before describing all the routines used
 to implement our cache we will first describe how the cache works.
 .PP
 Items in the cache are structures containing an external (\*(Sichar *\*(So),
 two registers (\*(Sireg_t\*(So) and a constant (\*(Siarith\*(So),
 any of which may be 0.
 The value of such a structure is the sum of (the values of)
 its elements. To put a register in the cache, one has to be allocated either
 by calling \*(Sialloc_reg\*(So which returns a free register, by
 \*(Siforced_alloc_reg\*(So which allocates a specific register or any
 of the other routines available to allocate a register. The keep things
 simple, we will not discuss all of the available primitives here.
 When the register
 is then put in the cache by the \*(Sipush_reg\*(So routine, the ownership will
 be transferred from the user to the cache. Ownership is important, because
 only the owner of a register may (and must!) deallocate it. Registers can be
 owned by either an (imaginary) register manager, the cache or the user.
 When the user retrieves a register from the stack with \*(Sipop_reg\*(So for
 instance, ownership is back to the user.
 The user should then call \*(Sifree_reg\*(So
 to transfer ownership to the register manager or call \*(Sipush_reg\*(So
 to give it back to the cache.
 Since the cache behaves itself as a stack we will use the term pop resp. push
 to get items from, resp. put items in the cache.
 .PP
 We shall now present the sets of routines that implement the cache.
 .IP \(bu
 The routines
 .DS
 \*(Si
 reg_t alloc_reg(void)
 reg_t alloc_reg_var(void)
 reg_t alloc_float(void)
 reg_t alloc_float_var(void)
 reg_t alloc_double(void)
 reg_t alloc_double_var(void)
 void forced_alloc_reg(reg_t)
 void soft_alloc_reg(reg_t)
 void free_reg(reg_t)
 void free_double_reg(reg_t)
 \*(So
 .DE
 allocate and deallocate registers. If there are no more register left,
 i.e. they are owned by the cache,
 one or more registers will be freed by flushing part of the cache
 onto the real stack.
 The \*(Sialloc_xxx_var\*(So primitives try to allocate a register that
 can be used to store local variables. (In the current implementation
 only the input and local registers.) If none can be found \*(SiNULL\*(So
 is returned. \*(Siforced_alloc_reg\*(So forces the allocation of a certain
 register. If it was already in use, its contents are moved to another
 register. Finally \*(Sisoft_alloc_reg\*(So provides the possibility to
 push a register onto the cache and still keep a copy for later use.
 (Used to implement the \*(Sidup 4\*(So for example.)
 .IP \(bu
 The routines
 .DS
 \*(Si
 void push_const(arith)
 arith pop_const(void)
 \*(So
 .DE
 push or pop a constant onto or from the stack. Distinction between
 constants and other types is made so as not to loose any information; constants
 may be used later on as immediate operators, which is not the case
 for other types. If \*(Sipop_const\*(So is called, but the element on top of
 the cache has either one of the external or register fields non-zero a
 fatal error will be reported.
 .IP \(bu
 The routines
 .DS
 \*(Si
 reg_t pop_reg(void)
 reg_t pop_float(void)
 reg_t pop_double(void)
 reg_t pop_reg_c13(char *n)
 void pop_reg_as(reg_t)
 void push_reg(reg_t)
 \*(So
 .DE
 push or pop a register. These will be used most often since results from one
 EM instruction, which are computed in a register, are often used in the next.
 When the element on top of the cache is more
 than just a register the cache manager
 will generate code to compute the sum of its fields and put the result in a
 register. This register will then be given to the user.
 If the user wants the result is a special register, he should use the
 \*(Sipop_reg_as\*(So routine.
 The \*(Sipop_reg_c13\*(So gives an optional number (as character string) whose
 value can be represented in 13 bits. The constant can then be used as an
 offset for the SPARC \*(Sild\*(So and \*(Sist\*(So instructions.
 .IP \(bu
 The routine
 .DS
 \*(Si
 void push_ext(char *)
 \*(So
 .DE
 pushes an external onto the stack. There is no pop-variant of this one since
 there is no use in popping an external.
 .IP \(bu
 The routines
 .DS
 \*(Si
 void inc_tos(arith n)
 void inc_tos_reg(reg_t r)
 \*(So
 .DE
 increment the element on top of the cache by either the constant \*(Sin\*(So
 or by a register. The latter is useful for pointer addition when referencing
 external memory.
 .KS
 .IP \(bu
 The routine
 .DS
 \*(Si
 int type_of_tos(void)
 \*(So
 .DE
 .KE
 returns the type of the element on top of the cache. This is a combination
 (binary OR) of \*(SiT_ext\*(So, \*(SiT_reg\*(So or \*(SiT_float\*(So,
 \*(SiT_reg2\*(So or \*(SiT_float2\*(So, and \*(SiT_cst\*(So,
 and tells the
 user which of the three fields are non-zero. When the register-fields
 represent \*(Si%g0\*(So, it is considered zero.
 .IP \(bu
 Miscellaneous routines:
 .DS
 \*(Si
 void init_cache(void)
 void cache_need(int)
 void change_reg(void)
 void flush_cache(void)
 \*(So
 .DE
 \*(Siinit_cache\*(So should be called before any
 other cache routines, to initialize some internal datastructures.
 \*(Sicache_need\*(So is used to tell the cache that a certain number
 of register are needed for the next operation. This way the cache can
 load them efficiently in one fell swoop. \*(Sichange_reg\*(So is to be
 called when the user changes a register of which the cache (possibly) has
 co-ownership. Because the contents of registers in the cache are
 not allowed to change the user should call \*(Sichange_reg\*(So to
 instruct the cache to copy the contents to some other register.
 \*(Siflush_cache\*(So writes the cache to the stack and invalidates
 the cache. It should be used before branches,
 before labels and on other places where the stack has to be valid (i.e. where
 every item on the EM-stack should be stored on the real stack, not in some
 virtual cache).
 .NH 3
 Implementing push-pop optimization in the EM_table
 .PP
 As indicated above, there is no regular way to represent the described
 optimization in the EM_table. The only possible escapes from the EM_table
 are function calls, but that is clearly not enough to implement a good
 push-pop optimizer. Therefore we will use a modified version of the EM_table
 format, where the description of, say, the \*(Silol\*(So instruction might look
 like this\(dg:
 .FS
 \(dg This is not the way the \*(Silol\*(So actually looks in the EM_table;
 it only shows how it \fImight\fR look using the forementioned push/pop
 primitives.
 .FE
 .DS
 \*(Si
 reg_t A, B;
 const_str_t n;
 alloc_reg(A);
 push_reg(LB);
 inc_tos($1);
 B = pop_reg_c13(n);
 "ld  [$B+$n], $A";
 push_reg(A);
 free_reg(B);
 \*(So
 .DE
 For more details about the exact implementation consult
 appendix B which contains some characteristic excerpts from the EM_table.
 .NH 2
 Stack management
 .PP
 When converting EM code to some executable code there is the problem of
 maintaining multiple stacks. The usual way to do this is described in
 .[ [
 Description of a Machine Architecture
 .]]
 and is shown in figure \*(SN1.
 .KE
 .PS
 copy "pics/EM_stack.orig"
 .PE
 .ce 1
 \fIFigure \*(SN1: usual stack management.
 .KE
 .sp
 .LP
 This means that the EM stack and the hardware stack (used
 for subroutine calls, etc.) are interleaved in memory. On the SPARC, however,
 this brings up a large problem: in the former model it is assumed that the
 resolution of the stack pointer is a word, but this is not the case on the
 SPARC processor. On the SPARC processor the stack-pointer as well as the
 frame-pointer have to be aligned on 8-byte boundaries, so one can not simply
 push a word on the stack and then lower the stack-pointer by 4 bytes!
 .NH 3
 Possible solutions
 .PP
 A simple idea might be to use a swiss-cheese stack; we could
 push a 4-byte word onto the stack and then lower the stack by 8.
 Unfortunately, this is not a very solid solution, because
 pointer-arithmetic involving pointers to objects on the stack would cause
 hard-to-predict anomalies.
 .PP
 Another try would be not to use the hardware stack at all. As long as we
 do not generate subroutine-calls everything will be all right. This
 approach, however, also has some disadvantages: first we would not be able
 to use any of the existing debuggers such as \fIadb\fR, because they all
 assume a regular stack format. Secondly, we would not be able to make use
 of the SPARC's register windows to keep local variables. Finally, doing all the
 administrative work necessary for subroutine calls ourselves instead of
 letting the hardware handle it for us,
 causes unnecessary procedure-call overhead.
 .PP
 Yet another alternative would be to emulate the EM-part of the stack,
 and to let the hardware handle the subroutine call. Since we will
 emulate our own stack, there are no alignment restrictions and because
 we will use the hardware procedure call we can still make use of
 the register windows.
 .NH 3
 Our implementation
 .PP
 To implement the hybrid stack we need two extra registers: one for the
 the EM stack pointer (the forementioned \*(Si%SP\*(So) and one for the
 EM local base pointer (\*(Si%LB\*(So). The most elegant solution would be to
 put both stacks in different segments, so they would not influence
 each other. Unfortunately
 .UX
 lacks the ability to add segments and
 since we will implement our backend under
 .UX,
 we will have to put
 both stacks in the same segment. Exactly how this can be done is shown
 in figure \*(SN2.
 .DS
 .PS
 copy "pics/mem_config"
 .PE
 .ce 1
 \fIFigure \*(SN2: our stack management.\fR
 .DE
 .sp
 During normal procedure execution, the SPARC stack pointer has to point to
 a memory location where the operating system can dump the active part of
 the register window. The rest of the
 register window will be dumped in the therefor pre-allocated (stack) space
 by following the frame
 pointer. When a signal occurs things get even more complicated and
 result in figure \*(SN3.
 .DS
 .PS
 copy "pics/signal_stack"
 .PE
 .ce 1
 \fIFigure \*(SN3: our signal stack.\fR
 .DE
 .PP
 The exact implementation of the stack is shown in figure \*(SN4.
 .KF
 .PS
 copy "pics/EM_stack.ours"
 .PE
 .ce 1
 \fIFigure \*(SN4: stack overview.\fR
 .KE
 .NH 2
 Miscellaneous
 .PP
 As mentioned in the previous chapter, the generated \fI.o\fR-files are
 not compatible with Sun's own object format. The primary reason for
 this is that Sun usually passes the first six parameters of a procedure call
 through registers. If we were to do that too, we would always have
 to fetch the top six words from the stack into registers, even when
 the procedure would not have any parameters at all. Apart from this,
 structure-passing is another exception in Sun's object format which
 makes is impossible to generate object-compatible code.\(dg
 .FS
 \(dg Exactly how Sun passes structures as parameters is described in
 Appendix D of the SPARC Architecture Manual (Software Considerations)
 .FE
 .bp
--- a/doc/sparc/5
+++ b/doc/sparc/5
@ -0,0 +1,153 @@
 .so init
 .nr H1 4
 .NH
 FUTURE WORK
 .NH 2
 A critique of EM
 .PP
 In general, EM fits its purpose quite well. Numerous compilers have been
 written using EM as their intermediate language and it has even become a
 commercial product. A great deal of its success is probably due to its
 simplicity. There are no extravagant instructions but it does have all the
 necessary functions to write a decent compiler.
 .PP
 There are, however, a few functions that come rather close to being
 extravagant. The \*(Silar\*(So function for example \(em used
 to fetch an element from an array \(em does not make it much easier
 to write a frontend, but does make it unnecessary hard to write an
 efficient backend. Other instructions for which it is difficult
 to generate efficient code for are those that permit
 dynamic operators, such as the \*(Silos\*(So. Dynamic operators, however, provide
 significant extra possibilities and can therefore not be disposed of.
 Note that even though the array operations \*(Silar\*(So and \*(Sisar\*(So
 provide dynamic operators, they do not add additional power, since
 they can easily be replaced with a sequence using the \*(Silos\*(So or
 \*(Sists\*(So instructions.
 .PP
 EM code to reference arrays generated by the C frontend can be translated
 very efficiently for almost any processor. However the same operation
 generated by the Modula-2 frontend (which uses the \*(Silar\*(So),
 is much less efficient, although the only difference is that the
 latter performs range checking whereas the former does not.\(dg
 .FS
 \(dg Actually this depends on whether or not explicit range checking in enabled.
 This clearly shows that the current code generators are not optimal and
 often depend on ad-hoc decisions.
 .FE
 Since range checking can also be expressed explicitly in
 EM (\*(Sirck\*(So) there is no need for any of the array operations
 (\*(Siaar\*(So, \*(Silar\*(So and \*(Sisar\*(So).
 .PP
 Besides efficiency of the array-operations themselves, there still is another
 major disadvantage of using these array-operations. In sharp contrast to
 all other EM instructions except the \*(Silos\*(So and the \*(Sists\*(So,
 they allow dynamic operators, so their effect on the stack-pointer can not
 always be
 determined at compile-time. This means that efficient caching of the
 top-of-stack in registers is almost impossible,
 so using these array-operations also effects the
 efficiency of the surrounding code. Now that processors are produced with
 more and more registers it could be very beneficiary to cache the
 top-of-stack, so that the memory/register reference ratio decreases
 to the benefit of the overall performance.
 .PP
 As a final critique, we would also like to discuss the semantics of some of
 the EM instructions. In
 .[ [
 Description of a Machine Architecture
 .]]
 it is said that
 all signed instructions such as the \*(Siadi\*(So, should cause an exception
 on overflow. The unsigned operations such as \*(Siadu\*(So, however,
 should act as modulo operations and therefor not perform overflow checking.
 Since it is very expensive to perform overflow checking in EM,
 we would suggest that the backend takes care of this. For languages which
 do not require overflow checking, a simple message could be generated to
 disable overflow checking in backends. This way all backends could be
 written to fully comply to the official EM definition without any reduction in
 efficiency.\(dd 
 .FS
 \(dd Currently many backends do not implement error checks because they
 are too expensive and almost never needed. Some frontends even have
 facilities build in to generate EM-code to force these checks. If this
 trend continues we will end up with a de-facto and a de-jure standard
 both developed by the same people but nonetheless incompatible.
 .FE
 When such messages will be added we would like to suggest
 that they can enforce overflow checks on unsigned, as well as signed arithmetic.
 .PP
 As a conclusion we would like to suggest removal of the array operations from
 EM, or at least discontinuation of there usage in frontends.
 .NH 2
 \*(OQWanted: Procedure call information\*(CQ
 .PP
 The advantage of an intermediate language such as EM is that the backend
 no longer has to know about any 'quirks' of the 'input'-language. The major
 disadvantage, however, is that the backend no longer knows about any 'quirks'
 of the 'input'-language... If the SPARC backend ever has to compete
 with Sun's own C-compiler for example, removal of the array-operations
 will not be enough. The amount of information that is lost during
 the translation to EM is too large to ever generate truly efficient SPARC code.
 .PP
 To write such an efficient backend one needs to know, for example, whether,
 when and what type of parameter is being computed, so the result can be stored
 in the proper place and scratch registers can be reused.
 (On the SPARC processor, for example, it is very beneficiary
 to pass the first six parameters of a procedure call through
 registers instead of using the stack.)
 One way to express such things in EM is to insert extra messages in
 the EM-code. The C statement \*(Sia = f(4, a + b);\*(So for example,
 could be translated to the following EM-code:
 .DS
 .TS
 ;
 l1f6 lf6 l.
 lol	-4	! a
 lol	-8	! b
 mes	x, 2	! next instruction will compute 2nd parameter
 adi	4
 mes	x, 1	! next instruction will compute 1st parameter
 loc	4
 cal	_f	! call function f
 lfr	4
 stl	-4	! store result in a
 .TE
 .DE
 For a code expander it is important that the \*(Simes\*(So pseudo
 instructions appear \fIbefore\fR
 the EM instruction that computes the parameter, because that way the final
 computation (the \*(Siadi\*(So and \*(Siloc\*(So in the previous example)
 can be translated to machine code that performs the required computation
 and also puts the result in the required place. If it is found to be
 too difficult for the frontend to insert these \*(Simes\*(So instructions
 at the right place the peep-hole optimizer might swap the \*(Simes\*(So and
 the instruction that computes the parameter.
 .PP
 For some architectures, it is also
 possible to generate more efficient code for a procedure when it is a
 so-called leaf-procedure: a procedure that doesn't call other procedures.
 On the SPARC, for example, it is not necessary to rotate the register
 window for a call to a leaf procedure and it is also possible to use
 the global registers for register variables in leaf procedures.
 It will be a little harder to insert useful messages about leaf procedures,
 because just as with register messages, they are only useful to the
 backend when they appear immediately
 after or before the \*(Sipro\*(So pseudo instruction. The frontend,
 however, only knows whether a certain procedure is a leaf-procedure or not
 when it has already generated the entire procedure in EM. Just as with the
 \*(Sipro ? / end n\*(So-dilemma the peep-hole optimizer
 .[ [
 Using Peephole Optimization
 .]]
 might be able to lend a hand
 and help us out by delaying EM-code generation until it has reached the
 end of the procedure.
 .PP
 As with most optimizations, the main problem is that they have to be
 implemented with the \*(Simes\*(So pseudo instruction.
 Because the \*(Simes\*(So instruction can have many different meanings
 depending on its argument, 
 it is important that all optimizers recognize and respect them. Addition
 of even a single message will require careful inspection of, and maybe even
 incorporate small changes to each of the optimizers.
 .bp
--- a/doc/sparc/A
+++ b/doc/sparc/A
@ -0,0 +1,184 @@
 .so init
 .SH
 A. MEASUREMENTS
 .SH
 A.1. \*(OQThe bottom line\*(CQ
 .PP
 Although examples often are most illustrative, the cruel world out there is
 usually more interested in everyday performance figures. To satisfy those
 people too, we will present a series of measurements on our code expander
 taken from (close to) real life situations. These include measurements
 of compile and run times of different programs,
 compiled with different compilers.
 .SH
 A.2. Compile time measurements
 .PP
 Figure A.2.1 shows compile-time measurements for typical C code:
 the dhrystone benchmark\(dg
 .[ [
 dhrystone
 .]].
 .FS
 \(dg To be certain that we only tested the compiler and not the quality of
 the code in the library, we have added our own version of
 \fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the
 library.
 .FE
 The numbers represent the duration of each separate pass of the compiler.
 The numbers at the end of each bar represent the total duration of the
 compilation process. As with all measurements in this chapter, the
 quoted time or duration is the sum of user and system time in seconds.
 .PS
 copy "pics/compile_bars"
 .PE
 .DS
 .IP cem: 6
 C to EM frontend
 .IP opt:
 EM peep-hole optimizer
 .IP be:
 EM to assembler backend
 .IP cpp:
 Sun's C preprocessor
 .IP ccom:
 Sun's C compiler
 .IP iropt:
 Sun's optimizer
 .IP cg:
 Sun's code generator
 .IP as:
 Sun's assembler
 .IP ld:
 Sun's linker
 .ce 1
 \fIFigure A.2.1: compile-time measurements.\fR
 .DE
 .sp
 .PP
 A close examination of the first two bars in fig A.2.1 shows that the maximum
 achievable compile-time
 gain compared to \fIcc\fR is about 50% for medium-sized
 programs.\(dd
 .FS
 \(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53
 .FE
 For small programs the gain will be less, due to the almost constant
 start-up time of each pass in the compilation process. Only a
 built-in assembler may increase this number up to
 180% in the ideal case that the optimizer, backend and assembler
 would run in zero time. Speed-ups of 5 to 10 times as mentioned in
 .[ [
 fast portable compilers
 .]]
 are therefore not possible on the Sun-4 family. This is also due to
 Sun's implementation of saving and restoring register windows. With
 the current implementation in which only a single window is saved
 or restored on a register-window overflow, it is very time consuming
 when programs have highly dynamic stack use
 due to procedure calls (as is often the case with compilers).
 .PP
 Although we are currently a little slower than \fIcc\fR, it is hard to
 blame this on our backend. Optimizing the backend so that it would run
 twice as fast would only reduce the total compilation process by
 a mere 14%.
 .PP
 Finally it is nice to see that our push/pop-optimization,
 initially designed to generate faster code, has also increased the
 compilation speed. (see also figures A.4.1 and A.4.2.)
 .SH
 A.3. Run time performance
 .PP
 Figure A.3.1 shows the run-time performance of different compilers.
 All results are normalized, where the best available compiler (Sun's
 compiler with full optimization) is represented by 1.0 on our scale.
 .PS
 copy "pics/run-time_bars"
 .PE
 .ce 1
 \fIFigure A.3.1: run time performance.\fR
 .sp 1
 .PP
 The fact that our compiler behaves rather poorly compared to Sun's
 compiler is due to the fact that the dhrystone benchmark uses
 relatively many subroutine calls; all of which have to be 'emulated'
 by our backend.
 .SH
 A.4. Overall performance
 .LP
 In the next two figures we will show the combined run and compile time
 performance of 'our' compiler (the ACK C frontend and our backend)
 compared to Sun's C compiler. Figure A.4.1 shows the results from
 measurements on the dhrystone benchmark.
 .G1
 frame invis left solid bot solid
 label left "run time" "(in \(*msec/dhrystone)"
 label bot "compile time (in sec)"
 coord x 0,21 y 0,610
 ticks left out from 0 to 600 by 200
 ticks bot out from 0 to 20 by 5
 "\(bu" at 3.5, 1000000/1700
 "ack w/o opt" ljust at 3.5 + 1, 1000000/1700
 "\(bu" at 2.8, 1000000/8770
 "ack with opt" below at 2.8 + 0.1, 1000000/8770
 "\(bu" at 16.0, 1000000/10434
 "ack -O4" above at 16.0, 1000000/10434
 "\(bu" at 2.3, 1000000/7270
 "\fIcc\fR" above at 2.3, 1000000/7270
 "\(bu" at 9.0, 1000000/12500
 "\fIcc -O4\fR" above at 9.0, 1000000/12500
 "\(bu" at 5.9, 1000000/15250
 "\fIcc -O\fR" below at 5.9, 1000000/15250
 .G2
 .ce 1
 \fIFigure A.4.1: overall performance on dhrystones.
 .sp 1
 .LP
 Fortunately for us, dhrystones are not all there is. The following
 figure shows the same measurements as the previous one, except
 this time we took a benchmark that uses no subroutines: an implementation
 of Eratosthenes' sieve:
 .G1
 frame invis left solid bot solid
 label left "run time" "for one run" "(in sec)" left .6
 label bot "compile time (in sec)"
 coord x 0,11 y 0,21
 ticks bot out from 0 to 10 by 5
 ticks left out from 0 to 20 by 5
 "\(bu" at 2.5, 17.28
 "ack w/o opt" above at 2.5, 17.28
 "\(bu" at 1.6, 2.93
 "ack with opt" above at 1.6, 2.93
 "\(bu" at 9.4, 2.26
 "ack -O4" above at 9.4, 2.26
 "\(bu" at 1.5, 7.43
 "\fIcc\fR" above at 1.5, 7.43
 "\(bu" at 2.7, 2.02
 "\fIcc -O4\fR" ljust at 1.9, 1.2
 "\(bu" at 2.6, 2.10
 "\fIcc -O\fR" ljust at 3.1,2.5
 .G2
 .ce 1
 \fIFigure A.4.2: overall performance on Eratosthenes' sieve.
 .sp 1
 .PP
 Although the above figures speak for themselves, a small comment
 may be in place. At first it is clear that our compiler is neither
 faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should
 also be noted however, that we do produce better code than \fIcc\fR
 at only a very small additional cost.
 It is also worth noticing that push-pop optimization
 increases run-time speed as well as compile speed.
 The first seems rather obvious,
 since optimized code is
 faster code, but the increase in compile speed may come as a surprise.
 The main reason is that the \fIas\fR+\fIld\fR time depends largely on the
 amount of generated code, which in general
 depends on the efficiency of the code.
 Push-pop optimization removes a lot of useless instructions which
 would otherwise
 have found their way through to the assembler and the loader.
 Useless instructions inserted in an early stage in the compilation
 process will slow down every following stage, so elimination of useless
 instructions in an early stage, even when it requires a little computational
 overhead, can often be beneficial to the overall compilation speed.
 .bp
--- a/doc/sparc/B
+++ b/doc/sparc/B
@ -0,0 +1,128 @@
 .so init
 .SH
 B. IMPLEMENTATION
 .SH
 B.1. Excerpts from the non-optimized EM_table
 .PP
 Even though the non-optimized version of the EM_table is relatively
 straight-forward, examples have never hurt anybody.
 One of the simplest instructions is the \*(Siloc\*(So, which appears in
 our EM_table as follows:
 .DS
 \f6
 .TA 8 16 24 32 40 48 56 64
 C_loc	==>	"set	$1, T1";
 		"dec	4, SP";
 		"st	T1, [SP]".
 \f1
 .DE
 Just as \*(SiSP\*(So is an alias for \*(Si%l0\*(So, \*(SiT1\*(So is
 an alias for \*(Si%g1\*(So.
 A little more complex is the \*(Siadi\*(So which performs integer
 addition.
 .DS
 \f6
 C_adi	==>	"ld	[SP], T1";
 		"ld	[SP+4], T2";
 		"add	T1, T2, T3";
 		"st	T3, [SP+4];
 		"inc	4, SP".
 \f1
 .DE
 We could go on with even more complex instructions, but since that would
 not contribute to anything the reader is referred to the implementation
 for more details.
 .SH
 B.2. Excerpts from the optimized EM_table
 .PP
 The optimized EM_table uses the cache primitives mentioned in chapter 4.
 This means that the \*(Siloc\*(So this time appears as
 .DS
 \f6
 C_loc	==>	push_const($1).
 \f1
 .DE
 The \*(Silol\*(So can now be written as
 .DS
 \f6
 C_lol	==>	push_reg(LB);
 		inc_tos($1);
 		push_const(4);
 		C_los(4).
 \f1
 .DE
 Due to the law of conservation of misery somebody has to do the dirty work.
 In this case, it is the \*(Silos\*(So. To show just a small part of
 the implementation of the \*(Silos\*(So:
 .DS
 \f6
 C_los	$1 == 4	==>
 		if (type_of_tos() == T_cst) {
 			arith size;
 			const_str_t n;
 			size= pop_const();
 			if (size <= 4) {
 				reg_t a;
 				reg_t a;
 				char *LD;
 				switch (size) {
 				case 1:	LD = "ldub"; break;
 				case 2:	LD = "lduh"; break;
 				case 4:	LD = "ld"; break;
 				default:	arg_error("C_los", size);
 				}
 				a = pop_reg_c13(n);
 				b = alloc_reg();
 				"$LD	[$a+$n], $b";
 				push_reg(b);
 				free_reg(a);
 			} else ...
 \f1
 .DE
 For the full implementation, the reader is again referred to the actual
 implementation. Just to show how other instructions are affected
 by the optimization we will show that implementation of the \*(Sitge\*(So
 instruction:
 .DS
 \f6
 C_tge	==>	{
 			reg_t a;
 			reg_t b;
 			a = pop_reg();
 			b = alloc_reg();
 			"	tst	$a";
 			"	bge,a	1f";
 			"	mov	1, $b";		/* delay slot */
 			"	set	0, $b";
 			"1:";
 			free_reg(a);
 			push_reg(b);
 		}.
 \f1
 .DE
 .SH
 .bp
 CREDITS
 .PP
 In order of appearance:
 .TS
 center;
 r c l.
 Original idea	-	Dick Grune
 Design & implementation	-	Philip Homburg
 	-	Raymond Michiels
 Tutor	-	Dick Grune
 Assistant Tutor	-	Ceriel Jacobs
 Proofreading	-	Dick Grune
 	-	Hans van Eck
 .TE
 .SH
 REFERENCES
 .PP
 .[
 $LIST$
 .]
--- a/doc/sparc/Makefile
+++ b/doc/sparc/Makefile
@ -0,0 +1,10 @@
 # $Header$
 REFER=refer
 TBL=tbl
 TARGET=-Tlp
 PIC=pic
 GRAP=grap
 ../sparc.doc:	refs title intro 1 2 3 4 5 A B init
 		$(REFER) -sA+T '-l\", ' -p refs title intro 1 2 3 4 5 A B | $(GRAP) | $(PIC) | $(TBL) | soelim > $@
--- a/doc/sparc/init
+++ b/doc/sparc/init
@ -0,0 +1,18 @@
 .nr PS 12
 .nr VS 14
 .\" .fp 6 AM
 .fp 6 CW
 .ds Si \f6\s-1
 .ds So \f1\s+1
 .ds OQ `\h'-1p'`
 .ds CQ '\h'-1p''
 .de UX
 .ie \\n(UX \s-1UNIX\s0\\$1
 .el \{\
 \s-1UNIX\s0\\$1\(dg
 .FS
 \(dg \s-1UNIX\s0 is a registered bell of AT&T Trademark Laboratories.
 .FE
 .nr UX 1
 .\}
 ..
--- a/doc/sparc/intro
+++ b/doc/sparc/intro
@ -0,0 +1,23 @@
 .so init
 .hw de-vised
 .TL
 A fast backend for SPARC processors
 .AU
 Philip Homburg
 Raymond Michiels
 .AI
 Dept. of Mathematics and Computer Science
 Vrije Universiteit
 Amsterdam, The Netherlands
 .AB
 The language EM is an intermediate language for use in compiler
 construction.
 In this paper we describe the construction of a so-called fast backend
 which translates EM code to assembler for SPARC processors.
 .br
 Our construction deviates strongly from the usual procedure. We have
 devised and implemented a virtual stack with which it is possible to
 generate very acceptable code without much loss in compile time.
 .AE
 .PP
 .bp
--- a/doc/sparc/note_on_reg_wins
+++ b/doc/sparc/note_on_reg_wins
@ -0,0 +1,58 @@
 When developing a fast compiler for the Sun-4 series we have encountered
 rather strange behavior of the Sun kernel.
 The problem is that when you have lots of nested procedure calls, (as
 is often the case in compilers and parsers) the registers fill up which
 causes a kernel trap. The kernel will then write out some of the registers
 to memory to make room for another window. When you return from the nested
 procedure call, just the reverse happens: yet another kernel trap so the
 kernel can load the register from memory.
 Unfortunately the kernel only saves or loads a single window (= 16 register)
 on each trap. This means that when you call a procedure recursively it causes
 a kernel trap on almost every invocation (except for the first few).
 To illustrate this consider the following little program:
 --------------- little program -------------
 f(i)	/* calls itself i times */
 int i;
 {
  if (i)
 	f(i-1);
 }
 main(argc, argv)
 int argc;
 char *argv[];
 {
  i = atoi(argv[1]);	/* # loops */
  j = atoi(argv[2]);	/* depth */
  while (i--)
 	f(j);
 }
 ------------ end of little program -----------
 The performance decreases abruptly when the depth (j) becomes larger
 than 5. On a SPARC station we got the following results:
 	depth	run time (in seconds)
 	1	 0.5
 	2	 0.8
 	3	 1.0
 	4	 1.4	<- from here on it's +6 seconds for each
 	5	 7.6		step deeper.
 	6	13.9
 	7	19.9
 	8	26.3
 	9	32.9
 Things would be a lot better when instead of just 1, the kernel would
 save or restore 4 windows (= 64 registers = 50% on our SPARC stations).
 	-Raymond.
--- a/doc/sparc/pics/.distr
+++ b/doc/sparc/pics/.distr
@ -0,0 +1,12 @@
 EM_stack.orig
 EM_stack.ours
 compile_bars
 mem_config
 perf
 perf.comp
 perf.d
 perf.dhry
 reg_layout
 run-time_bars
 run-time_bars.bup
 signal_stack
--- a/doc/sparc/pics/EM_stack.orig
+++ b/doc/sparc/pics/EM_stack.orig
@ -0,0 +1,34 @@
 .PS
 .ps -2
 .vs -2
 boxwid = 1.5;
 boxht = 0.24
 down;
 box "actual parameter n-1";
 box "." "." "." ht 0.6;
 box "actual parameter 0";
 move 0.3
 box "return status block";
 {arrow <- right with .w at last box.e; \
 box invis wid 0.3 "LB" }
 down
 move to 2nd last box.s
 move 0.1
 box "local variables"
 box "compiler temporaries"
 move 0.1
 box "register save block"
 move 0.1
 box "dynamic local generators"
 move 0.1
 box "operand"
 box "operand"
 move 0.1
 box "parameter m-1"
 box "." "." "." ht 0.6;
 box "parameter 0" with .n at last box .s
 { arrow <- right with .w at last box.e; \
 box invis wid 0.3 "SP" }
 .ps +2
 .vs +2
 .PE
--- a/doc/sparc/pics/EM_stack.ours
+++ b/doc/sparc/pics/EM_stack.ours
@ -0,0 +1,106 @@
 .ps 10
 .vs 12
 .PS
 boxwid = 1.3
 boxht = 0.25
 down;
 box "floating point" "register dump area" ht 0.6
 box "tmp float store"
 box "register dump area" ht 0.6
 { arrow <- right with .w at 3/4 <last box.e, last box.se>; \
 box invis wid 0.3 "%fp" }
 move .1
 box dotted "gap"
 { arrow <- right with .w at last box.e; \
 box invis wid 0.3 "%LB" }
 move .1
 box "locals"
 box "actual parameter n-1";
 box "." "." "." ht 0.6;
 box "actual parameter 0";
 { arrow <- right with .w at last box.e; \
 box invis wid 0.3 "%SP" }
 move 0.1
 box "large gap" "(>64kb)" ht 1.0
 box "register dump area" ht 0.6
 { arrow <- right with .w at 3/4 <last box.e, last box.se>; \
 box invis wid 0.3 "%sp" }
 move 0.2
 box invis "\\s+2just before call\\s0"
 move 1
 box dotted "gap"
 box invis "0 or 4 bytes" "for stack alignment" with .w at last box.e
 box invis height .7 "when gap is 0 bytes," "%fp == %LB" with .n at 2nd last box.s
 .PF
 .PS
 down;
 move to 2.4,0
 box "floating point" "register dump area" ht 0.6
 box "tmp float store"
 box "register dump area" ht 0.6
 { arrow <- right with .w at 3/4 <last box.e, last box.se>; \
 box invis wid 0.3 "%fp" }
 move .1
 box dotted "gap"
 { arrow <- right with .w at last box.e; \
 box invis wid 0.3 "%LB" }
 move .1
 box "locals"
 box "actual parameter n-1";
 box "." "." "." ht 0.6;
 box "actual parameter 0";
 { arrow <- right with .w at last box.e; \
 box invis wid 0.3 "%SP" }
 move .1
 box dotted "gap"
 move .4
 box "floating point" "register dump area" ht 0.6
 box "tmp float store"
 box "register dump area" ht 0.6
 { arrow <- right with .w at 3/4 <last box.e, last box.se>; \
 box invis wid 0.3 "%sp" }
 move 0.2
 box invis "\\s+2'during' call\\s0"
 .PF
 .PS
 down;
 move to 4.8,0
 box "floating point" "register dump area" ht 0.6
 box "tmp float store"
 box "register dump area" ht 0.6
 move .1
 box dotted "gap"
 move .1
 box "locals"
 box "actual parameter n-1";
 box "." "." "." ht 0.6;
 box "actual parameter 0";
 move .1
 box dotted "gap"
 move .4
 box "floating point" "register dump area" ht 0.6
 box "tmp float store"
 box "register dump area" ht 0.6
 { arrow <- right with .w at 3/4 <last box.e, last box.se>; \
 box invis wid 0.3 "%fp" }
 move .1
 box dotted "gap"
 { arrow <- right with .w at last box.e; \
 box invis wid 0.3 "%LB" }
 move .1
 box "locals"
 box "actual parameter n-1";
 box "." "." "." ht 0.6;
 box "actual parameter 0";
 { arrow <- right with .w at last box.e; \
 box invis wid 0.3 "%SP" }
 move 0.1
 box "large gap" "(>64kb)" ht 1.0
 box "register dump area" ht 0.6
 { arrow <- right with .w at 3/4 <last box.e, last box.se>; \
 box invis wid 0.3 "%sp" }
 move 0.2
 box invis "\\s+2after call\\s0"
 .PF
 .ps 12
 .vs 14
--- a/doc/sparc/pics/compile_bars
+++ b/doc/sparc/pics/compile_bars
@ -0,0 +1,49 @@
 .PS
 boxht = 0.5
 boxwid = 1
 moveht = 0.65
 down;
 {
 right;
 box invis "ACK" "w/o" "opt"
 box "cem" "0.7" wid 0.7
 box "opt" "0.4" wid 0.4
 box "be" "1.1" wid 1.1
 box "as" "1.4" wid 1.4
 box "ld" "0.4" wid 0.4
 box invis "4.0" wid 0.5
 }
 move
 {
 right;
 box invis "ACK" "with" "opt"
 box "cem" "0.7" wid 0.7
 box "opt" "0.4" wid 0.4
 box "be" "0.6" wid 0.6
 box "as" "0.7" wid 0.7
 box "ld" "0.4" wid 0.4
 box invis "2.8" wid 0.5
 }
 move
 {
 right;
 box invis "\fIcc\fR"
 box "cpp" "0.2" wid 0.2
 box "ccom" "1.0" wid 1.0
 box "as" "0.7" wid 0.7
 box "ld" "0.4" wid 0.4
 box invis "2.3" wid 0.5
 }
 move
 {
 right;
 box invis "\fIcc -O4\fR"
 box "cpp" "0.2" wid 0.2
 box "ccom" "1.0" wid 1.0
 box "iropt" "5.0 (not to scale!)" wid 1.5
 box "cg" "0.7" wid 0.7
 box "as" "1.7" wid 1.7
 box "ld" "0.4" wid 0.4
 box invis "9.0" wid 0.5
 }
 .PE
--- a/doc/sparc/pics/mem_config
+++ b/doc/sparc/pics/mem_config
@ -0,0 +1,34 @@
 .PS
 boxwid = 1.3
 down
 [
 right
 [
 down;
 box "stack" ht .6
 box "free" ht 1
 box "heap" ht .3
 box "text" ht .5
 ]
 move 1
 [
 down;
 box "\s-4SPARC stack\s+4" ht .2
 box "\s-4EM stack\s+4" ht .1
 box "\s-4SPARC stack\s+4" ht .1
 box "\s-4EM stack\s+4" ht .1
 box "\s-4free\s+4" ht .2
 box "\s-4SPARC stack\s+4" ht .1
 box "free" ht .8
 box "heap" ht .3
 box "text" ht .5
 ]
 ]
 move .3
 [
 right
 box invis "regular \(UX memory layout"
 move 1
 box invis "memory layout for EM"
 ]
 .PF
--- a/doc/sparc/pics/perf
+++ b/doc/sparc/pics/perf
@ -0,0 +1,12 @@
 .G1
 frame invis left solid bot solid
 label left "run time" "(log scale)" left .5
 label bot "compile time (log scale)"
 coord x 0.1,10 log x y 1000,20000 log y
 ticks left out at 2000,5000,10000,20000
 ticks bot out at 0.1 0.3 1.0 3.0 10
 copy "perf.d" thru X
  "\(bu" at $1, $2
  "$3" rjust at $1, $2
 X
 .G2
--- a/doc/sparc/pics/perf.comp
+++ b/doc/sparc/pics/perf.comp
@ -0,0 +1,7 @@
 in-line in ../A
 2.5 17.28 ack w/o opt
 1.6 2.93 ack with opt
 9.4 2.26 ack -O4
 1.5 7.43 \fIcc\fR
 2.7 2.02 \fIcc -O4\fR
--- a/doc/sparc/pics/perf.d
+++ b/doc/sparc/pics/perf.d
@ -0,0 +1,4 @@
 1.0 1700 ack w/o opt
 1.9 8000 ack with opt
 1.6 8000 \fIcc\fR
 7 18000 \fIcc -O4\fR
--- a/doc/sparc/pics/perf.dhry
+++ b/doc/sparc/pics/perf.dhry
@ -0,0 +1,7 @@
 in-line in ../A
 3.5 1700 ack w/o opt
 2.8 8770 ack with opt
 16.0 10434 ack -O4
 2.3 7270 \fIcc\fR
 9.0 12500 \fIcc -O4\fR
--- a/doc/sparc/pics/reg_layout
+++ b/doc/sparc/pics/reg_layout
@ -0,0 +1,24 @@
 .nr PS 12
 .nr VS 14
 .PP
 .TS
 allbox;
 l l l l
 l2f6 l l2f6 l.
 g0	0	l0	EM_SP
 g1	temporary 1	l1	EM_LB
 g2	temporary 2	l2
 g3	temporary 3	l3	reserved
 g4	64k..1M	l4	reserved
 g5	temporary 4	l5	reserved
 g6	line number	l6	reserved
 g7	file name	l7	reserved
 o0	param 1	i0
 o1	param 2	i1
 o2	param 3	i2
 o3	param 4	i3
 o4	RETL_LD	i4	RETL_ST
 o5	RETH_LD	i5	RETH_ST
 sp	stack pointer	fp	frame pointer
 o7	xxx 	i7	return address
 .TE
--- a/doc/sparc/pics/run-time_bars
+++ b/doc/sparc/pics/run-time_bars
@ -0,0 +1,101 @@
 .PS
 boxht = 0.5
 boxwid = 1
 moveht = 1
 down;
 {
 right;
 box invis "ACK" "w/o" "opt."
 move
 [
 down;
 boxht = 0.25
 box wid 4.5
 "Sieve" ljust at last box.w + 0.1,-0.02
 "10(!)" ljust at last box.e + 0.1,-0.02
 box wid 4.5 with .nw at last box.sw
 "Dhrystones" ljust at last box.w + 0.1,-0.02
 "10(!)" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 move
 {
 right;
 box invis "ACK" "with" "our" "opt."
 move
 [
 down;
 boxht = 0.25
 box wid 1.4
 "Sieve" ljust at last box.w + 0.1,-0.02
 "1.4" ljust at last box.e + 0.1,-0.02
 box wid 1.9 with .nw at last box.sw
 "Dhrystones" ljust at last box.w + 0.1,-0.02
 "1.9" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 move
 {
 right;
 box invis "ACK" "-O4"
 move
 [
 down;
 boxht = 0.25
 box wid 1.1
 "Sieve" ljust at last box.w + 0.1,-0.02
 "1.1" ljust at last box.e + 0.1,-0.02
 box wid 1.6 with .nw at last box.sw
 "Dhrystones" ljust at last box.w + 0.1,-0.02
 "1.6" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 move
 {
 right;
 box invis "Sun's" "compiler" "w/o opt."
 move
 [
 down;
 boxht = 0.25
 box wid 3.7
 "Sieve" ljust at last box.w + 0.1,-0.02
 "3.7" ljust at last box.e + 0.1,-0.02
 box wid 2.2 with .nw at last box.sw
 "Dhrystones" ljust at last box.w + 0.1,-0.02
 "2.2" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 move
 {
 right;
 box invis "Sun's" "compiler" "-O"
 move
 [
 down;
 boxht = 0.25
 box wid 1.1
 "Sieve" ljust at last box.w + 0.1,-0.02
 "1.1" ljust at last box.e + 0.1,-0.02
 box wid 0.8 with .nw at last box.sw
 "Dhryst." ljust at last box.w + 0.1,-0.02
 "0.8!" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 move
 {
 right;
 box invis "Sun's" "compiler" "-O4"
 move
 [
 down;
 boxht = 0.25
 box wid 1.0
 "Sieve" ljust at last box.w + 0.1,-0.02
 "1.0" ljust at last box.e + 0.1,-0.02
 box wid 1.0 with .nw at last box.sw
 "Dhrystones" ljust at last box.w + 0.1,-0.02
 "1.0" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 .PE
--- a/doc/sparc/pics/run-time_bars.bup
+++ b/doc/sparc/pics/run-time_bars.bup
@ -0,0 +1,100 @@
 .PS
 boxht = 0.5
 boxwid = 1
 moveht = 1
 down;
 {
 right;
 box invis "ACK" "w/o" "opt"
 move
 [
 down;
 boxht = 0.25
 box wid 4.5
 "C (arithmetic)" ljust at last box.w + 0.1,-0.02
 "10(!)" ljust at last box.e + 0.1,-0.02
 box wid 4.5 with .nw at last box.sw
 "C (dhrystones)" ljust at last box.w + 0.1,-0.02
 "10(!)" ljust at last box.e + 0.1,-0.02
 box wid 4.5 with .nw at last box.sw
 "Modula-2" ljust at last box.w + 0.1,-0.02
 "8(!)" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 move
 {
 right;
 box invis "ACK" "with" "peep-hole" "opt"
 move
 [
 down;
 boxht = 0.25
 box wid 1.4
 "C (arithmetic)" ljust at last box.w + 0.1,-0.02
 "1.4" ljust at last box.e + 0.1,-0.02
 box wid 1.9 with .nw at last box.sw
 "C (dhrystones)" ljust at last box.w + 0.1,-0.02
 "1.9" ljust at last box.e + 0.1,-0.02
 box wid 2.5 with .nw at last box.sw
 "Modula-2" ljust at last box.w + 0.1,-0.02
 "2.5" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 move
 {
 right;
 box invis "ACK" "-O4"
 move
 [
 down;
 boxht = 0.25
 box wid 1.1
 "C (arithmetic)" ljust at last box.w + 0.1,-0.02
 "1.1" ljust at last box.e + 0.1,-0.02
 box wid 1.6 with .nw at last box.sw
 "C (dhrystones)" ljust at last box.w + 0.1,-0.02
 "1.6" ljust at last box.e + 0.1,-0.02
 box wid 2.5 with .nw at last box.sw
 "Modula-2" ljust at last box.w + 0.1,-0.02
 "2.5" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 move
 {
 right;
 box invis "Sun's" "compiler" "w/o opt."
 move
 [
 down;
 boxht = 0.25
 box wid 3.7
 "C (arithmetic)" ljust at last box.w + 0.1,-0.02
 "3.7" ljust at last box.e + 0.1,-0.02
 box wid 2.2 with .nw at last box.sw
 "C (dhrystones)" ljust at last box.w + 0.1,-0.02
 "2.2" ljust at last box.e + 0.1,-0.02
 box wid 1.8 with .nw at last box.sw
 "Modula-2" ljust at last box.w + 0.1,-0.02
 "1.8" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 move
 {
 right;
 box invis "Sun's" "compiler" "-O4"
 move
 [
 down;
 boxht = 0.25
 box wid 1.0
 "C (arith.)" ljust at last box.w + 0.1,-0.02
 "1.0" ljust at last box.e + 0.1,-0.02
 box wid 1.0 with .nw at last box.sw
 "C (dhryst.)" ljust at last box.w + 0.1,-0.02
 "1.0" ljust at last box.e + 0.1,-0.02
 box wid 1.0 with .nw at last box.sw
 "Modula-2" ljust at last box.w + 0.1,-0.02
 "1.0" ljust at last box.e + 0.1,-0.02
 ] with .w at last box.e
 }
 .PE
--- a/doc/sparc/pics/signal_stack
+++ b/doc/sparc/pics/signal_stack
@ -0,0 +1,42 @@
 .PS
 boxwid = 1.3
 down
 [
 right
 [
 down;
 box "\s-4SPARC stack\s+4" ht .2
 box "\s-4EM stack\s+4" ht .1
 box "\s-4SPARC stack\s+4" ht .1
 box "\s-4EM stack\s+4" ht .1
 box "\s-4free\s+4" ht .2
 box "\s-4SPARC stack\s+4" ht .1
 box "free" ht .8
 box "heap" ht .3
 box "text" ht .5
 ]
 move 1
 [
 down;
 box "\s-4SPARC stack\s+4" ht .2
 box "\s-4EM stack\s+4" ht .1
 box "\s-4SPARC stack\s+4" ht .1
 box "\s-4EM stack\s+4" ht .1
 box "\s-4free\s+4" ht .2
 box "\s-4SPARC stack\s+4" ht .1
 box "\s-4EM stack\s+4" ht .1
 box "\s-4free\s+4" ht .2
 box "\s-4SPARC stack\s+4" ht .1
 box "free" ht .4
 box "heap" ht .3
 box "text" ht .5
 ]
 ]
 move .3
 [
 right
 box invis "before signal"
 move 1
 box invis "during (1st) signal"
 ]
 .PF
--- a/doc/sparc/printP4P
+++ b/doc/sparc/printP4P
@ -0,0 +1,31 @@
 echo $0
 case $1 in
 1 )
 	CMD="cat"
 ;;
 2 )
 	CMD="cat"
 ;;
 3 )
 	CMD="cat"
 ;;
 4 )
 	CMD="pic | tbl"
 ;;
 5 )
 	CMD="tbl"
 ;;
 A )
 	CMD="grap | pic"
 ;;
 B )
 	CMD="tbl"
 ;;
 esac
 echo $0
 if [ $0 = printP4P ]
 then
  refer -sA+T '-l\", ' -p refs $1 | eval $CMD | troff -ms -Tp4p | dip -Tp4p -Pp4p
 else
  xtroff -full -geom 665x883+566+0 -command "refer -sA+T '-l\", ' -p refs $1 | $CMD | troff -ms -Tp4p"
 fi
--- a/doc/sparc/refs
+++ b/doc/sparc/refs
@ -0,0 +1,185 @@
 %T The design of very fast portable compilers
 %A A.S. Tanenbaum
 %A M.F. Kaashoek
 %A K.G. Langendoen
 %A C.J.H. Jacobs
 %J SIGPLAN Notices
 %V 24
 %N 11
 %P 125-131
 %D November 1989
 %T A Programmer-friendly LL(1) Parser Generator
 %A D. Grune
 %A C.J.H. Jacobs
 %J Software \- Practice and Experience
 %V 18
 %N 1
 %P 29-38
 %D January 1988
 %T The Code Expander Generator
 %A Frans Kaashoek
 %A Koen Langendoen
 %R IM-9
 %I Vrije Universiteit, Amsterdam
 %D November 1987
 %T The ACK Pascal Compiler
 %A Aad Geudeke
 %A Frans Hofmeester
 %R IM-8
 %I Vrije Universiteit, Amsterdam
 %D November 1987
 %T The EM-interpreter
 %A Eddo de Groot
 %A Leo van den Berge
 %R IM-7
 %I Vrije Universiteit, Amsterdam
 %D June 1987
 %T A set of multi\-process primitives for stack based machines
 %A K. Bot
 %A E. Scheffer
 %R IR-122
 %I Vrije Universiteit, Amsterdam
 %D December 1986
 %T An Occam Compiler
 %A K. Bot
 %A E. Scheffer
 %R IM-6
 %I Vrije Universiteit, Amsterdam
 %D December 1986
 %T Language- and Machine-independent Global Optimization on Intermediate Code
 %A H.E. Bal
 %A A.S. Tanenbaum
 %J Computer Languages
 %V 11
 %N 2
 %P 105-121
 %D April 1986
 %T The ACK Target Optimizer
 %A H.E. Bal
 %R IR-107
 %D 1985
 %I Vrije Universiteit, Amsterdam
 %T Some Topics in Parser Generation
 %A C.J.H. Jacobs
 %R IR-105
 %D October 1985
 %I Vrije Universiteit, Amsterdam
 %T The CEM compiler
 %A E.H. Baalbergen
 %A D. Grune
 %A M. Waage
 %R IM-4
 %I Vrije Universiteit, Amsterdam
 %D 1985
 %T The Design and Implementation of the EM Global Optimizer
 %A H.E. Bal
 %I Vrije Universiteit, Amsterdam
 %R IR-99
 %D March 1985
 %T Does anybody out there want to write HALF of a compiler?
 %A A.S. Tanenbaum
 %A E.G. Keizer
 %A H. van Staveren
 %J Sigplan Notices
 %V 19
 %N 8
 %P 106-108
 %D August 1984
 %T Amsterdam Compiler Kit documentation
 %A A.S. Tanenbaum et. al.
 %I Vrije Universiteit, Amsterdam
 %R IR-90
 %D June 1984
 %T A Practical Toolkit for Making Portable Compilers
 %A A. S. Tanenbaum
 %A H. van Staveren
 %A E. G. Keizer
 %A J. W. Stevenson
 %J Communications of the ACM
 %V 26
 %N 9
 %P 654-660
 %D September 1983
 %T Description of a Machine Architecture for use with Block Structured
 Languages
 %A A. S. Tanenbaum
 %A H. van Staveren
 %A E. G. Keizer
 %A J. W. Stevenson
 %R IR-81
 %D August 1983
 %I Vrije Universiteit, Amsterdam
 %T A Unix Toolkit for Making Portable Compilers
 %A A.S. Tanenbaum
 %A H. van Staveren
 %A E.G. Keizer
 %A J.W. Stevenson
 %J Proceedings USENIX conf.
 %C Toronto, Canada
 %V 26
 %D July 1983
 %P 255-261
 %T Using Peephole Optimization on Intermediate Code
 %A A.S. Tanenbaum
 %A J.M. van Staveren
 %A J.W. Stevenson
 %J TOPLAS
 %V 4
 %N 1
 %P 21-36
 %D January 1982
 %T EM-1 Compiler
 %A A.S. Tanenbaum
 %J Pascal News
 %D September 1981
 %P 4-38
 %T A portable compiler for the Proposed ISO Standard Pascal Language
 %A A.S. Tanenbaum
 %A J.W. Stevenson
 %A H. van Staveren
 %J Sigplan Notices
 %V 15
 %N 10
 %D 1980
 %T Implications of Structured Programming for Machine Architecture
 %A A.S. Tanenbaum
 %J CACM
 %V 21
 %N 3
 %P 237-246
 %D March 1978
 %T The table driven code generator from the Amsterdam Compiler Kit (Second
 revised edition)
 %A H. van Staveren
 %I Vrije Universiteit, Amsterdam
 %R on-line internal ACK documentation
 %D early 1985
 %T Dhrystone Benchmark: Rationale for Version 2 and Measurement Rules
 %A R.P. Weicker
 %J Sigplan Notices
 %V 23
 %N 8
 %D august 1988
 %P 49-62
--- a/doc/sparc/timing
+++ b/doc/sparc/timing
@ -0,0 +1,22 @@
 			DHRYSTONES V2.0
 		cc	cc -O4	cc -O	fccO	fccCE	ack	ack -O4
 compile time:
 	real	4.0	12.0		10.0	6.4	8.0	31.0
 	user	1.6	7.3	4.1	1.9	1.8	2.0	9.3
 	sys	0.9	2.1	1.8	2.5	1.5	2.0	7.7
 run time:	7263	16250	15250	4730	3430	8474	10434
 (stones/sec)
 			SIEVE
 		cc	cc -O4	fccO	fccCE	ack	ack -O4
 compile time:
 	real	2.4	4.4	x	3.3	6.4	17.0
 	user	0.8	1.6	x	0.7	0.7	3.2
 	sys	0.7	1.0	x	0.8	1.3	6.2
 run time:	7.43	2.02	x	12.18	2.93	2.26
 All ack-derived compilers are shell script driven
--- a/doc/sparc/title
+++ b/doc/sparc/title
@ -0,0 +1,15 @@
 .so init
 .TL
 .sp 1.2c
 A fast backend for SPARC processors
 .AU
 Philip Homburg
 Raymond Michiels
 .AI
 Dept. of Mathematics and Computer Science
 Vrije Universiteit
 Amsterdam, The Netherlands
 .PP
 .sp 1i
 Afstudeerverslag, 20 augustus 1990
 .bp