Added

1991-09-27 16:19:24 +00:00 · 1991-09-27 16:19:24 +00:00 · fb51183da2
commit fb51183da2
parent 63c9fea5c2
29 changed files with 2085 additions and 0 deletions
--- a/doc/sparc/.distr
+++ b/doc/sparc/.distr
@ -0,0 +1,14 @@
+1
+2
+3
+4
+5
+A
+B
+init
+intro
+note_on_reg_wins
+refs
+timing
+title
+Makefile
--- a/doc/sparc/1
+++ b/doc/sparc/1
@ -0,0 +1,53 @@
+.so init
+.NH
+INTRODUCTION
+.NH 2
+Why an EM backend for SPARC processors?
+.PP
+With the introduction of SPARC-based computers like the Sun-4, a
+whole new range of fast computers became readily available to the general
+public. The power of large mainframes had been captured into a small
+desk-top computer at only a fraction of the cost.
+.PP
+In the older days, a new computer used to be very hard to integrate into
+the existing environment, but due to standardization in the software world
+incompatibility in hardware no longer means incompatibility in software.
+Programs that are written for computer A can often be run on computer B
+without major modifications. Unfortunately this is not true for all software.
+.PP
+There will always be programs that rely on the specific
+hardware of a certain computer for many different reasons. They
+can be categorized as:
+.IP -
+poorly written programs
+.IP -
+programs to directly control hardware (device drivers)
+.IP -
+code that requires efficiency (time-critical I/O drivers)
+.IP -
+programs to generate code to run on the hardware (compilers)
+.LP
+This project for instance, the design and implementation of an EM backend
+for SPARC processors, comes in the last category.
+.PP
+We have designed and implemented an algorithm to convert EM programs to code
+that will run directly on the SPARC hardware. Henceforth, both the algorithm
+and the implementation will be referred to as the EM-to-SPARC backend,
+or simply: the backend.
+.NH 2
+Why has nobody done this before?
+.PP
+Since EM was designed around 1981 and even SPARC has been around for some
+years now, one may wonder why nobody has ever written an EM to SPARC backend
+before. The reason is twofold. In the first place, there are some
+non-trivial problems to be solved in the design phase, and secondly,
+the SPARC-design combined with the lack of documentation, would surely
+cost a lot of blood, sweat and tears. The absence of
+clues to any of the design problems, combined with the \(em at first
+glance \(em inhuman
+SPARC instruction set did not make this a very attractive project.
+.PP
+On the other hand, these were exactly the reasons which made us take on
+this particular project: it would require design skills, as well as some
+hard work; a golden combination for a successful project.
+.bp
--- a/doc/sparc/2
+++ b/doc/sparc/2
@ -0,0 +1,109 @@
+.so init
+.nr H1 1
+.NH
+CLOSE-UP LOOK
+.NH 2 
+What is EM?
+.PP
+As the abstract of the IR-81 rapport on EM
+.[ [
+description of a machine architecture
+.]]
+says: \*(OQEM is a family
+of intermediate languages designed for producing portable compilers.\*(CQ
+Because EM is to be used on a wide range of languages and processors,
+the instruction set is kept simple enough to allow easy translation to,
+or interpretation on, almost any processor. Yet it is also powerful enough
+to accommodate easy translation from almost any block-structured language.
+.PP
+Even though EM was designed in the early 1980s, it
+is based on
+.\" already shows strong signs of being influenced by
+the (then innovative) RISC architecture. All instructions
+have 0 or 1 operands, there are no fancy addressing modes as in the
+68020's\*(Si move.w a3(_array,d3.w*2), -(sp)\*(So, no explicit registers,
+although instructions for higher languages
+such as array-operations, multiway branches (case) and
+floating point operations are provided.
+.PP
+To fully understand the discussion in the following chapters,
+the reader should at least have some knowledge of EM.
+.NH 2
+What is SPARC?
+.PP
+According to Sun's RISC tutorial: \*(OQSun Microsystems has designed a RISC
+architecture, called SPARC, and has implemented that architecture with
+the Sun-4 family of supercomputing workstations and servers. SPARC stands
+for Scalable Processor ARChitecture, emphasizing its applicability to
+large as well as small machines.\*(CQ
+.PP
+In sharp contrast to EM, SPARC does have
+explicit registers (31 integer and 32 floating point, all of which
+are 32 bits wide) and
+does not support any high level language operations: it does not even have
+multiplication or division instructions. Because the SPARC design is
+very straightforward, all instructions could be hard-coded (no microcode
+involved) to
+provided extremely high performance. All register-to-register operations
+require exactly one clock cycle, and all register-to-memory and
+memory-to-register operations require two clock cycles, one to retrieve
+the instruction and one to access external memory. At a clock speed of
+over 20 MHz this means that you can achieve well over 10 VAX MIPS:
+more than 4 times the speed of a 15 MHz 68020 used in the Sun3/50.
+.PP
+As above, the reader should also have some general knowledge about
+the SPARC processer to be able to understand the following chapters.
+.NH 2
+What exactly is a (fast) backend?
+.PP
+To put in the simplest of ways: a (fast) backend is a set of routines to
+translate EM code to code that will run 'on the metal' (for example the SPARC
+processor). The distinction between full-fledged backends (code generators)
+.[ [
+The table driven code generator
+.]]
+and fast backends (code expanders)
+.[ [
+The Code Expander Generator
+.]]
+is related to
+the compilation-time vs. run-time trade off. Code generators generate
+efficient code and code expanders generate code very efficient.
+For details about code expanders see also
+.[ [
+The design of very fast portable compilers
+.]].
+.PP
+The reasons for us to implement a code expander are numerous: Our first reason to
+implement a code expander, rather than a code generator was that implementing a
+code expander would be hard enough already. Code generators only give
+more problems and there were already enough problems to be solved. Secondly,
+we knew we would never be able to compete with original SPARC compilers due
+to loss of information in the frontends (see also chapter 5). By implementing
+a code expander we might be able to outrun the existing compilers on a
+completely different terrain: compile speed.
+.PP
+The third 'reason' to implement a code expander lies a little deeper and was
+not discovered until we had actually started the implementation... It was only
+then that we found out that for certain architectures, such as the SPARC,
+the idea behind the code-expander is not necessarily inferior to that
+behind a code-generator. It seems that for highly orthogonal instruction
+sets it is possible to generate near optimal code without using the
+code-expander. We have to say, however, that this is only true for our
+optimized version of the code-expander. With the original code-expander
+it would not have been possible to generate near-optimal code for the
+SPARC processor.
+.NH 2
+So, what are the main differences between EM and SPARC?
+.PP
+The main
+difference between EM and SPARC is the stack versus register orientation.
+The other differences, such as the presence of high level language
+operations in EM, can easily be overcome by subroutines,
+or small pieces of in-line SPARC code.
+The design-part of this project mostly concentrates on
+building a bridge between EM's stack and SPARC's registers.
+.PP
+In the next chapter we will make a list of all our design problems which
+will then be discussed in chapter 4.
+.bp
--- a/doc/sparc/3
+++ b/doc/sparc/3
@ -0,0 +1,82 @@
+.so init
+.nr H1 2
+.NH
+PROBLEMS
+.NH 2
+Maintain SPARC speed
+.PP
+If we want to generate SPARC code, we should try to generate efficient code
+as fast as possible. It would be quite embarrassing to find out that the
+same program would run faster on a Motorola 68020 than on a SPARC processor,
+when both operate at the same clock frequency.
+Looking at some code generated by Sun's C-compiler and optimizing assembler,
+we can spot a few remarkable characteristics of the generated SPARC code:
+.IP -
+There are almost no memory references
+.IP -
+Parameters to functions are passed through registers.
+.IP -
+Almost all delay slots\(dg
+.FS
+\(dg For details about delay slots see the SPARC Architecture Manual, chapter 4, pp. 42-48
+.FE
+are filled in by the assembler
+.LP
+If we want to generate efficient code, we should at least try to
+reduce the number of memory references and use registers wherever we can.
+Since EM is stack-oriented it references its stack for every operation so
+this will not be an easy task; a suitable solution will however be given in
+the next chapter.
+.NH 2
+Increase compilation speed
+.PP
+Because we will implement a code expander (fast backend) we should keep
+a close eye on efficiency; if we cannot beat regular compilers on producing
+efficient code we will try to beat them on fast code generation.
+The usual trick to achieve fast compilation is to pack the frontend,
+optimizer, code-generator and
+assembler all into a single large binary to reduce the overhead of
+reading and writing temporary files. Unfortunately, due to the
+SPARC instruction set, its relocation information is slightly bizarre
+and cannot be represented with the present primitives.
+This means that it will not be possible to generate the required output
+format directly from our backend.
+.PP
+There are three solutions here: generate assembler code, and let an
+existing assembler generate the required object (\fI.o\fR) files,
+create our own primitives than can handle the SPARC relocation format, or
+do not use any of the addressing modes that require the bizarre relocation.
+Because we have enough on our hands already we will
+let the existing assembler deal with generating object files.
+.NH 2
+Convert stack to register operations
+.PP
+As we wrote in the previous chapter, for RISC machines a code expander can
+produce almost as efficient code as a code generator. The fact that this is
+true for stack-oriented RISC processors is rather obvious. The problem we
+face, however, is that the SPARC processor is register, instead of 
+stack oriented. In the next chapter we will give a suitable solution to
+convert most stack accesses to register accesses.
+.NH 2
+Miscellaneous
+.PP
+Besides performance and \fI.o\fR-compatibility there are some other
+peculiarities of the SPARC processor and Sun's C-compiler (henceforth
+simply called \fIcc\fR).
+.PP
+For some reason, the SPARC stack pointer requires alignment
+on 8 bytes, so you cannot push a 4-byte integer on the stack
+and then \*(Sisub 4, %sp\*(So\(dd.
+.FS
+\(dd For more information about SPARC assembler see the Sun-4 Assembly
+Language Reference Manual
+.FE
+This too will be discussed in the next chapter, where we will take a
+more in-depth look into this problem and also discuss a couple of
+possible solutions.
+.PP
+Another thing is that \fIcc\fR usually passes the first six parameters of a
+function-call through registers. To be \fI.o\fR-compatible we would have to
+pass the first six parameters of each function call through registers as well.
+Exactly why this is not feasible will also be discussed in the next chapter.
+.bp
--- a/doc/sparc/4
+++ b/doc/sparc/4
@ -0,0 +1,468 @@
+.so init
+.hw data-structures
+.nr H1 3
+.NH
+SOLUTIONS
+.NH 2
+Maintaining SPARC speed
+.PP
+In chapter 3 we wrote:
+.sp 0.3
+.nf
+>If we want to generate efficient code, we should at least try to reduce the number of
+>memory references and use registers wherever we can.
+.fi
+.sp 0.3
+In this chapter we will device a strategy to swiftly generate acceptable
+code by using push-pop optimization.
+Note that this is not the push-pop
+optimization already available in the EM-kit, since that is only present
+in the assembler-to-binary part which we do not use
+.[ [
+The Code Expander Generator
+.]].
+Our push-pop optimization
+works more like the fake-stack described in
+.[ [
+The table driven code generator
+.]].
+.NH 3
+Ad-hoc optimization
+.PP
+Before getting involved in any optimization let's have a look at some
+code generated with a straightforward EM to SPARC conversion of the
+C statement: \*(Sif(a[i]);\*(So Note that \*(Si%SP\*(So is an alias
+for a general purpose
+register and acts as the EM stack pointer. It has nothing to do with
+\*(Si%sp\*(So \(em the SPARC stack pointer.
+Analogous \*(Si%LB\*(So is EMs local base pointer.
+.br
+.IP
+.HS
+.TS
+;
+l s l s l
+l1f6 lf6 l2f6 lf6 l.
+EM code	SPARC code	Comment
+
+lae	_a	set	_a, %g1	! load address of external _a
+		dec	4, %SP
+		st	%g1, [%SP]
+
+lol	-4	set	-4, %g1	! load local -4 (i)
+		ld	[%g1+%LB], %g2
+		dec	4, %SP
+		st	%g2, [%SP]
+
+loc	2	set	2, %g1	! load constant 2
+		dec	4, %SP
+		st	%g1, [%SP]
+
+sli	4	ld	[%SP], %g1	! pop shift count
+		ld	[%SP+4], %g2	! pop shiftee
+		sll	%g2, %g1, %g3
+		inc	4, %SP
+		st	%g3, [%SP]	! push 4 * i
+
+ads	4	ld	[%SP], %g1	! add pointer and offset
+		ld	[%SP+4], %g2
+		add	%g1, %g2, %g3
+		inc	4, %SP
+		st	%g3, [%SP]	! push address of _a + (4 * i)
+
+loi	4	ld	[%SP], %g1	! load indirect 4 bytes
+		ld	[%g1], %g2
+		st	%g2, [%SP]	! push a[i]
+cal	_f
+		...
+.TE
+.HS
+.LP
+Although the code is easy understand, it clearly is far from optimal.
+The above code uses approximately 60 clock-cycles\(dg
+.FS
+\(dg In general each instruction only takes one cycle,
+except for \*(Sild\*(So and
+\*(Sist\*(So which may both require additional clock cycles. The exact amount
+of extra cycles needed depends on the SPARC implementation and memory access
+time. Furthermore, the
+\*(Siset\*(So pseudo-instruction is a bit tricky. It takes one cycle when
+its argument lies between -4096 and 4095, and two cycles otherwise.
+.FE
+to push an array-element on the stack,
+something which a 68020 can do in a single instruction. The SPARC
+processor may be fast, but not fast enough to justify the above code.
+.PP
+The same statement can be translated much more efficiently:
+.DS
+.TS
+;
+l2f6 lf6 l.
+sll	%i0, 2, %g2	! multiply index by 4
+set	_a, g3
+ld	[%g2+%g3], %g1	! get contents of a[i]
+dec	4, SP
+st	%g2, [SP]	! push a[i] onto the stack
+.TE
+.DE
+which, instead of 60, uses only 5 clock cycles to retrieve the element
+from memory and 5 additional cycles when the result has to be pushed
+on the stack. Note that when the result is not a parameter it does not
+have to be pushed on the stack. By making efficient use of the SPARC
+registers we can fetch \*(Sia[i]\*(So in only 5 cycles!
+.NH 3
+Analyzing optimization
+.PP
+Instead of ad-hoc optimization we will need something more solid.
+When one tries to optimize the above code in an ad-hoc manner one will
+probably notice the large overhead due to stack access. Almost every EM
+instruction requires at least three SPARC instructions: one to carry out
+the EM instruction and two to pop and push the result from and onto the
+stack. This happens for every instruction, even though the data being pushed
+will probably be needed by the next instruction. To optimize this extensive
+pushing and popping of data we will use the appropriately named push-pop
+optimization.
+.PP
+The idea behind push-pop optimization is to delay the push operation until
+it is almost certain that the data actually has to be pushed.
+As is often the case, the data does not have to be pushed,
+but will be used as input to another EM instruction.
+If we can decide at compile time that this will indeed be
+the case we can save the time of first pushing the data and then popping it
+back again by temporarily storing the data (possibly only during compilation!)
+and using it no sooner than it is actually needed.
+.PP
+The \*(Sisli 4\*(So instruction, for instance, expects two inputs on top of the
+stack: on top a counter and right below that the shiftee (the number
+to be shifted). As a result \*(Sisli\*(So
+pushes 'shiftee << counter' back to the stack. Now consider the following
+sequence, which could be the result of the expression \*(Si4 * i\*(So
+.DS
+.TS
+;
+l1f6 lf6 l.
+lol	-4
+loc	2
+sli	4
+.TE
+.DE
+In the non-optimized situation the \*(Silol\*(So would push
+a local variable (whose offset is -4) on the stack.
+Then the \*(Siloc\*(So pushes a 2 on the stack and finally \*(Sisli\*(So
+retrieves both these numbers to replace then with the result.
+On most machines it is not necessary to
+push the 2 on the stack, since it can be used in the shift instruction
+as an immediately operand. On a SPARC, for instance, one can write
+.DS
+.TS
+;
+l2f6 lf6 l.
+ld	[%LB-4], %g1	! load local variable into register g1
+sll	%g1, 2, %g2	! perform the shift-left-by-2
+.TE
+.DE
+where the output of the \*(Silol\*(So, as well as the immediate operand 2 are used
+in the shift instruction. As suggested before, all of this can be
+achieved with push-pop optimization.
+.NH 3
+A mechanism for push-pop optimization
+.PP
+To implement the above optimization we need some mechanism to
+temporarily store information during compilation.
+We need to be able to store, compare and retrieve information from the
+temporary storage (cache) without any
+loss of information. Before describing all the routines used
+to implement our cache we will first describe how the cache works.
+.PP
+Items in the cache are structures containing an external (\*(Sichar *\*(So),
+two registers (\*(Sireg_t\*(So) and a constant (\*(Siarith\*(So),
+any of which may be 0.
+The value of such a structure is the sum of (the values of)
+its elements. To put a register in the cache, one has to be allocated either
+by calling \*(Sialloc_reg\*(So which returns a free register, by
+\*(Siforced_alloc_reg\*(So which allocates a specific register or any
+of the other routines available to allocate a register. The keep things
+simple, we will not discuss all of the available primitives here.
+When the register
+is then put in the cache by the \*(Sipush_reg\*(So routine, the ownership will
+be transferred from the user to the cache. Ownership is important, because
+only the owner of a register may (and must!) deallocate it. Registers can be
+owned by either an (imaginary) register manager, the cache or the user.
+When the user retrieves a register from the stack with \*(Sipop_reg\*(So for
+instance, ownership is back to the user.
+The user should then call \*(Sifree_reg\*(So
+to transfer ownership to the register manager or call \*(Sipush_reg\*(So
+to give it back to the cache.
+Since the cache behaves itself as a stack we will use the term pop resp. push
+to get items from, resp. put items in the cache.
+.PP
+We shall now present the sets of routines that implement the cache.
+.IP \(bu
+The routines
+.DS
+\*(Si
+reg_t alloc_reg(void)
+reg_t alloc_reg_var(void)
+reg_t alloc_float(void)
+reg_t alloc_float_var(void)
+reg_t alloc_double(void)
+reg_t alloc_double_var(void)
+
+void forced_alloc_reg(reg_t)
+void soft_alloc_reg(reg_t)
+
+void free_reg(reg_t)
+void free_double_reg(reg_t)
+\*(So
+.DE
+allocate and deallocate registers. If there are no more register left,
+i.e. they are owned by the cache,
+one or more registers will be freed by flushing part of the cache
+onto the real stack.
+The \*(Sialloc_xxx_var\*(So primitives try to allocate a register that
+can be used to store local variables. (In the current implementation
+only the input and local registers.) If none can be found \*(SiNULL\*(So
+is returned. \*(Siforced_alloc_reg\*(So forces the allocation of a certain
+register. If it was already in use, its contents are moved to another
+register. Finally \*(Sisoft_alloc_reg\*(So provides the possibility to
+push a register onto the cache and still keep a copy for later use.
+(Used to implement the \*(Sidup 4\*(So for example.)
+.IP \(bu
+The routines
+.DS
+\*(Si
+void push_const(arith)
+arith pop_const(void)
+\*(So
+.DE
+push or pop a constant onto or from the stack. Distinction between
+constants and other types is made so as not to loose any information; constants
+may be used later on as immediate operators, which is not the case
+for other types. If \*(Sipop_const\*(So is called, but the element on top of
+the cache has either one of the external or register fields non-zero a
+fatal error will be reported.
+.IP \(bu
+The routines
+.DS
+\*(Si
+reg_t pop_reg(void)
+reg_t pop_float(void)
+reg_t pop_double(void)
+reg_t pop_reg_c13(char *n)
+
+void pop_reg_as(reg_t)
+
+void push_reg(reg_t)
+\*(So
+.DE
+push or pop a register. These will be used most often since results from one
+EM instruction, which are computed in a register, are often used in the next.
+When the element on top of the cache is more
+than just a register the cache manager
+will generate code to compute the sum of its fields and put the result in a
+register. This register will then be given to the user.
+If the user wants the result is a special register, he should use the
+\*(Sipop_reg_as\*(So routine.
+The \*(Sipop_reg_c13\*(So gives an optional number (as character string) whose
+value can be represented in 13 bits. The constant can then be used as an
+offset for the SPARC \*(Sild\*(So and \*(Sist\*(So instructions.
+.IP \(bu
+The routine
+.DS
+\*(Si
+void push_ext(char *)
+\*(So
+.DE
+pushes an external onto the stack. There is no pop-variant of this one since
+there is no use in popping an external.
+.IP \(bu
+The routines
+.DS
+\*(Si
+void inc_tos(arith n)
+void inc_tos_reg(reg_t r)
+\*(So
+.DE
+increment the element on top of the cache by either the constant \*(Sin\*(So
+or by a register. The latter is useful for pointer addition when referencing
+external memory.
+.KS
+.IP \(bu
+The routine
+.DS
+\*(Si
+int type_of_tos(void)
+\*(So
+.DE
+.KE
+returns the type of the element on top of the cache. This is a combination
+(binary OR) of \*(SiT_ext\*(So, \*(SiT_reg\*(So or \*(SiT_float\*(So,
+\*(SiT_reg2\*(So or \*(SiT_float2\*(So, and \*(SiT_cst\*(So,
+and tells the
+user which of the three fields are non-zero. When the register-fields
+represent \*(Si%g0\*(So, it is considered zero.
+.IP \(bu
+Miscellaneous routines:
+.DS
+\*(Si
+void init_cache(void)
+void cache_need(int)
+void change_reg(void)
+void flush_cache(void)
+\*(So
+.DE
+\*(Siinit_cache\*(So should be called before any
+other cache routines, to initialize some internal datastructures.
+\*(Sicache_need\*(So is used to tell the cache that a certain number
+of register are needed for the next operation. This way the cache can
+load them efficiently in one fell swoop. \*(Sichange_reg\*(So is to be
+called when the user changes a register of which the cache (possibly) has
+co-ownership. Because the contents of registers in the cache are
+not allowed to change the user should call \*(Sichange_reg\*(So to
+instruct the cache to copy the contents to some other register.
+\*(Siflush_cache\*(So writes the cache to the stack and invalidates
+the cache. It should be used before branches,
+before labels and on other places where the stack has to be valid (i.e. where
+every item on the EM-stack should be stored on the real stack, not in some
+virtual cache).
+.NH 3
+Implementing push-pop optimization in the EM_table
+.PP
+As indicated above, there is no regular way to represent the described
+optimization in the EM_table. The only possible escapes from the EM_table
+are function calls, but that is clearly not enough to implement a good
+push-pop optimizer. Therefore we will use a modified version of the EM_table
+format, where the description of, say, the \*(Silol\*(So instruction might look
+like this\(dg:
+.FS
+\(dg This is not the way the \*(Silol\*(So actually looks in the EM_table;
+it only shows how it \fImight\fR look using the forementioned push/pop
+primitives.
+.FE
+.DS
+\*(Si
+reg_t A, B;
+const_str_t n;
+
+alloc_reg(A);
+push_reg(LB);
+inc_tos($1);
+B = pop_reg_c13(n);
+"ld  [$B+$n], $A";
+push_reg(A);
+free_reg(B);
+\*(So
+.DE
+For more details about the exact implementation consult
+appendix B which contains some characteristic excerpts from the EM_table.
+.NH 2
+Stack management
+.PP
+When converting EM code to some executable code there is the problem of
+maintaining multiple stacks. The usual way to do this is described in
+.[ [
+Description of a Machine Architecture
+.]]
+and is shown in figure \*(SN1.
+.KE
+.PS
+copy "pics/EM_stack.orig"
+.PE
+.ce 1
+\fIFigure \*(SN1: usual stack management.
+.KE
+.sp
+.LP
+This means that the EM stack and the hardware stack (used
+for subroutine calls, etc.) are interleaved in memory. On the SPARC, however,
+this brings up a large problem: in the former model it is assumed that the
+resolution of the stack pointer is a word, but this is not the case on the
+SPARC processor. On the SPARC processor the stack-pointer as well as the
+frame-pointer have to be aligned on 8-byte boundaries, so one can not simply
+push a word on the stack and then lower the stack-pointer by 4 bytes!
+.NH 3
+Possible solutions
+.PP
+A simple idea might be to use a swiss-cheese stack; we could
+push a 4-byte word onto the stack and then lower the stack by 8.
+Unfortunately, this is not a very solid solution, because
+pointer-arithmetic involving pointers to objects on the stack would cause
+hard-to-predict anomalies.
+.PP
+Another try would be not to use the hardware stack at all. As long as we
+do not generate subroutine-calls everything will be all right. This
+approach, however, also has some disadvantages: first we would not be able
+to use any of the existing debuggers such as \fIadb\fR, because they all
+assume a regular stack format. Secondly, we would not be able to make use
+of the SPARC's register windows to keep local variables. Finally, doing all the
+administrative work necessary for subroutine calls ourselves instead of
+letting the hardware handle it for us,
+causes unnecessary procedure-call overhead.
+.PP
+Yet another alternative would be to emulate the EM-part of the stack,
+and to let the hardware handle the subroutine call. Since we will
+emulate our own stack, there are no alignment restrictions and because
+we will use the hardware procedure call we can still make use of
+the register windows.
+.NH 3
+Our implementation
+.PP
+To implement the hybrid stack we need two extra registers: one for the
+the EM stack pointer (the forementioned \*(Si%SP\*(So) and one for the
+EM local base pointer (\*(Si%LB\*(So). The most elegant solution would be to
+put both stacks in different segments, so they would not influence
+each other. Unfortunately
+.UX
+lacks the ability to add segments and
+since we will implement our backend under
+.UX,
+we will have to put
+both stacks in the same segment. Exactly how this can be done is shown
+in figure \*(SN2.
+.DS
+.PS
+copy "pics/mem_config"
+.PE
+.ce 1
+\fIFigure \*(SN2: our stack management.\fR
+.DE
+.sp
+During normal procedure execution, the SPARC stack pointer has to point to
+a memory location where the operating system can dump the active part of
+the register window. The rest of the
+register window will be dumped in the therefor pre-allocated (stack) space
+by following the frame
+pointer. When a signal occurs things get even more complicated and
+result in figure \*(SN3.
+.DS
+.PS
+copy "pics/signal_stack"
+.PE
+.ce 1
+\fIFigure \*(SN3: our signal stack.\fR
+.DE
+.PP
+The exact implementation of the stack is shown in figure \*(SN4.
+.KF
+.PS
+copy "pics/EM_stack.ours"
+.PE
+.ce 1
+\fIFigure \*(SN4: stack overview.\fR
+.KE
+.NH 2
+Miscellaneous
+.PP
+As mentioned in the previous chapter, the generated \fI.o\fR-files are
+not compatible with Sun's own object format. The primary reason for
+this is that Sun usually passes the first six parameters of a procedure call
+through registers. If we were to do that too, we would always have
+to fetch the top six words from the stack into registers, even when
+the procedure would not have any parameters at all. Apart from this,
+structure-passing is another exception in Sun's object format which
+makes is impossible to generate object-compatible code.\(dg
+.FS
+\(dg Exactly how Sun passes structures as parameters is described in
+Appendix D of the SPARC Architecture Manual (Software Considerations)
+.FE
+.bp
--- a/doc/sparc/5
+++ b/doc/sparc/5
@ -0,0 +1,153 @@
+.so init
+.nr H1 4
+.NH
+FUTURE WORK
+.NH 2
+A critique of EM
+.PP
+In general, EM fits its purpose quite well. Numerous compilers have been
+written using EM as their intermediate language and it has even become a
+commercial product. A great deal of its success is probably due to its
+simplicity. There are no extravagant instructions but it does have all the
+necessary functions to write a decent compiler.
+.PP
+There are, however, a few functions that come rather close to being
+extravagant. The \*(Silar\*(So function for example \(em used
+to fetch an element from an array \(em does not make it much easier
+to write a frontend, but does make it unnecessary hard to write an
+efficient backend. Other instructions for which it is difficult
+to generate efficient code for are those that permit
+dynamic operators, such as the \*(Silos\*(So. Dynamic operators, however, provide
+significant extra possibilities and can therefore not be disposed of.
+Note that even though the array operations \*(Silar\*(So and \*(Sisar\*(So
+provide dynamic operators, they do not add additional power, since
+they can easily be replaced with a sequence using the \*(Silos\*(So or
+\*(Sists\*(So instructions.
+.PP
+EM code to reference arrays generated by the C frontend can be translated
+very efficiently for almost any processor. However the same operation
+generated by the Modula-2 frontend (which uses the \*(Silar\*(So),
+is much less efficient, although the only difference is that the
+latter performs range checking whereas the former does not.\(dg
+.FS
+\(dg Actually this depends on whether or not explicit range checking in enabled.
+This clearly shows that the current code generators are not optimal and
+often depend on ad-hoc decisions.
+.FE
+Since range checking can also be expressed explicitly in
+EM (\*(Sirck\*(So) there is no need for any of the array operations
+(\*(Siaar\*(So, \*(Silar\*(So and \*(Sisar\*(So).
+.PP
+Besides efficiency of the array-operations themselves, there still is another
+major disadvantage of using these array-operations. In sharp contrast to
+all other EM instructions except the \*(Silos\*(So and the \*(Sists\*(So,
+they allow dynamic operators, so their effect on the stack-pointer can not
+always be
+determined at compile-time. This means that efficient caching of the
+top-of-stack in registers is almost impossible,
+so using these array-operations also effects the
+efficiency of the surrounding code. Now that processors are produced with
+more and more registers it could be very beneficiary to cache the
+top-of-stack, so that the memory/register reference ratio decreases
+to the benefit of the overall performance.
+.PP
+As a final critique, we would also like to discuss the semantics of some of
+the EM instructions. In
+.[ [
+Description of a Machine Architecture
+.]]
+it is said that
+all signed instructions such as the \*(Siadi\*(So, should cause an exception
+on overflow. The unsigned operations such as \*(Siadu\*(So, however,
+should act as modulo operations and therefor not perform overflow checking.
+Since it is very expensive to perform overflow checking in EM,
+we would suggest that the backend takes care of this. For languages which
+do not require overflow checking, a simple message could be generated to
+disable overflow checking in backends. This way all backends could be
+written to fully comply to the official EM definition without any reduction in
+efficiency.\(dd 
+.FS
+\(dd Currently many backends do not implement error checks because they
+are too expensive and almost never needed. Some frontends even have
+facilities build in to generate EM-code to force these checks. If this
+trend continues we will end up with a de-facto and a de-jure standard
+both developed by the same people but nonetheless incompatible.
+.FE
+When such messages will be added we would like to suggest
+that they can enforce overflow checks on unsigned, as well as signed arithmetic.
+.PP
+As a conclusion we would like to suggest removal of the array operations from
+EM, or at least discontinuation of there usage in frontends.
+.NH 2
+\*(OQWanted: Procedure call information\*(CQ
+.PP
+The advantage of an intermediate language such as EM is that the backend
+no longer has to know about any 'quirks' of the 'input'-language. The major
+disadvantage, however, is that the backend no longer knows about any 'quirks'
+of the 'input'-language... If the SPARC backend ever has to compete
+with Sun's own C-compiler for example, removal of the array-operations
+will not be enough. The amount of information that is lost during
+the translation to EM is too large to ever generate truly efficient SPARC code.
+.PP
+To write such an efficient backend one needs to know, for example, whether,
+when and what type of parameter is being computed, so the result can be stored
+in the proper place and scratch registers can be reused.
+(On the SPARC processor, for example, it is very beneficiary
+to pass the first six parameters of a procedure call through
+registers instead of using the stack.)
+One way to express such things in EM is to insert extra messages in
+the EM-code. The C statement \*(Sia = f(4, a + b);\*(So for example,
+could be translated to the following EM-code:
+.DS
+.TS
+;
+l1f6 lf6 l.
+lol	-4	! a
+lol	-8	! b
+mes	x, 2	! next instruction will compute 2nd parameter
+adi	4
+mes	x, 1	! next instruction will compute 1st parameter
+loc	4
+cal	_f	! call function f
+lfr	4
+stl	-4	! store result in a
+.TE
+.DE
+For a code expander it is important that the \*(Simes\*(So pseudo
+instructions appear \fIbefore\fR
+the EM instruction that computes the parameter, because that way the final
+computation (the \*(Siadi\*(So and \*(Siloc\*(So in the previous example)
+can be translated to machine code that performs the required computation
+and also puts the result in the required place. If it is found to be
+too difficult for the frontend to insert these \*(Simes\*(So instructions
+at the right place the peep-hole optimizer might swap the \*(Simes\*(So and
+the instruction that computes the parameter.
+.PP
+For some architectures, it is also
+possible to generate more efficient code for a procedure when it is a
+so-called leaf-procedure: a procedure that doesn't call other procedures.
+On the SPARC, for example, it is not necessary to rotate the register
+window for a call to a leaf procedure and it is also possible to use
+the global registers for register variables in leaf procedures.
+It will be a little harder to insert useful messages about leaf procedures,
+because just as with register messages, they are only useful to the
+backend when they appear immediately
+after or before the \*(Sipro\*(So pseudo instruction. The frontend,
+however, only knows whether a certain procedure is a leaf-procedure or not
+when it has already generated the entire procedure in EM. Just as with the
+\*(Sipro ? / end n\*(So-dilemma the peep-hole optimizer
+.[ [
+Using Peephole Optimization
+.]]
+might be able to lend a hand
+and help us out by delaying EM-code generation until it has reached the
+end of the procedure.
+.PP
+As with most optimizations, the main problem is that they have to be
+implemented with the \*(Simes\*(So pseudo instruction.
+Because the \*(Simes\*(So instruction can have many different meanings
+depending on its argument, 
+it is important that all optimizers recognize and respect them. Addition
+of even a single message will require careful inspection of, and maybe even
+incorporate small changes to each of the optimizers.
+.bp
--- a/doc/sparc/A
+++ b/doc/sparc/A
@ -0,0 +1,184 @@
+.so init
+.SH
+A. MEASUREMENTS
+.SH
+A.1. \*(OQThe bottom line\*(CQ
+.PP
+Although examples often are most illustrative, the cruel world out there is
+usually more interested in everyday performance figures. To satisfy those
+people too, we will present a series of measurements on our code expander
+taken from (close to) real life situations. These include measurements
+of compile and run times of different programs,
+compiled with different compilers.
+.SH
+A.2. Compile time measurements
+.PP
+Figure A.2.1 shows compile-time measurements for typical C code:
+the dhrystone benchmark\(dg
+.[ [
+dhrystone
+.]].
+.FS
+\(dg To be certain that we only tested the compiler and not the quality of
+the code in the library, we have added our own version of
+\fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the
+library.
+.FE
+The numbers represent the duration of each separate pass of the compiler.
+The numbers at the end of each bar represent the total duration of the
+compilation process. As with all measurements in this chapter, the
+quoted time or duration is the sum of user and system time in seconds.
+.PS
+copy "pics/compile_bars"
+.PE
+.DS
+.IP cem: 6
+C to EM frontend
+.IP opt:
+EM peep-hole optimizer
+.IP be:
+EM to assembler backend
+.IP cpp:
+Sun's C preprocessor
+.IP ccom:
+Sun's C compiler
+.IP iropt:
+Sun's optimizer
+.IP cg:
+Sun's code generator
+.IP as:
+Sun's assembler
+.IP ld:
+Sun's linker
+.ce 1
+\fIFigure A.2.1: compile-time measurements.\fR
+.DE
+.sp
+.PP
+A close examination of the first two bars in fig A.2.1 shows that the maximum
+achievable compile-time
+gain compared to \fIcc\fR is about 50% for medium-sized
+programs.\(dd
+.FS
+\(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53
+.FE
+For small programs the gain will be less, due to the almost constant
+start-up time of each pass in the compilation process. Only a
+built-in assembler may increase this number up to
+180% in the ideal case that the optimizer, backend and assembler
+would run in zero time. Speed-ups of 5 to 10 times as mentioned in
+.[ [
+fast portable compilers
+.]]
+are therefore not possible on the Sun-4 family. This is also due to
+Sun's implementation of saving and restoring register windows. With
+the current implementation in which only a single window is saved
+or restored on a register-window overflow, it is very time consuming
+when programs have highly dynamic stack use
+due to procedure calls (as is often the case with compilers).
+.PP
+Although we are currently a little slower than \fIcc\fR, it is hard to
+blame this on our backend. Optimizing the backend so that it would run
+twice as fast would only reduce the total compilation process by
+a mere 14%.
+.PP
+Finally it is nice to see that our push/pop-optimization,
+initially designed to generate faster code, has also increased the
+compilation speed. (see also figures A.4.1 and A.4.2.)
+.SH
+A.3. Run time performance
+.PP
+Figure A.3.1 shows the run-time performance of different compilers.
+All results are normalized, where the best available compiler (Sun's
+compiler with full optimization) is represented by 1.0 on our scale.
+.PS
+copy "pics/run-time_bars"
+.PE
+.ce 1
+\fIFigure A.3.1: run time performance.\fR
+.sp 1
+.PP
+The fact that our compiler behaves rather poorly compared to Sun's
+compiler is due to the fact that the dhrystone benchmark uses
+relatively many subroutine calls; all of which have to be 'emulated'
+by our backend.
+.SH
+A.4. Overall performance
+.LP
+In the next two figures we will show the combined run and compile time
+performance of 'our' compiler (the ACK C frontend and our backend)
+compared to Sun's C compiler. Figure A.4.1 shows the results from
+measurements on the dhrystone benchmark.
+.G1
+frame invis left solid bot solid
+label left "run time" "(in \(*msec/dhrystone)"
+label bot "compile time (in sec)"
+coord x 0,21 y 0,610
+ticks left out from 0 to 600 by 200
+ticks bot out from 0 to 20 by 5
+"\(bu" at 3.5, 1000000/1700
+"ack w/o opt" ljust at 3.5 + 1, 1000000/1700
+"\(bu" at 2.8, 1000000/8770
+"ack with opt" below at 2.8 + 0.1, 1000000/8770
+"\(bu" at 16.0, 1000000/10434
+"ack -O4" above at 16.0, 1000000/10434
+"\(bu" at 2.3, 1000000/7270
+"\fIcc\fR" above at 2.3, 1000000/7270
+"\(bu" at 9.0, 1000000/12500
+"\fIcc -O4\fR" above at 9.0, 1000000/12500
+"\(bu" at 5.9, 1000000/15250
+"\fIcc -O\fR" below at 5.9, 1000000/15250
+.G2
+.ce 1
+\fIFigure A.4.1: overall performance on dhrystones.
+.sp 1
+.LP
+Fortunately for us, dhrystones are not all there is. The following
+figure shows the same measurements as the previous one, except
+this time we took a benchmark that uses no subroutines: an implementation
+of Eratosthenes' sieve:
+.G1
+frame invis left solid bot solid
+label left "run time" "for one run" "(in sec)" left .6
+label bot "compile time (in sec)"
+coord x 0,11 y 0,21
+ticks bot out from 0 to 10 by 5
+ticks left out from 0 to 20 by 5
+"\(bu" at 2.5, 17.28
+"ack w/o opt" above at 2.5, 17.28
+"\(bu" at 1.6, 2.93
+"ack with opt" above at 1.6, 2.93
+"\(bu" at 9.4, 2.26
+"ack -O4" above at 9.4, 2.26
+"\(bu" at 1.5, 7.43
+"\fIcc\fR" above at 1.5, 7.43
+"\(bu" at 2.7, 2.02
+"\fIcc -O4\fR" ljust at 1.9, 1.2
+"\(bu" at 2.6, 2.10
+"\fIcc -O\fR" ljust at 3.1,2.5
+.G2
+.ce 1
+\fIFigure A.4.2: overall performance on Eratosthenes' sieve.
+.sp 1
+.PP
+Although the above figures speak for themselves, a small comment
+may be in place. At first it is clear that our compiler is neither
+faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should
+also be noted however, that we do produce better code than \fIcc\fR
+at only a very small additional cost.
+It is also worth noticing that push-pop optimization
+increases run-time speed as well as compile speed.
+The first seems rather obvious,
+since optimized code is
+faster code, but the increase in compile speed may come as a surprise.
+The main reason is that the \fIas\fR+\fIld\fR time depends largely on the
+amount of generated code, which in general
+depends on the efficiency of the code.
+Push-pop optimization removes a lot of useless instructions which
+would otherwise
+have found their way through to the assembler and the loader.
+Useless instructions inserted in an early stage in the compilation
+process will slow down every following stage, so elimination of useless
+instructions in an early stage, even when it requires a little computational
+overhead, can often be beneficial to the overall compilation speed.
+.bp
--- a/doc/sparc/B
+++ b/doc/sparc/B
@ -0,0 +1,128 @@
+.so init
+.SH
+B. IMPLEMENTATION
+.SH
+B.1. Excerpts from the non-optimized EM_table
+.PP
+Even though the non-optimized version of the EM_table is relatively
+straight-forward, examples have never hurt anybody.
+One of the simplest instructions is the \*(Siloc\*(So, which appears in
+our EM_table as follows:
+.DS
+\f6
+.TA 8 16 24 32 40 48 56 64
+C_loc	==>	"set	$1, T1";
+		"dec	4, SP";
+		"st	T1, [SP]".
+\f1
+.DE
+Just as \*(SiSP\*(So is an alias for \*(Si%l0\*(So, \*(SiT1\*(So is
+an alias for \*(Si%g1\*(So.
+A little more complex is the \*(Siadi\*(So which performs integer
+addition.
+.DS
+\f6
+C_adi	==>	"ld	[SP], T1";
+		"ld	[SP+4], T2";
+		"add	T1, T2, T3";
+		"st	T3, [SP+4];
+		"inc	4, SP".
+\f1
+.DE
+We could go on with even more complex instructions, but since that would
+not contribute to anything the reader is referred to the implementation
+for more details.
+.SH
+B.2. Excerpts from the optimized EM_table
+.PP
+The optimized EM_table uses the cache primitives mentioned in chapter 4.
+This means that the \*(Siloc\*(So this time appears as
+.DS
+\f6
+C_loc	==>	push_const($1).
+\f1
+.DE
+The \*(Silol\*(So can now be written as
+.DS
+\f6
+C_lol	==>	push_reg(LB);
+		inc_tos($1);
+		push_const(4);
+		C_los(4).
+\f1
+.DE
+Due to the law of conservation of misery somebody has to do the dirty work.
+In this case, it is the \*(Silos\*(So. To show just a small part of
+the implementation of the \*(Silos\*(So:
+.DS
+\f6
+C_los	$1 == 4	==>
+		if (type_of_tos() == T_cst) {
+			arith size;
+			const_str_t n;
+
+			size= pop_const();
+			if (size <= 4) {
+				reg_t a;
+				reg_t a;
+				char *LD;
+
+				switch (size) {
+				case 1:	LD = "ldub"; break;
+				case 2:	LD = "lduh"; break;
+				case 4:	LD = "ld"; break;
+				default:	arg_error("C_los", size);
+				}
+				a = pop_reg_c13(n);
+				b = alloc_reg();
+				"$LD	[$a+$n], $b";
+				push_reg(b);
+				free_reg(a);
+			} else ...
+\f1
+.DE
+For the full implementation, the reader is again referred to the actual
+implementation. Just to show how other instructions are affected
+by the optimization we will show that implementation of the \*(Sitge\*(So
+instruction:
+.DS
+\f6
+C_tge	==>	{
+			reg_t a;
+			reg_t b;
+
+			a = pop_reg();
+			b = alloc_reg();
+			"	tst	$a";
+			"	bge,a	1f";
+			"	mov	1, $b";		/* delay slot */
+			"	set	0, $b";
+			"1:";
+			free_reg(a);
+			push_reg(b);
+		}.
+
+\f1
+.DE
+.SH
+.bp
+CREDITS
+.PP
+In order of appearance:
+.TS
+center;
+r c l.
+Original idea	-	Dick Grune
+Design & implementation	-	Philip Homburg
+	-	Raymond Michiels
+Tutor	-	Dick Grune
+Assistant Tutor	-	Ceriel Jacobs
+Proofreading	-	Dick Grune
+	-	Hans van Eck
+.TE
+.SH
+REFERENCES
+.PP
+.[
+$LIST$
+.]
--- a/doc/sparc/Makefile
+++ b/doc/sparc/Makefile
@ -0,0 +1,10 @@
+# $Header$
+
+REFER=refer
+TBL=tbl
+TARGET=-Tlp
+PIC=pic
+GRAP=grap
+
+../sparc.doc:	refs title intro 1 2 3 4 5 A B init
+		$(REFER) -sA+T '-l\", ' -p refs title intro 1 2 3 4 5 A B | $(GRAP) | $(PIC) | $(TBL) | soelim > $@
--- a/doc/sparc/init
+++ b/doc/sparc/init
@ -0,0 +1,18 @@
+.nr PS 12
+.nr VS 14
+.\" .fp 6 AM
+.fp 6 CW
+.ds Si \f6\s-1
+.ds So \f1\s+1
+.ds OQ `\h'-1p'`
+.ds CQ '\h'-1p''
+.de UX
+.ie \\n(UX \s-1UNIX\s0\\$1
+.el \{\
+\s-1UNIX\s0\\$1\(dg
+.FS
+\(dg \s-1UNIX\s0 is a registered bell of AT&T Trademark Laboratories.
+.FE
+.nr UX 1
+.\}
+..
--- a/doc/sparc/intro
+++ b/doc/sparc/intro
@ -0,0 +1,23 @@
+.so init
+.hw de-vised
+.TL
+A fast backend for SPARC processors
+.AU
+Philip Homburg
+Raymond Michiels
+.AI
+Dept. of Mathematics and Computer Science
+Vrije Universiteit
+Amsterdam, The Netherlands
+.AB
+The language EM is an intermediate language for use in compiler
+construction.
+In this paper we describe the construction of a so-called fast backend
+which translates EM code to assembler for SPARC processors.
+.br
+Our construction deviates strongly from the usual procedure. We have
+devised and implemented a virtual stack with which it is possible to
+generate very acceptable code without much loss in compile time.
+.AE
+.PP
+.bp
--- a/doc/sparc/note_on_reg_wins
+++ b/doc/sparc/note_on_reg_wins
@ -0,0 +1,58 @@
+When developing a fast compiler for the Sun-4 series we have encountered
+rather strange behavior of the Sun kernel.
+
+The problem is that when you have lots of nested procedure calls, (as
+is often the case in compilers and parsers) the registers fill up which
+causes a kernel trap. The kernel will then write out some of the registers
+to memory to make room for another window. When you return from the nested
+procedure call, just the reverse happens: yet another kernel trap so the
+kernel can load the register from memory.
+
+Unfortunately the kernel only saves or loads a single window (= 16 register)
+on each trap. This means that when you call a procedure recursively it causes
+a kernel trap on almost every invocation (except for the first few).
+
+To illustrate this consider the following little program:
+
+--------------- little program -------------
+f(i)	/* calls itself i times */
+int i;
+{
+  if (i)
+	f(i-1);
+}
+
+main(argc, argv)
+int argc;
+char *argv[];
+{
+
+
+  i = atoi(argv[1]);	/* # loops */
+  j = atoi(argv[2]);	/* depth */
+
+  while (i--)
+	f(j);
+}
+------------ end of little program -----------
+
+
+The performance decreases abruptly when the depth (j) becomes larger
+than 5. On a SPARC station we got the following results:
+
+	depth	run time (in seconds)
+
+	1	 0.5
+	2	 0.8
+	3	 1.0
+	4	 1.4	<- from here on it's +6 seconds for each
+	5	 7.6		step deeper.
+	6	13.9
+	7	19.9
+	8	26.3
+	9	32.9
+
+Things would be a lot better when instead of just 1, the kernel would
+save or restore 4 windows (= 64 registers = 50% on our SPARC stations).
+
+	-Raymond.
--- a/doc/sparc/pics/.distr
+++ b/doc/sparc/pics/.distr
@ -0,0 +1,12 @@
+EM_stack.orig
+EM_stack.ours
+compile_bars
+mem_config
+perf
+perf.comp
+perf.d
+perf.dhry
+reg_layout
+run-time_bars
+run-time_bars.bup
+signal_stack
--- a/doc/sparc/pics/EM_stack.orig
+++ b/doc/sparc/pics/EM_stack.orig
@ -0,0 +1,34 @@
+.PS
+.ps -2
+.vs -2
+boxwid = 1.5;
+boxht = 0.24
+down;
+box "actual parameter n-1";
+box "." "." "." ht 0.6;
+box "actual parameter 0";
+move 0.3
+box "return status block";
+{arrow <- right with .w at last box.e; \
+box invis wid 0.3 "LB" }
+down
+move to 2nd last box.s
+move 0.1
+box "local variables"
+box "compiler temporaries"
+move 0.1
+box "register save block"
+move 0.1
+box "dynamic local generators"
+move 0.1
+box "operand"
+box "operand"
+move 0.1
+box "parameter m-1"
+box "." "." "." ht 0.6;
+box "parameter 0" with .n at last box .s
+{ arrow <- right with .w at last box.e; \
+box invis wid 0.3 "SP" }
+.ps +2
+.vs +2
+.PE
--- a/doc/sparc/pics/EM_stack.ours
+++ b/doc/sparc/pics/EM_stack.ours
@ -0,0 +1,106 @@
+.ps 10
+.vs 12
+.PS
+boxwid = 1.3
+boxht = 0.25
+down;
+box "floating point" "register dump area" ht 0.6
+box "tmp float store"
+box "register dump area" ht 0.6
+{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
+box invis wid 0.3 "%fp" }
+move .1
+box dotted "gap"
+{ arrow <- right with .w at last box.e; \
+box invis wid 0.3 "%LB" }
+move .1
+box "locals"
+box "actual parameter n-1";
+box "." "." "." ht 0.6;
+box "actual parameter 0";
+{ arrow <- right with .w at last box.e; \
+box invis wid 0.3 "%SP" }
+move 0.1
+box "large gap" "(>64kb)" ht 1.0
+box "register dump area" ht 0.6
+{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
+box invis wid 0.3 "%sp" }
+move 0.2
+box invis "\\s+2just before call\\s0"
+move 1
+box dotted "gap"
+box invis "0 or 4 bytes" "for stack alignment" with .w at last box.e
+box invis height .7 "when gap is 0 bytes," "%fp == %LB" with .n at 2nd last box.s
+.PF
+.PS
+down;
+move to 2.4,0
+box "floating point" "register dump area" ht 0.6
+box "tmp float store"
+box "register dump area" ht 0.6
+{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
+box invis wid 0.3 "%fp" }
+move .1
+box dotted "gap"
+{ arrow <- right with .w at last box.e; \
+box invis wid 0.3 "%LB" }
+move .1
+box "locals"
+box "actual parameter n-1";
+box "." "." "." ht 0.6;
+box "actual parameter 0";
+{ arrow <- right with .w at last box.e; \
+box invis wid 0.3 "%SP" }
+move .1
+box dotted "gap"
+move .4
+box "floating point" "register dump area" ht 0.6
+box "tmp float store"
+box "register dump area" ht 0.6
+{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
+box invis wid 0.3 "%sp" }
+move 0.2
+box invis "\\s+2'during' call\\s0"
+.PF
+.PS
+down;
+move to 4.8,0
+box "floating point" "register dump area" ht 0.6
+box "tmp float store"
+box "register dump area" ht 0.6
+move .1
+box dotted "gap"
+move .1
+box "locals"
+box "actual parameter n-1";
+box "." "." "." ht 0.6;
+box "actual parameter 0";
+move .1
+box dotted "gap"
+move .4
+box "floating point" "register dump area" ht 0.6
+box "tmp float store"
+box "register dump area" ht 0.6
+{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
+box invis wid 0.3 "%fp" }
+move .1
+box dotted "gap"
+{ arrow <- right with .w at last box.e; \
+box invis wid 0.3 "%LB" }
+move .1
+box "locals"
+box "actual parameter n-1";
+box "." "." "." ht 0.6;
+box "actual parameter 0";
+{ arrow <- right with .w at last box.e; \
+box invis wid 0.3 "%SP" }
+move 0.1
+box "large gap" "(>64kb)" ht 1.0
+box "register dump area" ht 0.6
+{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
+box invis wid 0.3 "%sp" }
+move 0.2
+box invis "\\s+2after call\\s0"
+.PF
+.ps 12
+.vs 14
--- a/doc/sparc/pics/compile_bars
+++ b/doc/sparc/pics/compile_bars
@ -0,0 +1,49 @@
+.PS
+boxht = 0.5
+boxwid = 1
+moveht = 0.65
+down;
+{
+right;
+box invis "ACK" "w/o" "opt"
+box "cem" "0.7" wid 0.7
+box "opt" "0.4" wid 0.4
+box "be" "1.1" wid 1.1
+box "as" "1.4" wid 1.4
+box "ld" "0.4" wid 0.4
+box invis "4.0" wid 0.5
+}
+move
+{
+right;
+box invis "ACK" "with" "opt"
+box "cem" "0.7" wid 0.7
+box "opt" "0.4" wid 0.4
+box "be" "0.6" wid 0.6
+box "as" "0.7" wid 0.7
+box "ld" "0.4" wid 0.4
+box invis "2.8" wid 0.5
+}
+move
+{
+right;
+box invis "\fIcc\fR"
+box "cpp" "0.2" wid 0.2
+box "ccom" "1.0" wid 1.0
+box "as" "0.7" wid 0.7
+box "ld" "0.4" wid 0.4
+box invis "2.3" wid 0.5
+}
+move
+{
+right;
+box invis "\fIcc -O4\fR"
+box "cpp" "0.2" wid 0.2
+box "ccom" "1.0" wid 1.0
+box "iropt" "5.0 (not to scale!)" wid 1.5
+box "cg" "0.7" wid 0.7
+box "as" "1.7" wid 1.7
+box "ld" "0.4" wid 0.4
+box invis "9.0" wid 0.5
+}
+.PE
--- a/doc/sparc/pics/mem_config
+++ b/doc/sparc/pics/mem_config
@ -0,0 +1,34 @@
+.PS
+boxwid = 1.3
+down
+[
+right
+[
+down;
+box "stack" ht .6
+box "free" ht 1
+box "heap" ht .3
+box "text" ht .5
+]
+move 1
+[
+down;
+box "\s-4SPARC stack\s+4" ht .2
+box "\s-4EM stack\s+4" ht .1
+box "\s-4SPARC stack\s+4" ht .1
+box "\s-4EM stack\s+4" ht .1
+box "\s-4free\s+4" ht .2
+box "\s-4SPARC stack\s+4" ht .1
+box "free" ht .8
+box "heap" ht .3
+box "text" ht .5
+]
+]
+move .3
+[
+right
+box invis "regular \(UX memory layout"
+move 1
+box invis "memory layout for EM"
+]
+.PF
--- a/doc/sparc/pics/perf
+++ b/doc/sparc/pics/perf
@ -0,0 +1,12 @@
+.G1
+frame invis left solid bot solid
+label left "run time" "(log scale)" left .5
+label bot "compile time (log scale)"
+coord x 0.1,10 log x y 1000,20000 log y
+ticks left out at 2000,5000,10000,20000
+ticks bot out at 0.1 0.3 1.0 3.0 10
+copy "perf.d" thru X
+  "\(bu" at $1, $2
+  "$3" rjust at $1, $2
+X
+.G2
--- a/doc/sparc/pics/perf.comp
+++ b/doc/sparc/pics/perf.comp
@ -0,0 +1,7 @@
+in-line in ../A
+
+2.5 17.28 ack w/o opt
+1.6 2.93 ack with opt
+9.4 2.26 ack -O4
+1.5 7.43 \fIcc\fR
+2.7 2.02 \fIcc -O4\fR
--- a/doc/sparc/pics/perf.d
+++ b/doc/sparc/pics/perf.d
@ -0,0 +1,4 @@
+1.0 1700 ack w/o opt
+1.9 8000 ack with opt
+1.6 8000 \fIcc\fR
+7 18000 \fIcc -O4\fR
--- a/doc/sparc/pics/perf.dhry
+++ b/doc/sparc/pics/perf.dhry
@ -0,0 +1,7 @@
+in-line in ../A
+
+3.5 1700 ack w/o opt
+2.8 8770 ack with opt
+16.0 10434 ack -O4
+2.3 7270 \fIcc\fR
+9.0 12500 \fIcc -O4\fR
--- a/doc/sparc/pics/reg_layout
+++ b/doc/sparc/pics/reg_layout
@ -0,0 +1,24 @@
+.nr PS 12
+.nr VS 14
+.PP
+.TS
+allbox;
+l l l l
+l2f6 l l2f6 l.
+g0	0	l0	EM_SP
+g1	temporary 1	l1	EM_LB
+g2	temporary 2	l2
+g3	temporary 3	l3	reserved
+g4	64k..1M	l4	reserved
+g5	temporary 4	l5	reserved
+g6	line number	l6	reserved
+g7	file name	l7	reserved
+o0	param 1	i0
+o1	param 2	i1
+o2	param 3	i2
+o3	param 4	i3
+o4	RETL_LD	i4	RETL_ST
+o5	RETH_LD	i5	RETH_ST
+sp	stack pointer	fp	frame pointer
+o7	xxx 	i7	return address
+.TE
--- a/doc/sparc/pics/run-time_bars
+++ b/doc/sparc/pics/run-time_bars
@ -0,0 +1,101 @@
+.PS
+boxht = 0.5
+boxwid = 1
+moveht = 1
+down;
+{
+right;
+box invis "ACK" "w/o" "opt."
+move
+[
+down;
+boxht = 0.25
+box wid 4.5
+"Sieve" ljust at last box.w + 0.1,-0.02
+"10(!)" ljust at last box.e + 0.1,-0.02
+box wid 4.5 with .nw at last box.sw
+"Dhrystones" ljust at last box.w + 0.1,-0.02
+"10(!)" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+move
+{
+right;
+box invis "ACK" "with" "our" "opt."
+move
+[
+down;
+boxht = 0.25
+box wid 1.4
+"Sieve" ljust at last box.w + 0.1,-0.02
+"1.4" ljust at last box.e + 0.1,-0.02
+box wid 1.9 with .nw at last box.sw
+"Dhrystones" ljust at last box.w + 0.1,-0.02
+"1.9" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+move
+{
+right;
+box invis "ACK" "-O4"
+move
+[
+down;
+boxht = 0.25
+box wid 1.1
+"Sieve" ljust at last box.w + 0.1,-0.02
+"1.1" ljust at last box.e + 0.1,-0.02
+box wid 1.6 with .nw at last box.sw
+"Dhrystones" ljust at last box.w + 0.1,-0.02
+"1.6" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+move
+{
+right;
+box invis "Sun's" "compiler" "w/o opt."
+move
+[
+down;
+boxht = 0.25
+box wid 3.7
+"Sieve" ljust at last box.w + 0.1,-0.02
+"3.7" ljust at last box.e + 0.1,-0.02
+box wid 2.2 with .nw at last box.sw
+"Dhrystones" ljust at last box.w + 0.1,-0.02
+"2.2" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+move
+{
+right;
+box invis "Sun's" "compiler" "-O"
+move
+[
+down;
+boxht = 0.25
+box wid 1.1
+"Sieve" ljust at last box.w + 0.1,-0.02
+"1.1" ljust at last box.e + 0.1,-0.02
+box wid 0.8 with .nw at last box.sw
+"Dhryst." ljust at last box.w + 0.1,-0.02
+"0.8!" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+move
+{
+right;
+box invis "Sun's" "compiler" "-O4"
+move
+[
+down;
+boxht = 0.25
+box wid 1.0
+"Sieve" ljust at last box.w + 0.1,-0.02
+"1.0" ljust at last box.e + 0.1,-0.02
+box wid 1.0 with .nw at last box.sw
+"Dhrystones" ljust at last box.w + 0.1,-0.02
+"1.0" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+.PE
--- a/doc/sparc/pics/run-time_bars.bup
+++ b/doc/sparc/pics/run-time_bars.bup
@ -0,0 +1,100 @@
+.PS
+boxht = 0.5
+boxwid = 1
+moveht = 1
+down;
+{
+right;
+box invis "ACK" "w/o" "opt"
+move
+[
+down;
+boxht = 0.25
+box wid 4.5
+"C (arithmetic)" ljust at last box.w + 0.1,-0.02
+"10(!)" ljust at last box.e + 0.1,-0.02
+box wid 4.5 with .nw at last box.sw
+"C (dhrystones)" ljust at last box.w + 0.1,-0.02
+"10(!)" ljust at last box.e + 0.1,-0.02
+box wid 4.5 with .nw at last box.sw
+"Modula-2" ljust at last box.w + 0.1,-0.02
+"8(!)" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+move
+{
+right;
+box invis "ACK" "with" "peep-hole" "opt"
+move
+[
+down;
+boxht = 0.25
+box wid 1.4
+"C (arithmetic)" ljust at last box.w + 0.1,-0.02
+"1.4" ljust at last box.e + 0.1,-0.02
+box wid 1.9 with .nw at last box.sw
+"C (dhrystones)" ljust at last box.w + 0.1,-0.02
+"1.9" ljust at last box.e + 0.1,-0.02
+box wid 2.5 with .nw at last box.sw
+"Modula-2" ljust at last box.w + 0.1,-0.02
+"2.5" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+move
+{
+right;
+box invis "ACK" "-O4"
+move
+[
+down;
+boxht = 0.25
+box wid 1.1
+"C (arithmetic)" ljust at last box.w + 0.1,-0.02
+"1.1" ljust at last box.e + 0.1,-0.02
+box wid 1.6 with .nw at last box.sw
+"C (dhrystones)" ljust at last box.w + 0.1,-0.02
+"1.6" ljust at last box.e + 0.1,-0.02
+box wid 2.5 with .nw at last box.sw
+"Modula-2" ljust at last box.w + 0.1,-0.02
+"2.5" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+move
+{
+right;
+box invis "Sun's" "compiler" "w/o opt."
+move
+[
+down;
+boxht = 0.25
+box wid 3.7
+"C (arithmetic)" ljust at last box.w + 0.1,-0.02
+"3.7" ljust at last box.e + 0.1,-0.02
+box wid 2.2 with .nw at last box.sw
+"C (dhrystones)" ljust at last box.w + 0.1,-0.02
+"2.2" ljust at last box.e + 0.1,-0.02
+box wid 1.8 with .nw at last box.sw
+"Modula-2" ljust at last box.w + 0.1,-0.02
+"1.8" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+move
+{
+right;
+box invis "Sun's" "compiler" "-O4"
+move
+[
+down;
+boxht = 0.25
+box wid 1.0
+"C (arith.)" ljust at last box.w + 0.1,-0.02
+"1.0" ljust at last box.e + 0.1,-0.02
+box wid 1.0 with .nw at last box.sw
+"C (dhryst.)" ljust at last box.w + 0.1,-0.02
+"1.0" ljust at last box.e + 0.1,-0.02
+box wid 1.0 with .nw at last box.sw
+"Modula-2" ljust at last box.w + 0.1,-0.02
+"1.0" ljust at last box.e + 0.1,-0.02
+] with .w at last box.e
+}
+.PE
--- a/doc/sparc/pics/signal_stack
+++ b/doc/sparc/pics/signal_stack
@ -0,0 +1,42 @@
+.PS
+boxwid = 1.3
+down
+[
+right
+[
+down;
+box "\s-4SPARC stack\s+4" ht .2
+box "\s-4EM stack\s+4" ht .1
+box "\s-4SPARC stack\s+4" ht .1
+box "\s-4EM stack\s+4" ht .1
+box "\s-4free\s+4" ht .2
+box "\s-4SPARC stack\s+4" ht .1
+box "free" ht .8
+box "heap" ht .3
+box "text" ht .5
+]
+move 1
+[
+down;
+box "\s-4SPARC stack\s+4" ht .2
+box "\s-4EM stack\s+4" ht .1
+box "\s-4SPARC stack\s+4" ht .1
+box "\s-4EM stack\s+4" ht .1
+box "\s-4free\s+4" ht .2
+box "\s-4SPARC stack\s+4" ht .1
+box "\s-4EM stack\s+4" ht .1
+box "\s-4free\s+4" ht .2
+box "\s-4SPARC stack\s+4" ht .1
+box "free" ht .4
+box "heap" ht .3
+box "text" ht .5
+]
+]
+move .3
+[
+right
+box invis "before signal"
+move 1
+box invis "during (1st) signal"
+]
+.PF
--- a/doc/sparc/printP4P
+++ b/doc/sparc/printP4P
@ -0,0 +1,31 @@
+echo $0
+case $1 in
+1 )
+	CMD="cat"
+;;
+2 )
+	CMD="cat"
+;;
+3 )
+	CMD="cat"
+;;
+4 )
+	CMD="pic | tbl"
+;;
+5 )
+	CMD="tbl"
+;;
+A )
+	CMD="grap | pic"
+;;
+B )
+	CMD="tbl"
+;;
+esac
+echo $0
+if [ $0 = printP4P ]
+then
+  refer -sA+T '-l\", ' -p refs $1 | eval $CMD | troff -ms -Tp4p | dip -Tp4p -Pp4p
+else
+  xtroff -full -geom 665x883+566+0 -command "refer -sA+T '-l\", ' -p refs $1 | $CMD | troff -ms -Tp4p"
+fi
--- a/doc/sparc/refs
+++ b/doc/sparc/refs
@ -0,0 +1,185 @@
+%T The design of very fast portable compilers
+%A A.S. Tanenbaum
+%A M.F. Kaashoek
+%A K.G. Langendoen
+%A C.J.H. Jacobs
+%J SIGPLAN Notices
+%V 24
+%N 11
+%P 125-131
+%D November 1989
+
+%T A Programmer-friendly LL(1) Parser Generator
+%A D. Grune
+%A C.J.H. Jacobs
+%J Software \- Practice and Experience
+%V 18
+%N 1
+%P 29-38
+%D January 1988
+
+%T The Code Expander Generator
+%A Frans Kaashoek
+%A Koen Langendoen
+%R IM-9
+%I Vrije Universiteit, Amsterdam
+%D November 1987
+
+%T The ACK Pascal Compiler
+%A Aad Geudeke
+%A Frans Hofmeester
+%R IM-8
+%I Vrije Universiteit, Amsterdam
+%D November 1987
+
+%T The EM-interpreter
+%A Eddo de Groot
+%A Leo van den Berge
+%R IM-7
+%I Vrije Universiteit, Amsterdam
+%D June 1987
+
+%T A set of multi\-process primitives for stack based machines
+%A K. Bot
+%A E. Scheffer
+%R IR-122
+%I Vrije Universiteit, Amsterdam
+%D December 1986
+
+%T An Occam Compiler
+%A K. Bot
+%A E. Scheffer
+%R IM-6
+%I Vrije Universiteit, Amsterdam
+%D December 1986
+
+%T Language- and Machine-independent Global Optimization on Intermediate Code
+%A H.E. Bal
+%A A.S. Tanenbaum
+%J Computer Languages
+%V 11
+%N 2
+%P 105-121
+%D April 1986
+
+%T The ACK Target Optimizer
+%A H.E. Bal
+%R IR-107
+%D 1985
+%I Vrije Universiteit, Amsterdam
+
+%T Some Topics in Parser Generation
+%A C.J.H. Jacobs
+%R IR-105
+%D October 1985
+%I Vrije Universiteit, Amsterdam
+
+%T The CEM compiler
+%A E.H. Baalbergen
+%A D. Grune
+%A M. Waage
+%R IM-4
+%I Vrije Universiteit, Amsterdam
+%D 1985
+
+%T The Design and Implementation of the EM Global Optimizer
+%A H.E. Bal
+%I Vrije Universiteit, Amsterdam
+%R IR-99
+%D March 1985
+
+%T Does anybody out there want to write HALF of a compiler?
+%A A.S. Tanenbaum
+%A E.G. Keizer
+%A H. van Staveren
+%J Sigplan Notices
+%V 19
+%N 8
+%P 106-108
+%D August 1984
+
+%T Amsterdam Compiler Kit documentation
+%A A.S. Tanenbaum et. al.
+%I Vrije Universiteit, Amsterdam
+%R IR-90
+%D June 1984
+
+%T A Practical Toolkit for Making Portable Compilers
+%A A. S. Tanenbaum
+%A H. van Staveren
+%A E. G. Keizer
+%A J. W. Stevenson
+%J Communications of the ACM
+%V 26
+%N 9
+%P 654-660
+%D September 1983
+
+%T Description of a Machine Architecture for use with Block Structured
+Languages
+%A A. S. Tanenbaum
+%A H. van Staveren
+%A E. G. Keizer
+%A J. W. Stevenson
+%R IR-81
+%D August 1983
+%I Vrije Universiteit, Amsterdam
+
+%T A Unix Toolkit for Making Portable Compilers
+%A A.S. Tanenbaum
+%A H. van Staveren
+%A E.G. Keizer
+%A J.W. Stevenson
+%J Proceedings USENIX conf.
+%C Toronto, Canada
+%V 26
+%D July 1983
+%P 255-261
+
+%T Using Peephole Optimization on Intermediate Code
+%A A.S. Tanenbaum
+%A J.M. van Staveren
+%A J.W. Stevenson
+%J TOPLAS
+%V 4
+%N 1
+%P 21-36
+%D January 1982
+
+%T EM-1 Compiler
+%A A.S. Tanenbaum
+%J Pascal News
+%D September 1981
+%P 4-38
+
+%T A portable compiler for the Proposed ISO Standard Pascal Language
+%A A.S. Tanenbaum
+%A J.W. Stevenson
+%A H. van Staveren
+%J Sigplan Notices
+%V 15
+%N 10
+%D 1980
+
+%T Implications of Structured Programming for Machine Architecture
+%A A.S. Tanenbaum
+%J CACM
+%V 21
+%N 3
+%P 237-246
+%D March 1978
+
+%T The table driven code generator from the Amsterdam Compiler Kit (Second
+revised edition)
+%A H. van Staveren
+%I Vrije Universiteit, Amsterdam
+%R on-line internal ACK documentation
+%D early 1985
+
+%T Dhrystone Benchmark: Rationale for Version 2 and Measurement Rules
+%A R.P. Weicker
+%J Sigplan Notices
+%V 23
+%N 8
+%D august 1988
+%P 49-62
--- a/doc/sparc/timing
+++ b/doc/sparc/timing
@ -0,0 +1,22 @@
+			DHRYSTONES V2.0
+
+		cc	cc -O4	cc -O	fccO	fccCE	ack	ack -O4
+compile time:
+	real	4.0	12.0		10.0	6.4	8.0	31.0
+	user	1.6	7.3	4.1	1.9	1.8	2.0	9.3
+	sys	0.9	2.1	1.8	2.5	1.5	2.0	7.7
+
+run time:	7263	16250	15250	4730	3430	8474	10434
+(stones/sec)
+
+			SIEVE
+
+		cc	cc -O4	fccO	fccCE	ack	ack -O4
+compile time:
+	real	2.4	4.4	x	3.3	6.4	17.0
+	user	0.8	1.6	x	0.7	0.7	3.2
+	sys	0.7	1.0	x	0.8	1.3	6.2
+
+run time:	7.43	2.02	x	12.18	2.93	2.26
+
+All ack-derived compilers are shell script driven
--- a/doc/sparc/title
+++ b/doc/sparc/title
@ -0,0 +1,15 @@
+.so init
+.TL
+.sp 1.2c
+A fast backend for SPARC processors
+.AU
+Philip Homburg
+Raymond Michiels
+.AI
+Dept. of Mathematics and Computer Science
+Vrije Universiteit
+Amsterdam, The Netherlands
+.PP
+.sp 1i
+Afstudeerverslag, 20 augustus 1990
+.bp