This commit is contained in:
ceriel 1991-09-27 16:19:24 +00:00
parent 63c9fea5c2
commit fb51183da2
29 changed files with 2085 additions and 0 deletions

14
doc/sparc/.distr Normal file
View file

@ -0,0 +1,14 @@
1
2
3
4
5
A
B
init
intro
note_on_reg_wins
refs
timing
title
Makefile

53
doc/sparc/1 Normal file
View file

@ -0,0 +1,53 @@
.so init
.NH
INTRODUCTION
.NH 2
Why an EM backend for SPARC processors?
.PP
With the introduction of SPARC-based computers like the Sun-4, a
whole new range of fast computers became readily available to the general
public. The power of large mainframes had been captured into a small
desk-top computer at only a fraction of the cost.
.PP
In the older days, a new computer used to be very hard to integrate into
the existing environment, but due to standardization in the software world
incompatibility in hardware no longer means incompatibility in software.
Programs that are written for computer A can often be run on computer B
without major modifications. Unfortunately this is not true for all software.
.PP
There will always be programs that rely on the specific
hardware of a certain computer for many different reasons. They
can be categorized as:
.IP -
poorly written programs
.IP -
programs to directly control hardware (device drivers)
.IP -
code that requires efficiency (time-critical I/O drivers)
.IP -
programs to generate code to run on the hardware (compilers)
.LP
This project for instance, the design and implementation of an EM backend
for SPARC processors, comes in the last category.
.PP
We have designed and implemented an algorithm to convert EM programs to code
that will run directly on the SPARC hardware. Henceforth, both the algorithm
and the implementation will be referred to as the EM-to-SPARC backend,
or simply: the backend.
.NH 2
Why has nobody done this before?
.PP
Since EM was designed around 1981 and even SPARC has been around for some
years now, one may wonder why nobody has ever written an EM to SPARC backend
before. The reason is twofold. In the first place, there are some
non-trivial problems to be solved in the design phase, and secondly,
the SPARC-design combined with the lack of documentation, would surely
cost a lot of blood, sweat and tears. The absence of
clues to any of the design problems, combined with the \(em at first
glance \(em inhuman
SPARC instruction set did not make this a very attractive project.
.PP
On the other hand, these were exactly the reasons which made us take on
this particular project: it would require design skills, as well as some
hard work; a golden combination for a successful project.
.bp

109
doc/sparc/2 Normal file
View file

@ -0,0 +1,109 @@
.so init
.nr H1 1
.NH
CLOSE-UP LOOK
.NH 2
What is EM?
.PP
As the abstract of the IR-81 rapport on EM
.[ [
description of a machine architecture
.]]
says: \*(OQEM is a family
of intermediate languages designed for producing portable compilers.\*(CQ
Because EM is to be used on a wide range of languages and processors,
the instruction set is kept simple enough to allow easy translation to,
or interpretation on, almost any processor. Yet it is also powerful enough
to accommodate easy translation from almost any block-structured language.
.PP
Even though EM was designed in the early 1980s, it
is based on
.\" already shows strong signs of being influenced by
the (then innovative) RISC architecture. All instructions
have 0 or 1 operands, there are no fancy addressing modes as in the
68020's\*(Si move.w a3(_array,d3.w*2), -(sp)\*(So, no explicit registers,
although instructions for higher languages
such as array-operations, multiway branches (case) and
floating point operations are provided.
.PP
To fully understand the discussion in the following chapters,
the reader should at least have some knowledge of EM.
.NH 2
What is SPARC?
.PP
According to Sun's RISC tutorial: \*(OQSun Microsystems has designed a RISC
architecture, called SPARC, and has implemented that architecture with
the Sun-4 family of supercomputing workstations and servers. SPARC stands
for Scalable Processor ARChitecture, emphasizing its applicability to
large as well as small machines.\*(CQ
.PP
In sharp contrast to EM, SPARC does have
explicit registers (31 integer and 32 floating point, all of which
are 32 bits wide) and
does not support any high level language operations: it does not even have
multiplication or division instructions. Because the SPARC design is
very straightforward, all instructions could be hard-coded (no microcode
involved) to
provided extremely high performance. All register-to-register operations
require exactly one clock cycle, and all register-to-memory and
memory-to-register operations require two clock cycles, one to retrieve
the instruction and one to access external memory. At a clock speed of
over 20 MHz this means that you can achieve well over 10 VAX MIPS:
more than 4 times the speed of a 15 MHz 68020 used in the Sun3/50.
.PP
As above, the reader should also have some general knowledge about
the SPARC processer to be able to understand the following chapters.
.NH 2
What exactly is a (fast) backend?
.PP
To put in the simplest of ways: a (fast) backend is a set of routines to
translate EM code to code that will run 'on the metal' (for example the SPARC
processor). The distinction between full-fledged backends (code generators)
.[ [
The table driven code generator
.]]
and fast backends (code expanders)
.[ [
The Code Expander Generator
.]]
is related to
the compilation-time vs. run-time trade off. Code generators generate
efficient code and code expanders generate code very efficient.
For details about code expanders see also
.[ [
The design of very fast portable compilers
.]].
.PP
The reasons for us to implement a code expander are numerous: Our first reason to
implement a code expander, rather than a code generator was that implementing a
code expander would be hard enough already. Code generators only give
more problems and there were already enough problems to be solved. Secondly,
we knew we would never be able to compete with original SPARC compilers due
to loss of information in the frontends (see also chapter 5). By implementing
a code expander we might be able to outrun the existing compilers on a
completely different terrain: compile speed.
.PP
The third 'reason' to implement a code expander lies a little deeper and was
not discovered until we had actually started the implementation... It was only
then that we found out that for certain architectures, such as the SPARC,
the idea behind the code-expander is not necessarily inferior to that
behind a code-generator. It seems that for highly orthogonal instruction
sets it is possible to generate near optimal code without using the
code-expander. We have to say, however, that this is only true for our
optimized version of the code-expander. With the original code-expander
it would not have been possible to generate near-optimal code for the
SPARC processor.
.NH 2
So, what are the main differences between EM and SPARC?
.PP
The main
difference between EM and SPARC is the stack versus register orientation.
The other differences, such as the presence of high level language
operations in EM, can easily be overcome by subroutines,
or small pieces of in-line SPARC code.
The design-part of this project mostly concentrates on
building a bridge between EM's stack and SPARC's registers.
.PP
In the next chapter we will make a list of all our design problems which
will then be discussed in chapter 4.
.bp

82
doc/sparc/3 Normal file
View file

@ -0,0 +1,82 @@
.so init
.nr H1 2
.NH
PROBLEMS
.NH 2
Maintain SPARC speed
.PP
If we want to generate SPARC code, we should try to generate efficient code
as fast as possible. It would be quite embarrassing to find out that the
same program would run faster on a Motorola 68020 than on a SPARC processor,
when both operate at the same clock frequency.
Looking at some code generated by Sun's C-compiler and optimizing assembler,
we can spot a few remarkable characteristics of the generated SPARC code:
.IP -
There are almost no memory references
.IP -
Parameters to functions are passed through registers.
.IP -
Almost all delay slots\(dg
.FS
\(dg For details about delay slots see the SPARC Architecture Manual, chapter 4, pp. 42-48
.FE
are filled in by the assembler
.LP
If we want to generate efficient code, we should at least try to
reduce the number of memory references and use registers wherever we can.
Since EM is stack-oriented it references its stack for every operation so
this will not be an easy task; a suitable solution will however be given in
the next chapter.
.NH 2
Increase compilation speed
.PP
Because we will implement a code expander (fast backend) we should keep
a close eye on efficiency; if we cannot beat regular compilers on producing
efficient code we will try to beat them on fast code generation.
The usual trick to achieve fast compilation is to pack the frontend,
optimizer, code-generator and
assembler all into a single large binary to reduce the overhead of
reading and writing temporary files. Unfortunately, due to the
SPARC instruction set, its relocation information is slightly bizarre
and cannot be represented with the present primitives.
This means that it will not be possible to generate the required output
format directly from our backend.
.PP
There are three solutions here: generate assembler code, and let an
existing assembler generate the required object (\fI.o\fR) files,
create our own primitives than can handle the SPARC relocation format, or
do not use any of the addressing modes that require the bizarre relocation.
Because we have enough on our hands already we will
let the existing assembler deal with generating object files.
.NH 2
Convert stack to register operations
.PP
As we wrote in the previous chapter, for RISC machines a code expander can
produce almost as efficient code as a code generator. The fact that this is
true for stack-oriented RISC processors is rather obvious. The problem we
face, however, is that the SPARC processor is register, instead of
stack oriented. In the next chapter we will give a suitable solution to
convert most stack accesses to register accesses.
.NH 2
Miscellaneous
.PP
Besides performance and \fI.o\fR-compatibility there are some other
peculiarities of the SPARC processor and Sun's C-compiler (henceforth
simply called \fIcc\fR).
.PP
For some reason, the SPARC stack pointer requires alignment
on 8 bytes, so you cannot push a 4-byte integer on the stack
and then \*(Sisub 4, %sp\*(So\(dd.
.FS
\(dd For more information about SPARC assembler see the Sun-4 Assembly
Language Reference Manual
.FE
This too will be discussed in the next chapter, where we will take a
more in-depth look into this problem and also discuss a couple of
possible solutions.
.PP
Another thing is that \fIcc\fR usually passes the first six parameters of a
function-call through registers. To be \fI.o\fR-compatible we would have to
pass the first six parameters of each function call through registers as well.
Exactly why this is not feasible will also be discussed in the next chapter.
.bp

468
doc/sparc/4 Normal file
View file

@ -0,0 +1,468 @@
.so init
.hw data-structures
.nr H1 3
.NH
SOLUTIONS
.NH 2
Maintaining SPARC speed
.PP
In chapter 3 we wrote:
.sp 0.3
.nf
>If we want to generate efficient code, we should at least try to reduce the number of
>memory references and use registers wherever we can.
.fi
.sp 0.3
In this chapter we will device a strategy to swiftly generate acceptable
code by using push-pop optimization.
Note that this is not the push-pop
optimization already available in the EM-kit, since that is only present
in the assembler-to-binary part which we do not use
.[ [
The Code Expander Generator
.]].
Our push-pop optimization
works more like the fake-stack described in
.[ [
The table driven code generator
.]].
.NH 3
Ad-hoc optimization
.PP
Before getting involved in any optimization let's have a look at some
code generated with a straightforward EM to SPARC conversion of the
C statement: \*(Sif(a[i]);\*(So Note that \*(Si%SP\*(So is an alias
for a general purpose
register and acts as the EM stack pointer. It has nothing to do with
\*(Si%sp\*(So \(em the SPARC stack pointer.
Analogous \*(Si%LB\*(So is EMs local base pointer.
.br
.IP
.HS
.TS
;
l s l s l
l1f6 lf6 l2f6 lf6 l.
EM code SPARC code Comment
lae _a set _a, %g1 ! load address of external _a
dec 4, %SP
st %g1, [%SP]
lol -4 set -4, %g1 ! load local -4 (i)
ld [%g1+%LB], %g2
dec 4, %SP
st %g2, [%SP]
loc 2 set 2, %g1 ! load constant 2
dec 4, %SP
st %g1, [%SP]
sli 4 ld [%SP], %g1 ! pop shift count
ld [%SP+4], %g2 ! pop shiftee
sll %g2, %g1, %g3
inc 4, %SP
st %g3, [%SP] ! push 4 * i
ads 4 ld [%SP], %g1 ! add pointer and offset
ld [%SP+4], %g2
add %g1, %g2, %g3
inc 4, %SP
st %g3, [%SP] ! push address of _a + (4 * i)
loi 4 ld [%SP], %g1 ! load indirect 4 bytes
ld [%g1], %g2
st %g2, [%SP] ! push a[i]
cal _f
...
.TE
.HS
.LP
Although the code is easy understand, it clearly is far from optimal.
The above code uses approximately 60 clock-cycles\(dg
.FS
\(dg In general each instruction only takes one cycle,
except for \*(Sild\*(So and
\*(Sist\*(So which may both require additional clock cycles. The exact amount
of extra cycles needed depends on the SPARC implementation and memory access
time. Furthermore, the
\*(Siset\*(So pseudo-instruction is a bit tricky. It takes one cycle when
its argument lies between -4096 and 4095, and two cycles otherwise.
.FE
to push an array-element on the stack,
something which a 68020 can do in a single instruction. The SPARC
processor may be fast, but not fast enough to justify the above code.
.PP
The same statement can be translated much more efficiently:
.DS
.TS
;
l2f6 lf6 l.
sll %i0, 2, %g2 ! multiply index by 4
set _a, g3
ld [%g2+%g3], %g1 ! get contents of a[i]
dec 4, SP
st %g2, [SP] ! push a[i] onto the stack
.TE
.DE
which, instead of 60, uses only 5 clock cycles to retrieve the element
from memory and 5 additional cycles when the result has to be pushed
on the stack. Note that when the result is not a parameter it does not
have to be pushed on the stack. By making efficient use of the SPARC
registers we can fetch \*(Sia[i]\*(So in only 5 cycles!
.NH 3
Analyzing optimization
.PP
Instead of ad-hoc optimization we will need something more solid.
When one tries to optimize the above code in an ad-hoc manner one will
probably notice the large overhead due to stack access. Almost every EM
instruction requires at least three SPARC instructions: one to carry out
the EM instruction and two to pop and push the result from and onto the
stack. This happens for every instruction, even though the data being pushed
will probably be needed by the next instruction. To optimize this extensive
pushing and popping of data we will use the appropriately named push-pop
optimization.
.PP
The idea behind push-pop optimization is to delay the push operation until
it is almost certain that the data actually has to be pushed.
As is often the case, the data does not have to be pushed,
but will be used as input to another EM instruction.
If we can decide at compile time that this will indeed be
the case we can save the time of first pushing the data and then popping it
back again by temporarily storing the data (possibly only during compilation!)
and using it no sooner than it is actually needed.
.PP
The \*(Sisli 4\*(So instruction, for instance, expects two inputs on top of the
stack: on top a counter and right below that the shiftee (the number
to be shifted). As a result \*(Sisli\*(So
pushes 'shiftee << counter' back to the stack. Now consider the following
sequence, which could be the result of the expression \*(Si4 * i\*(So
.DS
.TS
;
l1f6 lf6 l.
lol -4
loc 2
sli 4
.TE
.DE
In the non-optimized situation the \*(Silol\*(So would push
a local variable (whose offset is -4) on the stack.
Then the \*(Siloc\*(So pushes a 2 on the stack and finally \*(Sisli\*(So
retrieves both these numbers to replace then with the result.
On most machines it is not necessary to
push the 2 on the stack, since it can be used in the shift instruction
as an immediately operand. On a SPARC, for instance, one can write
.DS
.TS
;
l2f6 lf6 l.
ld [%LB-4], %g1 ! load local variable into register g1
sll %g1, 2, %g2 ! perform the shift-left-by-2
.TE
.DE
where the output of the \*(Silol\*(So, as well as the immediate operand 2 are used
in the shift instruction. As suggested before, all of this can be
achieved with push-pop optimization.
.NH 3
A mechanism for push-pop optimization
.PP
To implement the above optimization we need some mechanism to
temporarily store information during compilation.
We need to be able to store, compare and retrieve information from the
temporary storage (cache) without any
loss of information. Before describing all the routines used
to implement our cache we will first describe how the cache works.
.PP
Items in the cache are structures containing an external (\*(Sichar *\*(So),
two registers (\*(Sireg_t\*(So) and a constant (\*(Siarith\*(So),
any of which may be 0.
The value of such a structure is the sum of (the values of)
its elements. To put a register in the cache, one has to be allocated either
by calling \*(Sialloc_reg\*(So which returns a free register, by
\*(Siforced_alloc_reg\*(So which allocates a specific register or any
of the other routines available to allocate a register. The keep things
simple, we will not discuss all of the available primitives here.
When the register
is then put in the cache by the \*(Sipush_reg\*(So routine, the ownership will
be transferred from the user to the cache. Ownership is important, because
only the owner of a register may (and must!) deallocate it. Registers can be
owned by either an (imaginary) register manager, the cache or the user.
When the user retrieves a register from the stack with \*(Sipop_reg\*(So for
instance, ownership is back to the user.
The user should then call \*(Sifree_reg\*(So
to transfer ownership to the register manager or call \*(Sipush_reg\*(So
to give it back to the cache.
Since the cache behaves itself as a stack we will use the term pop resp. push
to get items from, resp. put items in the cache.
.PP
We shall now present the sets of routines that implement the cache.
.IP \(bu
The routines
.DS
\*(Si
reg_t alloc_reg(void)
reg_t alloc_reg_var(void)
reg_t alloc_float(void)
reg_t alloc_float_var(void)
reg_t alloc_double(void)
reg_t alloc_double_var(void)
void forced_alloc_reg(reg_t)
void soft_alloc_reg(reg_t)
void free_reg(reg_t)
void free_double_reg(reg_t)
\*(So
.DE
allocate and deallocate registers. If there are no more register left,
i.e. they are owned by the cache,
one or more registers will be freed by flushing part of the cache
onto the real stack.
The \*(Sialloc_xxx_var\*(So primitives try to allocate a register that
can be used to store local variables. (In the current implementation
only the input and local registers.) If none can be found \*(SiNULL\*(So
is returned. \*(Siforced_alloc_reg\*(So forces the allocation of a certain
register. If it was already in use, its contents are moved to another
register. Finally \*(Sisoft_alloc_reg\*(So provides the possibility to
push a register onto the cache and still keep a copy for later use.
(Used to implement the \*(Sidup 4\*(So for example.)
.IP \(bu
The routines
.DS
\*(Si
void push_const(arith)
arith pop_const(void)
\*(So
.DE
push or pop a constant onto or from the stack. Distinction between
constants and other types is made so as not to loose any information; constants
may be used later on as immediate operators, which is not the case
for other types. If \*(Sipop_const\*(So is called, but the element on top of
the cache has either one of the external or register fields non-zero a
fatal error will be reported.
.IP \(bu
The routines
.DS
\*(Si
reg_t pop_reg(void)
reg_t pop_float(void)
reg_t pop_double(void)
reg_t pop_reg_c13(char *n)
void pop_reg_as(reg_t)
void push_reg(reg_t)
\*(So
.DE
push or pop a register. These will be used most often since results from one
EM instruction, which are computed in a register, are often used in the next.
When the element on top of the cache is more
than just a register the cache manager
will generate code to compute the sum of its fields and put the result in a
register. This register will then be given to the user.
If the user wants the result is a special register, he should use the
\*(Sipop_reg_as\*(So routine.
The \*(Sipop_reg_c13\*(So gives an optional number (as character string) whose
value can be represented in 13 bits. The constant can then be used as an
offset for the SPARC \*(Sild\*(So and \*(Sist\*(So instructions.
.IP \(bu
The routine
.DS
\*(Si
void push_ext(char *)
\*(So
.DE
pushes an external onto the stack. There is no pop-variant of this one since
there is no use in popping an external.
.IP \(bu
The routines
.DS
\*(Si
void inc_tos(arith n)
void inc_tos_reg(reg_t r)
\*(So
.DE
increment the element on top of the cache by either the constant \*(Sin\*(So
or by a register. The latter is useful for pointer addition when referencing
external memory.
.KS
.IP \(bu
The routine
.DS
\*(Si
int type_of_tos(void)
\*(So
.DE
.KE
returns the type of the element on top of the cache. This is a combination
(binary OR) of \*(SiT_ext\*(So, \*(SiT_reg\*(So or \*(SiT_float\*(So,
\*(SiT_reg2\*(So or \*(SiT_float2\*(So, and \*(SiT_cst\*(So,
and tells the
user which of the three fields are non-zero. When the register-fields
represent \*(Si%g0\*(So, it is considered zero.
.IP \(bu
Miscellaneous routines:
.DS
\*(Si
void init_cache(void)
void cache_need(int)
void change_reg(void)
void flush_cache(void)
\*(So
.DE
\*(Siinit_cache\*(So should be called before any
other cache routines, to initialize some internal datastructures.
\*(Sicache_need\*(So is used to tell the cache that a certain number
of register are needed for the next operation. This way the cache can
load them efficiently in one fell swoop. \*(Sichange_reg\*(So is to be
called when the user changes a register of which the cache (possibly) has
co-ownership. Because the contents of registers in the cache are
not allowed to change the user should call \*(Sichange_reg\*(So to
instruct the cache to copy the contents to some other register.
\*(Siflush_cache\*(So writes the cache to the stack and invalidates
the cache. It should be used before branches,
before labels and on other places where the stack has to be valid (i.e. where
every item on the EM-stack should be stored on the real stack, not in some
virtual cache).
.NH 3
Implementing push-pop optimization in the EM_table
.PP
As indicated above, there is no regular way to represent the described
optimization in the EM_table. The only possible escapes from the EM_table
are function calls, but that is clearly not enough to implement a good
push-pop optimizer. Therefore we will use a modified version of the EM_table
format, where the description of, say, the \*(Silol\*(So instruction might look
like this\(dg:
.FS
\(dg This is not the way the \*(Silol\*(So actually looks in the EM_table;
it only shows how it \fImight\fR look using the forementioned push/pop
primitives.
.FE
.DS
\*(Si
reg_t A, B;
const_str_t n;
alloc_reg(A);
push_reg(LB);
inc_tos($1);
B = pop_reg_c13(n);
"ld [$B+$n], $A";
push_reg(A);
free_reg(B);
\*(So
.DE
For more details about the exact implementation consult
appendix B which contains some characteristic excerpts from the EM_table.
.NH 2
Stack management
.PP
When converting EM code to some executable code there is the problem of
maintaining multiple stacks. The usual way to do this is described in
.[ [
Description of a Machine Architecture
.]]
and is shown in figure \*(SN1.
.KE
.PS
copy "pics/EM_stack.orig"
.PE
.ce 1
\fIFigure \*(SN1: usual stack management.
.KE
.sp
.LP
This means that the EM stack and the hardware stack (used
for subroutine calls, etc.) are interleaved in memory. On the SPARC, however,
this brings up a large problem: in the former model it is assumed that the
resolution of the stack pointer is a word, but this is not the case on the
SPARC processor. On the SPARC processor the stack-pointer as well as the
frame-pointer have to be aligned on 8-byte boundaries, so one can not simply
push a word on the stack and then lower the stack-pointer by 4 bytes!
.NH 3
Possible solutions
.PP
A simple idea might be to use a swiss-cheese stack; we could
push a 4-byte word onto the stack and then lower the stack by 8.
Unfortunately, this is not a very solid solution, because
pointer-arithmetic involving pointers to objects on the stack would cause
hard-to-predict anomalies.
.PP
Another try would be not to use the hardware stack at all. As long as we
do not generate subroutine-calls everything will be all right. This
approach, however, also has some disadvantages: first we would not be able
to use any of the existing debuggers such as \fIadb\fR, because they all
assume a regular stack format. Secondly, we would not be able to make use
of the SPARC's register windows to keep local variables. Finally, doing all the
administrative work necessary for subroutine calls ourselves instead of
letting the hardware handle it for us,
causes unnecessary procedure-call overhead.
.PP
Yet another alternative would be to emulate the EM-part of the stack,
and to let the hardware handle the subroutine call. Since we will
emulate our own stack, there are no alignment restrictions and because
we will use the hardware procedure call we can still make use of
the register windows.
.NH 3
Our implementation
.PP
To implement the hybrid stack we need two extra registers: one for the
the EM stack pointer (the forementioned \*(Si%SP\*(So) and one for the
EM local base pointer (\*(Si%LB\*(So). The most elegant solution would be to
put both stacks in different segments, so they would not influence
each other. Unfortunately
.UX
lacks the ability to add segments and
since we will implement our backend under
.UX,
we will have to put
both stacks in the same segment. Exactly how this can be done is shown
in figure \*(SN2.
.DS
.PS
copy "pics/mem_config"
.PE
.ce 1
\fIFigure \*(SN2: our stack management.\fR
.DE
.sp
During normal procedure execution, the SPARC stack pointer has to point to
a memory location where the operating system can dump the active part of
the register window. The rest of the
register window will be dumped in the therefor pre-allocated (stack) space
by following the frame
pointer. When a signal occurs things get even more complicated and
result in figure \*(SN3.
.DS
.PS
copy "pics/signal_stack"
.PE
.ce 1
\fIFigure \*(SN3: our signal stack.\fR
.DE
.PP
The exact implementation of the stack is shown in figure \*(SN4.
.KF
.PS
copy "pics/EM_stack.ours"
.PE
.ce 1
\fIFigure \*(SN4: stack overview.\fR
.KE
.NH 2
Miscellaneous
.PP
As mentioned in the previous chapter, the generated \fI.o\fR-files are
not compatible with Sun's own object format. The primary reason for
this is that Sun usually passes the first six parameters of a procedure call
through registers. If we were to do that too, we would always have
to fetch the top six words from the stack into registers, even when
the procedure would not have any parameters at all. Apart from this,
structure-passing is another exception in Sun's object format which
makes is impossible to generate object-compatible code.\(dg
.FS
\(dg Exactly how Sun passes structures as parameters is described in
Appendix D of the SPARC Architecture Manual (Software Considerations)
.FE
.bp

153
doc/sparc/5 Normal file
View file

@ -0,0 +1,153 @@
.so init
.nr H1 4
.NH
FUTURE WORK
.NH 2
A critique of EM
.PP
In general, EM fits its purpose quite well. Numerous compilers have been
written using EM as their intermediate language and it has even become a
commercial product. A great deal of its success is probably due to its
simplicity. There are no extravagant instructions but it does have all the
necessary functions to write a decent compiler.
.PP
There are, however, a few functions that come rather close to being
extravagant. The \*(Silar\*(So function for example \(em used
to fetch an element from an array \(em does not make it much easier
to write a frontend, but does make it unnecessary hard to write an
efficient backend. Other instructions for which it is difficult
to generate efficient code for are those that permit
dynamic operators, such as the \*(Silos\*(So. Dynamic operators, however, provide
significant extra possibilities and can therefore not be disposed of.
Note that even though the array operations \*(Silar\*(So and \*(Sisar\*(So
provide dynamic operators, they do not add additional power, since
they can easily be replaced with a sequence using the \*(Silos\*(So or
\*(Sists\*(So instructions.
.PP
EM code to reference arrays generated by the C frontend can be translated
very efficiently for almost any processor. However the same operation
generated by the Modula-2 frontend (which uses the \*(Silar\*(So),
is much less efficient, although the only difference is that the
latter performs range checking whereas the former does not.\(dg
.FS
\(dg Actually this depends on whether or not explicit range checking in enabled.
This clearly shows that the current code generators are not optimal and
often depend on ad-hoc decisions.
.FE
Since range checking can also be expressed explicitly in
EM (\*(Sirck\*(So) there is no need for any of the array operations
(\*(Siaar\*(So, \*(Silar\*(So and \*(Sisar\*(So).
.PP
Besides efficiency of the array-operations themselves, there still is another
major disadvantage of using these array-operations. In sharp contrast to
all other EM instructions except the \*(Silos\*(So and the \*(Sists\*(So,
they allow dynamic operators, so their effect on the stack-pointer can not
always be
determined at compile-time. This means that efficient caching of the
top-of-stack in registers is almost impossible,
so using these array-operations also effects the
efficiency of the surrounding code. Now that processors are produced with
more and more registers it could be very beneficiary to cache the
top-of-stack, so that the memory/register reference ratio decreases
to the benefit of the overall performance.
.PP
As a final critique, we would also like to discuss the semantics of some of
the EM instructions. In
.[ [
Description of a Machine Architecture
.]]
it is said that
all signed instructions such as the \*(Siadi\*(So, should cause an exception
on overflow. The unsigned operations such as \*(Siadu\*(So, however,
should act as modulo operations and therefor not perform overflow checking.
Since it is very expensive to perform overflow checking in EM,
we would suggest that the backend takes care of this. For languages which
do not require overflow checking, a simple message could be generated to
disable overflow checking in backends. This way all backends could be
written to fully comply to the official EM definition without any reduction in
efficiency.\(dd
.FS
\(dd Currently many backends do not implement error checks because they
are too expensive and almost never needed. Some frontends even have
facilities build in to generate EM-code to force these checks. If this
trend continues we will end up with a de-facto and a de-jure standard
both developed by the same people but nonetheless incompatible.
.FE
When such messages will be added we would like to suggest
that they can enforce overflow checks on unsigned, as well as signed arithmetic.
.PP
As a conclusion we would like to suggest removal of the array operations from
EM, or at least discontinuation of there usage in frontends.
.NH 2
\*(OQWanted: Procedure call information\*(CQ
.PP
The advantage of an intermediate language such as EM is that the backend
no longer has to know about any 'quirks' of the 'input'-language. The major
disadvantage, however, is that the backend no longer knows about any 'quirks'
of the 'input'-language... If the SPARC backend ever has to compete
with Sun's own C-compiler for example, removal of the array-operations
will not be enough. The amount of information that is lost during
the translation to EM is too large to ever generate truly efficient SPARC code.
.PP
To write such an efficient backend one needs to know, for example, whether,
when and what type of parameter is being computed, so the result can be stored
in the proper place and scratch registers can be reused.
(On the SPARC processor, for example, it is very beneficiary
to pass the first six parameters of a procedure call through
registers instead of using the stack.)
One way to express such things in EM is to insert extra messages in
the EM-code. The C statement \*(Sia = f(4, a + b);\*(So for example,
could be translated to the following EM-code:
.DS
.TS
;
l1f6 lf6 l.
lol -4 ! a
lol -8 ! b
mes x, 2 ! next instruction will compute 2nd parameter
adi 4
mes x, 1 ! next instruction will compute 1st parameter
loc 4
cal _f ! call function f
lfr 4
stl -4 ! store result in a
.TE
.DE
For a code expander it is important that the \*(Simes\*(So pseudo
instructions appear \fIbefore\fR
the EM instruction that computes the parameter, because that way the final
computation (the \*(Siadi\*(So and \*(Siloc\*(So in the previous example)
can be translated to machine code that performs the required computation
and also puts the result in the required place. If it is found to be
too difficult for the frontend to insert these \*(Simes\*(So instructions
at the right place the peep-hole optimizer might swap the \*(Simes\*(So and
the instruction that computes the parameter.
.PP
For some architectures, it is also
possible to generate more efficient code for a procedure when it is a
so-called leaf-procedure: a procedure that doesn't call other procedures.
On the SPARC, for example, it is not necessary to rotate the register
window for a call to a leaf procedure and it is also possible to use
the global registers for register variables in leaf procedures.
It will be a little harder to insert useful messages about leaf procedures,
because just as with register messages, they are only useful to the
backend when they appear immediately
after or before the \*(Sipro\*(So pseudo instruction. The frontend,
however, only knows whether a certain procedure is a leaf-procedure or not
when it has already generated the entire procedure in EM. Just as with the
\*(Sipro ? / end n\*(So-dilemma the peep-hole optimizer
.[ [
Using Peephole Optimization
.]]
might be able to lend a hand
and help us out by delaying EM-code generation until it has reached the
end of the procedure.
.PP
As with most optimizations, the main problem is that they have to be
implemented with the \*(Simes\*(So pseudo instruction.
Because the \*(Simes\*(So instruction can have many different meanings
depending on its argument,
it is important that all optimizers recognize and respect them. Addition
of even a single message will require careful inspection of, and maybe even
incorporate small changes to each of the optimizers.
.bp

184
doc/sparc/A Normal file
View file

@ -0,0 +1,184 @@
.so init
.SH
A. MEASUREMENTS
.SH
A.1. \*(OQThe bottom line\*(CQ
.PP
Although examples often are most illustrative, the cruel world out there is
usually more interested in everyday performance figures. To satisfy those
people too, we will present a series of measurements on our code expander
taken from (close to) real life situations. These include measurements
of compile and run times of different programs,
compiled with different compilers.
.SH
A.2. Compile time measurements
.PP
Figure A.2.1 shows compile-time measurements for typical C code:
the dhrystone benchmark\(dg
.[ [
dhrystone
.]].
.FS
\(dg To be certain that we only tested the compiler and not the quality of
the code in the library, we have added our own version of
\fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the
library.
.FE
The numbers represent the duration of each separate pass of the compiler.
The numbers at the end of each bar represent the total duration of the
compilation process. As with all measurements in this chapter, the
quoted time or duration is the sum of user and system time in seconds.
.PS
copy "pics/compile_bars"
.PE
.DS
.IP cem: 6
C to EM frontend
.IP opt:
EM peep-hole optimizer
.IP be:
EM to assembler backend
.IP cpp:
Sun's C preprocessor
.IP ccom:
Sun's C compiler
.IP iropt:
Sun's optimizer
.IP cg:
Sun's code generator
.IP as:
Sun's assembler
.IP ld:
Sun's linker
.ce 1
\fIFigure A.2.1: compile-time measurements.\fR
.DE
.sp
.PP
A close examination of the first two bars in fig A.2.1 shows that the maximum
achievable compile-time
gain compared to \fIcc\fR is about 50% for medium-sized
programs.\(dd
.FS
\(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53
.FE
For small programs the gain will be less, due to the almost constant
start-up time of each pass in the compilation process. Only a
built-in assembler may increase this number up to
180% in the ideal case that the optimizer, backend and assembler
would run in zero time. Speed-ups of 5 to 10 times as mentioned in
.[ [
fast portable compilers
.]]
are therefore not possible on the Sun-4 family. This is also due to
Sun's implementation of saving and restoring register windows. With
the current implementation in which only a single window is saved
or restored on a register-window overflow, it is very time consuming
when programs have highly dynamic stack use
due to procedure calls (as is often the case with compilers).
.PP
Although we are currently a little slower than \fIcc\fR, it is hard to
blame this on our backend. Optimizing the backend so that it would run
twice as fast would only reduce the total compilation process by
a mere 14%.
.PP
Finally it is nice to see that our push/pop-optimization,
initially designed to generate faster code, has also increased the
compilation speed. (see also figures A.4.1 and A.4.2.)
.SH
A.3. Run time performance
.PP
Figure A.3.1 shows the run-time performance of different compilers.
All results are normalized, where the best available compiler (Sun's
compiler with full optimization) is represented by 1.0 on our scale.
.PS
copy "pics/run-time_bars"
.PE
.ce 1
\fIFigure A.3.1: run time performance.\fR
.sp 1
.PP
The fact that our compiler behaves rather poorly compared to Sun's
compiler is due to the fact that the dhrystone benchmark uses
relatively many subroutine calls; all of which have to be 'emulated'
by our backend.
.SH
A.4. Overall performance
.LP
In the next two figures we will show the combined run and compile time
performance of 'our' compiler (the ACK C frontend and our backend)
compared to Sun's C compiler. Figure A.4.1 shows the results from
measurements on the dhrystone benchmark.
.G1
frame invis left solid bot solid
label left "run time" "(in \(*msec/dhrystone)"
label bot "compile time (in sec)"
coord x 0,21 y 0,610
ticks left out from 0 to 600 by 200
ticks bot out from 0 to 20 by 5
"\(bu" at 3.5, 1000000/1700
"ack w/o opt" ljust at 3.5 + 1, 1000000/1700
"\(bu" at 2.8, 1000000/8770
"ack with opt" below at 2.8 + 0.1, 1000000/8770
"\(bu" at 16.0, 1000000/10434
"ack -O4" above at 16.0, 1000000/10434
"\(bu" at 2.3, 1000000/7270
"\fIcc\fR" above at 2.3, 1000000/7270
"\(bu" at 9.0, 1000000/12500
"\fIcc -O4\fR" above at 9.0, 1000000/12500
"\(bu" at 5.9, 1000000/15250
"\fIcc -O\fR" below at 5.9, 1000000/15250
.G2
.ce 1
\fIFigure A.4.1: overall performance on dhrystones.
.sp 1
.LP
Fortunately for us, dhrystones are not all there is. The following
figure shows the same measurements as the previous one, except
this time we took a benchmark that uses no subroutines: an implementation
of Eratosthenes' sieve:
.G1
frame invis left solid bot solid
label left "run time" "for one run" "(in sec)" left .6
label bot "compile time (in sec)"
coord x 0,11 y 0,21
ticks bot out from 0 to 10 by 5
ticks left out from 0 to 20 by 5
"\(bu" at 2.5, 17.28
"ack w/o opt" above at 2.5, 17.28
"\(bu" at 1.6, 2.93
"ack with opt" above at 1.6, 2.93
"\(bu" at 9.4, 2.26
"ack -O4" above at 9.4, 2.26
"\(bu" at 1.5, 7.43
"\fIcc\fR" above at 1.5, 7.43
"\(bu" at 2.7, 2.02
"\fIcc -O4\fR" ljust at 1.9, 1.2
"\(bu" at 2.6, 2.10
"\fIcc -O\fR" ljust at 3.1,2.5
.G2
.ce 1
\fIFigure A.4.2: overall performance on Eratosthenes' sieve.
.sp 1
.PP
Although the above figures speak for themselves, a small comment
may be in place. At first it is clear that our compiler is neither
faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should
also be noted however, that we do produce better code than \fIcc\fR
at only a very small additional cost.
It is also worth noticing that push-pop optimization
increases run-time speed as well as compile speed.
The first seems rather obvious,
since optimized code is
faster code, but the increase in compile speed may come as a surprise.
The main reason is that the \fIas\fR+\fIld\fR time depends largely on the
amount of generated code, which in general
depends on the efficiency of the code.
Push-pop optimization removes a lot of useless instructions which
would otherwise
have found their way through to the assembler and the loader.
Useless instructions inserted in an early stage in the compilation
process will slow down every following stage, so elimination of useless
instructions in an early stage, even when it requires a little computational
overhead, can often be beneficial to the overall compilation speed.
.bp

128
doc/sparc/B Normal file
View file

@ -0,0 +1,128 @@
.so init
.SH
B. IMPLEMENTATION
.SH
B.1. Excerpts from the non-optimized EM_table
.PP
Even though the non-optimized version of the EM_table is relatively
straight-forward, examples have never hurt anybody.
One of the simplest instructions is the \*(Siloc\*(So, which appears in
our EM_table as follows:
.DS
\f6
.TA 8 16 24 32 40 48 56 64
C_loc ==> "set $1, T1";
"dec 4, SP";
"st T1, [SP]".
\f1
.DE
Just as \*(SiSP\*(So is an alias for \*(Si%l0\*(So, \*(SiT1\*(So is
an alias for \*(Si%g1\*(So.
A little more complex is the \*(Siadi\*(So which performs integer
addition.
.DS
\f6
C_adi ==> "ld [SP], T1";
"ld [SP+4], T2";
"add T1, T2, T3";
"st T3, [SP+4];
"inc 4, SP".
\f1
.DE
We could go on with even more complex instructions, but since that would
not contribute to anything the reader is referred to the implementation
for more details.
.SH
B.2. Excerpts from the optimized EM_table
.PP
The optimized EM_table uses the cache primitives mentioned in chapter 4.
This means that the \*(Siloc\*(So this time appears as
.DS
\f6
C_loc ==> push_const($1).
\f1
.DE
The \*(Silol\*(So can now be written as
.DS
\f6
C_lol ==> push_reg(LB);
inc_tos($1);
push_const(4);
C_los(4).
\f1
.DE
Due to the law of conservation of misery somebody has to do the dirty work.
In this case, it is the \*(Silos\*(So. To show just a small part of
the implementation of the \*(Silos\*(So:
.DS
\f6
C_los $1 == 4 ==>
if (type_of_tos() == T_cst) {
arith size;
const_str_t n;
size= pop_const();
if (size <= 4) {
reg_t a;
reg_t a;
char *LD;
switch (size) {
case 1: LD = "ldub"; break;
case 2: LD = "lduh"; break;
case 4: LD = "ld"; break;
default: arg_error("C_los", size);
}
a = pop_reg_c13(n);
b = alloc_reg();
"$LD [$a+$n], $b";
push_reg(b);
free_reg(a);
} else ...
\f1
.DE
For the full implementation, the reader is again referred to the actual
implementation. Just to show how other instructions are affected
by the optimization we will show that implementation of the \*(Sitge\*(So
instruction:
.DS
\f6
C_tge ==> {
reg_t a;
reg_t b;
a = pop_reg();
b = alloc_reg();
" tst $a";
" bge,a 1f";
" mov 1, $b"; /* delay slot */
" set 0, $b";
"1:";
free_reg(a);
push_reg(b);
}.
\f1
.DE
.SH
.bp
CREDITS
.PP
In order of appearance:
.TS
center;
r c l.
Original idea - Dick Grune
Design & implementation - Philip Homburg
- Raymond Michiels
Tutor - Dick Grune
Assistant Tutor - Ceriel Jacobs
Proofreading - Dick Grune
- Hans van Eck
.TE
.SH
REFERENCES
.PP
.[
$LIST$
.]

10
doc/sparc/Makefile Normal file
View file

@ -0,0 +1,10 @@
# $Header$
REFER=refer
TBL=tbl
TARGET=-Tlp
PIC=pic
GRAP=grap
../sparc.doc: refs title intro 1 2 3 4 5 A B init
$(REFER) -sA+T '-l\", ' -p refs title intro 1 2 3 4 5 A B | $(GRAP) | $(PIC) | $(TBL) | soelim > $@

18
doc/sparc/init Normal file
View file

@ -0,0 +1,18 @@
.nr PS 12
.nr VS 14
.\" .fp 6 AM
.fp 6 CW
.ds Si \f6\s-1
.ds So \f1\s+1
.ds OQ `\h'-1p'`
.ds CQ '\h'-1p''
.de UX
.ie \\n(UX \s-1UNIX\s0\\$1
.el \{\
\s-1UNIX\s0\\$1\(dg
.FS
\(dg \s-1UNIX\s0 is a registered bell of AT&T Trademark Laboratories.
.FE
.nr UX 1
.\}
..

23
doc/sparc/intro Normal file
View file

@ -0,0 +1,23 @@
.so init
.hw de-vised
.TL
A fast backend for SPARC processors
.AU
Philip Homburg
Raymond Michiels
.AI
Dept. of Mathematics and Computer Science
Vrije Universiteit
Amsterdam, The Netherlands
.AB
The language EM is an intermediate language for use in compiler
construction.
In this paper we describe the construction of a so-called fast backend
which translates EM code to assembler for SPARC processors.
.br
Our construction deviates strongly from the usual procedure. We have
devised and implemented a virtual stack with which it is possible to
generate very acceptable code without much loss in compile time.
.AE
.PP
.bp

View file

@ -0,0 +1,58 @@
When developing a fast compiler for the Sun-4 series we have encountered
rather strange behavior of the Sun kernel.
The problem is that when you have lots of nested procedure calls, (as
is often the case in compilers and parsers) the registers fill up which
causes a kernel trap. The kernel will then write out some of the registers
to memory to make room for another window. When you return from the nested
procedure call, just the reverse happens: yet another kernel trap so the
kernel can load the register from memory.
Unfortunately the kernel only saves or loads a single window (= 16 register)
on each trap. This means that when you call a procedure recursively it causes
a kernel trap on almost every invocation (except for the first few).
To illustrate this consider the following little program:
--------------- little program -------------
f(i) /* calls itself i times */
int i;
{
if (i)
f(i-1);
}
main(argc, argv)
int argc;
char *argv[];
{
i = atoi(argv[1]); /* # loops */
j = atoi(argv[2]); /* depth */
while (i--)
f(j);
}
------------ end of little program -----------
The performance decreases abruptly when the depth (j) becomes larger
than 5. On a SPARC station we got the following results:
depth run time (in seconds)
1 0.5
2 0.8
3 1.0
4 1.4 <- from here on it's +6 seconds for each
5 7.6 step deeper.
6 13.9
7 19.9
8 26.3
9 32.9
Things would be a lot better when instead of just 1, the kernel would
save or restore 4 windows (= 64 registers = 50% on our SPARC stations).
-Raymond.

12
doc/sparc/pics/.distr Normal file
View file

@ -0,0 +1,12 @@
EM_stack.orig
EM_stack.ours
compile_bars
mem_config
perf
perf.comp
perf.d
perf.dhry
reg_layout
run-time_bars
run-time_bars.bup
signal_stack

View file

@ -0,0 +1,34 @@
.PS
.ps -2
.vs -2
boxwid = 1.5;
boxht = 0.24
down;
box "actual parameter n-1";
box "." "." "." ht 0.6;
box "actual parameter 0";
move 0.3
box "return status block";
{arrow <- right with .w at last box.e; \
box invis wid 0.3 "LB" }
down
move to 2nd last box.s
move 0.1
box "local variables"
box "compiler temporaries"
move 0.1
box "register save block"
move 0.1
box "dynamic local generators"
move 0.1
box "operand"
box "operand"
move 0.1
box "parameter m-1"
box "." "." "." ht 0.6;
box "parameter 0" with .n at last box .s
{ arrow <- right with .w at last box.e; \
box invis wid 0.3 "SP" }
.ps +2
.vs +2
.PE

View file

@ -0,0 +1,106 @@
.ps 10
.vs 12
.PS
boxwid = 1.3
boxht = 0.25
down;
box "floating point" "register dump area" ht 0.6
box "tmp float store"
box "register dump area" ht 0.6
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
box invis wid 0.3 "%fp" }
move .1
box dotted "gap"
{ arrow <- right with .w at last box.e; \
box invis wid 0.3 "%LB" }
move .1
box "locals"
box "actual parameter n-1";
box "." "." "." ht 0.6;
box "actual parameter 0";
{ arrow <- right with .w at last box.e; \
box invis wid 0.3 "%SP" }
move 0.1
box "large gap" "(>64kb)" ht 1.0
box "register dump area" ht 0.6
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
box invis wid 0.3 "%sp" }
move 0.2
box invis "\\s+2just before call\\s0"
move 1
box dotted "gap"
box invis "0 or 4 bytes" "for stack alignment" with .w at last box.e
box invis height .7 "when gap is 0 bytes," "%fp == %LB" with .n at 2nd last box.s
.PF
.PS
down;
move to 2.4,0
box "floating point" "register dump area" ht 0.6
box "tmp float store"
box "register dump area" ht 0.6
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
box invis wid 0.3 "%fp" }
move .1
box dotted "gap"
{ arrow <- right with .w at last box.e; \
box invis wid 0.3 "%LB" }
move .1
box "locals"
box "actual parameter n-1";
box "." "." "." ht 0.6;
box "actual parameter 0";
{ arrow <- right with .w at last box.e; \
box invis wid 0.3 "%SP" }
move .1
box dotted "gap"
move .4
box "floating point" "register dump area" ht 0.6
box "tmp float store"
box "register dump area" ht 0.6
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
box invis wid 0.3 "%sp" }
move 0.2
box invis "\\s+2'during' call\\s0"
.PF
.PS
down;
move to 4.8,0
box "floating point" "register dump area" ht 0.6
box "tmp float store"
box "register dump area" ht 0.6
move .1
box dotted "gap"
move .1
box "locals"
box "actual parameter n-1";
box "." "." "." ht 0.6;
box "actual parameter 0";
move .1
box dotted "gap"
move .4
box "floating point" "register dump area" ht 0.6
box "tmp float store"
box "register dump area" ht 0.6
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
box invis wid 0.3 "%fp" }
move .1
box dotted "gap"
{ arrow <- right with .w at last box.e; \
box invis wid 0.3 "%LB" }
move .1
box "locals"
box "actual parameter n-1";
box "." "." "." ht 0.6;
box "actual parameter 0";
{ arrow <- right with .w at last box.e; \
box invis wid 0.3 "%SP" }
move 0.1
box "large gap" "(>64kb)" ht 1.0
box "register dump area" ht 0.6
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
box invis wid 0.3 "%sp" }
move 0.2
box invis "\\s+2after call\\s0"
.PF
.ps 12
.vs 14

View file

@ -0,0 +1,49 @@
.PS
boxht = 0.5
boxwid = 1
moveht = 0.65
down;
{
right;
box invis "ACK" "w/o" "opt"
box "cem" "0.7" wid 0.7
box "opt" "0.4" wid 0.4
box "be" "1.1" wid 1.1
box "as" "1.4" wid 1.4
box "ld" "0.4" wid 0.4
box invis "4.0" wid 0.5
}
move
{
right;
box invis "ACK" "with" "opt"
box "cem" "0.7" wid 0.7
box "opt" "0.4" wid 0.4
box "be" "0.6" wid 0.6
box "as" "0.7" wid 0.7
box "ld" "0.4" wid 0.4
box invis "2.8" wid 0.5
}
move
{
right;
box invis "\fIcc\fR"
box "cpp" "0.2" wid 0.2
box "ccom" "1.0" wid 1.0
box "as" "0.7" wid 0.7
box "ld" "0.4" wid 0.4
box invis "2.3" wid 0.5
}
move
{
right;
box invis "\fIcc -O4\fR"
box "cpp" "0.2" wid 0.2
box "ccom" "1.0" wid 1.0
box "iropt" "5.0 (not to scale!)" wid 1.5
box "cg" "0.7" wid 0.7
box "as" "1.7" wid 1.7
box "ld" "0.4" wid 0.4
box invis "9.0" wid 0.5
}
.PE

34
doc/sparc/pics/mem_config Normal file
View file

@ -0,0 +1,34 @@
.PS
boxwid = 1.3
down
[
right
[
down;
box "stack" ht .6
box "free" ht 1
box "heap" ht .3
box "text" ht .5
]
move 1
[
down;
box "\s-4SPARC stack\s+4" ht .2
box "\s-4EM stack\s+4" ht .1
box "\s-4SPARC stack\s+4" ht .1
box "\s-4EM stack\s+4" ht .1
box "\s-4free\s+4" ht .2
box "\s-4SPARC stack\s+4" ht .1
box "free" ht .8
box "heap" ht .3
box "text" ht .5
]
]
move .3
[
right
box invis "regular \(UX memory layout"
move 1
box invis "memory layout for EM"
]
.PF

12
doc/sparc/pics/perf Normal file
View file

@ -0,0 +1,12 @@
.G1
frame invis left solid bot solid
label left "run time" "(log scale)" left .5
label bot "compile time (log scale)"
coord x 0.1,10 log x y 1000,20000 log y
ticks left out at 2000,5000,10000,20000
ticks bot out at 0.1 0.3 1.0 3.0 10
copy "perf.d" thru X
"\(bu" at $1, $2
"$3" rjust at $1, $2
X
.G2

7
doc/sparc/pics/perf.comp Normal file
View file

@ -0,0 +1,7 @@
in-line in ../A
2.5 17.28 ack w/o opt
1.6 2.93 ack with opt
9.4 2.26 ack -O4
1.5 7.43 \fIcc\fR
2.7 2.02 \fIcc -O4\fR

4
doc/sparc/pics/perf.d Normal file
View file

@ -0,0 +1,4 @@
1.0 1700 ack w/o opt
1.9 8000 ack with opt
1.6 8000 \fIcc\fR
7 18000 \fIcc -O4\fR

7
doc/sparc/pics/perf.dhry Normal file
View file

@ -0,0 +1,7 @@
in-line in ../A
3.5 1700 ack w/o opt
2.8 8770 ack with opt
16.0 10434 ack -O4
2.3 7270 \fIcc\fR
9.0 12500 \fIcc -O4\fR

24
doc/sparc/pics/reg_layout Normal file
View file

@ -0,0 +1,24 @@
.nr PS 12
.nr VS 14
.PP
.TS
allbox;
l l l l
l2f6 l l2f6 l.
g0 0 l0 EM_SP
g1 temporary 1 l1 EM_LB
g2 temporary 2 l2
g3 temporary 3 l3 reserved
g4 64k..1M l4 reserved
g5 temporary 4 l5 reserved
g6 line number l6 reserved
g7 file name l7 reserved
o0 param 1 i0
o1 param 2 i1
o2 param 3 i2
o3 param 4 i3
o4 RETL_LD i4 RETL_ST
o5 RETH_LD i5 RETH_ST
sp stack pointer fp frame pointer
o7 xxx i7 return address
.TE

View file

@ -0,0 +1,101 @@
.PS
boxht = 0.5
boxwid = 1
moveht = 1
down;
{
right;
box invis "ACK" "w/o" "opt."
move
[
down;
boxht = 0.25
box wid 4.5
"Sieve" ljust at last box.w + 0.1,-0.02
"10(!)" ljust at last box.e + 0.1,-0.02
box wid 4.5 with .nw at last box.sw
"Dhrystones" ljust at last box.w + 0.1,-0.02
"10(!)" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
move
{
right;
box invis "ACK" "with" "our" "opt."
move
[
down;
boxht = 0.25
box wid 1.4
"Sieve" ljust at last box.w + 0.1,-0.02
"1.4" ljust at last box.e + 0.1,-0.02
box wid 1.9 with .nw at last box.sw
"Dhrystones" ljust at last box.w + 0.1,-0.02
"1.9" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
move
{
right;
box invis "ACK" "-O4"
move
[
down;
boxht = 0.25
box wid 1.1
"Sieve" ljust at last box.w + 0.1,-0.02
"1.1" ljust at last box.e + 0.1,-0.02
box wid 1.6 with .nw at last box.sw
"Dhrystones" ljust at last box.w + 0.1,-0.02
"1.6" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
move
{
right;
box invis "Sun's" "compiler" "w/o opt."
move
[
down;
boxht = 0.25
box wid 3.7
"Sieve" ljust at last box.w + 0.1,-0.02
"3.7" ljust at last box.e + 0.1,-0.02
box wid 2.2 with .nw at last box.sw
"Dhrystones" ljust at last box.w + 0.1,-0.02
"2.2" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
move
{
right;
box invis "Sun's" "compiler" "-O"
move
[
down;
boxht = 0.25
box wid 1.1
"Sieve" ljust at last box.w + 0.1,-0.02
"1.1" ljust at last box.e + 0.1,-0.02
box wid 0.8 with .nw at last box.sw
"Dhryst." ljust at last box.w + 0.1,-0.02
"0.8!" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
move
{
right;
box invis "Sun's" "compiler" "-O4"
move
[
down;
boxht = 0.25
box wid 1.0
"Sieve" ljust at last box.w + 0.1,-0.02
"1.0" ljust at last box.e + 0.1,-0.02
box wid 1.0 with .nw at last box.sw
"Dhrystones" ljust at last box.w + 0.1,-0.02
"1.0" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
.PE

View file

@ -0,0 +1,100 @@
.PS
boxht = 0.5
boxwid = 1
moveht = 1
down;
{
right;
box invis "ACK" "w/o" "opt"
move
[
down;
boxht = 0.25
box wid 4.5
"C (arithmetic)" ljust at last box.w + 0.1,-0.02
"10(!)" ljust at last box.e + 0.1,-0.02
box wid 4.5 with .nw at last box.sw
"C (dhrystones)" ljust at last box.w + 0.1,-0.02
"10(!)" ljust at last box.e + 0.1,-0.02
box wid 4.5 with .nw at last box.sw
"Modula-2" ljust at last box.w + 0.1,-0.02
"8(!)" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
move
{
right;
box invis "ACK" "with" "peep-hole" "opt"
move
[
down;
boxht = 0.25
box wid 1.4
"C (arithmetic)" ljust at last box.w + 0.1,-0.02
"1.4" ljust at last box.e + 0.1,-0.02
box wid 1.9 with .nw at last box.sw
"C (dhrystones)" ljust at last box.w + 0.1,-0.02
"1.9" ljust at last box.e + 0.1,-0.02
box wid 2.5 with .nw at last box.sw
"Modula-2" ljust at last box.w + 0.1,-0.02
"2.5" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
move
{
right;
box invis "ACK" "-O4"
move
[
down;
boxht = 0.25
box wid 1.1
"C (arithmetic)" ljust at last box.w + 0.1,-0.02
"1.1" ljust at last box.e + 0.1,-0.02
box wid 1.6 with .nw at last box.sw
"C (dhrystones)" ljust at last box.w + 0.1,-0.02
"1.6" ljust at last box.e + 0.1,-0.02
box wid 2.5 with .nw at last box.sw
"Modula-2" ljust at last box.w + 0.1,-0.02
"2.5" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
move
{
right;
box invis "Sun's" "compiler" "w/o opt."
move
[
down;
boxht = 0.25
box wid 3.7
"C (arithmetic)" ljust at last box.w + 0.1,-0.02
"3.7" ljust at last box.e + 0.1,-0.02
box wid 2.2 with .nw at last box.sw
"C (dhrystones)" ljust at last box.w + 0.1,-0.02
"2.2" ljust at last box.e + 0.1,-0.02
box wid 1.8 with .nw at last box.sw
"Modula-2" ljust at last box.w + 0.1,-0.02
"1.8" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
move
{
right;
box invis "Sun's" "compiler" "-O4"
move
[
down;
boxht = 0.25
box wid 1.0
"C (arith.)" ljust at last box.w + 0.1,-0.02
"1.0" ljust at last box.e + 0.1,-0.02
box wid 1.0 with .nw at last box.sw
"C (dhryst.)" ljust at last box.w + 0.1,-0.02
"1.0" ljust at last box.e + 0.1,-0.02
box wid 1.0 with .nw at last box.sw
"Modula-2" ljust at last box.w + 0.1,-0.02
"1.0" ljust at last box.e + 0.1,-0.02
] with .w at last box.e
}
.PE

View file

@ -0,0 +1,42 @@
.PS
boxwid = 1.3
down
[
right
[
down;
box "\s-4SPARC stack\s+4" ht .2
box "\s-4EM stack\s+4" ht .1
box "\s-4SPARC stack\s+4" ht .1
box "\s-4EM stack\s+4" ht .1
box "\s-4free\s+4" ht .2
box "\s-4SPARC stack\s+4" ht .1
box "free" ht .8
box "heap" ht .3
box "text" ht .5
]
move 1
[
down;
box "\s-4SPARC stack\s+4" ht .2
box "\s-4EM stack\s+4" ht .1
box "\s-4SPARC stack\s+4" ht .1
box "\s-4EM stack\s+4" ht .1
box "\s-4free\s+4" ht .2
box "\s-4SPARC stack\s+4" ht .1
box "\s-4EM stack\s+4" ht .1
box "\s-4free\s+4" ht .2
box "\s-4SPARC stack\s+4" ht .1
box "free" ht .4
box "heap" ht .3
box "text" ht .5
]
]
move .3
[
right
box invis "before signal"
move 1
box invis "during (1st) signal"
]
.PF

31
doc/sparc/printP4P Normal file
View file

@ -0,0 +1,31 @@
echo $0
case $1 in
1 )
CMD="cat"
;;
2 )
CMD="cat"
;;
3 )
CMD="cat"
;;
4 )
CMD="pic | tbl"
;;
5 )
CMD="tbl"
;;
A )
CMD="grap | pic"
;;
B )
CMD="tbl"
;;
esac
echo $0
if [ $0 = printP4P ]
then
refer -sA+T '-l\", ' -p refs $1 | eval $CMD | troff -ms -Tp4p | dip -Tp4p -Pp4p
else
xtroff -full -geom 665x883+566+0 -command "refer -sA+T '-l\", ' -p refs $1 | $CMD | troff -ms -Tp4p"
fi

185
doc/sparc/refs Normal file
View file

@ -0,0 +1,185 @@
%T The design of very fast portable compilers
%A A.S. Tanenbaum
%A M.F. Kaashoek
%A K.G. Langendoen
%A C.J.H. Jacobs
%J SIGPLAN Notices
%V 24
%N 11
%P 125-131
%D November 1989
%T A Programmer-friendly LL(1) Parser Generator
%A D. Grune
%A C.J.H. Jacobs
%J Software \- Practice and Experience
%V 18
%N 1
%P 29-38
%D January 1988
%T The Code Expander Generator
%A Frans Kaashoek
%A Koen Langendoen
%R IM-9
%I Vrije Universiteit, Amsterdam
%D November 1987
%T The ACK Pascal Compiler
%A Aad Geudeke
%A Frans Hofmeester
%R IM-8
%I Vrije Universiteit, Amsterdam
%D November 1987
%T The EM-interpreter
%A Eddo de Groot
%A Leo van den Berge
%R IM-7
%I Vrije Universiteit, Amsterdam
%D June 1987
%T A set of multi\-process primitives for stack based machines
%A K. Bot
%A E. Scheffer
%R IR-122
%I Vrije Universiteit, Amsterdam
%D December 1986
%T An Occam Compiler
%A K. Bot
%A E. Scheffer
%R IM-6
%I Vrije Universiteit, Amsterdam
%D December 1986
%T Language- and Machine-independent Global Optimization on Intermediate Code
%A H.E. Bal
%A A.S. Tanenbaum
%J Computer Languages
%V 11
%N 2
%P 105-121
%D April 1986
%T The ACK Target Optimizer
%A H.E. Bal
%R IR-107
%D 1985
%I Vrije Universiteit, Amsterdam
%T Some Topics in Parser Generation
%A C.J.H. Jacobs
%R IR-105
%D October 1985
%I Vrije Universiteit, Amsterdam
%T The CEM compiler
%A E.H. Baalbergen
%A D. Grune
%A M. Waage
%R IM-4
%I Vrije Universiteit, Amsterdam
%D 1985
%T The Design and Implementation of the EM Global Optimizer
%A H.E. Bal
%I Vrije Universiteit, Amsterdam
%R IR-99
%D March 1985
%T Does anybody out there want to write HALF of a compiler?
%A A.S. Tanenbaum
%A E.G. Keizer
%A H. van Staveren
%J Sigplan Notices
%V 19
%N 8
%P 106-108
%D August 1984
%T Amsterdam Compiler Kit documentation
%A A.S. Tanenbaum et. al.
%I Vrije Universiteit, Amsterdam
%R IR-90
%D June 1984
%T A Practical Toolkit for Making Portable Compilers
%A A. S. Tanenbaum
%A H. van Staveren
%A E. G. Keizer
%A J. W. Stevenson
%J Communications of the ACM
%V 26
%N 9
%P 654-660
%D September 1983
%T Description of a Machine Architecture for use with Block Structured
Languages
%A A. S. Tanenbaum
%A H. van Staveren
%A E. G. Keizer
%A J. W. Stevenson
%R IR-81
%D August 1983
%I Vrije Universiteit, Amsterdam
%T A Unix Toolkit for Making Portable Compilers
%A A.S. Tanenbaum
%A H. van Staveren
%A E.G. Keizer
%A J.W. Stevenson
%J Proceedings USENIX conf.
%C Toronto, Canada
%V 26
%D July 1983
%P 255-261
%T Using Peephole Optimization on Intermediate Code
%A A.S. Tanenbaum
%A J.M. van Staveren
%A J.W. Stevenson
%J TOPLAS
%V 4
%N 1
%P 21-36
%D January 1982
%T EM-1 Compiler
%A A.S. Tanenbaum
%J Pascal News
%D September 1981
%P 4-38
%T A portable compiler for the Proposed ISO Standard Pascal Language
%A A.S. Tanenbaum
%A J.W. Stevenson
%A H. van Staveren
%J Sigplan Notices
%V 15
%N 10
%D 1980
%T Implications of Structured Programming for Machine Architecture
%A A.S. Tanenbaum
%J CACM
%V 21
%N 3
%P 237-246
%D March 1978
%T The table driven code generator from the Amsterdam Compiler Kit (Second
revised edition)
%A H. van Staveren
%I Vrije Universiteit, Amsterdam
%R on-line internal ACK documentation
%D early 1985
%T Dhrystone Benchmark: Rationale for Version 2 and Measurement Rules
%A R.P. Weicker
%J Sigplan Notices
%V 23
%N 8
%D august 1988
%P 49-62

22
doc/sparc/timing Normal file
View file

@ -0,0 +1,22 @@
DHRYSTONES V2.0
cc cc -O4 cc -O fccO fccCE ack ack -O4
compile time:
real 4.0 12.0 10.0 6.4 8.0 31.0
user 1.6 7.3 4.1 1.9 1.8 2.0 9.3
sys 0.9 2.1 1.8 2.5 1.5 2.0 7.7
run time: 7263 16250 15250 4730 3430 8474 10434
(stones/sec)
SIEVE
cc cc -O4 fccO fccCE ack ack -O4
compile time:
real 2.4 4.4 x 3.3 6.4 17.0
user 0.8 1.6 x 0.7 0.7 3.2
sys 0.7 1.0 x 0.8 1.3 6.2
run time: 7.43 2.02 x 12.18 2.93 2.26
All ack-derived compilers are shell script driven

15
doc/sparc/title Normal file
View file

@ -0,0 +1,15 @@
.so init
.TL
.sp 1.2c
A fast backend for SPARC processors
.AU
Philip Homburg
Raymond Michiels
.AI
Dept. of Mathematics and Computer Science
Vrije Universiteit
Amsterdam, The Netherlands
.PP
.sp 1i
Afstudeerverslag, 20 augustus 1990
.bp