Added
This commit is contained in:
parent
63c9fea5c2
commit
fb51183da2
29 changed files with 2085 additions and 0 deletions
14
doc/sparc/.distr
Normal file
14
doc/sparc/.distr
Normal file
|
@ -0,0 +1,14 @@
|
|||
1
|
||||
2
|
||||
3
|
||||
4
|
||||
5
|
||||
A
|
||||
B
|
||||
init
|
||||
intro
|
||||
note_on_reg_wins
|
||||
refs
|
||||
timing
|
||||
title
|
||||
Makefile
|
53
doc/sparc/1
Normal file
53
doc/sparc/1
Normal file
|
@ -0,0 +1,53 @@
|
|||
.so init
|
||||
.NH
|
||||
INTRODUCTION
|
||||
.NH 2
|
||||
Why an EM backend for SPARC processors?
|
||||
.PP
|
||||
With the introduction of SPARC-based computers like the Sun-4, a
|
||||
whole new range of fast computers became readily available to the general
|
||||
public. The power of large mainframes had been captured into a small
|
||||
desk-top computer at only a fraction of the cost.
|
||||
.PP
|
||||
In the older days, a new computer used to be very hard to integrate into
|
||||
the existing environment, but due to standardization in the software world
|
||||
incompatibility in hardware no longer means incompatibility in software.
|
||||
Programs that are written for computer A can often be run on computer B
|
||||
without major modifications. Unfortunately this is not true for all software.
|
||||
.PP
|
||||
There will always be programs that rely on the specific
|
||||
hardware of a certain computer for many different reasons. They
|
||||
can be categorized as:
|
||||
.IP -
|
||||
poorly written programs
|
||||
.IP -
|
||||
programs to directly control hardware (device drivers)
|
||||
.IP -
|
||||
code that requires efficiency (time-critical I/O drivers)
|
||||
.IP -
|
||||
programs to generate code to run on the hardware (compilers)
|
||||
.LP
|
||||
This project for instance, the design and implementation of an EM backend
|
||||
for SPARC processors, comes in the last category.
|
||||
.PP
|
||||
We have designed and implemented an algorithm to convert EM programs to code
|
||||
that will run directly on the SPARC hardware. Henceforth, both the algorithm
|
||||
and the implementation will be referred to as the EM-to-SPARC backend,
|
||||
or simply: the backend.
|
||||
.NH 2
|
||||
Why has nobody done this before?
|
||||
.PP
|
||||
Since EM was designed around 1981 and even SPARC has been around for some
|
||||
years now, one may wonder why nobody has ever written an EM to SPARC backend
|
||||
before. The reason is twofold. In the first place, there are some
|
||||
non-trivial problems to be solved in the design phase, and secondly,
|
||||
the SPARC-design combined with the lack of documentation, would surely
|
||||
cost a lot of blood, sweat and tears. The absence of
|
||||
clues to any of the design problems, combined with the \(em at first
|
||||
glance \(em inhuman
|
||||
SPARC instruction set did not make this a very attractive project.
|
||||
.PP
|
||||
On the other hand, these were exactly the reasons which made us take on
|
||||
this particular project: it would require design skills, as well as some
|
||||
hard work; a golden combination for a successful project.
|
||||
.bp
|
109
doc/sparc/2
Normal file
109
doc/sparc/2
Normal file
|
@ -0,0 +1,109 @@
|
|||
.so init
|
||||
.nr H1 1
|
||||
.NH
|
||||
CLOSE-UP LOOK
|
||||
.NH 2
|
||||
What is EM?
|
||||
.PP
|
||||
As the abstract of the IR-81 rapport on EM
|
||||
.[ [
|
||||
description of a machine architecture
|
||||
.]]
|
||||
says: \*(OQEM is a family
|
||||
of intermediate languages designed for producing portable compilers.\*(CQ
|
||||
Because EM is to be used on a wide range of languages and processors,
|
||||
the instruction set is kept simple enough to allow easy translation to,
|
||||
or interpretation on, almost any processor. Yet it is also powerful enough
|
||||
to accommodate easy translation from almost any block-structured language.
|
||||
.PP
|
||||
Even though EM was designed in the early 1980s, it
|
||||
is based on
|
||||
.\" already shows strong signs of being influenced by
|
||||
the (then innovative) RISC architecture. All instructions
|
||||
have 0 or 1 operands, there are no fancy addressing modes as in the
|
||||
68020's\*(Si move.w a3(_array,d3.w*2), -(sp)\*(So, no explicit registers,
|
||||
although instructions for higher languages
|
||||
such as array-operations, multiway branches (case) and
|
||||
floating point operations are provided.
|
||||
.PP
|
||||
To fully understand the discussion in the following chapters,
|
||||
the reader should at least have some knowledge of EM.
|
||||
.NH 2
|
||||
What is SPARC?
|
||||
.PP
|
||||
According to Sun's RISC tutorial: \*(OQSun Microsystems has designed a RISC
|
||||
architecture, called SPARC, and has implemented that architecture with
|
||||
the Sun-4 family of supercomputing workstations and servers. SPARC stands
|
||||
for Scalable Processor ARChitecture, emphasizing its applicability to
|
||||
large as well as small machines.\*(CQ
|
||||
.PP
|
||||
In sharp contrast to EM, SPARC does have
|
||||
explicit registers (31 integer and 32 floating point, all of which
|
||||
are 32 bits wide) and
|
||||
does not support any high level language operations: it does not even have
|
||||
multiplication or division instructions. Because the SPARC design is
|
||||
very straightforward, all instructions could be hard-coded (no microcode
|
||||
involved) to
|
||||
provided extremely high performance. All register-to-register operations
|
||||
require exactly one clock cycle, and all register-to-memory and
|
||||
memory-to-register operations require two clock cycles, one to retrieve
|
||||
the instruction and one to access external memory. At a clock speed of
|
||||
over 20 MHz this means that you can achieve well over 10 VAX MIPS:
|
||||
more than 4 times the speed of a 15 MHz 68020 used in the Sun3/50.
|
||||
.PP
|
||||
As above, the reader should also have some general knowledge about
|
||||
the SPARC processer to be able to understand the following chapters.
|
||||
.NH 2
|
||||
What exactly is a (fast) backend?
|
||||
.PP
|
||||
To put in the simplest of ways: a (fast) backend is a set of routines to
|
||||
translate EM code to code that will run 'on the metal' (for example the SPARC
|
||||
processor). The distinction between full-fledged backends (code generators)
|
||||
.[ [
|
||||
The table driven code generator
|
||||
.]]
|
||||
and fast backends (code expanders)
|
||||
.[ [
|
||||
The Code Expander Generator
|
||||
.]]
|
||||
is related to
|
||||
the compilation-time vs. run-time trade off. Code generators generate
|
||||
efficient code and code expanders generate code very efficient.
|
||||
For details about code expanders see also
|
||||
.[ [
|
||||
The design of very fast portable compilers
|
||||
.]].
|
||||
.PP
|
||||
The reasons for us to implement a code expander are numerous: Our first reason to
|
||||
implement a code expander, rather than a code generator was that implementing a
|
||||
code expander would be hard enough already. Code generators only give
|
||||
more problems and there were already enough problems to be solved. Secondly,
|
||||
we knew we would never be able to compete with original SPARC compilers due
|
||||
to loss of information in the frontends (see also chapter 5). By implementing
|
||||
a code expander we might be able to outrun the existing compilers on a
|
||||
completely different terrain: compile speed.
|
||||
.PP
|
||||
The third 'reason' to implement a code expander lies a little deeper and was
|
||||
not discovered until we had actually started the implementation... It was only
|
||||
then that we found out that for certain architectures, such as the SPARC,
|
||||
the idea behind the code-expander is not necessarily inferior to that
|
||||
behind a code-generator. It seems that for highly orthogonal instruction
|
||||
sets it is possible to generate near optimal code without using the
|
||||
code-expander. We have to say, however, that this is only true for our
|
||||
optimized version of the code-expander. With the original code-expander
|
||||
it would not have been possible to generate near-optimal code for the
|
||||
SPARC processor.
|
||||
.NH 2
|
||||
So, what are the main differences between EM and SPARC?
|
||||
.PP
|
||||
The main
|
||||
difference between EM and SPARC is the stack versus register orientation.
|
||||
The other differences, such as the presence of high level language
|
||||
operations in EM, can easily be overcome by subroutines,
|
||||
or small pieces of in-line SPARC code.
|
||||
The design-part of this project mostly concentrates on
|
||||
building a bridge between EM's stack and SPARC's registers.
|
||||
.PP
|
||||
In the next chapter we will make a list of all our design problems which
|
||||
will then be discussed in chapter 4.
|
||||
.bp
|
82
doc/sparc/3
Normal file
82
doc/sparc/3
Normal file
|
@ -0,0 +1,82 @@
|
|||
.so init
|
||||
.nr H1 2
|
||||
.NH
|
||||
PROBLEMS
|
||||
.NH 2
|
||||
Maintain SPARC speed
|
||||
.PP
|
||||
If we want to generate SPARC code, we should try to generate efficient code
|
||||
as fast as possible. It would be quite embarrassing to find out that the
|
||||
same program would run faster on a Motorola 68020 than on a SPARC processor,
|
||||
when both operate at the same clock frequency.
|
||||
Looking at some code generated by Sun's C-compiler and optimizing assembler,
|
||||
we can spot a few remarkable characteristics of the generated SPARC code:
|
||||
.IP -
|
||||
There are almost no memory references
|
||||
.IP -
|
||||
Parameters to functions are passed through registers.
|
||||
.IP -
|
||||
Almost all delay slots\(dg
|
||||
.FS
|
||||
\(dg For details about delay slots see the SPARC Architecture Manual, chapter 4, pp. 42-48
|
||||
.FE
|
||||
are filled in by the assembler
|
||||
.LP
|
||||
If we want to generate efficient code, we should at least try to
|
||||
reduce the number of memory references and use registers wherever we can.
|
||||
Since EM is stack-oriented it references its stack for every operation so
|
||||
this will not be an easy task; a suitable solution will however be given in
|
||||
the next chapter.
|
||||
.NH 2
|
||||
Increase compilation speed
|
||||
.PP
|
||||
Because we will implement a code expander (fast backend) we should keep
|
||||
a close eye on efficiency; if we cannot beat regular compilers on producing
|
||||
efficient code we will try to beat them on fast code generation.
|
||||
The usual trick to achieve fast compilation is to pack the frontend,
|
||||
optimizer, code-generator and
|
||||
assembler all into a single large binary to reduce the overhead of
|
||||
reading and writing temporary files. Unfortunately, due to the
|
||||
SPARC instruction set, its relocation information is slightly bizarre
|
||||
and cannot be represented with the present primitives.
|
||||
This means that it will not be possible to generate the required output
|
||||
format directly from our backend.
|
||||
.PP
|
||||
There are three solutions here: generate assembler code, and let an
|
||||
existing assembler generate the required object (\fI.o\fR) files,
|
||||
create our own primitives than can handle the SPARC relocation format, or
|
||||
do not use any of the addressing modes that require the bizarre relocation.
|
||||
Because we have enough on our hands already we will
|
||||
let the existing assembler deal with generating object files.
|
||||
.NH 2
|
||||
Convert stack to register operations
|
||||
.PP
|
||||
As we wrote in the previous chapter, for RISC machines a code expander can
|
||||
produce almost as efficient code as a code generator. The fact that this is
|
||||
true for stack-oriented RISC processors is rather obvious. The problem we
|
||||
face, however, is that the SPARC processor is register, instead of
|
||||
stack oriented. In the next chapter we will give a suitable solution to
|
||||
convert most stack accesses to register accesses.
|
||||
.NH 2
|
||||
Miscellaneous
|
||||
.PP
|
||||
Besides performance and \fI.o\fR-compatibility there are some other
|
||||
peculiarities of the SPARC processor and Sun's C-compiler (henceforth
|
||||
simply called \fIcc\fR).
|
||||
.PP
|
||||
For some reason, the SPARC stack pointer requires alignment
|
||||
on 8 bytes, so you cannot push a 4-byte integer on the stack
|
||||
and then \*(Sisub 4, %sp\*(So\(dd.
|
||||
.FS
|
||||
\(dd For more information about SPARC assembler see the Sun-4 Assembly
|
||||
Language Reference Manual
|
||||
.FE
|
||||
This too will be discussed in the next chapter, where we will take a
|
||||
more in-depth look into this problem and also discuss a couple of
|
||||
possible solutions.
|
||||
.PP
|
||||
Another thing is that \fIcc\fR usually passes the first six parameters of a
|
||||
function-call through registers. To be \fI.o\fR-compatible we would have to
|
||||
pass the first six parameters of each function call through registers as well.
|
||||
Exactly why this is not feasible will also be discussed in the next chapter.
|
||||
.bp
|
468
doc/sparc/4
Normal file
468
doc/sparc/4
Normal file
|
@ -0,0 +1,468 @@
|
|||
.so init
|
||||
.hw data-structures
|
||||
.nr H1 3
|
||||
.NH
|
||||
SOLUTIONS
|
||||
.NH 2
|
||||
Maintaining SPARC speed
|
||||
.PP
|
||||
In chapter 3 we wrote:
|
||||
.sp 0.3
|
||||
.nf
|
||||
>If we want to generate efficient code, we should at least try to reduce the number of
|
||||
>memory references and use registers wherever we can.
|
||||
.fi
|
||||
.sp 0.3
|
||||
In this chapter we will device a strategy to swiftly generate acceptable
|
||||
code by using push-pop optimization.
|
||||
Note that this is not the push-pop
|
||||
optimization already available in the EM-kit, since that is only present
|
||||
in the assembler-to-binary part which we do not use
|
||||
.[ [
|
||||
The Code Expander Generator
|
||||
.]].
|
||||
Our push-pop optimization
|
||||
works more like the fake-stack described in
|
||||
.[ [
|
||||
The table driven code generator
|
||||
.]].
|
||||
.NH 3
|
||||
Ad-hoc optimization
|
||||
.PP
|
||||
Before getting involved in any optimization let's have a look at some
|
||||
code generated with a straightforward EM to SPARC conversion of the
|
||||
C statement: \*(Sif(a[i]);\*(So Note that \*(Si%SP\*(So is an alias
|
||||
for a general purpose
|
||||
register and acts as the EM stack pointer. It has nothing to do with
|
||||
\*(Si%sp\*(So \(em the SPARC stack pointer.
|
||||
Analogous \*(Si%LB\*(So is EMs local base pointer.
|
||||
.br
|
||||
.IP
|
||||
.HS
|
||||
.TS
|
||||
;
|
||||
l s l s l
|
||||
l1f6 lf6 l2f6 lf6 l.
|
||||
EM code SPARC code Comment
|
||||
|
||||
lae _a set _a, %g1 ! load address of external _a
|
||||
dec 4, %SP
|
||||
st %g1, [%SP]
|
||||
|
||||
lol -4 set -4, %g1 ! load local -4 (i)
|
||||
ld [%g1+%LB], %g2
|
||||
dec 4, %SP
|
||||
st %g2, [%SP]
|
||||
|
||||
loc 2 set 2, %g1 ! load constant 2
|
||||
dec 4, %SP
|
||||
st %g1, [%SP]
|
||||
|
||||
sli 4 ld [%SP], %g1 ! pop shift count
|
||||
ld [%SP+4], %g2 ! pop shiftee
|
||||
sll %g2, %g1, %g3
|
||||
inc 4, %SP
|
||||
st %g3, [%SP] ! push 4 * i
|
||||
|
||||
ads 4 ld [%SP], %g1 ! add pointer and offset
|
||||
ld [%SP+4], %g2
|
||||
add %g1, %g2, %g3
|
||||
inc 4, %SP
|
||||
st %g3, [%SP] ! push address of _a + (4 * i)
|
||||
|
||||
loi 4 ld [%SP], %g1 ! load indirect 4 bytes
|
||||
ld [%g1], %g2
|
||||
st %g2, [%SP] ! push a[i]
|
||||
cal _f
|
||||
...
|
||||
.TE
|
||||
.HS
|
||||
.LP
|
||||
Although the code is easy understand, it clearly is far from optimal.
|
||||
The above code uses approximately 60 clock-cycles\(dg
|
||||
.FS
|
||||
\(dg In general each instruction only takes one cycle,
|
||||
except for \*(Sild\*(So and
|
||||
\*(Sist\*(So which may both require additional clock cycles. The exact amount
|
||||
of extra cycles needed depends on the SPARC implementation and memory access
|
||||
time. Furthermore, the
|
||||
\*(Siset\*(So pseudo-instruction is a bit tricky. It takes one cycle when
|
||||
its argument lies between -4096 and 4095, and two cycles otherwise.
|
||||
.FE
|
||||
to push an array-element on the stack,
|
||||
something which a 68020 can do in a single instruction. The SPARC
|
||||
processor may be fast, but not fast enough to justify the above code.
|
||||
.PP
|
||||
The same statement can be translated much more efficiently:
|
||||
.DS
|
||||
.TS
|
||||
;
|
||||
l2f6 lf6 l.
|
||||
sll %i0, 2, %g2 ! multiply index by 4
|
||||
set _a, g3
|
||||
ld [%g2+%g3], %g1 ! get contents of a[i]
|
||||
dec 4, SP
|
||||
st %g2, [SP] ! push a[i] onto the stack
|
||||
.TE
|
||||
.DE
|
||||
which, instead of 60, uses only 5 clock cycles to retrieve the element
|
||||
from memory and 5 additional cycles when the result has to be pushed
|
||||
on the stack. Note that when the result is not a parameter it does not
|
||||
have to be pushed on the stack. By making efficient use of the SPARC
|
||||
registers we can fetch \*(Sia[i]\*(So in only 5 cycles!
|
||||
.NH 3
|
||||
Analyzing optimization
|
||||
.PP
|
||||
Instead of ad-hoc optimization we will need something more solid.
|
||||
When one tries to optimize the above code in an ad-hoc manner one will
|
||||
probably notice the large overhead due to stack access. Almost every EM
|
||||
instruction requires at least three SPARC instructions: one to carry out
|
||||
the EM instruction and two to pop and push the result from and onto the
|
||||
stack. This happens for every instruction, even though the data being pushed
|
||||
will probably be needed by the next instruction. To optimize this extensive
|
||||
pushing and popping of data we will use the appropriately named push-pop
|
||||
optimization.
|
||||
.PP
|
||||
The idea behind push-pop optimization is to delay the push operation until
|
||||
it is almost certain that the data actually has to be pushed.
|
||||
As is often the case, the data does not have to be pushed,
|
||||
but will be used as input to another EM instruction.
|
||||
If we can decide at compile time that this will indeed be
|
||||
the case we can save the time of first pushing the data and then popping it
|
||||
back again by temporarily storing the data (possibly only during compilation!)
|
||||
and using it no sooner than it is actually needed.
|
||||
.PP
|
||||
The \*(Sisli 4\*(So instruction, for instance, expects two inputs on top of the
|
||||
stack: on top a counter and right below that the shiftee (the number
|
||||
to be shifted). As a result \*(Sisli\*(So
|
||||
pushes 'shiftee << counter' back to the stack. Now consider the following
|
||||
sequence, which could be the result of the expression \*(Si4 * i\*(So
|
||||
.DS
|
||||
.TS
|
||||
;
|
||||
l1f6 lf6 l.
|
||||
lol -4
|
||||
loc 2
|
||||
sli 4
|
||||
.TE
|
||||
.DE
|
||||
In the non-optimized situation the \*(Silol\*(So would push
|
||||
a local variable (whose offset is -4) on the stack.
|
||||
Then the \*(Siloc\*(So pushes a 2 on the stack and finally \*(Sisli\*(So
|
||||
retrieves both these numbers to replace then with the result.
|
||||
On most machines it is not necessary to
|
||||
push the 2 on the stack, since it can be used in the shift instruction
|
||||
as an immediately operand. On a SPARC, for instance, one can write
|
||||
.DS
|
||||
.TS
|
||||
;
|
||||
l2f6 lf6 l.
|
||||
ld [%LB-4], %g1 ! load local variable into register g1
|
||||
sll %g1, 2, %g2 ! perform the shift-left-by-2
|
||||
.TE
|
||||
.DE
|
||||
where the output of the \*(Silol\*(So, as well as the immediate operand 2 are used
|
||||
in the shift instruction. As suggested before, all of this can be
|
||||
achieved with push-pop optimization.
|
||||
.NH 3
|
||||
A mechanism for push-pop optimization
|
||||
.PP
|
||||
To implement the above optimization we need some mechanism to
|
||||
temporarily store information during compilation.
|
||||
We need to be able to store, compare and retrieve information from the
|
||||
temporary storage (cache) without any
|
||||
loss of information. Before describing all the routines used
|
||||
to implement our cache we will first describe how the cache works.
|
||||
.PP
|
||||
Items in the cache are structures containing an external (\*(Sichar *\*(So),
|
||||
two registers (\*(Sireg_t\*(So) and a constant (\*(Siarith\*(So),
|
||||
any of which may be 0.
|
||||
The value of such a structure is the sum of (the values of)
|
||||
its elements. To put a register in the cache, one has to be allocated either
|
||||
by calling \*(Sialloc_reg\*(So which returns a free register, by
|
||||
\*(Siforced_alloc_reg\*(So which allocates a specific register or any
|
||||
of the other routines available to allocate a register. The keep things
|
||||
simple, we will not discuss all of the available primitives here.
|
||||
When the register
|
||||
is then put in the cache by the \*(Sipush_reg\*(So routine, the ownership will
|
||||
be transferred from the user to the cache. Ownership is important, because
|
||||
only the owner of a register may (and must!) deallocate it. Registers can be
|
||||
owned by either an (imaginary) register manager, the cache or the user.
|
||||
When the user retrieves a register from the stack with \*(Sipop_reg\*(So for
|
||||
instance, ownership is back to the user.
|
||||
The user should then call \*(Sifree_reg\*(So
|
||||
to transfer ownership to the register manager or call \*(Sipush_reg\*(So
|
||||
to give it back to the cache.
|
||||
Since the cache behaves itself as a stack we will use the term pop resp. push
|
||||
to get items from, resp. put items in the cache.
|
||||
.PP
|
||||
We shall now present the sets of routines that implement the cache.
|
||||
.IP \(bu
|
||||
The routines
|
||||
.DS
|
||||
\*(Si
|
||||
reg_t alloc_reg(void)
|
||||
reg_t alloc_reg_var(void)
|
||||
reg_t alloc_float(void)
|
||||
reg_t alloc_float_var(void)
|
||||
reg_t alloc_double(void)
|
||||
reg_t alloc_double_var(void)
|
||||
|
||||
void forced_alloc_reg(reg_t)
|
||||
void soft_alloc_reg(reg_t)
|
||||
|
||||
void free_reg(reg_t)
|
||||
void free_double_reg(reg_t)
|
||||
\*(So
|
||||
.DE
|
||||
allocate and deallocate registers. If there are no more register left,
|
||||
i.e. they are owned by the cache,
|
||||
one or more registers will be freed by flushing part of the cache
|
||||
onto the real stack.
|
||||
The \*(Sialloc_xxx_var\*(So primitives try to allocate a register that
|
||||
can be used to store local variables. (In the current implementation
|
||||
only the input and local registers.) If none can be found \*(SiNULL\*(So
|
||||
is returned. \*(Siforced_alloc_reg\*(So forces the allocation of a certain
|
||||
register. If it was already in use, its contents are moved to another
|
||||
register. Finally \*(Sisoft_alloc_reg\*(So provides the possibility to
|
||||
push a register onto the cache and still keep a copy for later use.
|
||||
(Used to implement the \*(Sidup 4\*(So for example.)
|
||||
.IP \(bu
|
||||
The routines
|
||||
.DS
|
||||
\*(Si
|
||||
void push_const(arith)
|
||||
arith pop_const(void)
|
||||
\*(So
|
||||
.DE
|
||||
push or pop a constant onto or from the stack. Distinction between
|
||||
constants and other types is made so as not to loose any information; constants
|
||||
may be used later on as immediate operators, which is not the case
|
||||
for other types. If \*(Sipop_const\*(So is called, but the element on top of
|
||||
the cache has either one of the external or register fields non-zero a
|
||||
fatal error will be reported.
|
||||
.IP \(bu
|
||||
The routines
|
||||
.DS
|
||||
\*(Si
|
||||
reg_t pop_reg(void)
|
||||
reg_t pop_float(void)
|
||||
reg_t pop_double(void)
|
||||
reg_t pop_reg_c13(char *n)
|
||||
|
||||
void pop_reg_as(reg_t)
|
||||
|
||||
void push_reg(reg_t)
|
||||
\*(So
|
||||
.DE
|
||||
push or pop a register. These will be used most often since results from one
|
||||
EM instruction, which are computed in a register, are often used in the next.
|
||||
When the element on top of the cache is more
|
||||
than just a register the cache manager
|
||||
will generate code to compute the sum of its fields and put the result in a
|
||||
register. This register will then be given to the user.
|
||||
If the user wants the result is a special register, he should use the
|
||||
\*(Sipop_reg_as\*(So routine.
|
||||
The \*(Sipop_reg_c13\*(So gives an optional number (as character string) whose
|
||||
value can be represented in 13 bits. The constant can then be used as an
|
||||
offset for the SPARC \*(Sild\*(So and \*(Sist\*(So instructions.
|
||||
.IP \(bu
|
||||
The routine
|
||||
.DS
|
||||
\*(Si
|
||||
void push_ext(char *)
|
||||
\*(So
|
||||
.DE
|
||||
pushes an external onto the stack. There is no pop-variant of this one since
|
||||
there is no use in popping an external.
|
||||
.IP \(bu
|
||||
The routines
|
||||
.DS
|
||||
\*(Si
|
||||
void inc_tos(arith n)
|
||||
void inc_tos_reg(reg_t r)
|
||||
\*(So
|
||||
.DE
|
||||
increment the element on top of the cache by either the constant \*(Sin\*(So
|
||||
or by a register. The latter is useful for pointer addition when referencing
|
||||
external memory.
|
||||
.KS
|
||||
.IP \(bu
|
||||
The routine
|
||||
.DS
|
||||
\*(Si
|
||||
int type_of_tos(void)
|
||||
\*(So
|
||||
.DE
|
||||
.KE
|
||||
returns the type of the element on top of the cache. This is a combination
|
||||
(binary OR) of \*(SiT_ext\*(So, \*(SiT_reg\*(So or \*(SiT_float\*(So,
|
||||
\*(SiT_reg2\*(So or \*(SiT_float2\*(So, and \*(SiT_cst\*(So,
|
||||
and tells the
|
||||
user which of the three fields are non-zero. When the register-fields
|
||||
represent \*(Si%g0\*(So, it is considered zero.
|
||||
.IP \(bu
|
||||
Miscellaneous routines:
|
||||
.DS
|
||||
\*(Si
|
||||
void init_cache(void)
|
||||
void cache_need(int)
|
||||
void change_reg(void)
|
||||
void flush_cache(void)
|
||||
\*(So
|
||||
.DE
|
||||
\*(Siinit_cache\*(So should be called before any
|
||||
other cache routines, to initialize some internal datastructures.
|
||||
\*(Sicache_need\*(So is used to tell the cache that a certain number
|
||||
of register are needed for the next operation. This way the cache can
|
||||
load them efficiently in one fell swoop. \*(Sichange_reg\*(So is to be
|
||||
called when the user changes a register of which the cache (possibly) has
|
||||
co-ownership. Because the contents of registers in the cache are
|
||||
not allowed to change the user should call \*(Sichange_reg\*(So to
|
||||
instruct the cache to copy the contents to some other register.
|
||||
\*(Siflush_cache\*(So writes the cache to the stack and invalidates
|
||||
the cache. It should be used before branches,
|
||||
before labels and on other places where the stack has to be valid (i.e. where
|
||||
every item on the EM-stack should be stored on the real stack, not in some
|
||||
virtual cache).
|
||||
.NH 3
|
||||
Implementing push-pop optimization in the EM_table
|
||||
.PP
|
||||
As indicated above, there is no regular way to represent the described
|
||||
optimization in the EM_table. The only possible escapes from the EM_table
|
||||
are function calls, but that is clearly not enough to implement a good
|
||||
push-pop optimizer. Therefore we will use a modified version of the EM_table
|
||||
format, where the description of, say, the \*(Silol\*(So instruction might look
|
||||
like this\(dg:
|
||||
.FS
|
||||
\(dg This is not the way the \*(Silol\*(So actually looks in the EM_table;
|
||||
it only shows how it \fImight\fR look using the forementioned push/pop
|
||||
primitives.
|
||||
.FE
|
||||
.DS
|
||||
\*(Si
|
||||
reg_t A, B;
|
||||
const_str_t n;
|
||||
|
||||
alloc_reg(A);
|
||||
push_reg(LB);
|
||||
inc_tos($1);
|
||||
B = pop_reg_c13(n);
|
||||
"ld [$B+$n], $A";
|
||||
push_reg(A);
|
||||
free_reg(B);
|
||||
\*(So
|
||||
.DE
|
||||
For more details about the exact implementation consult
|
||||
appendix B which contains some characteristic excerpts from the EM_table.
|
||||
.NH 2
|
||||
Stack management
|
||||
.PP
|
||||
When converting EM code to some executable code there is the problem of
|
||||
maintaining multiple stacks. The usual way to do this is described in
|
||||
.[ [
|
||||
Description of a Machine Architecture
|
||||
.]]
|
||||
and is shown in figure \*(SN1.
|
||||
.KE
|
||||
.PS
|
||||
copy "pics/EM_stack.orig"
|
||||
.PE
|
||||
.ce 1
|
||||
\fIFigure \*(SN1: usual stack management.
|
||||
.KE
|
||||
.sp
|
||||
.LP
|
||||
This means that the EM stack and the hardware stack (used
|
||||
for subroutine calls, etc.) are interleaved in memory. On the SPARC, however,
|
||||
this brings up a large problem: in the former model it is assumed that the
|
||||
resolution of the stack pointer is a word, but this is not the case on the
|
||||
SPARC processor. On the SPARC processor the stack-pointer as well as the
|
||||
frame-pointer have to be aligned on 8-byte boundaries, so one can not simply
|
||||
push a word on the stack and then lower the stack-pointer by 4 bytes!
|
||||
.NH 3
|
||||
Possible solutions
|
||||
.PP
|
||||
A simple idea might be to use a swiss-cheese stack; we could
|
||||
push a 4-byte word onto the stack and then lower the stack by 8.
|
||||
Unfortunately, this is not a very solid solution, because
|
||||
pointer-arithmetic involving pointers to objects on the stack would cause
|
||||
hard-to-predict anomalies.
|
||||
.PP
|
||||
Another try would be not to use the hardware stack at all. As long as we
|
||||
do not generate subroutine-calls everything will be all right. This
|
||||
approach, however, also has some disadvantages: first we would not be able
|
||||
to use any of the existing debuggers such as \fIadb\fR, because they all
|
||||
assume a regular stack format. Secondly, we would not be able to make use
|
||||
of the SPARC's register windows to keep local variables. Finally, doing all the
|
||||
administrative work necessary for subroutine calls ourselves instead of
|
||||
letting the hardware handle it for us,
|
||||
causes unnecessary procedure-call overhead.
|
||||
.PP
|
||||
Yet another alternative would be to emulate the EM-part of the stack,
|
||||
and to let the hardware handle the subroutine call. Since we will
|
||||
emulate our own stack, there are no alignment restrictions and because
|
||||
we will use the hardware procedure call we can still make use of
|
||||
the register windows.
|
||||
.NH 3
|
||||
Our implementation
|
||||
.PP
|
||||
To implement the hybrid stack we need two extra registers: one for the
|
||||
the EM stack pointer (the forementioned \*(Si%SP\*(So) and one for the
|
||||
EM local base pointer (\*(Si%LB\*(So). The most elegant solution would be to
|
||||
put both stacks in different segments, so they would not influence
|
||||
each other. Unfortunately
|
||||
.UX
|
||||
lacks the ability to add segments and
|
||||
since we will implement our backend under
|
||||
.UX,
|
||||
we will have to put
|
||||
both stacks in the same segment. Exactly how this can be done is shown
|
||||
in figure \*(SN2.
|
||||
.DS
|
||||
.PS
|
||||
copy "pics/mem_config"
|
||||
.PE
|
||||
.ce 1
|
||||
\fIFigure \*(SN2: our stack management.\fR
|
||||
.DE
|
||||
.sp
|
||||
During normal procedure execution, the SPARC stack pointer has to point to
|
||||
a memory location where the operating system can dump the active part of
|
||||
the register window. The rest of the
|
||||
register window will be dumped in the therefor pre-allocated (stack) space
|
||||
by following the frame
|
||||
pointer. When a signal occurs things get even more complicated and
|
||||
result in figure \*(SN3.
|
||||
.DS
|
||||
.PS
|
||||
copy "pics/signal_stack"
|
||||
.PE
|
||||
.ce 1
|
||||
\fIFigure \*(SN3: our signal stack.\fR
|
||||
.DE
|
||||
.PP
|
||||
The exact implementation of the stack is shown in figure \*(SN4.
|
||||
.KF
|
||||
.PS
|
||||
copy "pics/EM_stack.ours"
|
||||
.PE
|
||||
.ce 1
|
||||
\fIFigure \*(SN4: stack overview.\fR
|
||||
.KE
|
||||
.NH 2
|
||||
Miscellaneous
|
||||
.PP
|
||||
As mentioned in the previous chapter, the generated \fI.o\fR-files are
|
||||
not compatible with Sun's own object format. The primary reason for
|
||||
this is that Sun usually passes the first six parameters of a procedure call
|
||||
through registers. If we were to do that too, we would always have
|
||||
to fetch the top six words from the stack into registers, even when
|
||||
the procedure would not have any parameters at all. Apart from this,
|
||||
structure-passing is another exception in Sun's object format which
|
||||
makes is impossible to generate object-compatible code.\(dg
|
||||
.FS
|
||||
\(dg Exactly how Sun passes structures as parameters is described in
|
||||
Appendix D of the SPARC Architecture Manual (Software Considerations)
|
||||
.FE
|
||||
.bp
|
153
doc/sparc/5
Normal file
153
doc/sparc/5
Normal file
|
@ -0,0 +1,153 @@
|
|||
.so init
|
||||
.nr H1 4
|
||||
.NH
|
||||
FUTURE WORK
|
||||
.NH 2
|
||||
A critique of EM
|
||||
.PP
|
||||
In general, EM fits its purpose quite well. Numerous compilers have been
|
||||
written using EM as their intermediate language and it has even become a
|
||||
commercial product. A great deal of its success is probably due to its
|
||||
simplicity. There are no extravagant instructions but it does have all the
|
||||
necessary functions to write a decent compiler.
|
||||
.PP
|
||||
There are, however, a few functions that come rather close to being
|
||||
extravagant. The \*(Silar\*(So function for example \(em used
|
||||
to fetch an element from an array \(em does not make it much easier
|
||||
to write a frontend, but does make it unnecessary hard to write an
|
||||
efficient backend. Other instructions for which it is difficult
|
||||
to generate efficient code for are those that permit
|
||||
dynamic operators, such as the \*(Silos\*(So. Dynamic operators, however, provide
|
||||
significant extra possibilities and can therefore not be disposed of.
|
||||
Note that even though the array operations \*(Silar\*(So and \*(Sisar\*(So
|
||||
provide dynamic operators, they do not add additional power, since
|
||||
they can easily be replaced with a sequence using the \*(Silos\*(So or
|
||||
\*(Sists\*(So instructions.
|
||||
.PP
|
||||
EM code to reference arrays generated by the C frontend can be translated
|
||||
very efficiently for almost any processor. However the same operation
|
||||
generated by the Modula-2 frontend (which uses the \*(Silar\*(So),
|
||||
is much less efficient, although the only difference is that the
|
||||
latter performs range checking whereas the former does not.\(dg
|
||||
.FS
|
||||
\(dg Actually this depends on whether or not explicit range checking in enabled.
|
||||
This clearly shows that the current code generators are not optimal and
|
||||
often depend on ad-hoc decisions.
|
||||
.FE
|
||||
Since range checking can also be expressed explicitly in
|
||||
EM (\*(Sirck\*(So) there is no need for any of the array operations
|
||||
(\*(Siaar\*(So, \*(Silar\*(So and \*(Sisar\*(So).
|
||||
.PP
|
||||
Besides efficiency of the array-operations themselves, there still is another
|
||||
major disadvantage of using these array-operations. In sharp contrast to
|
||||
all other EM instructions except the \*(Silos\*(So and the \*(Sists\*(So,
|
||||
they allow dynamic operators, so their effect on the stack-pointer can not
|
||||
always be
|
||||
determined at compile-time. This means that efficient caching of the
|
||||
top-of-stack in registers is almost impossible,
|
||||
so using these array-operations also effects the
|
||||
efficiency of the surrounding code. Now that processors are produced with
|
||||
more and more registers it could be very beneficiary to cache the
|
||||
top-of-stack, so that the memory/register reference ratio decreases
|
||||
to the benefit of the overall performance.
|
||||
.PP
|
||||
As a final critique, we would also like to discuss the semantics of some of
|
||||
the EM instructions. In
|
||||
.[ [
|
||||
Description of a Machine Architecture
|
||||
.]]
|
||||
it is said that
|
||||
all signed instructions such as the \*(Siadi\*(So, should cause an exception
|
||||
on overflow. The unsigned operations such as \*(Siadu\*(So, however,
|
||||
should act as modulo operations and therefor not perform overflow checking.
|
||||
Since it is very expensive to perform overflow checking in EM,
|
||||
we would suggest that the backend takes care of this. For languages which
|
||||
do not require overflow checking, a simple message could be generated to
|
||||
disable overflow checking in backends. This way all backends could be
|
||||
written to fully comply to the official EM definition without any reduction in
|
||||
efficiency.\(dd
|
||||
.FS
|
||||
\(dd Currently many backends do not implement error checks because they
|
||||
are too expensive and almost never needed. Some frontends even have
|
||||
facilities build in to generate EM-code to force these checks. If this
|
||||
trend continues we will end up with a de-facto and a de-jure standard
|
||||
both developed by the same people but nonetheless incompatible.
|
||||
.FE
|
||||
When such messages will be added we would like to suggest
|
||||
that they can enforce overflow checks on unsigned, as well as signed arithmetic.
|
||||
.PP
|
||||
As a conclusion we would like to suggest removal of the array operations from
|
||||
EM, or at least discontinuation of there usage in frontends.
|
||||
.NH 2
|
||||
\*(OQWanted: Procedure call information\*(CQ
|
||||
.PP
|
||||
The advantage of an intermediate language such as EM is that the backend
|
||||
no longer has to know about any 'quirks' of the 'input'-language. The major
|
||||
disadvantage, however, is that the backend no longer knows about any 'quirks'
|
||||
of the 'input'-language... If the SPARC backend ever has to compete
|
||||
with Sun's own C-compiler for example, removal of the array-operations
|
||||
will not be enough. The amount of information that is lost during
|
||||
the translation to EM is too large to ever generate truly efficient SPARC code.
|
||||
.PP
|
||||
To write such an efficient backend one needs to know, for example, whether,
|
||||
when and what type of parameter is being computed, so the result can be stored
|
||||
in the proper place and scratch registers can be reused.
|
||||
(On the SPARC processor, for example, it is very beneficiary
|
||||
to pass the first six parameters of a procedure call through
|
||||
registers instead of using the stack.)
|
||||
One way to express such things in EM is to insert extra messages in
|
||||
the EM-code. The C statement \*(Sia = f(4, a + b);\*(So for example,
|
||||
could be translated to the following EM-code:
|
||||
.DS
|
||||
.TS
|
||||
;
|
||||
l1f6 lf6 l.
|
||||
lol -4 ! a
|
||||
lol -8 ! b
|
||||
mes x, 2 ! next instruction will compute 2nd parameter
|
||||
adi 4
|
||||
mes x, 1 ! next instruction will compute 1st parameter
|
||||
loc 4
|
||||
cal _f ! call function f
|
||||
lfr 4
|
||||
stl -4 ! store result in a
|
||||
.TE
|
||||
.DE
|
||||
For a code expander it is important that the \*(Simes\*(So pseudo
|
||||
instructions appear \fIbefore\fR
|
||||
the EM instruction that computes the parameter, because that way the final
|
||||
computation (the \*(Siadi\*(So and \*(Siloc\*(So in the previous example)
|
||||
can be translated to machine code that performs the required computation
|
||||
and also puts the result in the required place. If it is found to be
|
||||
too difficult for the frontend to insert these \*(Simes\*(So instructions
|
||||
at the right place the peep-hole optimizer might swap the \*(Simes\*(So and
|
||||
the instruction that computes the parameter.
|
||||
.PP
|
||||
For some architectures, it is also
|
||||
possible to generate more efficient code for a procedure when it is a
|
||||
so-called leaf-procedure: a procedure that doesn't call other procedures.
|
||||
On the SPARC, for example, it is not necessary to rotate the register
|
||||
window for a call to a leaf procedure and it is also possible to use
|
||||
the global registers for register variables in leaf procedures.
|
||||
It will be a little harder to insert useful messages about leaf procedures,
|
||||
because just as with register messages, they are only useful to the
|
||||
backend when they appear immediately
|
||||
after or before the \*(Sipro\*(So pseudo instruction. The frontend,
|
||||
however, only knows whether a certain procedure is a leaf-procedure or not
|
||||
when it has already generated the entire procedure in EM. Just as with the
|
||||
\*(Sipro ? / end n\*(So-dilemma the peep-hole optimizer
|
||||
.[ [
|
||||
Using Peephole Optimization
|
||||
.]]
|
||||
might be able to lend a hand
|
||||
and help us out by delaying EM-code generation until it has reached the
|
||||
end of the procedure.
|
||||
.PP
|
||||
As with most optimizations, the main problem is that they have to be
|
||||
implemented with the \*(Simes\*(So pseudo instruction.
|
||||
Because the \*(Simes\*(So instruction can have many different meanings
|
||||
depending on its argument,
|
||||
it is important that all optimizers recognize and respect them. Addition
|
||||
of even a single message will require careful inspection of, and maybe even
|
||||
incorporate small changes to each of the optimizers.
|
||||
.bp
|
184
doc/sparc/A
Normal file
184
doc/sparc/A
Normal file
|
@ -0,0 +1,184 @@
|
|||
.so init
|
||||
.SH
|
||||
A. MEASUREMENTS
|
||||
.SH
|
||||
A.1. \*(OQThe bottom line\*(CQ
|
||||
.PP
|
||||
Although examples often are most illustrative, the cruel world out there is
|
||||
usually more interested in everyday performance figures. To satisfy those
|
||||
people too, we will present a series of measurements on our code expander
|
||||
taken from (close to) real life situations. These include measurements
|
||||
of compile and run times of different programs,
|
||||
compiled with different compilers.
|
||||
.SH
|
||||
A.2. Compile time measurements
|
||||
.PP
|
||||
Figure A.2.1 shows compile-time measurements for typical C code:
|
||||
the dhrystone benchmark\(dg
|
||||
.[ [
|
||||
dhrystone
|
||||
.]].
|
||||
.FS
|
||||
\(dg To be certain that we only tested the compiler and not the quality of
|
||||
the code in the library, we have added our own version of
|
||||
\fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the
|
||||
library.
|
||||
.FE
|
||||
The numbers represent the duration of each separate pass of the compiler.
|
||||
The numbers at the end of each bar represent the total duration of the
|
||||
compilation process. As with all measurements in this chapter, the
|
||||
quoted time or duration is the sum of user and system time in seconds.
|
||||
.PS
|
||||
copy "pics/compile_bars"
|
||||
.PE
|
||||
.DS
|
||||
.IP cem: 6
|
||||
C to EM frontend
|
||||
.IP opt:
|
||||
EM peep-hole optimizer
|
||||
.IP be:
|
||||
EM to assembler backend
|
||||
.IP cpp:
|
||||
Sun's C preprocessor
|
||||
.IP ccom:
|
||||
Sun's C compiler
|
||||
.IP iropt:
|
||||
Sun's optimizer
|
||||
.IP cg:
|
||||
Sun's code generator
|
||||
.IP as:
|
||||
Sun's assembler
|
||||
.IP ld:
|
||||
Sun's linker
|
||||
.ce 1
|
||||
\fIFigure A.2.1: compile-time measurements.\fR
|
||||
.DE
|
||||
.sp
|
||||
.PP
|
||||
A close examination of the first two bars in fig A.2.1 shows that the maximum
|
||||
achievable compile-time
|
||||
gain compared to \fIcc\fR is about 50% for medium-sized
|
||||
programs.\(dd
|
||||
.FS
|
||||
\(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53
|
||||
.FE
|
||||
For small programs the gain will be less, due to the almost constant
|
||||
start-up time of each pass in the compilation process. Only a
|
||||
built-in assembler may increase this number up to
|
||||
180% in the ideal case that the optimizer, backend and assembler
|
||||
would run in zero time. Speed-ups of 5 to 10 times as mentioned in
|
||||
.[ [
|
||||
fast portable compilers
|
||||
.]]
|
||||
are therefore not possible on the Sun-4 family. This is also due to
|
||||
Sun's implementation of saving and restoring register windows. With
|
||||
the current implementation in which only a single window is saved
|
||||
or restored on a register-window overflow, it is very time consuming
|
||||
when programs have highly dynamic stack use
|
||||
due to procedure calls (as is often the case with compilers).
|
||||
.PP
|
||||
Although we are currently a little slower than \fIcc\fR, it is hard to
|
||||
blame this on our backend. Optimizing the backend so that it would run
|
||||
twice as fast would only reduce the total compilation process by
|
||||
a mere 14%.
|
||||
.PP
|
||||
Finally it is nice to see that our push/pop-optimization,
|
||||
initially designed to generate faster code, has also increased the
|
||||
compilation speed. (see also figures A.4.1 and A.4.2.)
|
||||
.SH
|
||||
A.3. Run time performance
|
||||
.PP
|
||||
Figure A.3.1 shows the run-time performance of different compilers.
|
||||
All results are normalized, where the best available compiler (Sun's
|
||||
compiler with full optimization) is represented by 1.0 on our scale.
|
||||
.PS
|
||||
copy "pics/run-time_bars"
|
||||
.PE
|
||||
.ce 1
|
||||
\fIFigure A.3.1: run time performance.\fR
|
||||
.sp 1
|
||||
.PP
|
||||
The fact that our compiler behaves rather poorly compared to Sun's
|
||||
compiler is due to the fact that the dhrystone benchmark uses
|
||||
relatively many subroutine calls; all of which have to be 'emulated'
|
||||
by our backend.
|
||||
.SH
|
||||
A.4. Overall performance
|
||||
.LP
|
||||
In the next two figures we will show the combined run and compile time
|
||||
performance of 'our' compiler (the ACK C frontend and our backend)
|
||||
compared to Sun's C compiler. Figure A.4.1 shows the results from
|
||||
measurements on the dhrystone benchmark.
|
||||
.G1
|
||||
frame invis left solid bot solid
|
||||
label left "run time" "(in \(*msec/dhrystone)"
|
||||
label bot "compile time (in sec)"
|
||||
coord x 0,21 y 0,610
|
||||
ticks left out from 0 to 600 by 200
|
||||
ticks bot out from 0 to 20 by 5
|
||||
"\(bu" at 3.5, 1000000/1700
|
||||
"ack w/o opt" ljust at 3.5 + 1, 1000000/1700
|
||||
"\(bu" at 2.8, 1000000/8770
|
||||
"ack with opt" below at 2.8 + 0.1, 1000000/8770
|
||||
"\(bu" at 16.0, 1000000/10434
|
||||
"ack -O4" above at 16.0, 1000000/10434
|
||||
"\(bu" at 2.3, 1000000/7270
|
||||
"\fIcc\fR" above at 2.3, 1000000/7270
|
||||
"\(bu" at 9.0, 1000000/12500
|
||||
"\fIcc -O4\fR" above at 9.0, 1000000/12500
|
||||
"\(bu" at 5.9, 1000000/15250
|
||||
"\fIcc -O\fR" below at 5.9, 1000000/15250
|
||||
.G2
|
||||
.ce 1
|
||||
\fIFigure A.4.1: overall performance on dhrystones.
|
||||
.sp 1
|
||||
.LP
|
||||
Fortunately for us, dhrystones are not all there is. The following
|
||||
figure shows the same measurements as the previous one, except
|
||||
this time we took a benchmark that uses no subroutines: an implementation
|
||||
of Eratosthenes' sieve:
|
||||
.G1
|
||||
frame invis left solid bot solid
|
||||
label left "run time" "for one run" "(in sec)" left .6
|
||||
label bot "compile time (in sec)"
|
||||
coord x 0,11 y 0,21
|
||||
ticks bot out from 0 to 10 by 5
|
||||
ticks left out from 0 to 20 by 5
|
||||
"\(bu" at 2.5, 17.28
|
||||
"ack w/o opt" above at 2.5, 17.28
|
||||
"\(bu" at 1.6, 2.93
|
||||
"ack with opt" above at 1.6, 2.93
|
||||
"\(bu" at 9.4, 2.26
|
||||
"ack -O4" above at 9.4, 2.26
|
||||
"\(bu" at 1.5, 7.43
|
||||
"\fIcc\fR" above at 1.5, 7.43
|
||||
"\(bu" at 2.7, 2.02
|
||||
"\fIcc -O4\fR" ljust at 1.9, 1.2
|
||||
"\(bu" at 2.6, 2.10
|
||||
"\fIcc -O\fR" ljust at 3.1,2.5
|
||||
.G2
|
||||
.ce 1
|
||||
\fIFigure A.4.2: overall performance on Eratosthenes' sieve.
|
||||
.sp 1
|
||||
.PP
|
||||
Although the above figures speak for themselves, a small comment
|
||||
may be in place. At first it is clear that our compiler is neither
|
||||
faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should
|
||||
also be noted however, that we do produce better code than \fIcc\fR
|
||||
at only a very small additional cost.
|
||||
It is also worth noticing that push-pop optimization
|
||||
increases run-time speed as well as compile speed.
|
||||
The first seems rather obvious,
|
||||
since optimized code is
|
||||
faster code, but the increase in compile speed may come as a surprise.
|
||||
The main reason is that the \fIas\fR+\fIld\fR time depends largely on the
|
||||
amount of generated code, which in general
|
||||
depends on the efficiency of the code.
|
||||
Push-pop optimization removes a lot of useless instructions which
|
||||
would otherwise
|
||||
have found their way through to the assembler and the loader.
|
||||
Useless instructions inserted in an early stage in the compilation
|
||||
process will slow down every following stage, so elimination of useless
|
||||
instructions in an early stage, even when it requires a little computational
|
||||
overhead, can often be beneficial to the overall compilation speed.
|
||||
.bp
|
128
doc/sparc/B
Normal file
128
doc/sparc/B
Normal file
|
@ -0,0 +1,128 @@
|
|||
.so init
|
||||
.SH
|
||||
B. IMPLEMENTATION
|
||||
.SH
|
||||
B.1. Excerpts from the non-optimized EM_table
|
||||
.PP
|
||||
Even though the non-optimized version of the EM_table is relatively
|
||||
straight-forward, examples have never hurt anybody.
|
||||
One of the simplest instructions is the \*(Siloc\*(So, which appears in
|
||||
our EM_table as follows:
|
||||
.DS
|
||||
\f6
|
||||
.TA 8 16 24 32 40 48 56 64
|
||||
C_loc ==> "set $1, T1";
|
||||
"dec 4, SP";
|
||||
"st T1, [SP]".
|
||||
\f1
|
||||
.DE
|
||||
Just as \*(SiSP\*(So is an alias for \*(Si%l0\*(So, \*(SiT1\*(So is
|
||||
an alias for \*(Si%g1\*(So.
|
||||
A little more complex is the \*(Siadi\*(So which performs integer
|
||||
addition.
|
||||
.DS
|
||||
\f6
|
||||
C_adi ==> "ld [SP], T1";
|
||||
"ld [SP+4], T2";
|
||||
"add T1, T2, T3";
|
||||
"st T3, [SP+4];
|
||||
"inc 4, SP".
|
||||
\f1
|
||||
.DE
|
||||
We could go on with even more complex instructions, but since that would
|
||||
not contribute to anything the reader is referred to the implementation
|
||||
for more details.
|
||||
.SH
|
||||
B.2. Excerpts from the optimized EM_table
|
||||
.PP
|
||||
The optimized EM_table uses the cache primitives mentioned in chapter 4.
|
||||
This means that the \*(Siloc\*(So this time appears as
|
||||
.DS
|
||||
\f6
|
||||
C_loc ==> push_const($1).
|
||||
\f1
|
||||
.DE
|
||||
The \*(Silol\*(So can now be written as
|
||||
.DS
|
||||
\f6
|
||||
C_lol ==> push_reg(LB);
|
||||
inc_tos($1);
|
||||
push_const(4);
|
||||
C_los(4).
|
||||
\f1
|
||||
.DE
|
||||
Due to the law of conservation of misery somebody has to do the dirty work.
|
||||
In this case, it is the \*(Silos\*(So. To show just a small part of
|
||||
the implementation of the \*(Silos\*(So:
|
||||
.DS
|
||||
\f6
|
||||
C_los $1 == 4 ==>
|
||||
if (type_of_tos() == T_cst) {
|
||||
arith size;
|
||||
const_str_t n;
|
||||
|
||||
size= pop_const();
|
||||
if (size <= 4) {
|
||||
reg_t a;
|
||||
reg_t a;
|
||||
char *LD;
|
||||
|
||||
switch (size) {
|
||||
case 1: LD = "ldub"; break;
|
||||
case 2: LD = "lduh"; break;
|
||||
case 4: LD = "ld"; break;
|
||||
default: arg_error("C_los", size);
|
||||
}
|
||||
a = pop_reg_c13(n);
|
||||
b = alloc_reg();
|
||||
"$LD [$a+$n], $b";
|
||||
push_reg(b);
|
||||
free_reg(a);
|
||||
} else ...
|
||||
\f1
|
||||
.DE
|
||||
For the full implementation, the reader is again referred to the actual
|
||||
implementation. Just to show how other instructions are affected
|
||||
by the optimization we will show that implementation of the \*(Sitge\*(So
|
||||
instruction:
|
||||
.DS
|
||||
\f6
|
||||
C_tge ==> {
|
||||
reg_t a;
|
||||
reg_t b;
|
||||
|
||||
a = pop_reg();
|
||||
b = alloc_reg();
|
||||
" tst $a";
|
||||
" bge,a 1f";
|
||||
" mov 1, $b"; /* delay slot */
|
||||
" set 0, $b";
|
||||
"1:";
|
||||
free_reg(a);
|
||||
push_reg(b);
|
||||
}.
|
||||
|
||||
\f1
|
||||
.DE
|
||||
.SH
|
||||
.bp
|
||||
CREDITS
|
||||
.PP
|
||||
In order of appearance:
|
||||
.TS
|
||||
center;
|
||||
r c l.
|
||||
Original idea - Dick Grune
|
||||
Design & implementation - Philip Homburg
|
||||
- Raymond Michiels
|
||||
Tutor - Dick Grune
|
||||
Assistant Tutor - Ceriel Jacobs
|
||||
Proofreading - Dick Grune
|
||||
- Hans van Eck
|
||||
.TE
|
||||
.SH
|
||||
REFERENCES
|
||||
.PP
|
||||
.[
|
||||
$LIST$
|
||||
.]
|
10
doc/sparc/Makefile
Normal file
10
doc/sparc/Makefile
Normal file
|
@ -0,0 +1,10 @@
|
|||
# $Header$
|
||||
|
||||
REFER=refer
|
||||
TBL=tbl
|
||||
TARGET=-Tlp
|
||||
PIC=pic
|
||||
GRAP=grap
|
||||
|
||||
../sparc.doc: refs title intro 1 2 3 4 5 A B init
|
||||
$(REFER) -sA+T '-l\", ' -p refs title intro 1 2 3 4 5 A B | $(GRAP) | $(PIC) | $(TBL) | soelim > $@
|
18
doc/sparc/init
Normal file
18
doc/sparc/init
Normal file
|
@ -0,0 +1,18 @@
|
|||
.nr PS 12
|
||||
.nr VS 14
|
||||
.\" .fp 6 AM
|
||||
.fp 6 CW
|
||||
.ds Si \f6\s-1
|
||||
.ds So \f1\s+1
|
||||
.ds OQ `\h'-1p'`
|
||||
.ds CQ '\h'-1p''
|
||||
.de UX
|
||||
.ie \\n(UX \s-1UNIX\s0\\$1
|
||||
.el \{\
|
||||
\s-1UNIX\s0\\$1\(dg
|
||||
.FS
|
||||
\(dg \s-1UNIX\s0 is a registered bell of AT&T Trademark Laboratories.
|
||||
.FE
|
||||
.nr UX 1
|
||||
.\}
|
||||
..
|
23
doc/sparc/intro
Normal file
23
doc/sparc/intro
Normal file
|
@ -0,0 +1,23 @@
|
|||
.so init
|
||||
.hw de-vised
|
||||
.TL
|
||||
A fast backend for SPARC processors
|
||||
.AU
|
||||
Philip Homburg
|
||||
Raymond Michiels
|
||||
.AI
|
||||
Dept. of Mathematics and Computer Science
|
||||
Vrije Universiteit
|
||||
Amsterdam, The Netherlands
|
||||
.AB
|
||||
The language EM is an intermediate language for use in compiler
|
||||
construction.
|
||||
In this paper we describe the construction of a so-called fast backend
|
||||
which translates EM code to assembler for SPARC processors.
|
||||
.br
|
||||
Our construction deviates strongly from the usual procedure. We have
|
||||
devised and implemented a virtual stack with which it is possible to
|
||||
generate very acceptable code without much loss in compile time.
|
||||
.AE
|
||||
.PP
|
||||
.bp
|
58
doc/sparc/note_on_reg_wins
Normal file
58
doc/sparc/note_on_reg_wins
Normal file
|
@ -0,0 +1,58 @@
|
|||
When developing a fast compiler for the Sun-4 series we have encountered
|
||||
rather strange behavior of the Sun kernel.
|
||||
|
||||
The problem is that when you have lots of nested procedure calls, (as
|
||||
is often the case in compilers and parsers) the registers fill up which
|
||||
causes a kernel trap. The kernel will then write out some of the registers
|
||||
to memory to make room for another window. When you return from the nested
|
||||
procedure call, just the reverse happens: yet another kernel trap so the
|
||||
kernel can load the register from memory.
|
||||
|
||||
Unfortunately the kernel only saves or loads a single window (= 16 register)
|
||||
on each trap. This means that when you call a procedure recursively it causes
|
||||
a kernel trap on almost every invocation (except for the first few).
|
||||
|
||||
To illustrate this consider the following little program:
|
||||
|
||||
--------------- little program -------------
|
||||
f(i) /* calls itself i times */
|
||||
int i;
|
||||
{
|
||||
if (i)
|
||||
f(i-1);
|
||||
}
|
||||
|
||||
main(argc, argv)
|
||||
int argc;
|
||||
char *argv[];
|
||||
{
|
||||
|
||||
|
||||
i = atoi(argv[1]); /* # loops */
|
||||
j = atoi(argv[2]); /* depth */
|
||||
|
||||
while (i--)
|
||||
f(j);
|
||||
}
|
||||
------------ end of little program -----------
|
||||
|
||||
|
||||
The performance decreases abruptly when the depth (j) becomes larger
|
||||
than 5. On a SPARC station we got the following results:
|
||||
|
||||
depth run time (in seconds)
|
||||
|
||||
1 0.5
|
||||
2 0.8
|
||||
3 1.0
|
||||
4 1.4 <- from here on it's +6 seconds for each
|
||||
5 7.6 step deeper.
|
||||
6 13.9
|
||||
7 19.9
|
||||
8 26.3
|
||||
9 32.9
|
||||
|
||||
Things would be a lot better when instead of just 1, the kernel would
|
||||
save or restore 4 windows (= 64 registers = 50% on our SPARC stations).
|
||||
|
||||
-Raymond.
|
12
doc/sparc/pics/.distr
Normal file
12
doc/sparc/pics/.distr
Normal file
|
@ -0,0 +1,12 @@
|
|||
EM_stack.orig
|
||||
EM_stack.ours
|
||||
compile_bars
|
||||
mem_config
|
||||
perf
|
||||
perf.comp
|
||||
perf.d
|
||||
perf.dhry
|
||||
reg_layout
|
||||
run-time_bars
|
||||
run-time_bars.bup
|
||||
signal_stack
|
34
doc/sparc/pics/EM_stack.orig
Normal file
34
doc/sparc/pics/EM_stack.orig
Normal file
|
@ -0,0 +1,34 @@
|
|||
.PS
|
||||
.ps -2
|
||||
.vs -2
|
||||
boxwid = 1.5;
|
||||
boxht = 0.24
|
||||
down;
|
||||
box "actual parameter n-1";
|
||||
box "." "." "." ht 0.6;
|
||||
box "actual parameter 0";
|
||||
move 0.3
|
||||
box "return status block";
|
||||
{arrow <- right with .w at last box.e; \
|
||||
box invis wid 0.3 "LB" }
|
||||
down
|
||||
move to 2nd last box.s
|
||||
move 0.1
|
||||
box "local variables"
|
||||
box "compiler temporaries"
|
||||
move 0.1
|
||||
box "register save block"
|
||||
move 0.1
|
||||
box "dynamic local generators"
|
||||
move 0.1
|
||||
box "operand"
|
||||
box "operand"
|
||||
move 0.1
|
||||
box "parameter m-1"
|
||||
box "." "." "." ht 0.6;
|
||||
box "parameter 0" with .n at last box .s
|
||||
{ arrow <- right with .w at last box.e; \
|
||||
box invis wid 0.3 "SP" }
|
||||
.ps +2
|
||||
.vs +2
|
||||
.PE
|
106
doc/sparc/pics/EM_stack.ours
Normal file
106
doc/sparc/pics/EM_stack.ours
Normal file
|
@ -0,0 +1,106 @@
|
|||
.ps 10
|
||||
.vs 12
|
||||
.PS
|
||||
boxwid = 1.3
|
||||
boxht = 0.25
|
||||
down;
|
||||
box "floating point" "register dump area" ht 0.6
|
||||
box "tmp float store"
|
||||
box "register dump area" ht 0.6
|
||||
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
|
||||
box invis wid 0.3 "%fp" }
|
||||
move .1
|
||||
box dotted "gap"
|
||||
{ arrow <- right with .w at last box.e; \
|
||||
box invis wid 0.3 "%LB" }
|
||||
move .1
|
||||
box "locals"
|
||||
box "actual parameter n-1";
|
||||
box "." "." "." ht 0.6;
|
||||
box "actual parameter 0";
|
||||
{ arrow <- right with .w at last box.e; \
|
||||
box invis wid 0.3 "%SP" }
|
||||
move 0.1
|
||||
box "large gap" "(>64kb)" ht 1.0
|
||||
box "register dump area" ht 0.6
|
||||
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
|
||||
box invis wid 0.3 "%sp" }
|
||||
move 0.2
|
||||
box invis "\\s+2just before call\\s0"
|
||||
move 1
|
||||
box dotted "gap"
|
||||
box invis "0 or 4 bytes" "for stack alignment" with .w at last box.e
|
||||
box invis height .7 "when gap is 0 bytes," "%fp == %LB" with .n at 2nd last box.s
|
||||
.PF
|
||||
.PS
|
||||
down;
|
||||
move to 2.4,0
|
||||
box "floating point" "register dump area" ht 0.6
|
||||
box "tmp float store"
|
||||
box "register dump area" ht 0.6
|
||||
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
|
||||
box invis wid 0.3 "%fp" }
|
||||
move .1
|
||||
box dotted "gap"
|
||||
{ arrow <- right with .w at last box.e; \
|
||||
box invis wid 0.3 "%LB" }
|
||||
move .1
|
||||
box "locals"
|
||||
box "actual parameter n-1";
|
||||
box "." "." "." ht 0.6;
|
||||
box "actual parameter 0";
|
||||
{ arrow <- right with .w at last box.e; \
|
||||
box invis wid 0.3 "%SP" }
|
||||
move .1
|
||||
box dotted "gap"
|
||||
move .4
|
||||
box "floating point" "register dump area" ht 0.6
|
||||
box "tmp float store"
|
||||
box "register dump area" ht 0.6
|
||||
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
|
||||
box invis wid 0.3 "%sp" }
|
||||
move 0.2
|
||||
box invis "\\s+2'during' call\\s0"
|
||||
.PF
|
||||
.PS
|
||||
down;
|
||||
move to 4.8,0
|
||||
box "floating point" "register dump area" ht 0.6
|
||||
box "tmp float store"
|
||||
box "register dump area" ht 0.6
|
||||
move .1
|
||||
box dotted "gap"
|
||||
move .1
|
||||
box "locals"
|
||||
box "actual parameter n-1";
|
||||
box "." "." "." ht 0.6;
|
||||
box "actual parameter 0";
|
||||
move .1
|
||||
box dotted "gap"
|
||||
move .4
|
||||
box "floating point" "register dump area" ht 0.6
|
||||
box "tmp float store"
|
||||
box "register dump area" ht 0.6
|
||||
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
|
||||
box invis wid 0.3 "%fp" }
|
||||
move .1
|
||||
box dotted "gap"
|
||||
{ arrow <- right with .w at last box.e; \
|
||||
box invis wid 0.3 "%LB" }
|
||||
move .1
|
||||
box "locals"
|
||||
box "actual parameter n-1";
|
||||
box "." "." "." ht 0.6;
|
||||
box "actual parameter 0";
|
||||
{ arrow <- right with .w at last box.e; \
|
||||
box invis wid 0.3 "%SP" }
|
||||
move 0.1
|
||||
box "large gap" "(>64kb)" ht 1.0
|
||||
box "register dump area" ht 0.6
|
||||
{ arrow <- right with .w at 3/4 <last box.e, last box.se>; \
|
||||
box invis wid 0.3 "%sp" }
|
||||
move 0.2
|
||||
box invis "\\s+2after call\\s0"
|
||||
.PF
|
||||
.ps 12
|
||||
.vs 14
|
49
doc/sparc/pics/compile_bars
Normal file
49
doc/sparc/pics/compile_bars
Normal file
|
@ -0,0 +1,49 @@
|
|||
.PS
|
||||
boxht = 0.5
|
||||
boxwid = 1
|
||||
moveht = 0.65
|
||||
down;
|
||||
{
|
||||
right;
|
||||
box invis "ACK" "w/o" "opt"
|
||||
box "cem" "0.7" wid 0.7
|
||||
box "opt" "0.4" wid 0.4
|
||||
box "be" "1.1" wid 1.1
|
||||
box "as" "1.4" wid 1.4
|
||||
box "ld" "0.4" wid 0.4
|
||||
box invis "4.0" wid 0.5
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "ACK" "with" "opt"
|
||||
box "cem" "0.7" wid 0.7
|
||||
box "opt" "0.4" wid 0.4
|
||||
box "be" "0.6" wid 0.6
|
||||
box "as" "0.7" wid 0.7
|
||||
box "ld" "0.4" wid 0.4
|
||||
box invis "2.8" wid 0.5
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "\fIcc\fR"
|
||||
box "cpp" "0.2" wid 0.2
|
||||
box "ccom" "1.0" wid 1.0
|
||||
box "as" "0.7" wid 0.7
|
||||
box "ld" "0.4" wid 0.4
|
||||
box invis "2.3" wid 0.5
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "\fIcc -O4\fR"
|
||||
box "cpp" "0.2" wid 0.2
|
||||
box "ccom" "1.0" wid 1.0
|
||||
box "iropt" "5.0 (not to scale!)" wid 1.5
|
||||
box "cg" "0.7" wid 0.7
|
||||
box "as" "1.7" wid 1.7
|
||||
box "ld" "0.4" wid 0.4
|
||||
box invis "9.0" wid 0.5
|
||||
}
|
||||
.PE
|
34
doc/sparc/pics/mem_config
Normal file
34
doc/sparc/pics/mem_config
Normal file
|
@ -0,0 +1,34 @@
|
|||
.PS
|
||||
boxwid = 1.3
|
||||
down
|
||||
[
|
||||
right
|
||||
[
|
||||
down;
|
||||
box "stack" ht .6
|
||||
box "free" ht 1
|
||||
box "heap" ht .3
|
||||
box "text" ht .5
|
||||
]
|
||||
move 1
|
||||
[
|
||||
down;
|
||||
box "\s-4SPARC stack\s+4" ht .2
|
||||
box "\s-4EM stack\s+4" ht .1
|
||||
box "\s-4SPARC stack\s+4" ht .1
|
||||
box "\s-4EM stack\s+4" ht .1
|
||||
box "\s-4free\s+4" ht .2
|
||||
box "\s-4SPARC stack\s+4" ht .1
|
||||
box "free" ht .8
|
||||
box "heap" ht .3
|
||||
box "text" ht .5
|
||||
]
|
||||
]
|
||||
move .3
|
||||
[
|
||||
right
|
||||
box invis "regular \(UX memory layout"
|
||||
move 1
|
||||
box invis "memory layout for EM"
|
||||
]
|
||||
.PF
|
12
doc/sparc/pics/perf
Normal file
12
doc/sparc/pics/perf
Normal file
|
@ -0,0 +1,12 @@
|
|||
.G1
|
||||
frame invis left solid bot solid
|
||||
label left "run time" "(log scale)" left .5
|
||||
label bot "compile time (log scale)"
|
||||
coord x 0.1,10 log x y 1000,20000 log y
|
||||
ticks left out at 2000,5000,10000,20000
|
||||
ticks bot out at 0.1 0.3 1.0 3.0 10
|
||||
copy "perf.d" thru X
|
||||
"\(bu" at $1, $2
|
||||
"$3" rjust at $1, $2
|
||||
X
|
||||
.G2
|
7
doc/sparc/pics/perf.comp
Normal file
7
doc/sparc/pics/perf.comp
Normal file
|
@ -0,0 +1,7 @@
|
|||
in-line in ../A
|
||||
|
||||
2.5 17.28 ack w/o opt
|
||||
1.6 2.93 ack with opt
|
||||
9.4 2.26 ack -O4
|
||||
1.5 7.43 \fIcc\fR
|
||||
2.7 2.02 \fIcc -O4\fR
|
4
doc/sparc/pics/perf.d
Normal file
4
doc/sparc/pics/perf.d
Normal file
|
@ -0,0 +1,4 @@
|
|||
1.0 1700 ack w/o opt
|
||||
1.9 8000 ack with opt
|
||||
1.6 8000 \fIcc\fR
|
||||
7 18000 \fIcc -O4\fR
|
7
doc/sparc/pics/perf.dhry
Normal file
7
doc/sparc/pics/perf.dhry
Normal file
|
@ -0,0 +1,7 @@
|
|||
in-line in ../A
|
||||
|
||||
3.5 1700 ack w/o opt
|
||||
2.8 8770 ack with opt
|
||||
16.0 10434 ack -O4
|
||||
2.3 7270 \fIcc\fR
|
||||
9.0 12500 \fIcc -O4\fR
|
24
doc/sparc/pics/reg_layout
Normal file
24
doc/sparc/pics/reg_layout
Normal file
|
@ -0,0 +1,24 @@
|
|||
.nr PS 12
|
||||
.nr VS 14
|
||||
.PP
|
||||
.TS
|
||||
allbox;
|
||||
l l l l
|
||||
l2f6 l l2f6 l.
|
||||
g0 0 l0 EM_SP
|
||||
g1 temporary 1 l1 EM_LB
|
||||
g2 temporary 2 l2
|
||||
g3 temporary 3 l3 reserved
|
||||
g4 64k..1M l4 reserved
|
||||
g5 temporary 4 l5 reserved
|
||||
g6 line number l6 reserved
|
||||
g7 file name l7 reserved
|
||||
o0 param 1 i0
|
||||
o1 param 2 i1
|
||||
o2 param 3 i2
|
||||
o3 param 4 i3
|
||||
o4 RETL_LD i4 RETL_ST
|
||||
o5 RETH_LD i5 RETH_ST
|
||||
sp stack pointer fp frame pointer
|
||||
o7 xxx i7 return address
|
||||
.TE
|
101
doc/sparc/pics/run-time_bars
Normal file
101
doc/sparc/pics/run-time_bars
Normal file
|
@ -0,0 +1,101 @@
|
|||
.PS
|
||||
boxht = 0.5
|
||||
boxwid = 1
|
||||
moveht = 1
|
||||
down;
|
||||
{
|
||||
right;
|
||||
box invis "ACK" "w/o" "opt."
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 4.5
|
||||
"Sieve" ljust at last box.w + 0.1,-0.02
|
||||
"10(!)" ljust at last box.e + 0.1,-0.02
|
||||
box wid 4.5 with .nw at last box.sw
|
||||
"Dhrystones" ljust at last box.w + 0.1,-0.02
|
||||
"10(!)" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "ACK" "with" "our" "opt."
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 1.4
|
||||
"Sieve" ljust at last box.w + 0.1,-0.02
|
||||
"1.4" ljust at last box.e + 0.1,-0.02
|
||||
box wid 1.9 with .nw at last box.sw
|
||||
"Dhrystones" ljust at last box.w + 0.1,-0.02
|
||||
"1.9" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "ACK" "-O4"
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 1.1
|
||||
"Sieve" ljust at last box.w + 0.1,-0.02
|
||||
"1.1" ljust at last box.e + 0.1,-0.02
|
||||
box wid 1.6 with .nw at last box.sw
|
||||
"Dhrystones" ljust at last box.w + 0.1,-0.02
|
||||
"1.6" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "Sun's" "compiler" "w/o opt."
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 3.7
|
||||
"Sieve" ljust at last box.w + 0.1,-0.02
|
||||
"3.7" ljust at last box.e + 0.1,-0.02
|
||||
box wid 2.2 with .nw at last box.sw
|
||||
"Dhrystones" ljust at last box.w + 0.1,-0.02
|
||||
"2.2" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "Sun's" "compiler" "-O"
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 1.1
|
||||
"Sieve" ljust at last box.w + 0.1,-0.02
|
||||
"1.1" ljust at last box.e + 0.1,-0.02
|
||||
box wid 0.8 with .nw at last box.sw
|
||||
"Dhryst." ljust at last box.w + 0.1,-0.02
|
||||
"0.8!" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "Sun's" "compiler" "-O4"
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 1.0
|
||||
"Sieve" ljust at last box.w + 0.1,-0.02
|
||||
"1.0" ljust at last box.e + 0.1,-0.02
|
||||
box wid 1.0 with .nw at last box.sw
|
||||
"Dhrystones" ljust at last box.w + 0.1,-0.02
|
||||
"1.0" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
.PE
|
100
doc/sparc/pics/run-time_bars.bup
Normal file
100
doc/sparc/pics/run-time_bars.bup
Normal file
|
@ -0,0 +1,100 @@
|
|||
.PS
|
||||
boxht = 0.5
|
||||
boxwid = 1
|
||||
moveht = 1
|
||||
down;
|
||||
{
|
||||
right;
|
||||
box invis "ACK" "w/o" "opt"
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 4.5
|
||||
"C (arithmetic)" ljust at last box.w + 0.1,-0.02
|
||||
"10(!)" ljust at last box.e + 0.1,-0.02
|
||||
box wid 4.5 with .nw at last box.sw
|
||||
"C (dhrystones)" ljust at last box.w + 0.1,-0.02
|
||||
"10(!)" ljust at last box.e + 0.1,-0.02
|
||||
box wid 4.5 with .nw at last box.sw
|
||||
"Modula-2" ljust at last box.w + 0.1,-0.02
|
||||
"8(!)" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "ACK" "with" "peep-hole" "opt"
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 1.4
|
||||
"C (arithmetic)" ljust at last box.w + 0.1,-0.02
|
||||
"1.4" ljust at last box.e + 0.1,-0.02
|
||||
box wid 1.9 with .nw at last box.sw
|
||||
"C (dhrystones)" ljust at last box.w + 0.1,-0.02
|
||||
"1.9" ljust at last box.e + 0.1,-0.02
|
||||
box wid 2.5 with .nw at last box.sw
|
||||
"Modula-2" ljust at last box.w + 0.1,-0.02
|
||||
"2.5" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "ACK" "-O4"
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 1.1
|
||||
"C (arithmetic)" ljust at last box.w + 0.1,-0.02
|
||||
"1.1" ljust at last box.e + 0.1,-0.02
|
||||
box wid 1.6 with .nw at last box.sw
|
||||
"C (dhrystones)" ljust at last box.w + 0.1,-0.02
|
||||
"1.6" ljust at last box.e + 0.1,-0.02
|
||||
box wid 2.5 with .nw at last box.sw
|
||||
"Modula-2" ljust at last box.w + 0.1,-0.02
|
||||
"2.5" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "Sun's" "compiler" "w/o opt."
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 3.7
|
||||
"C (arithmetic)" ljust at last box.w + 0.1,-0.02
|
||||
"3.7" ljust at last box.e + 0.1,-0.02
|
||||
box wid 2.2 with .nw at last box.sw
|
||||
"C (dhrystones)" ljust at last box.w + 0.1,-0.02
|
||||
"2.2" ljust at last box.e + 0.1,-0.02
|
||||
box wid 1.8 with .nw at last box.sw
|
||||
"Modula-2" ljust at last box.w + 0.1,-0.02
|
||||
"1.8" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
move
|
||||
{
|
||||
right;
|
||||
box invis "Sun's" "compiler" "-O4"
|
||||
move
|
||||
[
|
||||
down;
|
||||
boxht = 0.25
|
||||
box wid 1.0
|
||||
"C (arith.)" ljust at last box.w + 0.1,-0.02
|
||||
"1.0" ljust at last box.e + 0.1,-0.02
|
||||
box wid 1.0 with .nw at last box.sw
|
||||
"C (dhryst.)" ljust at last box.w + 0.1,-0.02
|
||||
"1.0" ljust at last box.e + 0.1,-0.02
|
||||
box wid 1.0 with .nw at last box.sw
|
||||
"Modula-2" ljust at last box.w + 0.1,-0.02
|
||||
"1.0" ljust at last box.e + 0.1,-0.02
|
||||
] with .w at last box.e
|
||||
}
|
||||
.PE
|
42
doc/sparc/pics/signal_stack
Normal file
42
doc/sparc/pics/signal_stack
Normal file
|
@ -0,0 +1,42 @@
|
|||
.PS
|
||||
boxwid = 1.3
|
||||
down
|
||||
[
|
||||
right
|
||||
[
|
||||
down;
|
||||
box "\s-4SPARC stack\s+4" ht .2
|
||||
box "\s-4EM stack\s+4" ht .1
|
||||
box "\s-4SPARC stack\s+4" ht .1
|
||||
box "\s-4EM stack\s+4" ht .1
|
||||
box "\s-4free\s+4" ht .2
|
||||
box "\s-4SPARC stack\s+4" ht .1
|
||||
box "free" ht .8
|
||||
box "heap" ht .3
|
||||
box "text" ht .5
|
||||
]
|
||||
move 1
|
||||
[
|
||||
down;
|
||||
box "\s-4SPARC stack\s+4" ht .2
|
||||
box "\s-4EM stack\s+4" ht .1
|
||||
box "\s-4SPARC stack\s+4" ht .1
|
||||
box "\s-4EM stack\s+4" ht .1
|
||||
box "\s-4free\s+4" ht .2
|
||||
box "\s-4SPARC stack\s+4" ht .1
|
||||
box "\s-4EM stack\s+4" ht .1
|
||||
box "\s-4free\s+4" ht .2
|
||||
box "\s-4SPARC stack\s+4" ht .1
|
||||
box "free" ht .4
|
||||
box "heap" ht .3
|
||||
box "text" ht .5
|
||||
]
|
||||
]
|
||||
move .3
|
||||
[
|
||||
right
|
||||
box invis "before signal"
|
||||
move 1
|
||||
box invis "during (1st) signal"
|
||||
]
|
||||
.PF
|
31
doc/sparc/printP4P
Normal file
31
doc/sparc/printP4P
Normal file
|
@ -0,0 +1,31 @@
|
|||
echo $0
|
||||
case $1 in
|
||||
1 )
|
||||
CMD="cat"
|
||||
;;
|
||||
2 )
|
||||
CMD="cat"
|
||||
;;
|
||||
3 )
|
||||
CMD="cat"
|
||||
;;
|
||||
4 )
|
||||
CMD="pic | tbl"
|
||||
;;
|
||||
5 )
|
||||
CMD="tbl"
|
||||
;;
|
||||
A )
|
||||
CMD="grap | pic"
|
||||
;;
|
||||
B )
|
||||
CMD="tbl"
|
||||
;;
|
||||
esac
|
||||
echo $0
|
||||
if [ $0 = printP4P ]
|
||||
then
|
||||
refer -sA+T '-l\", ' -p refs $1 | eval $CMD | troff -ms -Tp4p | dip -Tp4p -Pp4p
|
||||
else
|
||||
xtroff -full -geom 665x883+566+0 -command "refer -sA+T '-l\", ' -p refs $1 | $CMD | troff -ms -Tp4p"
|
||||
fi
|
185
doc/sparc/refs
Normal file
185
doc/sparc/refs
Normal file
|
@ -0,0 +1,185 @@
|
|||
%T The design of very fast portable compilers
|
||||
%A A.S. Tanenbaum
|
||||
%A M.F. Kaashoek
|
||||
%A K.G. Langendoen
|
||||
%A C.J.H. Jacobs
|
||||
%J SIGPLAN Notices
|
||||
%V 24
|
||||
%N 11
|
||||
%P 125-131
|
||||
%D November 1989
|
||||
|
||||
%T A Programmer-friendly LL(1) Parser Generator
|
||||
%A D. Grune
|
||||
%A C.J.H. Jacobs
|
||||
%J Software \- Practice and Experience
|
||||
%V 18
|
||||
%N 1
|
||||
%P 29-38
|
||||
%D January 1988
|
||||
|
||||
%T The Code Expander Generator
|
||||
%A Frans Kaashoek
|
||||
%A Koen Langendoen
|
||||
%R IM-9
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
%D November 1987
|
||||
|
||||
%T The ACK Pascal Compiler
|
||||
%A Aad Geudeke
|
||||
%A Frans Hofmeester
|
||||
%R IM-8
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
%D November 1987
|
||||
|
||||
%T The EM-interpreter
|
||||
%A Eddo de Groot
|
||||
%A Leo van den Berge
|
||||
%R IM-7
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
%D June 1987
|
||||
|
||||
%T A set of multi\-process primitives for stack based machines
|
||||
%A K. Bot
|
||||
%A E. Scheffer
|
||||
%R IR-122
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
%D December 1986
|
||||
|
||||
%T An Occam Compiler
|
||||
%A K. Bot
|
||||
%A E. Scheffer
|
||||
%R IM-6
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
%D December 1986
|
||||
|
||||
%T Language- and Machine-independent Global Optimization on Intermediate Code
|
||||
%A H.E. Bal
|
||||
%A A.S. Tanenbaum
|
||||
%J Computer Languages
|
||||
%V 11
|
||||
%N 2
|
||||
%P 105-121
|
||||
%D April 1986
|
||||
|
||||
%T The ACK Target Optimizer
|
||||
%A H.E. Bal
|
||||
%R IR-107
|
||||
%D 1985
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
|
||||
%T Some Topics in Parser Generation
|
||||
%A C.J.H. Jacobs
|
||||
%R IR-105
|
||||
%D October 1985
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
|
||||
%T The CEM compiler
|
||||
%A E.H. Baalbergen
|
||||
%A D. Grune
|
||||
%A M. Waage
|
||||
%R IM-4
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
%D 1985
|
||||
|
||||
%T The Design and Implementation of the EM Global Optimizer
|
||||
%A H.E. Bal
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
%R IR-99
|
||||
%D March 1985
|
||||
|
||||
%T Does anybody out there want to write HALF of a compiler?
|
||||
%A A.S. Tanenbaum
|
||||
%A E.G. Keizer
|
||||
%A H. van Staveren
|
||||
%J Sigplan Notices
|
||||
%V 19
|
||||
%N 8
|
||||
%P 106-108
|
||||
%D August 1984
|
||||
|
||||
%T Amsterdam Compiler Kit documentation
|
||||
%A A.S. Tanenbaum et. al.
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
%R IR-90
|
||||
%D June 1984
|
||||
|
||||
%T A Practical Toolkit for Making Portable Compilers
|
||||
%A A. S. Tanenbaum
|
||||
%A H. van Staveren
|
||||
%A E. G. Keizer
|
||||
%A J. W. Stevenson
|
||||
%J Communications of the ACM
|
||||
%V 26
|
||||
%N 9
|
||||
%P 654-660
|
||||
%D September 1983
|
||||
|
||||
%T Description of a Machine Architecture for use with Block Structured
|
||||
Languages
|
||||
%A A. S. Tanenbaum
|
||||
%A H. van Staveren
|
||||
%A E. G. Keizer
|
||||
%A J. W. Stevenson
|
||||
%R IR-81
|
||||
%D August 1983
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
|
||||
%T A Unix Toolkit for Making Portable Compilers
|
||||
%A A.S. Tanenbaum
|
||||
%A H. van Staveren
|
||||
%A E.G. Keizer
|
||||
%A J.W. Stevenson
|
||||
%J Proceedings USENIX conf.
|
||||
%C Toronto, Canada
|
||||
%V 26
|
||||
%D July 1983
|
||||
%P 255-261
|
||||
|
||||
%T Using Peephole Optimization on Intermediate Code
|
||||
%A A.S. Tanenbaum
|
||||
%A J.M. van Staveren
|
||||
%A J.W. Stevenson
|
||||
%J TOPLAS
|
||||
%V 4
|
||||
%N 1
|
||||
%P 21-36
|
||||
%D January 1982
|
||||
|
||||
%T EM-1 Compiler
|
||||
%A A.S. Tanenbaum
|
||||
%J Pascal News
|
||||
%D September 1981
|
||||
%P 4-38
|
||||
|
||||
%T A portable compiler for the Proposed ISO Standard Pascal Language
|
||||
%A A.S. Tanenbaum
|
||||
%A J.W. Stevenson
|
||||
%A H. van Staveren
|
||||
%J Sigplan Notices
|
||||
%V 15
|
||||
%N 10
|
||||
%D 1980
|
||||
|
||||
%T Implications of Structured Programming for Machine Architecture
|
||||
%A A.S. Tanenbaum
|
||||
%J CACM
|
||||
%V 21
|
||||
%N 3
|
||||
%P 237-246
|
||||
%D March 1978
|
||||
|
||||
%T The table driven code generator from the Amsterdam Compiler Kit (Second
|
||||
revised edition)
|
||||
%A H. van Staveren
|
||||
%I Vrije Universiteit, Amsterdam
|
||||
%R on-line internal ACK documentation
|
||||
%D early 1985
|
||||
|
||||
%T Dhrystone Benchmark: Rationale for Version 2 and Measurement Rules
|
||||
%A R.P. Weicker
|
||||
%J Sigplan Notices
|
||||
%V 23
|
||||
%N 8
|
||||
%D august 1988
|
||||
%P 49-62
|
22
doc/sparc/timing
Normal file
22
doc/sparc/timing
Normal file
|
@ -0,0 +1,22 @@
|
|||
DHRYSTONES V2.0
|
||||
|
||||
cc cc -O4 cc -O fccO fccCE ack ack -O4
|
||||
compile time:
|
||||
real 4.0 12.0 10.0 6.4 8.0 31.0
|
||||
user 1.6 7.3 4.1 1.9 1.8 2.0 9.3
|
||||
sys 0.9 2.1 1.8 2.5 1.5 2.0 7.7
|
||||
|
||||
run time: 7263 16250 15250 4730 3430 8474 10434
|
||||
(stones/sec)
|
||||
|
||||
SIEVE
|
||||
|
||||
cc cc -O4 fccO fccCE ack ack -O4
|
||||
compile time:
|
||||
real 2.4 4.4 x 3.3 6.4 17.0
|
||||
user 0.8 1.6 x 0.7 0.7 3.2
|
||||
sys 0.7 1.0 x 0.8 1.3 6.2
|
||||
|
||||
run time: 7.43 2.02 x 12.18 2.93 2.26
|
||||
|
||||
All ack-derived compilers are shell script driven
|
15
doc/sparc/title
Normal file
15
doc/sparc/title
Normal file
|
@ -0,0 +1,15 @@
|
|||
.so init
|
||||
.TL
|
||||
.sp 1.2c
|
||||
A fast backend for SPARC processors
|
||||
.AU
|
||||
Philip Homburg
|
||||
Raymond Michiels
|
||||
.AI
|
||||
Dept. of Mathematics and Computer Science
|
||||
Vrije Universiteit
|
||||
Amsterdam, The Netherlands
|
||||
.PP
|
||||
.sp 1i
|
||||
Afstudeerverslag, 20 augustus 1990
|
||||
.bp
|
Loading…
Reference in a new issue