ack/doc/sparc/A

.so init
.SH
A. MEASUREMENTS
.SH
A.1. \*(OQThe bottom line\*(CQ
.PP
Although examples often are most illustrative, the cruel world out there is
usually more interested in everyday performance figures. To satisfy those
people too, we will present a series of measurements on our code expander
taken from (close to) real life situations. These include measurements
of compile and run times of different programs,
compiled with different compilers.
.SH
A.2. Compile time measurements
.PP
Figure A.2.1 shows compile-time measurements for typical C code:
the dhrystone benchmark\(dg
.[ [
dhrystone
.]].
.FS
\(dg To be certain that we only tested the compiler and not the quality of
the code in the library, we have added our own version of
\fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the
library.
.FE
The numbers represent the duration of each separate pass of the compiler.
The numbers at the end of each bar represent the total duration of the
compilation process. As with all measurements in this chapter, the
quoted time or duration is the sum of user and system time in seconds.
.PS
copy "pics/compile_bars"
.PE
.DS
.IP cem: 6
C to EM frontend
.IP opt:
EM peep-hole optimizer
.IP be:
EM to assembler backend
.IP cpp:
Sun's C preprocessor
.IP ccom:
Sun's C compiler
.IP iropt:
Sun's optimizer
.IP cg:
Sun's code generator
.IP as:
Sun's assembler
.IP ld:
Sun's linker
.ce 1
\fIFigure A.2.1: compile-time measurements.\fR
.DE
.sp
.PP
A close examination of the first two bars in fig A.2.1 shows that the maximum
achievable compile-time
gain compared to \fIcc\fR is about 50% for medium-sized
programs.\(dd
.FS
\(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53
.FE
For small programs the gain will be less, due to the almost constant
start-up time of each pass in the compilation process. Only a
built-in assembler may increase this number up to
180% in the ideal case that the optimizer, backend and assembler
would run in zero time. Speed-ups of 5 to 10 times as mentioned in
.[ [
fast portable compilers
.]]
are therefore not possible on the Sun-4 family. This is also due to
Sun's implementation of saving and restoring register windows. With
the current implementation in which only a single window is saved
or restored on a register-window overflow, it is very time consuming
when programs have highly dynamic stack use
due to procedure calls (as is often the case with compilers).
.PP
Although we are currently a little slower than \fIcc\fR, it is hard to
blame this on our backend. Optimizing the backend so that it would run
twice as fast would only reduce the total compilation process by
a mere 14%.
.PP
Finally it is nice to see that our push/pop-optimization,
initially designed to generate faster code, has also increased the
compilation speed. (see also figures A.4.1 and A.4.2.)
.SH
A.3. Run time performance
.PP
Figure A.3.1 shows the run-time performance of different compilers.
All results are normalized, where the best available compiler (Sun's
compiler with full optimization) is represented by 1.0 on our scale.
.PS
copy "pics/run-time_bars"
.PE
.ce 1
\fIFigure A.3.1: run time performance.\fR
.sp 1
.PP
The fact that our compiler behaves rather poorly compared to Sun's
compiler is due to the fact that the dhrystone benchmark uses
relatively many subroutine calls; all of which have to be 'emulated'
by our backend.
.SH
A.4. Overall performance
.LP
In the next two figures we will show the combined run and compile time
performance of 'our' compiler (the ACK C frontend and our backend)
compared to Sun's C compiler. Figure A.4.1 shows the results from
measurements on the dhrystone benchmark.
.G1
frame invis left solid bot solid
label left "run time" "(in \(*msec/dhrystone)"
label bot "compile time (in sec)"
coord x 0,21 y 0,610
ticks left out from 0 to 600 by 200
ticks bot out from 0 to 20 by 5
"\(bu" at 3.5, 1000000/1700
"ack w/o opt" ljust at 3.5 + 1, 1000000/1700
"\(bu" at 2.8, 1000000/8770
"ack with opt" below at 2.8 + 0.1, 1000000/8770
"\(bu" at 16.0, 1000000/10434
"ack -O4" above at 16.0, 1000000/10434
"\(bu" at 2.3, 1000000/7270
"\fIcc\fR" above at 2.3, 1000000/7270
"\(bu" at 9.0, 1000000/12500
"\fIcc -O4\fR" above at 9.0, 1000000/12500
"\(bu" at 5.9, 1000000/15250
"\fIcc -O\fR" below at 5.9, 1000000/15250
.G2
.ce 1
\fIFigure A.4.1: overall performance on dhrystones.
.sp 1
.LP
Fortunately for us, dhrystones are not all there is. The following
figure shows the same measurements as the previous one, except
this time we took a benchmark that uses no subroutines: an implementation
of Eratosthenes' sieve:
.G1
frame invis left solid bot solid
label left "run time" "for one run" "(in sec)" left .6
label bot "compile time (in sec)"
coord x 0,11 y 0,21
ticks bot out from 0 to 10 by 5
ticks left out from 0 to 20 by 5
"\(bu" at 2.5, 17.28
"ack w/o opt" above at 2.5, 17.28
"\(bu" at 1.6, 2.93
"ack with opt" above at 1.6, 2.93
"\(bu" at 9.4, 2.26
"ack -O4" above at 9.4, 2.26
"\(bu" at 1.5, 7.43
"\fIcc\fR" above at 1.5, 7.43
"\(bu" at 2.7, 2.02
"\fIcc -O4\fR" ljust at 1.9, 1.2
"\(bu" at 2.6, 2.10
"\fIcc -O\fR" ljust at 3.1,2.5
.G2
.ce 1
\fIFigure A.4.2: overall performance on Eratosthenes' sieve.
.sp 1
.PP
Although the above figures speak for themselves, a small comment
may be in place. At first it is clear that our compiler is neither
faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should
also be noted however, that we do produce better code than \fIcc\fR
at only a very small additional cost.
It is also worth noticing that push-pop optimization
increases run-time speed as well as compile speed.
The first seems rather obvious,
since optimized code is
faster code, but the increase in compile speed may come as a surprise.
The main reason is that the \fIas\fR+\fIld\fR time depends largely on the
amount of generated code, which in general
depends on the efficiency of the code.
Push-pop optimization removes a lot of useless instructions which
would otherwise
have found their way through to the assembler and the loader.
Useless instructions inserted in an early stage in the compilation
process will slow down every following stage, so elimination of useless
instructions in an early stage, even when it requires a little computational
overhead, can often be beneficial to the overall compilation speed.
.bp