184 lines
6.1 KiB
Text
184 lines
6.1 KiB
Text
.In
|
|
.SH
|
|
A. MEASUREMENTS
|
|
.SH
|
|
A.1. \*(OQThe bottom line\*(CQ
|
|
.PP
|
|
Although examples often are most illustrative, the cruel world out there is
|
|
usually more interested in everyday performance figures. To satisfy those
|
|
people too, we will present a series of measurements on our code expander
|
|
taken from (close to) real life situations. These include measurements
|
|
of compile and run times of different programs,
|
|
compiled with different compilers.
|
|
.SH
|
|
A.2. Compile time measurements
|
|
.PP
|
|
Figure A.2.1 shows compile-time measurements for typical C code:
|
|
the dhrystone benchmark\(dg
|
|
.[ [
|
|
dhrystone
|
|
.]].
|
|
.FS
|
|
\(dg To be certain that we only tested the compiler and not the quality of
|
|
the code in the library, we have added our own version of
|
|
\fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the
|
|
library.
|
|
.FE
|
|
The numbers represent the duration of each separate pass of the compiler.
|
|
The numbers at the end of each bar represent the total duration of the
|
|
compilation process. As with all measurements in this chapter, the
|
|
quoted time or duration is the sum of user and system time in seconds.
|
|
.PS
|
|
copy "pics/compile_bars"
|
|
.PE
|
|
.DS
|
|
.IP cem: 6
|
|
C to EM frontend
|
|
.IP opt:
|
|
EM peep-hole optimizer
|
|
.IP be:
|
|
EM to assembler backend
|
|
.IP cpp:
|
|
Sun's C preprocessor
|
|
.IP ccom:
|
|
Sun's C compiler
|
|
.IP iropt:
|
|
Sun's optimizer
|
|
.IP cg:
|
|
Sun's code generator
|
|
.IP as:
|
|
Sun's assembler
|
|
.IP ld:
|
|
Sun's linker
|
|
.ce 1
|
|
\fIFigure A.2.1: compile-time measurements.\fR
|
|
.DE
|
|
.sp
|
|
.PP
|
|
A close examination of the first two bars in fig A.2.1 shows that the maximum
|
|
achievable compile-time
|
|
gain compared to \fIcc\fR is about 50% for medium-sized
|
|
programs.\(dd
|
|
.FS
|
|
\(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53
|
|
.FE
|
|
For small programs the gain will be less, due to the almost constant
|
|
start-up time of each pass in the compilation process. Only a
|
|
built-in assembler may increase this number up to
|
|
180% in the ideal case that the optimizer, backend and assembler
|
|
would run in zero time. Speed-ups of 5 to 10 times as mentioned in
|
|
.[ [
|
|
fast portable compilers
|
|
.]]
|
|
are therefore not possible on the Sun-4 family. This is also due to
|
|
Sun's implementation of saving and restoring register windows. With
|
|
the current implementation in which only a single window is saved
|
|
or restored on a register-window overflow, it is very time consuming
|
|
when programs have highly dynamic stack use
|
|
due to procedure calls (as is often the case with compilers).
|
|
.PP
|
|
Although we are currently a little slower than \fIcc\fR, it is hard to
|
|
blame this on our backend. Optimizing the backend so that it would run
|
|
twice as fast would only reduce the total compilation process by
|
|
a mere 14%.
|
|
.PP
|
|
Finally it is nice to see that our push/pop-optimization,
|
|
initially designed to generate faster code, has also increased the
|
|
compilation speed. (see also figures A.4.1 and A.4.2.)
|
|
.SH
|
|
A.3. Run time performance
|
|
.PP
|
|
Figure A.3.1 shows the run-time performance of different compilers.
|
|
All results are normalized, where the best available compiler (Sun's
|
|
compiler with full optimization) is represented by 1.0 on our scale.
|
|
.PS
|
|
copy "pics/run-time_bars"
|
|
.PE
|
|
.ce 1
|
|
\fIFigure A.3.1: run time performance.\fR
|
|
.sp 1
|
|
.PP
|
|
The fact that our compiler behaves rather poorly compared to Sun's
|
|
compiler is due to the fact that the dhrystone benchmark uses
|
|
relatively many subroutine calls; all of which have to be 'emulated'
|
|
by our backend.
|
|
.SH
|
|
A.4. Overall performance
|
|
.LP
|
|
In the next two figures we will show the combined run and compile time
|
|
performance of 'our' compiler (the ACK C frontend and our backend)
|
|
compared to Sun's C compiler. Figure A.4.1 shows the results from
|
|
measurements on the dhrystone benchmark.
|
|
.G1
|
|
frame invis left solid bot solid
|
|
label left "run time" "(in \(*msec/dhrystone)"
|
|
label bot "compile time (in sec)"
|
|
coord x 0,21 y 0,610
|
|
ticks left out from 0 to 600 by 200
|
|
ticks bot out from 0 to 20 by 5
|
|
"\(bu" at 3.5, 1000000/1700
|
|
"ack w/o opt" ljust at 3.5 + 1, 1000000/1700
|
|
"\(bu" at 2.8, 1000000/8770
|
|
"ack with opt" below at 2.8 + 0.1, 1000000/8770
|
|
"\(bu" at 16.0, 1000000/10434
|
|
"ack -O4" above at 16.0, 1000000/10434
|
|
"\(bu" at 2.3, 1000000/7270
|
|
"\fIcc\fR" above at 2.3, 1000000/7270
|
|
"\(bu" at 9.0, 1000000/12500
|
|
"\fIcc -O4\fR" above at 9.0, 1000000/12500
|
|
"\(bu" at 5.9, 1000000/15250
|
|
"\fIcc -O\fR" below at 5.9, 1000000/15250
|
|
.G2
|
|
.ce 1
|
|
\fIFigure A.4.1: overall performance on dhrystones.
|
|
.sp 1
|
|
.LP
|
|
Fortunately for us, dhrystones are not all there is. The following
|
|
figure shows the same measurements as the previous one, except
|
|
this time we took a benchmark that uses no subroutines: an implementation
|
|
of Eratosthenes' sieve:
|
|
.G1
|
|
frame invis left solid bot solid
|
|
label left "run time" "for one run" "(in sec)" left .6
|
|
label bot "compile time (in sec)"
|
|
coord x 0,11 y 0,21
|
|
ticks bot out from 0 to 10 by 5
|
|
ticks left out from 0 to 20 by 5
|
|
"\(bu" at 2.5, 17.28
|
|
"ack w/o opt" above at 2.5, 17.28
|
|
"\(bu" at 1.6, 2.93
|
|
"ack with opt" above at 1.6, 2.93
|
|
"\(bu" at 9.4, 2.26
|
|
"ack -O4" above at 9.4, 2.26
|
|
"\(bu" at 1.5, 7.43
|
|
"\fIcc\fR" above at 1.5, 7.43
|
|
"\(bu" at 2.7, 2.02
|
|
"\fIcc -O4\fR" ljust at 1.9, 1.2
|
|
"\(bu" at 2.6, 2.10
|
|
"\fIcc -O\fR" ljust at 3.1,2.5
|
|
.G2
|
|
.ce 1
|
|
\fIFigure A.4.2: overall performance on Eratosthenes' sieve.
|
|
.sp 1
|
|
.PP
|
|
Although the above figures speak for themselves, a small comment
|
|
may be in place. At first it is clear that our compiler is neither
|
|
faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should
|
|
also be noted however, that we do produce better code than \fIcc\fR
|
|
at only a very small additional cost.
|
|
It is also worth noticing that push-pop optimization
|
|
increases run-time speed as well as compile speed.
|
|
The first seems rather obvious,
|
|
since optimized code is
|
|
faster code, but the increase in compile speed may come as a surprise.
|
|
The main reason is that the \fIas\fR+\fIld\fR time depends largely on the
|
|
amount of generated code, which in general
|
|
depends on the efficiency of the code.
|
|
Push-pop optimization removes a lot of useless instructions which
|
|
would otherwise
|
|
have found their way through to the assembler and the loader.
|
|
Useless instructions inserted in an early stage in the compilation
|
|
process will slow down every following stage, so elimination of useless
|
|
instructions in an early stage, even when it requires a little computational
|
|
overhead, can often be beneficial to the overall compilation speed.
|
|
.bp
|