184 lines
		
	
	
	
		
			6.2 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			184 lines
		
	
	
	
		
			6.2 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
.so init
 | 
						|
.SH
 | 
						|
A. MEASUREMENTS
 | 
						|
.SH
 | 
						|
A.1. \*(OQThe bottom line\*(CQ
 | 
						|
.PP
 | 
						|
Although examples often are most illustrative, the cruel world out there is
 | 
						|
usually more interested in everyday performance figures. To satisfy those
 | 
						|
people too, we will present a series of measurements on our code expander
 | 
						|
taken from (close to) real life situations. These include measurements
 | 
						|
of compile and run times of different programs,
 | 
						|
compiled with different compilers.
 | 
						|
.SH
 | 
						|
A.2. Compile time measurements
 | 
						|
.PP
 | 
						|
Figure A.2.1 shows compile-time measurements for typical C code:
 | 
						|
the dhrystone benchmark\(dg
 | 
						|
.[ [
 | 
						|
dhrystone
 | 
						|
.]].
 | 
						|
.FS
 | 
						|
\(dg To be certain that we only tested the compiler and not the quality of
 | 
						|
the code in the library, we have added our own version of
 | 
						|
\fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the
 | 
						|
library.
 | 
						|
.FE
 | 
						|
The numbers represent the duration of each separate pass of the compiler.
 | 
						|
The numbers at the end of each bar represent the total duration of the
 | 
						|
compilation process. As with all measurements in this chapter, the
 | 
						|
quoted time or duration is the sum of user and system time in seconds.
 | 
						|
.PS
 | 
						|
copy "pics/compile_bars"
 | 
						|
.PE
 | 
						|
.DS
 | 
						|
.IP cem: 6
 | 
						|
C to EM frontend
 | 
						|
.IP opt:
 | 
						|
EM peep-hole optimizer
 | 
						|
.IP be:
 | 
						|
EM to assembler backend
 | 
						|
.IP cpp:
 | 
						|
Sun's C preprocessor
 | 
						|
.IP ccom:
 | 
						|
Sun's C compiler
 | 
						|
.IP iropt:
 | 
						|
Sun's optimizer
 | 
						|
.IP cg:
 | 
						|
Sun's code generator
 | 
						|
.IP as:
 | 
						|
Sun's assembler
 | 
						|
.IP ld:
 | 
						|
Sun's linker
 | 
						|
.ce 1
 | 
						|
\fIFigure A.2.1: compile-time measurements.\fR
 | 
						|
.DE
 | 
						|
.sp
 | 
						|
.PP
 | 
						|
A close examination of the first two bars in fig A.2.1 shows that the maximum
 | 
						|
achievable compile-time
 | 
						|
gain compared to \fIcc\fR is about 50% for medium-sized
 | 
						|
programs.\(dd
 | 
						|
.FS
 | 
						|
\(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53
 | 
						|
.FE
 | 
						|
For small programs the gain will be less, due to the almost constant
 | 
						|
start-up time of each pass in the compilation process. Only a
 | 
						|
built-in assembler may increase this number up to
 | 
						|
180% in the ideal case that the optimizer, backend and assembler
 | 
						|
would run in zero time. Speed-ups of 5 to 10 times as mentioned in
 | 
						|
.[ [
 | 
						|
fast portable compilers
 | 
						|
.]]
 | 
						|
are therefore not possible on the Sun-4 family. This is also due to
 | 
						|
Sun's implementation of saving and restoring register windows. With
 | 
						|
the current implementation in which only a single window is saved
 | 
						|
or restored on a register-window overflow, it is very time consuming
 | 
						|
when programs have highly dynamic stack use
 | 
						|
due to procedure calls (as is often the case with compilers).
 | 
						|
.PP
 | 
						|
Although we are currently a little slower than \fIcc\fR, it is hard to
 | 
						|
blame this on our backend. Optimizing the backend so that it would run
 | 
						|
twice as fast would only reduce the total compilation process by
 | 
						|
a mere 14%.
 | 
						|
.PP
 | 
						|
Finally it is nice to see that our push/pop-optimization,
 | 
						|
initially designed to generate faster code, has also increased the
 | 
						|
compilation speed. (see also figures A.4.1 and A.4.2.)
 | 
						|
.SH
 | 
						|
A.3. Run time performance
 | 
						|
.PP
 | 
						|
Figure A.3.1 shows the run-time performance of different compilers.
 | 
						|
All results are normalized, where the best available compiler (Sun's
 | 
						|
compiler with full optimization) is represented by 1.0 on our scale.
 | 
						|
.PS
 | 
						|
copy "pics/run-time_bars"
 | 
						|
.PE
 | 
						|
.ce 1
 | 
						|
\fIFigure A.3.1: run time performance.\fR
 | 
						|
.sp 1
 | 
						|
.PP
 | 
						|
The fact that our compiler behaves rather poorly compared to Sun's
 | 
						|
compiler is due to the fact that the dhrystone benchmark uses
 | 
						|
relatively many subroutine calls; all of which have to be 'emulated'
 | 
						|
by our backend.
 | 
						|
.SH
 | 
						|
A.4. Overall performance
 | 
						|
.LP
 | 
						|
In the next two figures we will show the combined run and compile time
 | 
						|
performance of 'our' compiler (the ACK C frontend and our backend)
 | 
						|
compared to Sun's C compiler. Figure A.4.1 shows the results from
 | 
						|
measurements on the dhrystone benchmark.
 | 
						|
.G1
 | 
						|
frame invis left solid bot solid
 | 
						|
label left "run time" "(in \(*msec/dhrystone)"
 | 
						|
label bot "compile time (in sec)"
 | 
						|
coord x 0,21 y 0,610
 | 
						|
ticks left out from 0 to 600 by 200
 | 
						|
ticks bot out from 0 to 20 by 5
 | 
						|
"\(bu" at 3.5, 1000000/1700
 | 
						|
"ack w/o opt" ljust at 3.5 + 1, 1000000/1700
 | 
						|
"\(bu" at 2.8, 1000000/8770
 | 
						|
"ack with opt" below at 2.8 + 0.1, 1000000/8770
 | 
						|
"\(bu" at 16.0, 1000000/10434
 | 
						|
"ack -O4" above at 16.0, 1000000/10434
 | 
						|
"\(bu" at 2.3, 1000000/7270
 | 
						|
"\fIcc\fR" above at 2.3, 1000000/7270
 | 
						|
"\(bu" at 9.0, 1000000/12500
 | 
						|
"\fIcc -O4\fR" above at 9.0, 1000000/12500
 | 
						|
"\(bu" at 5.9, 1000000/15250
 | 
						|
"\fIcc -O\fR" below at 5.9, 1000000/15250
 | 
						|
.G2
 | 
						|
.ce 1
 | 
						|
\fIFigure A.4.1: overall performance on dhrystones.
 | 
						|
.sp 1
 | 
						|
.LP
 | 
						|
Fortunately for us, dhrystones are not all there is. The following
 | 
						|
figure shows the same measurements as the previous one, except
 | 
						|
this time we took a benchmark that uses no subroutines: an implementation
 | 
						|
of Eratosthenes' sieve:
 | 
						|
.G1
 | 
						|
frame invis left solid bot solid
 | 
						|
label left "run time" "for one run" "(in sec)" left .6
 | 
						|
label bot "compile time (in sec)"
 | 
						|
coord x 0,11 y 0,21
 | 
						|
ticks bot out from 0 to 10 by 5
 | 
						|
ticks left out from 0 to 20 by 5
 | 
						|
"\(bu" at 2.5, 17.28
 | 
						|
"ack w/o opt" above at 2.5, 17.28
 | 
						|
"\(bu" at 1.6, 2.93
 | 
						|
"ack with opt" above at 1.6, 2.93
 | 
						|
"\(bu" at 9.4, 2.26
 | 
						|
"ack -O4" above at 9.4, 2.26
 | 
						|
"\(bu" at 1.5, 7.43
 | 
						|
"\fIcc\fR" above at 1.5, 7.43
 | 
						|
"\(bu" at 2.7, 2.02
 | 
						|
"\fIcc -O4\fR" ljust at 1.9, 1.2
 | 
						|
"\(bu" at 2.6, 2.10
 | 
						|
"\fIcc -O\fR" ljust at 3.1,2.5
 | 
						|
.G2
 | 
						|
.ce 1
 | 
						|
\fIFigure A.4.2: overall performance on Eratosthenes' sieve.
 | 
						|
.sp 1
 | 
						|
.PP
 | 
						|
Although the above figures speak for themselves, a small comment
 | 
						|
may be in place. At first it is clear that our compiler is neither
 | 
						|
faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should
 | 
						|
also be noted however, that we do produce better code than \fIcc\fR
 | 
						|
at only a very small additional cost.
 | 
						|
It is also worth noticing that push-pop optimization
 | 
						|
increases run-time speed as well as compile speed.
 | 
						|
The first seems rather obvious,
 | 
						|
since optimized code is
 | 
						|
faster code, but the increase in compile speed may come as a surprise.
 | 
						|
The main reason is that the \fIas\fR+\fIld\fR time depends largely on the
 | 
						|
amount of generated code, which in general
 | 
						|
depends on the efficiency of the code.
 | 
						|
Push-pop optimization removes a lot of useless instructions which
 | 
						|
would otherwise
 | 
						|
have found their way through to the assembler and the loader.
 | 
						|
Useless instructions inserted in an early stage in the compilation
 | 
						|
process will slow down every following stage, so elimination of useless
 | 
						|
instructions in an early stage, even when it requires a little computational
 | 
						|
overhead, can often be beneficial to the overall compilation speed.
 | 
						|
.bp
 |