184 lines
		
	
	
	
		
			6.2 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			184 lines
		
	
	
	
		
			6.2 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| .so init
 | |
| .SH
 | |
| A. MEASUREMENTS
 | |
| .SH
 | |
| A.1. \*(OQThe bottom line\*(CQ
 | |
| .PP
 | |
| Although examples often are most illustrative, the cruel world out there is
 | |
| usually more interested in everyday performance figures. To satisfy those
 | |
| people too, we will present a series of measurements on our code expander
 | |
| taken from (close to) real life situations. These include measurements
 | |
| of compile and run times of different programs,
 | |
| compiled with different compilers.
 | |
| .SH
 | |
| A.2. Compile time measurements
 | |
| .PP
 | |
| Figure A.2.1 shows compile-time measurements for typical C code:
 | |
| the dhrystone benchmark\(dg
 | |
| .[ [
 | |
| dhrystone
 | |
| .]].
 | |
| .FS
 | |
| \(dg To be certain that we only tested the compiler and not the quality of
 | |
| the code in the library, we have added our own version of
 | |
| \fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the
 | |
| library.
 | |
| .FE
 | |
| The numbers represent the duration of each separate pass of the compiler.
 | |
| The numbers at the end of each bar represent the total duration of the
 | |
| compilation process. As with all measurements in this chapter, the
 | |
| quoted time or duration is the sum of user and system time in seconds.
 | |
| .PS
 | |
| copy "pics/compile_bars"
 | |
| .PE
 | |
| .DS
 | |
| .IP cem: 6
 | |
| C to EM frontend
 | |
| .IP opt:
 | |
| EM peep-hole optimizer
 | |
| .IP be:
 | |
| EM to assembler backend
 | |
| .IP cpp:
 | |
| Sun's C preprocessor
 | |
| .IP ccom:
 | |
| Sun's C compiler
 | |
| .IP iropt:
 | |
| Sun's optimizer
 | |
| .IP cg:
 | |
| Sun's code generator
 | |
| .IP as:
 | |
| Sun's assembler
 | |
| .IP ld:
 | |
| Sun's linker
 | |
| .ce 1
 | |
| \fIFigure A.2.1: compile-time measurements.\fR
 | |
| .DE
 | |
| .sp
 | |
| .PP
 | |
| A close examination of the first two bars in fig A.2.1 shows that the maximum
 | |
| achievable compile-time
 | |
| gain compared to \fIcc\fR is about 50% for medium-sized
 | |
| programs.\(dd
 | |
| .FS
 | |
| \(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53
 | |
| .FE
 | |
| For small programs the gain will be less, due to the almost constant
 | |
| start-up time of each pass in the compilation process. Only a
 | |
| built-in assembler may increase this number up to
 | |
| 180% in the ideal case that the optimizer, backend and assembler
 | |
| would run in zero time. Speed-ups of 5 to 10 times as mentioned in
 | |
| .[ [
 | |
| fast portable compilers
 | |
| .]]
 | |
| are therefore not possible on the Sun-4 family. This is also due to
 | |
| Sun's implementation of saving and restoring register windows. With
 | |
| the current implementation in which only a single window is saved
 | |
| or restored on a register-window overflow, it is very time consuming
 | |
| when programs have highly dynamic stack use
 | |
| due to procedure calls (as is often the case with compilers).
 | |
| .PP
 | |
| Although we are currently a little slower than \fIcc\fR, it is hard to
 | |
| blame this on our backend. Optimizing the backend so that it would run
 | |
| twice as fast would only reduce the total compilation process by
 | |
| a mere 14%.
 | |
| .PP
 | |
| Finally it is nice to see that our push/pop-optimization,
 | |
| initially designed to generate faster code, has also increased the
 | |
| compilation speed. (see also figures A.4.1 and A.4.2.)
 | |
| .SH
 | |
| A.3. Run time performance
 | |
| .PP
 | |
| Figure A.3.1 shows the run-time performance of different compilers.
 | |
| All results are normalized, where the best available compiler (Sun's
 | |
| compiler with full optimization) is represented by 1.0 on our scale.
 | |
| .PS
 | |
| copy "pics/run-time_bars"
 | |
| .PE
 | |
| .ce 1
 | |
| \fIFigure A.3.1: run time performance.\fR
 | |
| .sp 1
 | |
| .PP
 | |
| The fact that our compiler behaves rather poorly compared to Sun's
 | |
| compiler is due to the fact that the dhrystone benchmark uses
 | |
| relatively many subroutine calls; all of which have to be 'emulated'
 | |
| by our backend.
 | |
| .SH
 | |
| A.4. Overall performance
 | |
| .LP
 | |
| In the next two figures we will show the combined run and compile time
 | |
| performance of 'our' compiler (the ACK C frontend and our backend)
 | |
| compared to Sun's C compiler. Figure A.4.1 shows the results from
 | |
| measurements on the dhrystone benchmark.
 | |
| .G1
 | |
| frame invis left solid bot solid
 | |
| label left "run time" "(in \(*msec/dhrystone)"
 | |
| label bot "compile time (in sec)"
 | |
| coord x 0,21 y 0,610
 | |
| ticks left out from 0 to 600 by 200
 | |
| ticks bot out from 0 to 20 by 5
 | |
| "\(bu" at 3.5, 1000000/1700
 | |
| "ack w/o opt" ljust at 3.5 + 1, 1000000/1700
 | |
| "\(bu" at 2.8, 1000000/8770
 | |
| "ack with opt" below at 2.8 + 0.1, 1000000/8770
 | |
| "\(bu" at 16.0, 1000000/10434
 | |
| "ack -O4" above at 16.0, 1000000/10434
 | |
| "\(bu" at 2.3, 1000000/7270
 | |
| "\fIcc\fR" above at 2.3, 1000000/7270
 | |
| "\(bu" at 9.0, 1000000/12500
 | |
| "\fIcc -O4\fR" above at 9.0, 1000000/12500
 | |
| "\(bu" at 5.9, 1000000/15250
 | |
| "\fIcc -O\fR" below at 5.9, 1000000/15250
 | |
| .G2
 | |
| .ce 1
 | |
| \fIFigure A.4.1: overall performance on dhrystones.
 | |
| .sp 1
 | |
| .LP
 | |
| Fortunately for us, dhrystones are not all there is. The following
 | |
| figure shows the same measurements as the previous one, except
 | |
| this time we took a benchmark that uses no subroutines: an implementation
 | |
| of Eratosthenes' sieve:
 | |
| .G1
 | |
| frame invis left solid bot solid
 | |
| label left "run time" "for one run" "(in sec)" left .6
 | |
| label bot "compile time (in sec)"
 | |
| coord x 0,11 y 0,21
 | |
| ticks bot out from 0 to 10 by 5
 | |
| ticks left out from 0 to 20 by 5
 | |
| "\(bu" at 2.5, 17.28
 | |
| "ack w/o opt" above at 2.5, 17.28
 | |
| "\(bu" at 1.6, 2.93
 | |
| "ack with opt" above at 1.6, 2.93
 | |
| "\(bu" at 9.4, 2.26
 | |
| "ack -O4" above at 9.4, 2.26
 | |
| "\(bu" at 1.5, 7.43
 | |
| "\fIcc\fR" above at 1.5, 7.43
 | |
| "\(bu" at 2.7, 2.02
 | |
| "\fIcc -O4\fR" ljust at 1.9, 1.2
 | |
| "\(bu" at 2.6, 2.10
 | |
| "\fIcc -O\fR" ljust at 3.1,2.5
 | |
| .G2
 | |
| .ce 1
 | |
| \fIFigure A.4.2: overall performance on Eratosthenes' sieve.
 | |
| .sp 1
 | |
| .PP
 | |
| Although the above figures speak for themselves, a small comment
 | |
| may be in place. At first it is clear that our compiler is neither
 | |
| faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should
 | |
| also be noted however, that we do produce better code than \fIcc\fR
 | |
| at only a very small additional cost.
 | |
| It is also worth noticing that push-pop optimization
 | |
| increases run-time speed as well as compile speed.
 | |
| The first seems rather obvious,
 | |
| since optimized code is
 | |
| faster code, but the increase in compile speed may come as a surprise.
 | |
| The main reason is that the \fIas\fR+\fIld\fR time depends largely on the
 | |
| amount of generated code, which in general
 | |
| depends on the efficiency of the code.
 | |
| Push-pop optimization removes a lot of useless instructions which
 | |
| would otherwise
 | |
| have found their way through to the assembler and the loader.
 | |
| Useless instructions inserted in an early stage in the compilation
 | |
| process will slow down every following stage, so elimination of useless
 | |
| instructions in an early stage, even when it requires a little computational
 | |
| overhead, can often be beneficial to the overall compilation speed.
 | |
| .bp
 |