.so init .SH A. MEASUREMENTS .SH A.1. \*(OQThe bottom line\*(CQ .PP Although examples often are most illustrative, the cruel world out there is usually more interested in everyday performance figures. To satisfy those people too, we will present a series of measurements on our code expander taken from (close to) real life situations. These include measurements of compile and run times of different programs, compiled with different compilers. .SH A.2. Compile time measurements .PP Figure A.2.1 shows compile-time measurements for typical C code: the dhrystone benchmark\(dg .[ [ dhrystone .]]. .FS \(dg To be certain that we only tested the compiler and not the quality of the code in the library, we have added our own version of \fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the library. .FE The numbers represent the duration of each separate pass of the compiler. The numbers at the end of each bar represent the total duration of the compilation process. As with all measurements in this chapter, the quoted time or duration is the sum of user and system time in seconds. .PS copy "pics/compile_bars" .PE .DS .IP cem: 6 C to EM frontend .IP opt: EM peep-hole optimizer .IP be: EM to assembler backend .IP cpp: Sun's C preprocessor .IP ccom: Sun's C compiler .IP iropt: Sun's optimizer .IP cg: Sun's code generator .IP as: Sun's assembler .IP ld: Sun's linker .ce 1 \fIFigure A.2.1: compile-time measurements.\fR .DE .sp .PP A close examination of the first two bars in fig A.2.1 shows that the maximum achievable compile-time gain compared to \fIcc\fR is about 50% for medium-sized programs.\(dd .FS \(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53 .FE For small programs the gain will be less, due to the almost constant start-up time of each pass in the compilation process. Only a built-in assembler may increase this number up to 180% in the ideal case that the optimizer, backend and assembler would run in zero time. Speed-ups of 5 to 10 times as mentioned in .[ [ fast portable compilers .]] are therefore not possible on the Sun-4 family. This is also due to Sun's implementation of saving and restoring register windows. With the current implementation in which only a single window is saved or restored on a register-window overflow, it is very time consuming when programs have highly dynamic stack use due to procedure calls (as is often the case with compilers). .PP Although we are currently a little slower than \fIcc\fR, it is hard to blame this on our backend. Optimizing the backend so that it would run twice as fast would only reduce the total compilation process by a mere 14%. .PP Finally it is nice to see that our push/pop-optimization, initially designed to generate faster code, has also increased the compilation speed. (see also figures A.4.1 and A.4.2.) .SH A.3. Run time performance .PP Figure A.3.1 shows the run-time performance of different compilers. All results are normalized, where the best available compiler (Sun's compiler with full optimization) is represented by 1.0 on our scale. .PS copy "pics/run-time_bars" .PE .ce 1 \fIFigure A.3.1: run time performance.\fR .sp 1 .PP The fact that our compiler behaves rather poorly compared to Sun's compiler is due to the fact that the dhrystone benchmark uses relatively many subroutine calls; all of which have to be 'emulated' by our backend. .SH A.4. Overall performance .LP In the next two figures we will show the combined run and compile time performance of 'our' compiler (the ACK C frontend and our backend) compared to Sun's C compiler. Figure A.4.1 shows the results from measurements on the dhrystone benchmark. .G1 frame invis left solid bot solid label left "run time" "(in \(*msec/dhrystone)" label bot "compile time (in sec)" coord x 0,21 y 0,610 ticks left out from 0 to 600 by 200 ticks bot out from 0 to 20 by 5 "\(bu" at 3.5, 1000000/1700 "ack w/o opt" ljust at 3.5 + 1, 1000000/1700 "\(bu" at 2.8, 1000000/8770 "ack with opt" below at 2.8 + 0.1, 1000000/8770 "\(bu" at 16.0, 1000000/10434 "ack -O4" above at 16.0, 1000000/10434 "\(bu" at 2.3, 1000000/7270 "\fIcc\fR" above at 2.3, 1000000/7270 "\(bu" at 9.0, 1000000/12500 "\fIcc -O4\fR" above at 9.0, 1000000/12500 "\(bu" at 5.9, 1000000/15250 "\fIcc -O\fR" below at 5.9, 1000000/15250 .G2 .ce 1 \fIFigure A.4.1: overall performance on dhrystones. .sp 1 .LP Fortunately for us, dhrystones are not all there is. The following figure shows the same measurements as the previous one, except this time we took a benchmark that uses no subroutines: an implementation of Eratosthenes' sieve: .G1 frame invis left solid bot solid label left "run time" "for one run" "(in sec)" left .6 label bot "compile time (in sec)" coord x 0,11 y 0,21 ticks bot out from 0 to 10 by 5 ticks left out from 0 to 20 by 5 "\(bu" at 2.5, 17.28 "ack w/o opt" above at 2.5, 17.28 "\(bu" at 1.6, 2.93 "ack with opt" above at 1.6, 2.93 "\(bu" at 9.4, 2.26 "ack -O4" above at 9.4, 2.26 "\(bu" at 1.5, 7.43 "\fIcc\fR" above at 1.5, 7.43 "\(bu" at 2.7, 2.02 "\fIcc -O4\fR" ljust at 1.9, 1.2 "\(bu" at 2.6, 2.10 "\fIcc -O\fR" ljust at 3.1,2.5 .G2 .ce 1 \fIFigure A.4.2: overall performance on Eratosthenes' sieve. .sp 1 .PP Although the above figures speak for themselves, a small comment may be in place. At first it is clear that our compiler is neither faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should also be noted however, that we do produce better code than \fIcc\fR at only a very small additional cost. It is also worth noticing that push-pop optimization increases run-time speed as well as compile speed. The first seems rather obvious, since optimized code is faster code, but the increase in compile speed may come as a surprise. The main reason is that the \fIas\fR+\fIld\fR time depends largely on the amount of generated code, which in general depends on the efficiency of the code. Push-pop optimization removes a lot of useless instructions which would otherwise have found their way through to the assembler and the loader. Useless instructions inserted in an early stage in the compilation process will slow down every following stage, so elimination of useless instructions in an early stage, even when it requires a little computational overhead, can often be beneficial to the overall compilation speed. .bp