diff --git a/doc/ceg/proposal.tr b/doc/ceg/proposal.tr new file mode 100644 index 000000000..0063bb623 --- /dev/null +++ b/doc/ceg/proposal.tr @@ -0,0 +1,284 @@ +.TL + +Code Expander +.br +(proposal) + +.SH +Introduction +.LP +The \fBcode expander\fR, \fBce\fR, is a program that translates EM-code to +objectcode. The main goal is to translate very fast. \fBce\fR is an instance +of the EM_CODE(3L)-interface. During execution of \fBce\fR, \fBce\fR will build +in core a machine independent objectfile ( NEW A.OUT(5L)). With \fBcv\fR or +with routines supplied by the user the machine independent objectcode will +be converted to a machine dependent object code. \fBce\fR needs +information about the targetmachine (e.g. the opcode's). We divide the +information into two parts: +.IP +- The description in assembly instructions of EM-code instructions. +.IP +- The description in objectcode of assembly instructions. +.LP +With these two tables we can make a \fBcode expander generator\fR which +generates a \fBce\fR. It is possible to put the information in one table +but that will probably introduce (propable) more bugs in the table. So we +divide and conquer. With this approach it is also possible to generate +assembly code ( rather yhan objectcode), wich is useful for debugging. +There is of course a link between the two tables, the link +consist of a restriction on the assembly format. Every assembly +instruction must have the following format: +.sp + INSTR ::= LABEL : MNEMONIC [ OPERAND ( "," OPERAND)* ] +.sp +.LP +\fBCeg\fR uses the following algorithm: +.IP \0\0a) +The assembly table will be converted to a (C-)routine assemble(). +assemble() gets as argument a string, the assembler instruction, +and can use the MNEMONIC to execute the corresponding action in the +assembly table. +.IP \0\0b) +The routine assemble() can now be used to convert the EM-code table to +a set of C-routines, wich together form an instance of the +EM_CODE(3L). +.SH +The EM-instruction table +.LP +We use the following grammar: +.sp +.TS +center box ; +l. +TABLE ::= (ROW)* +ROW ::= C_instr ( SPECIAL | SIMPLE) +SPECIAL ::= ( CONDITION SIMPLE)+ 'default' SIMPLE +SIMPLE ::= '==>' ACTIONLIST | '::=' ACTIONLIST +ACTIONLIST ::= [ ACTION ( ';' ACTION)* ] '.' +ACTION ::= function-call | assembly-instruction +.TE +.LP +An example for the 8086: +.LP +.DS +C_lxl + $arg1 == 0 ==> "push bp". + $arg1 == 1 ==> "push EM_BSIZE(bp)". + default ==> "mov cx, $arg1"; + "mov si, bp"; + "1: mov si, EM_BSIZE(si); + "loop 1b" + "push si". +.DE +.sp +Some remarks: +.sp +* The C_instr is a function indentifier in the EM_CODE(3L)-interface. +.LP +* CONDITION is a "boolean" C-expression. +.LP +* The arguments of an EM-instruction can be used in CONDITION and in assembly +instructions. They are referred by $arg\fIi\fR. \fBceg\fR modifies the +arguments as follows: +.IP \0\0- +For local variables at positive offsets it increases this offset by EM_BSIZE +.IP \0\0- +It makes names en labels unique. The user must supply the formats (see mach.h). +.LP +* function-call is allowed to implement e.g. push/pop optimization. +For example: +.LP +.DS +C_adi + $arg1 == 2 ==> combine( "pop ax"); + combine( "pop bx"); + "add ax, bx"; + save( "push ax"). + default ==> arg_error( "C_adi", $arg1). +.DE +.LP +* The C-functions called in the EM-instructions table have to use the routine +assemble()/gen?(). "assembler-instr" is in fact assemble( "assembler-instr"). +.LP +* \fBceg\fR takes care not only about the conversions of arguments but also +about +changes between segments. There are situation when one doesn't want +conversion of arguments. This can be done by using ::= in stead of ==>. +This is usefull when two C_instr are equivalent. For example: +.IP +C_slu ::= C_sli( $arg1) +.LP +* There are EM-CODE instructions wich are machine independent (e.g. C_open()). +For these EM_CODE instructions \fBceg\fR will generate \fIdefault\fR- +instructions. There is one exception: in the case of C_pro() the tablewriter +has to supply a function prolog(). +.LP +* Also the EM-pseudoinstructions C_bss_\fIcstp\fR(), C_hol_\fIcstp\fR(), +C_con_\fIcstp\fR() and C_rom_\fIcstp\fR can be translated automaticly. +\fBceg\fR only has to know how to interpretate string-constants: +.DS +\&..icon $arg2 == 1 ==> gen1( (char) atoi( $arg1)) + $arg2 == 2 ==> gen2( atoi( $arg1)) + $arg2 == 4 ==> gen4( atol( $arg1)) +\&..ucon $arg2 == 1 ==> gen1( (char) atoi( $arg1)) + $arg2 == 2 ==> gen2( atoi( $arg1)) + $arg2 == 4 ==> gen4( atol( $arg1)) +\&..fcon ::= not_implemented( "..fcon") +.DE +.LP +* Still, life can be made easier for the tablewriter; For the routines wich +he/she didn't implement \fBceg\fR will generate a default instruction wich +generates an error-message. \fBceg\fR seems to generate : +.IP +C_xxx ::= not_implemented( "C_xxx") +.SH +The assembly table +.LP +How to map assembly on objectcode. +.LP +Each row in the table consists of two fields, one field for the assembly +instruction, the other field for the corresponding objectcode. The tablewriter +can use the following primitives to generate code for the machine +instructions : +.IP "\0\0gen1( b)\0\0:" 17 +generates one byte in de machine independent objectfile. +.IP "\0\0gen2( w)\0\0:" 17 +generates one word ( = two bytes), the table writer can change the byte +order by setting the flag BYTES_REVERSED. +.IP "\0\0gen4( l)\0\0:" 17 +generates two words ( = four bytes), the table writer can change the word +order by setting the flag WORDS_REVERSED. +.IP "\0\0reloc( n, o, r)\0\0:" 17 +generates relocation information for a label ( = name + offset + +relocationtype). +.LP +Besides these primitives the table writer may use his self written +C-functions. This allows the table writer e.g. to write functions to set +bitfields within a byte. +.LP +There are more or less two methods to encode the assembly instructions: +.IP \0\0a) +MNEMONIC and OPERAND('s) are encoded independently of each other. This can be +done when the target machine has an orthogonal instruction set (e.g. pdp-11). +.IP \0\0b) +MNEMONIC and OPERAND('s) together determine the opcode. In this case the +assembler often uses overloading: one MNEMONIC is used for several +different machine-instructions. For example : (8086) +.br + mov ax, bx +.br + mov ax, variable +.br +These instructions have different opcodes. +.LP +As the transformation MNEMONIC-OPCODE is not one to +one the table writer must be allowed to put restrictions on the operands. +This can be done with type declarations. For example: +.LP +.DS + mov dst:REG, src:MEM ==> + gen1( 0x8b); + modRM( op2.reg, op1); +.DE +.DS + mov dst:REG, src:REG ==> + gen1( 0x89); + modRM( op2.reg, op1); +.DE +.LP +modRM() is a function written by the tablewriter and is used to encode +the operands. This frees the table writer of endless typing. +.LP +The table writer has to do the "typechecking" by himself. But typechecking +is almost the same as operand decoding. So it's more efficient to do this +in one function. We now have all the tools to describe the function +assemble(). +.IP +assemble() first calls the function +decode_operand() ( by the table writer written), with two arguments: a +string ( the operand) and a +pointer to a struct. The struct is declared by the table writer and must +consist of at least a field called type. ( the other fields in the struct can +be used to remember information about the decoded operand.) Now assemble() +fires a row wich is selected by mapping the MNEMONIC and the type of the +operands. +.br +In the second field of a row there may be references to other +fields in the struct (e.g. op2.reg in the example above). +.LP +We ignored one problem. It's possible when the operands are encoded, that +not everything is known. For example $arg\fIi\fR arguments in the +EM-instruction table get their value at runtime. This problem is solved by +introducing a function eval(). eval() has a string as argument and returns +an arith. The string consists of constants and/or $arg\fIi\fR's and the value +returned by eval() is the value of the string. To encode the $arg\fIi\fR's +in as few bytes as possible the table writer can use the statements %if, +%else and %endif. They can be used in the same manner as #if, #else and +#endif in C and result in a runtime test. An example : +.LP +.DS + -- Some rows of the assembly table + + mov dst:REG, src:DATA ==> + %if sfit( eval( src), 8) /* does the immediate-data fit in 1 byte? */ + R53( 0x16 , op1.reg); + gen1( eval( src)); + %else + R53( 0x17 , op1.reg); + gen2( eval( src)); + %endif +.LD + + mov dst:REG, src:REG ==> + gen1( 0x8b); + modRM( op1.reg, op2); + +.DE +.DS + -- The corresponding part in the function assemble() : + + case MNEM_mov : + decode_operand( arg1, &op1); + decode_operand( arg2, &op2); + if ( REG( op1.type) && DATA( op2.type)) { + printf( "if ( sfit( %s, 8)) {\\\\n", eval( src)); + R53( 0x16 , op1.reg); + printf( "gen1( %s)\\\\n", eval( arg2)); + printf( "}\\\\nelse {\\\\n"); + R53( 0x17 , op1.reg); + printf( "gen2( %s)\\\\n", eval( arg2)); + printf( "}\\\\n"); + } + else if ( REG( op1.type) && REG( op2.type)) { + gen1( 0x8b); + modRM( op1.reg, op2); + } + + +.DE +.DS + -- Some rows of the right part of the EM-instruction table are translated + -- in the following C-functions. + + "mov ax, $arg1" ==> + if ( sfit( w, 8)) { /* w is the actual argument of C_xxx( w) */ + gen1( 176); /* R53() */ + gen1( w); + } + else { + gen1( 184); + gen2( w); + } +.LD + + "mov ax, bx" ==> + gen1( 138); + gen1( 99); /* modRM() */ +.DE +.SH +Restrictions +.LP +.IP \0\01) +The EM-instructions C_exc() is not implemented. +.IP \0\03) +All messages are ignored. diff --git a/doc/ceg/prototype.tr b/doc/ceg/prototype.tr new file mode 100644 index 000000000..c5c5d91bd --- /dev/null +++ b/doc/ceg/prototype.tr @@ -0,0 +1,276 @@ +.TL +A prototype Code expander +.NH +Introduction +.PP +A program to be compiled with ACK is first fed into the preprocessor. +The output of the preprocessor goes into the appropiate front end, +whose job it is to produce EM. The EM code generated is +fed into the peephole optimizer, wich scans it with a window of few +instructions, replacing certain inefficient code sequences by better +ones. Following the peephole optimizer follows a backend wich produces +good assembly code. The assembly code goes into the assembler and the objectcode +then goes into the loader/linker, the final component in the pipeline. +.PP +For various applications this scheme is too slow. For example for testing +programs; In this case the program has to be translated fast and the +runtime of the objectcode may be slower. A solution is to build a code +expander ( \fBce\fR) wich translates EM code to objectcode. Of course this +has to +be done automaticly by a code expander generator, but to get some feeling +for the problem we started out to build prototypes. +We built two types of ce's. One wich tranlated EM to assembly, one +wich translated EM to objectcode. +.NH +EM to assembly +.PP +We made one for the 8086 and one for the vax4. These ce's are instances of the +EM_CODE(3L)-interface and produce for a single EM instruction a set +of assembly instruction wich are semantic equivalent. +We implemented in the 8086-ce push/pop-optimalization. +.NH +EM to objectcode +.PP +Instead of producing assembly code we tried to produce vax4-objectcode. +During execution of ce, ce builds in core a machine independent +objectfile ( NEW A.OUT(5L)) and just before dumping the tables this +objectfile is converted to a Berkly 4.2BSD a.out-file. We build two versions; +One with static memory allocation and one with dynamic memory allocation. +If the first one runs out of memory it will give an error message and stop, +the second one will allocate more memory and proceed with producing +objectcode. +.PP +The C-frontend calls the EM_CODE-interface. So after linking the frontend +and the ce we have a pipeline in a program saving a lot of i/o. +It is interesting to compare this C-compiler ( called fcemcom) with "cc -c". +fcemcom1 (the dynamic variant of fcemcom) is tuned in such a way, that +alloc() won't be called. +.NH 2 +Compile time +.PP +fac.c is a small program that produces n! ( see below). foo.c is small program +that loops a lot. +.TS +center, box, tab(:); +c | c | c | c | c | c +c | c | n | n | n | n. +compiler : program : real : user : sys : object size += +fcemcom : sort.c : 31.0 : 17.5 : 1.8 : 23824 +fcemcom1 : : 59.0 : 21.2 : 3.3 : +cc -c : : 50.0 : 38.0 : 3.5 : 6788 +_ +fcemcom : ed.c : 37.0 : 23.6 : 2.3 : 41744 +fcemcom1 : : 1.16.0 : 28.3 : 4.6 : +cc -c : : 1.19.0 : 54.8 : 4.3 : 11108 +_ +fcemcom : cp.c : 4.0 : 2.4 : 0.8 : 4652 +fcemcom1 : : 9.0 : 3.0 : 1.0 : +cc -c : : 8.0 : 5.2 : 1.6 : 1048 +_ +fcemcom : uniq.c : 5.0 : 2.5 : 0.8 : 5568 +fcemcom1 : : 9.0 : 2.9 : 0.8 : +cc -c : : 13.0 : 5.4 : 2.0 : 3008 +_ +fcemcom : btlgrep.c : 24.0 : 7.2 : 1.4 : 12968 +fcemcom1 : : 23.0 : 8.1 : 1.2 : +cc -c : : 1.20.0 : 15.3 : 3.8 : 2392 +_ +fcemcom : fac.c : 1.0 : 0.1 : 0.5 : 216 +fecmcom1 : : 2.0 : 0.2 : 0.5 : +cc -c : : 3.0 : 0.7 : 1.3 : 92 +_ +fcemcom : foo.c : 4.0 : 0.2 : 0.5 : 272 +fcemcom1 : : 11.0 : 0.3 : 0.5 : +cc -c : : 7.0 : 0.8 : 1.6 : 108 +.TE +.NH 2 +Run time +.LP +Is the runtime very bad? +.TS +tab(:), box, center; +c | c | c | c | c +c | c | n | n | n. +compiler : program : real : user : system += +fcem : sort.c : 22.0 : 17.5 : 1.5 +cc : : 5.0 : 2.4 : 1.1 +_ +fcem : btlgrep.c : 1.58.0 : 27.2 : 4.2 +cc : : 12.0 : 3.6 : 1.1 +_ +fcem : foo.c : 1.0 : 0.7 : 0.1 +cc : : 1.0 : 0.4 : 0.1 +_ +fcem : uniq.c : 2.0 : 0.5 : 0.3 +cc : : 1.0 : 0.1 : 0.2 +.TE +.NH 2 +quality object code +.LP +The runtime is very bad so its interesting to have look at the code which is +produced by fcemcom and by cc -c. I took a program which computes recursively +n!. +.DS +long fac(); + +main() +{ + int n; + + scanf( "%D", &n); + printf( "fac is %D\\\\n", fac( n)); +} + +long fac( n) +int n; +{ + if ( n == 0) + return( 1); + else + return( n * fac( n-1)); +} +.DE +.br +.br +.br +.br +.LP +"cc -c fac.c" produces : +.DS +fac: tstl 4(ap) + bnequ 7f + movl $1, r0 + ret +7f: subl3 $1, 4(ap), r0 + pushl r0 + call $1, fac + movl r0, -4(fp) + mull3 -4(fp), 4(ap), r0 + ret +.DE +.br +.br +.LP +"fcem fac.c fac.o" produces : +.DS +_fac: 0 +42: jmp be +48: pushl 4(ap) +4e: pushl $0 +54: subl2 (sp)+,(sp) +57: tstl (sp)+ +59: bnequ 61 +5b: jmp 67 +61: jmp 79 +67: pushl $1 +6d: jmp ba +73: jmp b9 +79: pushl 4(ap) +7f: pushl $1 +85: subl2 (sp)+,(sp) +88: calls $0,_fac +8f: addl2 $4,sp +96: pushl r0 +98: pushl 4(ap) +9e: pushl $4 +a4: pushl $4 +aa: jsb .cii +b0: mull2 (sp)+,(sp) +b3: jmp ba +b9: ret +ba: movl (sp)+,r0 +bd: ret +be: jmp 48 +.DE +.NH 1 +Conclusions +.PP +comparing "cc -c" with "fcemcom" +.LP +.TS +center, box, tab(:); +c | c s | c | c s +^ | c s | ^ | c s +^ | c | c | ^ | c | c +l | n | n | n | n | n. +program : compile time : object size : runtime +:_::_ +: user : sys :: user : sys += +sort.c : 0.47 : 0.5 : 3.5 : 7.3 : 1.4 +_ +ed.c : 0.46 : 0.5 : 3.8 : : : +_ +cp.c : 0.46 : 0.5 : 4.4 : : : +_ +uniq.c : 0.46 : 0.4 : 1.8 : : : +_ +btlgrep.c : 0.47 : 0.3 : 5.4 : 7.5 : 3.8 +_ +fac.c : 0.14 : 0.4 : 2.3 : 1.8 : 1.0 +_ +foo.c : 0.25 : 0.3 : 2.5 : 5.0 : 1.5 +.TE +.PP +The results for fcemcom1 are almost identical; The only thing that changes +is that fcemcom1 is 1.2 slower than fcemcom. ( compile time) This is due to +to an another datastructure . In the static version we use huge array's for +the text- and +data-segment, the relocation information, the symboltable and stringarea. +In the dynamic version we use linked lists, wich makes it expensive to get +and to put a byte on a abritrary memory location. So it is probably better +to use realloc(), because in the most cases there will be enough memory. +.PP +The quality of the objectcode is very bad. The reason is that the frontend +generates bad code and expects the peephole-optimizer to improve the code. +This is also one of the main reasons that the runtime is very bad. +(e.g. the expensive "cii" with arguments 4 and 4 could be deleted.) +So its seems a good +idea to put a new peephole-optimizer between the frontend and the ce. +.PP +Using the peephole optimizer the ce would produce : +.DS +_fac: 0 + pushl 4(ap) + tstl (sp)+ + beqlu 1f + jmp 3f + 1 : pushl $1 + jmp 2f + 3 : pushl 4(ap) + decl (sp) + calls $0,_fac + addl2 $4,sp + pushl r0 + pushl 4(ap) + mull2 (sp)+,(sp) + movl (sp)+,r0 + 2 : ret +.DE +.PP +Bruce McKenzy already implemented it and made some improvements in the +source code of the ce. The compile-time is two to two and a half times better +and the +size of the objectcode is two to three times bigger.(comparing with "cc -c") +Still we could do better. +.PP +Using peephole- and push/pop-optimization ce could produce : +.DS +_fac: 0 + tstl 4(ap) + beqlu 1f + jmp 2f + 1 : pushl $1 + jmp 3f + 2 : decl 4(ap) + calls $0,_fac + addl2 $4,sp + mull3 4(ap), r0, -(sp) + movl (sp)+, r0 + 3 : ret +.DE +.PP +prof doesn't cooperate, so no profile information. +.PP