342 lines
		
	
	
	
		
			12 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			342 lines
		
	
	
	
		
			12 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| .pl 12.5i
 | |
| .sp 1.5i
 | |
| .NH
 | |
| The compiler
 | |
| 
 | |
| .nh
 | |
| .LP
 | |
| The compiler can be divided roughly into four modules:
 | |
| 
 | |
| \(bu lexical analysis
 | |
| .br
 | |
| \(bu syntax analysis
 | |
| .br
 | |
| \(bu semantic analysis
 | |
| .br
 | |
| \(bu code generation
 | |
| .br
 | |
| 
 | |
| The four modules are grouped into one pass. The activity of these modules
 | |
| is interleaved during the pass.
 | |
| .br
 | |
| The lexical analyzer, some expression handling routines and various
 | |
| datastructures from the Modula-2 compiler contributed to the project.
 | |
| .sp 2
 | |
| .NH 2
 | |
| Lexical Analysis
 | |
| 
 | |
| .LP
 | |
| The first module of the compiler is the lexical analyzer. In this module, the
 | |
| stream of input characters making up the source program is grouped into
 | |
| \fItokens\fR, as defined in \fBISO 6.1\fR. The analyzer is hand-written,
 | |
| because the lexical analyzer generator, which was at our disposal,
 | |
| \fILex\fR [LEX], produces much slower analyzers. A character table, in the file
 | |
| \fIchar.c\fR, is created using the program \fItab\fR which takes as input
 | |
| the file \fIchar.tab\fR. In this table each character is placed into a
 | |
| particular class. The classes, as defined in the file \fIclass.h\fR,
 | |
| represent a set of tokens. The strategy of the analyzer is as follows: the
 | |
| first character of a new token is used in a multiway branch to eliminate as
 | |
| many candidate tokens as possible. Then the remaining characters of the token
 | |
| are read. The constant INP_NPUSHBACK, defined in the file \fIinput.h\fR,
 | |
| specifies the maximum number of characters the analyzer looks ahead. The
 | |
| value has to be at least 3, to handle input sequences such as:
 | |
| .br
 | |
|               1e+4  (which is a real number)
 | |
| .br
 | |
|               1e+a  (which is the integer 1, followed by the identifier "e", a plus, and the identifier "a")
 | |
| 
 | |
| Another aspect of this module is the insertion and deletion of tokens
 | |
| required by the parser for the recovery of syntactic errors (see also section
 | |
| 2.2). A generic input module [ACK] is used to avoid the burden of I/O.
 | |
| .sp 2
 | |
| .NH 2
 | |
| Syntax Analysis
 | |
| 
 | |
| .LP
 | |
| The second module of the compiler is the parser, which is the central part of
 | |
| the compiler. It invokes the routines of the other modules. The tokens obtained
 | |
| from the lexical analyzer are grouped into grammatical phrases. These phrases
 | |
| are stored as parse trees and handed over to the next part. The parser is
 | |
| generated using \fILLgen\fR[LL], a tool for generating an efficient recursive
 | |
| descent parser with no backtrack from an Extended Context Free Syntax.
 | |
| .br
 | |
| An error recovery mechanism is generated almost completely automatically. A
 | |
| routine called \fILLmessage\fR had to be written, which gives the necessary
 | |
| error messages and deals with the insertion and deletion of tokens.
 | |
| The routine \fILLmessage\fR must accept one parameter, whose value is
 | |
| a token number, zero or -1. A zero parameter indicates that the current token
 | |
| (the one in the external variable \fILLsymb\fR) is deleted.
 | |
| A -1 parameter indicates that the parser expected end of file, but did
 | |
| not get it. The parser will then skip tokens until end of file is detected.
 | |
| A parameter that is a token number (a positive parameter) indicates that
 | |
| this token is to be inserted in front of the token currently in \fILLsymb\fR.
 | |
| Also, care must be taken, that the token currently in \fILLsymb\fR is again
 | |
| returned by the \fBnext\fR call to the lexical analyzer, with the proper
 | |
| attributes. So, the lexical analyzer must have a facility to push back one
 | |
| token.
 | |
| .br
 | |
| Calls to the two standard procedures \fIwrite\fR and \fIwriteln\fR can be
 | |
| different from calls to other procedures. The syntax of a write-parameter
 | |
| is different from the syntax of an actual-parameter. We decided to include
 | |
| them, together with \fIread\fR and \fIreadln\fR, in the grammar. An alternate
 | |
| solution would be to make the syntax of an actual-parameter identical to the
 | |
| syntax of a write-parameter. Afterwards the parameter has to be checked to
 | |
| see whether it is used properly or not.
 | |
| .bp
 | |
| As the parser is LL(1), it must always be able to determine what to do,
 | |
| based on the last token read (\fILLsymb\fR). Unfortunately, this was not the
 | |
| case with the grammar as specified in [ISO]. Two kinds of problems
 | |
| appeared, viz. the \fBalternation\fR and \fBrepetition\fR conflict.
 | |
| The examples given in the following paragraphs are taken from the grammar.
 | |
| 
 | |
| .NH 3
 | |
| Alternation conflict
 | |
| 
 | |
| .LP
 | |
| An alternation conflict arises when the parser can not decide which
 | |
| production to choose.
 | |
| .br
 | |
| \fBExample:\fR
 | |
| .in +2m
 | |
| .ft 5
 | |
| .nf
 | |
| procedure-declaration    : procedure-heading \fB';'\f5 directive |
 | |
| .br
 | |
| \h'\w'procedure-declaration    : 'u'procedure-identification \fB';'\f5 procedure-block |
 | |
| .br
 | |
| \h'\w'procedure-declaration    : 'u'procedure-heading \fB';'\f5 procedure-block ;
 | |
| .br
 | |
| procedure-heading        : \fBprocedure\f5 identifier [ formal-parameter-list ]? ;
 | |
| .br
 | |
| procedure-identification : \fBprocedure\f5 procedure-identifier ;
 | |
| .fi
 | |
| .ft R
 | |
| .in -2m
 | |
| 
 | |
| A sentence that starts with the terminal \fBprocedure\fR is derived from the
 | |
| three alternative productions. This conflict can be resolved in two ways:
 | |
| adjusting the grammar, usually some rules are replaced by one rule and more
 | |
| work has to be done in the semantic analysis; using the LLgen conflict
 | |
| resolver, "\fB%if\fR (C-expression)", if the C-expression evaluates to
 | |
| non-zero, the production in question is chosen, otherwise one of the
 | |
| remaining rules is chosen. The grammar rules were rewritten to solve this
 | |
| conflict. The new rules are given below. For more details see the file
 | |
| \fIdeclar.g\fR.
 | |
| 
 | |
| .in +2m
 | |
| .ft 5
 | |
| .nf
 | |
| procedure-declaration : procedure-heading \fB';'\f5 ( directive | procedure-block ) ;
 | |
| .br
 | |
| procedure-heading     : \fBprocedure\f5 identifier [ formal-parameter-list ]? ;
 | |
| .fi
 | |
| .ft R
 | |
| .in -2m
 | |
| 
 | |
| A special case of an alternation conflict, which is common to many block
 | |
| structured languages, is the \fI"dangling-else"\fR ambiguity.
 | |
| 
 | |
| .in +2m
 | |
| .ft 5
 | |
| .nf
 | |
| if-statement : \fBif\f5 boolean-expression \fBthen\f5 statement [ else-part ]? ;
 | |
| .br
 | |
| else-part    : \fBelse\f5 statement ;
 | |
| .fi
 | |
| .ft R
 | |
| .in -2m
 | |
| 
 | |
| The following statement that can be derived from the rules above is ambiguous:
 | |
| 
 | |
| .ti +2m
 | |
| \fBif\f5 boolean-expr-1 \fBthen\f5 \fBif\f5 boolean-expr-2 \fBthen\f5 statement-1 \fBelse\f5 statement-2
 | |
| .ft R
 | |
| 
 | |
| 
 | |
| .ps 8
 | |
| .vs 7
 | |
| .PS
 | |
| move right 1.1i
 | |
| S: line down 0.5i
 | |
| "if-statement" at S.start above
 | |
| .ft B
 | |
| "then" at S.end below
 | |
| .ft R
 | |
| move to S.start then down 0.25i
 | |
| L: line left 0.5i then down 0.25i
 | |
| box ht 0.33i wid 0.6i "boolean" "expression-1"
 | |
| move to L.start then left 0.5i
 | |
| L: line left 0.5i then down 0.25i
 | |
| .ft B
 | |
| "if" at L.end below
 | |
| .ft R
 | |
| move to L.start then right 0.5i
 | |
| L: line right 0.5i then down 0.25i
 | |
| "statement" at L.end below
 | |
| move to L.end then down 0.10i
 | |
| L: line down 0.25i dashed
 | |
| "if-statement" at L.end below
 | |
| move to L.end then down 0.10i
 | |
| L: line down 0.5i
 | |
| .ft B
 | |
| "then" at L.end below
 | |
| .ft R
 | |
| move to L.start then down 0.25i
 | |
| L: line left 0.5i then down 0.25i
 | |
| box ht 0.33i wid 0.6i "boolean" "expression-2"
 | |
| move to L.start then left 0.5i
 | |
| L: line left 0.5i then down 0.25i
 | |
| .ft B
 | |
| "if" at L.end below
 | |
| .ft R
 | |
| move to L.start then right 0.5i
 | |
| L: line right 0.5i then down 0.25i
 | |
| box ht 0.33i wid 0.6i "statement-1"
 | |
| move to L.start then right 0.5i
 | |
| L: line right 0.5i then down 0.25i
 | |
| .ft B
 | |
| "else" at L.end below
 | |
| .ft R
 | |
| move to L.start then right 0.5i
 | |
| L: line right 0.5i then down 0.25i
 | |
| box ht 0.33i wid 0.6i "statement-2"
 | |
| move to S.start
 | |
| move right 3.5i
 | |
| L: line down 0.5i
 | |
| "if-statement" at L.start above
 | |
| .ft B
 | |
| "then" at L.end below
 | |
| .ft R
 | |
| move to L.start then down 0.25i
 | |
| L: line left 0.5i then down 0.25i
 | |
| box ht 0.33i wid 0.6i "boolean" "expression-1"
 | |
| move to L.start then left 0.5i
 | |
| L: line left 0.5i then down 0.25i
 | |
| .ft B
 | |
| "if" at L.end below
 | |
| .ft R
 | |
| move to L.start then right 0.5i
 | |
| S: line right 0.5i then down 0.25i
 | |
| "statement" at S.end below
 | |
| move to S.start then right 0.5i
 | |
| L: line right 0.5i then down 0.25i
 | |
| .ft B
 | |
| "else" at L.end below
 | |
| .ft R
 | |
| move to L.start then right 0.5i
 | |
| L: line right 0.5i then down 0.25i
 | |
| box ht 0.33i wid 0.6i "statement-2"
 | |
| move to S.end then down 0.10i
 | |
| L: line down 0.25i dashed
 | |
| "if-statement" at L.end below
 | |
| move to L.end then down 0.10i
 | |
| L: line down 0.5i
 | |
| .ft B
 | |
| "then" at L.end below
 | |
| .ft R
 | |
| move to L.start then down 0.25i
 | |
| L: line left 0.5i then down 0.25i
 | |
| box ht 0.33i wid 0.6i "boolean" "expression-2"
 | |
| move to L.start then left 0.5i
 | |
| L: line left 0.5i then down 0.25i
 | |
| .ft B
 | |
| "if" at L.end below
 | |
| .ft R
 | |
| move to L.start then right 0.5i
 | |
| L: line right 0.5i then down 0.25i
 | |
| box ht 0.33i wid 0.6i "statement-1"
 | |
| .PE
 | |
| .ps
 | |
| .vs
 | |
| \h'615u'(a)\h'1339u'(b)
 | |
| .sp
 | |
| .ce
 | |
| Two parse trees showing the \fIdangling-else\fR ambiguity
 | |
| .sp 2
 | |
| According to the standard, \fBelse\fR is matched with the nearest preceding
 | |
| unmatched \fBthen\fR, i.e. parse tree (a) is valid (\fBISO 6.8.3.4\fR).
 | |
| This conflict is statically resolved in LLgen by using "\fB%prefer\fR",
 | |
| which is equivalent in behaviour to "\fB%if\fR(1)".
 | |
| .bp
 | |
| .NH 3
 | |
| Repetition conflict
 | |
| 
 | |
| .LP
 | |
| A repetition conflict arises when the parser can not decide whether to choose
 | |
| a production once more, or not.
 | |
| .br
 | |
| \fBExample:\fR
 | |
| .in +2m
 | |
| .ft 5
 | |
| .nf
 | |
| field-list : [ ( fixed-part [ \fB';'\f5 variant-part ]? | variantpart ) [;]? ]? ;
 | |
| .br
 | |
| fixed-part : record-section [ \fB';'\f5 record-section ]* ;
 | |
| .fi
 | |
| .in -2m
 | |
| .ft R
 | |
| 
 | |
| When the parser sees the semicolon, it can not decide whether another
 | |
| record-section or a variant-part follows. This conflict can be resolved in
 | |
| two ways: adjusting the grammar or using the conflict resolver,
 | |
| "\fB%while\fR (C-expression)". The grammar rules that deal with this conflict
 | |
| were completely rewritten. For more details, the reader is referred to the
 | |
| file \fIdeclar.g\fR.
 | |
| .sp 2
 | |
| .NH 2
 | |
| Semantic Analysis
 | |
| 
 | |
| .LP
 | |
| The third module of the compiler is the checking of semantic conventions of
 | |
| ISO-Pascal. To check the program being parsed, actions have been used in
 | |
| LLgen. An action consists of several C-statements, enclosed in brackets
 | |
| "{" and "}". In order to facilitate communication between the actions and
 | |
| \fILLparse\fR, the parsing routines can be given C-like parameters and
 | |
| local variables. An important part of the semantic analyzer is the symbol
 | |
| table. This table stores all information concerning identifiers and their
 | |
| definitions. Symbol-table lookup and hashing is done by a generic namelist
 | |
| module [ACK]. The parser turns each program construction into a parse tree,
 | |
| which is the major datastructure in the compiler. This parse tree is used
 | |
| to exchange information between various routines.
 | |
| .sp 2
 | |
| .NH 2
 | |
| Code Generation
 | |
| 
 | |
| .LP
 | |
| The final module in the compiler is that of code generation. The information
 | |
| stored in the parse trees is used to generate the EM code [EM]. EM code is
 | |
| generated with the help of a procedural EM-code interface [ACK]. The use of
 | |
| static exchanges is not desired, since the fast back end can not cope with
 | |
| static code exchanges, hence the EM pseudoinstruction \fBexc\fR is never
 | |
| generated.
 | |
| .br
 | |
| Chapter 3 discusses the code generation in more detail.
 | |
| .sp 2
 | |
| .NH 2
 | |
| Error Handling
 | |
| 
 | |
| .LP
 | |
| The first three modules have in common that they can detect errors in the
 | |
| Pascal program being compiled. If this is the case, a proper message is given
 | |
| and some action is performed. If code generation has to be aborted, an error
 | |
| message is given, otherwise a warning is given. The constant MAXERR_LINE,
 | |
| defined in the file \fIerrout.h\fR, specifies the maximum number of messages
 | |
| given per line. This can be used to avoid long lists of error messages caused
 | |
| by, for example, the omission of a ';'. Three kinds of errors can be
 | |
| distinguished: the lexical error, the syntactic error, and the semantic error.
 | |
| Examples of these errors are respectively, nested comments, an expression with
 | |
| unbalanced parentheses, and the addition of two characters.
 | |
| .sp 2
 | |
| .NH 2
 | |
| Memory Allocation and Garbage Collection
 | |
| 
 | |
| .LP
 | |
| The routines \fIst_alloc\fR and \fIst_free\fR provide a mechanism for
 | |
| maintaining free lists of structures, whose first field is a pointer called
 | |
| \fBnext\fR. This field is used to chain free structures together. Each
 | |
| structure, suppose the tag of the structure is ST, has a free list pointed
 | |
| by h_ST. Associated with this list are the operations: \fInew_ST()\fR, an
 | |
| allocating mechanism which supplies the space for a new ST struct; and
 | |
| \fIfree_ST()\fR, a garbage collecting mechanism which links the specified
 | |
| structure into the free list.
 | |
| .bp
 |