343 lines
12 KiB
Plaintext
343 lines
12 KiB
Plaintext
.pl 12.5i
|
|
.sp 1.5i
|
|
.NH
|
|
The compiler
|
|
|
|
.nh
|
|
.LP
|
|
The compiler can be divided roughly into four modules:
|
|
|
|
\(bu lexical analysis
|
|
.br
|
|
\(bu syntax analysis
|
|
.br
|
|
\(bu semantic analysis
|
|
.br
|
|
\(bu code generation
|
|
.br
|
|
|
|
The four modules are grouped into one pass. The activity of these modules
|
|
is interleaved during the pass.
|
|
.br
|
|
The lexical analyzer, some expression handling routines and various
|
|
datastructures from the Modula-2 compiler contributed to the project.
|
|
.sp 2
|
|
.NH 2
|
|
Lexical Analysis
|
|
|
|
.LP
|
|
The first module of the compiler is the lexical analyzer. In this module, the
|
|
stream of input characters making up the source program is grouped into
|
|
\fItokens\fR, as defined in \fBISO 6.1\fR. The analyzer is hand-written,
|
|
because the lexical analyzer generator, which was at our disposal,
|
|
\fILex\fR [LEX], produces much slower analyzers. A character table, in the file
|
|
\fIchar.c\fR, is created using the program \fItab\fR which takes as input
|
|
the file \fIchar.tab\fR. In this table each character is placed into a
|
|
particular class. The classes, as defined in the file \fIclass.h\fR,
|
|
represent a set of tokens. The strategy of the analyzer is as follows: the
|
|
first character of a new token is used in a multiway branch to eliminate as
|
|
many candidate tokens as possible. Then the remaining characters of the token
|
|
are read. The constant INP_NPUSHBACK, defined in the file \fIinput.h\fR,
|
|
specifies the maximum number of characters the analyzer looks ahead. The
|
|
value has to be at least 3, to handle input sequences such as:
|
|
.br
|
|
1e+4 (which is a real number)
|
|
.br
|
|
1e+a (which is the integer 1, followed by the identifier "e", a plus, and the identifier "a")
|
|
|
|
Another aspect of this module is the insertion and deletion of tokens
|
|
required by the parser for the recovery of syntactic errors (see also section
|
|
2.2). A generic input module [ACK] is used to avoid the burden of I/O.
|
|
.sp 2
|
|
.NH 2
|
|
Syntax Analysis
|
|
|
|
.LP
|
|
The second module of the compiler is the parser, which is the central part of
|
|
the compiler. It invokes the routines of the other modules. The tokens obtained
|
|
from the lexical analyzer are grouped into grammatical phrases. These phrases
|
|
are stored as parse trees and handed over to the next part. The parser is
|
|
generated using \fILLgen\fR[LL], a tool for generating an efficient recursive
|
|
descent parser with no backtrack from an Extended Context Free Syntax.
|
|
.br
|
|
An error recovery mechanism is generated almost completely automatically. A
|
|
routine called \fILLmessage\fR had to be written, which gives the necessary
|
|
error messages and deals with the insertion and deletion of tokens.
|
|
The routine \fILLmessage\fR must accept one parameter, whose value is
|
|
a token number, zero or -1. A zero parameter indicates that the current token
|
|
(the one in the external variable \fILLsymb\fR) is deleted.
|
|
A -1 parameter indicates that the parser expected end of file, but did
|
|
not get it. The parser will then skip tokens until end of file is detected.
|
|
A parameter that is a token number (a positive parameter) indicates that
|
|
this token is to be inserted in front of the token currently in \fILLsymb\fR.
|
|
Also, care must be taken, that the token currently in \fILLsymb\fR is again
|
|
returned by the \fBnext\fR call to the lexical analyzer, with the proper
|
|
attributes. So, the lexical analyzer must have a facility to push back one
|
|
token.
|
|
.br
|
|
Calls to the two standard procedures \fIwrite\fR and \fIwriteln\fR can be
|
|
different from calls to other procedures. The syntax of a write-parameter
|
|
is different from the syntax of an actual-parameter. We decided to include
|
|
them, together with \fIread\fR and \fIreadln\fR, in the grammar. An alternate
|
|
solution would be to make the syntax of an actual-parameter identical to the
|
|
syntax of a write-parameter. Afterwards the parameter has to be checked to
|
|
see whether it is used properly or not.
|
|
.bp
|
|
As the parser is LL(1), it must always be able to determine what to do,
|
|
based on the last token read (\fILLsymb\fR). Unfortunately, this was not the
|
|
case with the grammar as specified in [ISO]. Two kinds of problems
|
|
appeared, viz. the \fBalternation\fR and \fBrepetition\fR conflict.
|
|
The examples given in the following paragraphs are taken from the grammar.
|
|
|
|
.NH 3
|
|
Alternation conflict
|
|
|
|
.LP
|
|
An alternation conflict arises when the parser can not decide which
|
|
production to choose.
|
|
.br
|
|
\fBExample:\fR
|
|
.in +2m
|
|
.ft 5
|
|
.nf
|
|
procedure-declaration : procedure-heading \fB';'\f5 directive |
|
|
.br
|
|
\h'\w'procedure-declaration : 'u'procedure-identification \fB';'\f5 procedure-block |
|
|
.br
|
|
\h'\w'procedure-declaration : 'u'procedure-heading \fB';'\f5 procedure-block ;
|
|
.br
|
|
procedure-heading : \fBprocedure\f5 identifier [ formal-parameter-list ]? ;
|
|
.br
|
|
procedure-identification : \fBprocedure\f5 procedure-identifier ;
|
|
.fi
|
|
.ft R
|
|
.in -2m
|
|
|
|
A sentence that starts with the terminal \fBprocedure\fR is derived from the
|
|
three alternative productions. This conflict can be resolved in two ways:
|
|
adjusting the grammar, usually some rules are replaced by one rule and more
|
|
work has to be done in the semantic analysis; using the LLgen conflict
|
|
resolver, "\fB%if\fR (C-expression)", if the C-expression evaluates to
|
|
non-zero, the production in question is chosen, otherwise one of the
|
|
remaining rules is chosen. The grammar rules were rewritten to solve this
|
|
conflict. The new rules are given below. For more details see the file
|
|
\fIdeclar.g\fR.
|
|
|
|
.in +2m
|
|
.ft 5
|
|
.nf
|
|
procedure-declaration : procedure-heading \fB';'\f5 ( directive | procedure-block ) ;
|
|
.br
|
|
procedure-heading : \fBprocedure\f5 identifier [ formal-parameter-list ]? ;
|
|
.fi
|
|
.ft R
|
|
.in -2m
|
|
|
|
A special case of an alternation conflict, which is common to many block
|
|
structured languages, is the \fI"dangling-else"\fR ambiguity.
|
|
|
|
.in +2m
|
|
.ft 5
|
|
.nf
|
|
if-statement : \fBif\f5 boolean-expression \fBthen\f5 statement [ else-part ]? ;
|
|
.br
|
|
else-part : \fBelse\f5 statement ;
|
|
.fi
|
|
.ft R
|
|
.in -2m
|
|
|
|
The following statement that can be derived from the rules above is ambiguous:
|
|
|
|
.ti +2m
|
|
\fBif\f5 boolean-expr-1 \fBthen\f5 \fBif\f5 boolean-expr-2 \fBthen\f5 statement-1 \fBelse\f5 statement-2
|
|
.ft R
|
|
|
|
|
|
.ps 8
|
|
.vs 7
|
|
.PS
|
|
move right 1.1i
|
|
S: line down 0.5i
|
|
"if-statement" at S.start above
|
|
.ft B
|
|
"then" at S.end below
|
|
.ft R
|
|
move to S.start then down 0.25i
|
|
L: line left 0.5i then down 0.25i
|
|
box ht 0.33i wid 0.6i "boolean" "expression-1"
|
|
move to L.start then left 0.5i
|
|
L: line left 0.5i then down 0.25i
|
|
.ft B
|
|
"if" at L.end below
|
|
.ft R
|
|
move to L.start then right 0.5i
|
|
L: line right 0.5i then down 0.25i
|
|
"statement" at L.end below
|
|
move to L.end then down 0.10i
|
|
L: line down 0.25i dashed
|
|
"if-statement" at L.end below
|
|
move to L.end then down 0.10i
|
|
L: line down 0.5i
|
|
.ft B
|
|
"then" at L.end below
|
|
.ft R
|
|
move to L.start then down 0.25i
|
|
L: line left 0.5i then down 0.25i
|
|
box ht 0.33i wid 0.6i "boolean" "expression-2"
|
|
move to L.start then left 0.5i
|
|
L: line left 0.5i then down 0.25i
|
|
.ft B
|
|
"if" at L.end below
|
|
.ft R
|
|
move to L.start then right 0.5i
|
|
L: line right 0.5i then down 0.25i
|
|
box ht 0.33i wid 0.6i "statement-1"
|
|
move to L.start then right 0.5i
|
|
L: line right 0.5i then down 0.25i
|
|
.ft B
|
|
"else" at L.end below
|
|
.ft R
|
|
move to L.start then right 0.5i
|
|
L: line right 0.5i then down 0.25i
|
|
box ht 0.33i wid 0.6i "statement-2"
|
|
move to S.start
|
|
move right 3.5i
|
|
L: line down 0.5i
|
|
"if-statement" at L.start above
|
|
.ft B
|
|
"then" at L.end below
|
|
.ft R
|
|
move to L.start then down 0.25i
|
|
L: line left 0.5i then down 0.25i
|
|
box ht 0.33i wid 0.6i "boolean" "expression-1"
|
|
move to L.start then left 0.5i
|
|
L: line left 0.5i then down 0.25i
|
|
.ft B
|
|
"if" at L.end below
|
|
.ft R
|
|
move to L.start then right 0.5i
|
|
S: line right 0.5i then down 0.25i
|
|
"statement" at S.end below
|
|
move to S.start then right 0.5i
|
|
L: line right 0.5i then down 0.25i
|
|
.ft B
|
|
"else" at L.end below
|
|
.ft R
|
|
move to L.start then right 0.5i
|
|
L: line right 0.5i then down 0.25i
|
|
box ht 0.33i wid 0.6i "statement-2"
|
|
move to S.end then down 0.10i
|
|
L: line down 0.25i dashed
|
|
"if-statement" at L.end below
|
|
move to L.end then down 0.10i
|
|
L: line down 0.5i
|
|
.ft B
|
|
"then" at L.end below
|
|
.ft R
|
|
move to L.start then down 0.25i
|
|
L: line left 0.5i then down 0.25i
|
|
box ht 0.33i wid 0.6i "boolean" "expression-2"
|
|
move to L.start then left 0.5i
|
|
L: line left 0.5i then down 0.25i
|
|
.ft B
|
|
"if" at L.end below
|
|
.ft R
|
|
move to L.start then right 0.5i
|
|
L: line right 0.5i then down 0.25i
|
|
box ht 0.33i wid 0.6i "statement-1"
|
|
.PE
|
|
.ps
|
|
.vs
|
|
\h'615u'(a)\h'1339u'(b)
|
|
.sp
|
|
.ce
|
|
Two parse trees showing the \fIdangling-else\fR ambiguity
|
|
.sp 2
|
|
According to the standard, \fBelse\fR is matched with the nearest preceding
|
|
unmatched \fBthen\fR, i.e. parse tree (a) is valid (\fBISO 6.8.3.4\fR).
|
|
This conflict is statically resolved in LLgen by using "\fB%prefer\fR",
|
|
which is equivalent in behaviour to "\fB%if\fR(1)".
|
|
.bp
|
|
.NH 3
|
|
Repetition conflict
|
|
|
|
.LP
|
|
A repetition conflict arises when the parser can not decide whether to choose
|
|
a production once more, or not.
|
|
.br
|
|
\fBExample:\fR
|
|
.in +2m
|
|
.ft 5
|
|
.nf
|
|
field-list : [ ( fixed-part [ \fB';'\f5 variant-part ]? | variantpart ) [;]? ]? ;
|
|
.br
|
|
fixed-part : record-section [ \fB';'\f5 record-section ]* ;
|
|
.fi
|
|
.in -2m
|
|
.ft R
|
|
|
|
When the parser sees the semicolon, it can not decide whether another
|
|
record-section or a variant-part follows. This conflict can be resolved in
|
|
two ways: adjusting the grammar or using the conflict resolver,
|
|
"\fB%while\fR (C-expression)". The grammar rules that deal with this conflict
|
|
were completely rewritten. For more details, the reader is referred to the
|
|
file \fIdeclar.g\fR.
|
|
.sp 2
|
|
.NH 2
|
|
Semantic Analysis
|
|
|
|
.LP
|
|
The third module of the compiler is the checking of semantic conventions of
|
|
ISO-Pascal. To check the program being parsed, actions have been used in
|
|
LLgen. An action consists of several C-statements, enclosed in brackets
|
|
"{" and "}". In order to facilitate communication between the actions and
|
|
\fILLparse\fR, the parsing routines can be given C-like parameters and
|
|
local variables. An important part of the semantic analyzer is the symbol
|
|
table. This table stores all information concerning identifiers and their
|
|
definitions. Symbol-table lookup and hashing is done by a generic namelist
|
|
module [ACK]. The parser turns each program construction into a parse tree,
|
|
which is the major datastructure in the compiler. This parse tree is used
|
|
to exchange information between various routines.
|
|
.sp 2
|
|
.NH 2
|
|
Code Generation
|
|
|
|
.LP
|
|
The final module in the compiler is that of code generation. The information
|
|
stored in the parse trees is used to generate the EM code [EM]. EM code is
|
|
generated with the help of a procedural EM-code interface [ACK]. The use of
|
|
static exchanges is not desired, since the fast back end can not cope with
|
|
static code exchanges, hence the EM pseudoinstruction \fBexc\fR is never
|
|
generated.
|
|
.br
|
|
Chapter 3 discusses the code generation in more detail.
|
|
.sp 2
|
|
.NH 2
|
|
Error Handling
|
|
|
|
.LP
|
|
The first three modules have in common that they can detect errors in the
|
|
Pascal program being compiled. If this is the case, a proper message is given
|
|
and some action is performed. If code generation has to be aborted, an error
|
|
message is given, otherwise a warning is given. The constant MAXERR_LINE,
|
|
defined in the file \fIerrout.h\fR, specifies the maximum number of messages
|
|
given per line. This can be used to avoid long lists of error messages caused
|
|
by, for example, the omission of a ';'. Three kinds of errors can be
|
|
distinguished: the lexical error, the syntactic error, and the semantic error.
|
|
Examples of these errors are respectively, nested comments, an expression with
|
|
unbalanced parentheses, and the addition of two characters.
|
|
.sp 2
|
|
.NH 2
|
|
Memory Allocation and Garbage Collection
|
|
|
|
.LP
|
|
The routines \fIst_alloc\fR and \fIst_free\fR provide a mechanism for
|
|
maintaining free lists of structures, whose first field is a pointer called
|
|
\fBnext\fR. This field is used to chain free structures together. Each
|
|
structure, suppose the tag of the structure is ST, has a free list pointed
|
|
by h_ST. Associated with this list are the operations: \fInew_ST()\fR, an
|
|
allocating mechanism which supplies the space for a new ST struct; and
|
|
\fIfree_ST()\fR, a garbage collecting mechanism which links the specified
|
|
structure into the free list.
|
|
.bp
|