ack/doc/pascal/internal.doc
1991-11-01 09:43:36 +00:00

342 lines
12 KiB
Text

.pl 12.5i
.sp 1.5i
.NH
The compiler
.nh
.LP
The compiler can be divided roughly into four modules:
\(bu lexical analysis
.br
\(bu syntax analysis
.br
\(bu semantic analysis
.br
\(bu code generation
.br
The four modules are grouped into one pass. The activity of these modules
is interleaved during the pass.
.br
The lexical analyzer, some expression handling routines and various
datastructures from the Modula-2 compiler contributed to the project.
.sp 2
.NH 2
Lexical Analysis
.LP
The first module of the compiler is the lexical analyzer. In this module, the
stream of input characters making up the source program is grouped into
\fItokens\fR, as defined in \fBISO 6.1\fR. The analyzer is hand-written,
because the lexical analyzer generator, which was at our disposal,
\fILex\fR [LEX], produces much slower analyzers. A character table, in the file
\fIchar.c\fR, is created using the program \fItab\fR which takes as input
the file \fIchar.tab\fR. In this table each character is placed into a
particular class. The classes, as defined in the file \fIclass.h\fR,
represent a set of tokens. The strategy of the analyzer is as follows: the
first character of a new token is used in a multiway branch to eliminate as
many candidate tokens as possible. Then the remaining characters of the token
are read. The constant INP_NPUSHBACK, defined in the file \fIinput.h\fR,
specifies the maximum number of characters the analyzer looks ahead. The
value has to be at least 3, to handle input sequences such as:
.br
1e+4 (which is a real number)
.br
1e+a (which is the integer 1, followed by the identifier "e", a plus, and the identifier "a")
Another aspect of this module is the insertion and deletion of tokens
required by the parser for the recovery of syntactic errors (see also section
2.2). A generic input module [ACK] is used to avoid the burden of I/O.
.sp 2
.NH 2
Syntax Analysis
.LP
The second module of the compiler is the parser, which is the central part of
the compiler. It invokes the routines of the other modules. The tokens obtained
from the lexical analyzer are grouped into grammatical phrases. These phrases
are stored as parse trees and handed over to the next part. The parser is
generated using \fILLgen\fR[LL], a tool for generating an efficient recursive
descent parser with no backtrack from an Extended Context Free Syntax.
.br
An error recovery mechanism is generated almost completely automatically. A
routine called \fILLmessage\fR had to be written, which gives the necessary
error messages and deals with the insertion and deletion of tokens.
The routine \fILLmessage\fR must accept one parameter, whose value is
a token number, zero or -1. A zero parameter indicates that the current token
(the one in the external variable \fILLsymb\fR) is deleted.
A -1 parameter indicates that the parser expected end of file, but did
not get it. The parser will then skip tokens until end of file is detected.
A parameter that is a token number (a positive parameter) indicates that
this token is to be inserted in front of the token currently in \fILLsymb\fR.
Also, care must be taken, that the token currently in \fILLsymb\fR is again
returned by the \fBnext\fR call to the lexical analyzer, with the proper
attributes. So, the lexical analyzer must have a facility to push back one
token.
.br
Calls to the two standard procedures \fIwrite\fR and \fIwriteln\fR can be
different from calls to other procedures. The syntax of a write-parameter
is different from the syntax of an actual-parameter. We decided to include
them, together with \fIread\fR and \fIreadln\fR, in the grammar. An alternate
solution would be to make the syntax of an actual-parameter identical to the
syntax of a write-parameter. Afterwards the parameter has to be checked to
see whether it is used properly or not.
.bp
As the parser is LL(1), it must always be able to determine what to do,
based on the last token read (\fILLsymb\fR). Unfortunately, this was not the
case with the grammar as specified in [ISO]. Two kinds of problems
appeared, viz. the \fBalternation\fR and \fBrepetition\fR conflict.
The examples given in the following paragraphs are taken from the grammar.
.NH 3
Alternation conflict
.LP
An alternation conflict arises when the parser can not decide which
production to choose.
.br
\fBExample:\fR
.in +2m
.ft 5
.nf
procedure-declaration : procedure-heading \fB';'\f5 directive |
.br
\h'\w'procedure-declaration : 'u'procedure-identification \fB';'\f5 procedure-block |
.br
\h'\w'procedure-declaration : 'u'procedure-heading \fB';'\f5 procedure-block ;
.br
procedure-heading : \fBprocedure\f5 identifier [ formal-parameter-list ]? ;
.br
procedure-identification : \fBprocedure\f5 procedure-identifier ;
.fi
.ft R
.in -2m
A sentence that starts with the terminal \fBprocedure\fR is derived from the
three alternative productions. This conflict can be resolved in two ways:
adjusting the grammar, usually some rules are replaced by one rule and more
work has to be done in the semantic analysis; using the LLgen conflict
resolver, "\fB%if\fR (C-expression)", if the C-expression evaluates to
non-zero, the production in question is chosen, otherwise one of the
remaining rules is chosen. The grammar rules were rewritten to solve this
conflict. The new rules are given below. For more details see the file
\fIdeclar.g\fR.
.in +2m
.ft 5
.nf
procedure-declaration : procedure-heading \fB';'\f5 ( directive | procedure-block ) ;
.br
procedure-heading : \fBprocedure\f5 identifier [ formal-parameter-list ]? ;
.fi
.ft R
.in -2m
A special case of an alternation conflict, which is common to many block
structured languages, is the \fI"dangling-else"\fR ambiguity.
.in +2m
.ft 5
.nf
if-statement : \fBif\f5 boolean-expression \fBthen\f5 statement [ else-part ]? ;
.br
else-part : \fBelse\f5 statement ;
.fi
.ft R
.in -2m
The following statement that can be derived from the rules above is ambiguous:
.ti +2m
\fBif\f5 boolean-expr-1 \fBthen\f5 \fBif\f5 boolean-expr-2 \fBthen\f5 statement-1 \fBelse\f5 statement-2
.ft R
.ps 8
.vs 7
.PS
move right 1.1i
S: line down 0.5i
"if-statement" at S.start above
.ft B
"then" at S.end below
.ft R
move to S.start then down 0.25i
L: line left 0.5i then down 0.25i
box ht 0.33i wid 0.6i "boolean" "expression-1"
move to L.start then left 0.5i
L: line left 0.5i then down 0.25i
.ft B
"if" at L.end below
.ft R
move to L.start then right 0.5i
L: line right 0.5i then down 0.25i
"statement" at L.end below
move to L.end then down 0.10i
L: line down 0.25i dashed
"if-statement" at L.end below
move to L.end then down 0.10i
L: line down 0.5i
.ft B
"then" at L.end below
.ft R
move to L.start then down 0.25i
L: line left 0.5i then down 0.25i
box ht 0.33i wid 0.6i "boolean" "expression-2"
move to L.start then left 0.5i
L: line left 0.5i then down 0.25i
.ft B
"if" at L.end below
.ft R
move to L.start then right 0.5i
L: line right 0.5i then down 0.25i
box ht 0.33i wid 0.6i "statement-1"
move to L.start then right 0.5i
L: line right 0.5i then down 0.25i
.ft B
"else" at L.end below
.ft R
move to L.start then right 0.5i
L: line right 0.5i then down 0.25i
box ht 0.33i wid 0.6i "statement-2"
move to S.start
move right 3.5i
L: line down 0.5i
"if-statement" at L.start above
.ft B
"then" at L.end below
.ft R
move to L.start then down 0.25i
L: line left 0.5i then down 0.25i
box ht 0.33i wid 0.6i "boolean" "expression-1"
move to L.start then left 0.5i
L: line left 0.5i then down 0.25i
.ft B
"if" at L.end below
.ft R
move to L.start then right 0.5i
S: line right 0.5i then down 0.25i
"statement" at S.end below
move to S.start then right 0.5i
L: line right 0.5i then down 0.25i
.ft B
"else" at L.end below
.ft R
move to L.start then right 0.5i
L: line right 0.5i then down 0.25i
box ht 0.33i wid 0.6i "statement-2"
move to S.end then down 0.10i
L: line down 0.25i dashed
"if-statement" at L.end below
move to L.end then down 0.10i
L: line down 0.5i
.ft B
"then" at L.end below
.ft R
move to L.start then down 0.25i
L: line left 0.5i then down 0.25i
box ht 0.33i wid 0.6i "boolean" "expression-2"
move to L.start then left 0.5i
L: line left 0.5i then down 0.25i
.ft B
"if" at L.end below
.ft R
move to L.start then right 0.5i
L: line right 0.5i then down 0.25i
box ht 0.33i wid 0.6i "statement-1"
.PE
.ps
.vs
\h'615u'(a)\h'1339u'(b)
.sp
.ce
Two parse trees showing the \fIdangling-else\fR ambiguity
.sp 2
According to the standard, \fBelse\fR is matched with the nearest preceding
unmatched \fBthen\fR, i.e. parse tree (a) is valid (\fBISO 6.8.3.4\fR).
This conflict is statically resolved in LLgen by using "\fB%prefer\fR",
which is equivalent in behaviour to "\fB%if\fR(1)".
.bp
.NH 3
Repetition conflict
.LP
A repetition conflict arises when the parser can not decide whether to choose
a production once more, or not.
.br
\fBExample:\fR
.in +2m
.ft 5
.nf
field-list : [ ( fixed-part [ \fB';'\f5 variant-part ]? | variantpart ) [;]? ]? ;
.br
fixed-part : record-section [ \fB';'\f5 record-section ]* ;
.fi
.in -2m
.ft R
When the parser sees the semicolon, it can not decide whether another
record-section or a variant-part follows. This conflict can be resolved in
two ways: adjusting the grammar or using the conflict resolver,
"\fB%while\fR (C-expression)". The grammar rules that deal with this conflict
were completely rewritten. For more details, the reader is referred to the
file \fIdeclar.g\fR.
.sp 2
.NH 2
Semantic Analysis
.LP
The third module of the compiler is the checking of semantic conventions of
ISO-Pascal. To check the program being parsed, actions have been used in
LLgen. An action consists of several C-statements, enclosed in brackets
"{" and "}". In order to facilitate communication between the actions and
\fILLparse\fR, the parsing routines can be given C-like parameters and
local variables. An important part of the semantic analyzer is the symbol
table. This table stores all information concerning identifiers and their
definitions. Symbol-table lookup and hashing is done by a generic namelist
module [ACK]. The parser turns each program construction into a parse tree,
which is the major datastructure in the compiler. This parse tree is used
to exchange information between various routines.
.sp 2
.NH 2
Code Generation
.LP
The final module in the compiler is that of code generation. The information
stored in the parse trees is used to generate the EM code [EM]. EM code is
generated with the help of a procedural EM-code interface [ACK]. The use of
static exchanges is not desired, since the fast back end can not cope with
static code exchanges, hence the EM pseudoinstruction \fBexc\fR is never
generated.
.br
Chapter 3 discusses the code generation in more detail.
.sp 2
.NH 2
Error Handling
.LP
The first three modules have in common that they can detect errors in the
Pascal program being compiled. If this is the case, a proper message is given
and some action is performed. If code generation has to be aborted, an error
message is given, otherwise a warning is given. The constant MAXERR_LINE,
defined in the file \fIerrout.h\fR, specifies the maximum number of messages
given per line. This can be used to avoid long lists of error messages caused
by, for example, the omission of a ';'. Three kinds of errors can be
distinguished: the lexical error, the syntactic error, and the semantic error.
Examples of these errors are respectively, nested comments, an expression with
unbalanced parentheses, and the addition of two characters.
.sp 2
.NH 2
Memory Allocation and Garbage Collection
.LP
The routines \fIst_alloc\fR and \fIst_free\fR provide a mechanism for
maintaining free lists of structures, whose first field is a pointer called
\fBnext\fR. This field is used to chain free structures together. Each
structure, suppose the tag of the structure is ST, has a free list pointed
by h_ST. Associated with this list are the operations: \fInew_ST()\fR, an
allocating mechanism which supplies the space for a new ST struct; and
\fIfree_ST()\fR, a garbage collecting mechanism which links the specified
structure into the free list.
.bp