151 lines
5.1 KiB
Text
151 lines
5.1 KiB
Text
.NH
|
|
The Compiler
|
|
.PP
|
|
The compiler is written in \fBC\fP using LLgen and Lex and compiles
|
|
Occam programs to EM code, using the procedural interface as defined for EM.
|
|
In the following sub-sections we describe the LLgen parser generator and
|
|
the aspect of indentation.
|
|
.NH 2
|
|
The LLgen Parser Generator
|
|
.PP
|
|
LLgen accepts a Context Free syntax extended with the operators `\f(CW*\fP', `\f(CW?\fP' and `\f(CW+\fP'
|
|
that have effects similar to those in regular expressions.
|
|
The `\f(CW*\fP' is the closure set operator without an upperbound; `\f(CW+\fP' is the positive
|
|
closure operator without an upperbound; `\f(CW?\fP' is the optional operator;
|
|
`\f(CW[\fP' and `\f(CW]\fP' can be used for grouping.
|
|
For example, a comma-separated list of expressions can be described as:
|
|
.DS
|
|
.ft CW
|
|
expression_list:
|
|
expression [ ',' expression ]*
|
|
;
|
|
.ft
|
|
.DE
|
|
.LP
|
|
Alternatives must be separated by `\f(CW|\fP'.
|
|
C code (``actions'') can be inserted at all points between the colon and the
|
|
semicolon.
|
|
Variables global to the complete rule can be declared just in front of the
|
|
colon enclosed in the brackets `\f(CW{\fP' and `\f(CW}\fP'. All other declarations are local to
|
|
their actions.
|
|
Nonterminals can have parameters to pass information.
|
|
A more mature version of the above example would be:
|
|
.DS
|
|
.ft CW
|
|
expression_list(expr *e;) { expr e1, e2; } :
|
|
expression(&e1)
|
|
[ ',' expression(&e2)
|
|
{ e1=append(e1, e2); }
|
|
]*
|
|
{ *e=e1; }
|
|
;
|
|
.ft
|
|
.DE
|
|
As LLgen generates a recursive-descent parser with no backtrack, it must at all
|
|
times be able to determine what to do, based on the current input symbol.
|
|
Unfortunately, this cannot be done for all grammars. Two kinds of conflicts
|
|
are possible, viz. the \fBalternation\fP and \fBrepetition\fP conflict.
|
|
An alternation confict arises if two sides of an alternation can start with the
|
|
same symbol. E.g.
|
|
.DS
|
|
.ft CW
|
|
plus: '+' | '+' ;
|
|
.ft
|
|
.DE
|
|
The parser doesn't know which `\f(CW+\fP' to choose (neither do we).
|
|
Such a conflict can be resolved by putting an \fBif-condition\fP in front of
|
|
the first conflicting production. It consists of a \fB``%if''\fP followed by a
|
|
C-expression between parentheses.
|
|
If a conflict occurs (and only if it does) the C-expression is evaluated and
|
|
parsing continues along this path if non-zero. Example:
|
|
.DS
|
|
.ft CW
|
|
plus:
|
|
%if (some_plusses_are_more_equal_than_others())
|
|
'+'
|
|
|
|
|
'+'
|
|
;
|
|
.ft
|
|
.DE
|
|
A repetition conflict arises when the parser cannot decide whether
|
|
``\f(CWproductionrule\fP'' in e.g. ``\f(CW[ productionrule ]*\fP'' must be chosen
|
|
once more, or that it should continue.
|
|
This kind of conflicts can be resolved by putting a \fBwhile-condition\fP right
|
|
after the opening parentheses. It consists of a \fB``%while''\fP
|
|
followed by a C-expression between parentheses. As an example, we can look at
|
|
the \fBcomma-expression\fP in C. The comma may only be used for the
|
|
comma-expression if the total expression is not part of another comma-separated
|
|
list:
|
|
.DS
|
|
.nf
|
|
.ft CW
|
|
comma_expression:
|
|
sub_expression
|
|
[ %while (not_part_of_comma_separated_list())
|
|
',' sub_expression
|
|
]*
|
|
;
|
|
.ft
|
|
.fi
|
|
.DE
|
|
Again, the \fB``%while''\fP is only used in case of a conflict.
|
|
.LP
|
|
Error recovery is done almost completely automatically. All the LLgen-user has to do
|
|
is write a routine called \fILLmessage\fP to give the necessary error
|
|
messages and supply information about terminals found missing.
|
|
.NH 2
|
|
Indentation
|
|
.PP
|
|
The way conflicts can be resolved are of great use to Occam. The use of
|
|
indentation, to group statements, leads to many conflicts because the spaces
|
|
used for indentation are just token separators to the lexical analyzer, i.e.
|
|
``white space''. The lexical analyzer can be instructed to generate `BEGIN' and
|
|
`END' tokens at each indentation change, but that leads to great difficulties
|
|
as expressions may occupy several lines, thus leading to indentation changes
|
|
at the strangest moments. So we decided to resolve the conflicts by looking
|
|
at the indentation ourselves. The lexical analyzer puts the current indentation
|
|
level in the global variable \fIind\fP for use by the parser. The best example
|
|
is the \fBSEQ\fP construct, which exists in two flavors, one with a replicator
|
|
and one process:
|
|
.DS
|
|
.nf
|
|
.ft CW
|
|
seq i = [ 1 for str[byte 0] ]
|
|
out ! str[byte i]
|
|
.ft
|
|
.fi
|
|
.DE
|
|
and one without a replicator and several processes:
|
|
.DS
|
|
.nf
|
|
.ft CW
|
|
seq
|
|
in ? c
|
|
out ! c
|
|
.ft
|
|
.fi
|
|
.DE
|
|
The LLgen skeleton grammar to handle these two is:
|
|
.DS
|
|
.nf
|
|
.ft CW
|
|
SEQ { line=yylineno; oind=ind; }
|
|
[ %if (line==yylineno)
|
|
replicator
|
|
process
|
|
|
|
|
[ %while (ind>oind) process ]*
|
|
]
|
|
.ft
|
|
.fi
|
|
.DE
|
|
This shows clearly that, a replicator must be on the same line as the \fBSEQ\fP,
|
|
and new processes are collected as long as the indentation level of each process
|
|
is greater than the indentation level of \fBSEQ\fP (with appropriate checks on this
|
|
identation).
|
|
.PP
|
|
Different indentation styles are accepted, as long as the same amount of spaces
|
|
is used for each indentation shift. The ascii tab character sets the indentation
|
|
level to an eight space boundary. The first indentation level found in a file
|
|
is used to compare all other indentation levels to.
|