ack/doc/occam/p2

152 lines
5.1 KiB
Plaintext
Raw Permalink Normal View History

1987-02-26 10:26:19 +00:00
.NH
The Compiler
.PP
The compiler is written in \fBC\fP using LLgen and Lex and compiles
Occam programs to EM code, using the procedural interface as defined for EM.
In the following sub-sections we describe the LLgen parser generator and
the aspect of indentation.
.NH 2
The LLgen Parser Generator
.PP
LLgen accepts a Context Free syntax extended with the operators `\f(CW*\fP', `\f(CW?\fP' and `\f(CW+\fP'
1987-02-26 10:26:19 +00:00
that have effects similar to those in regular expressions.
The `\f(CW*\fP' is the closure set operator without an upperbound; `\f(CW+\fP' is the positive
closure operator without an upperbound; `\f(CW?\fP' is the optional operator;
`\f(CW[\fP' and `\f(CW]\fP' can be used for grouping.
1987-02-26 10:26:19 +00:00
For example, a comma-separated list of expressions can be described as:
.DS
.ft CW
1987-02-26 10:26:19 +00:00
expression_list:
expression [ ',' expression ]*
;
.ft
.DE
.LP
Alternatives must be separated by `\f(CW|\fP'.
1987-02-26 10:26:19 +00:00
C code (``actions'') can be inserted at all points between the colon and the
semicolon.
Variables global to the complete rule can be declared just in front of the
colon enclosed in the brackets `\f(CW{\fP' and `\f(CW}\fP'. All other declarations are local to
1987-02-26 10:26:19 +00:00
their actions.
Nonterminals can have parameters to pass information.
A more mature version of the above example would be:
.DS
.ft CW
1987-02-26 10:26:19 +00:00
expression_list(expr *e;) { expr e1, e2; } :
expression(&e1)
[ ',' expression(&e2)
{ e1=append(e1, e2); }
]*
{ *e=e1; }
;
.ft
.DE
As LLgen generates a recursive-descent parser with no backtrack, it must at all
times be able to determine what to do, based on the current input symbol.
Unfortunately, this cannot be done for all grammars. Two kinds of conflicts
are possible, viz. the \fBalternation\fP and \fBrepetition\fP conflict.
An alternation confict arises if two sides of an alternation can start with the
same symbol. E.g.
.DS
.ft CW
1987-02-26 10:26:19 +00:00
plus: '+' | '+' ;
.ft
.DE
The parser doesn't know which `\f(CW+\fP' to choose (neither do we).
1987-02-26 10:26:19 +00:00
Such a conflict can be resolved by putting an \fBif-condition\fP in front of
the first conflicting production. It consists of a \fB``%if''\fP followed by a
C-expression between parentheses.
If a conflict occurs (and only if it does) the C-expression is evaluated and
parsing continues along this path if non-zero. Example:
.DS
.ft CW
1987-02-26 10:26:19 +00:00
plus:
%if (some_plusses_are_more_equal_than_others())
'+'
|
'+'
;
.ft
.DE
A repetition conflict arises when the parser cannot decide whether
``\f(CWproductionrule\fP'' in e.g. ``\f(CW[ productionrule ]*\fP'' must be chosen
1987-02-26 10:26:19 +00:00
once more, or that it should continue.
This kind of conflicts can be resolved by putting a \fBwhile-condition\fP right
after the opening parentheses. It consists of a \fB``%while''\fP
followed by a C-expression between parentheses. As an example, we can look at
the \fBcomma-expression\fP in C. The comma may only be used for the
comma-expression if the total expression is not part of another comma-separated
list:
.DS
.nf
.ft CW
1987-02-26 10:26:19 +00:00
comma_expression:
sub_expression
[ %while (not_part_of_comma_separated_list())
',' sub_expression
]*
;
.ft
.fi
.DE
Again, the \fB``%while''\fP is only used in case of a conflict.
.LP
1991-11-19 13:37:20 +00:00
Error recovery is done almost completely automatically. All the LLgen-user has to do
is write a routine called \fILLmessage\fP to give the necessary error
1987-02-26 10:26:19 +00:00
messages and supply information about terminals found missing.
.NH 2
Indentation
.PP
The way conflicts can be resolved are of great use to Occam. The use of
indentation, to group statements, leads to many conflicts because the spaces
used for indentation are just token separators to the lexical analyzer, i.e.
``white space''. The lexical analyzer can be instructed to generate `BEGIN' and
`END' tokens at each indentation change, but that leads to great difficulties
as expressions may occupy several lines, thus leading to indentation changes
at the strangest moments. So we decided to resolve the conflicts by looking
at the indentation ourselves. The lexical analyzer puts the current indentation
level in the global variable \fIind\fP for use by the parser. The best example
is the \fBSEQ\fP construct, which exists in two flavors, one with a replicator
and one process:
.DS
.nf
.ft CW
1987-02-26 10:26:19 +00:00
seq i = [ 1 for str[byte 0] ]
out ! str[byte i]
.ft
.fi
.DE
and one without a replicator and several processes:
.DS
.nf
.ft CW
1987-02-26 10:26:19 +00:00
seq
in ? c
out ! c
.ft
.fi
.DE
The LLgen skeleton grammar to handle these two is:
.DS
.nf
.ft CW
1987-02-26 10:26:19 +00:00
SEQ { line=yylineno; oind=ind; }
[ %if (line==yylineno)
replicator
process
|
[ %while (ind>oind) process ]*
]
.ft
.fi
.DE
This shows clearly that, a replicator must be on the same line as the \fBSEQ\fP,
and new processes are collected as long as the indentation level of each process
is greater than the indentation level of \fBSEQ\fP (with appropriate checks on this
identation).
.PP
Different indentation styles are accepted, as long as the same amount of spaces
is used for each indentation shift. The ascii tab character sets the indentation
level to an eight space boundary. The first indentation level found in a file
is used to compare all other indentation levels to.