.NH The Compiler .PP The compiler is written in \fBC\fP using LLgen and Lex and compiles Occam programs to EM code, using the procedural interface as defined for EM. In the following sub-sections we describe the LLgen parser generator and the aspect of indentation. .NH 2 The LLgen Parser Generator .PP LLgen accepts a Context Free syntax extended with the operators `\f5*\fP', `\f5?\fP' and `\f5+\fP' that have effects similar to those in regular expressions. The `\f5*\fP' is the closure set operator without an upperbound; `\f5+\fP' is the positive closure operator without an upperbound; `\f5?\fP' is the optional operator; `\f5[\fP' and `\f5]\fP' can be used for grouping. For example, a comma-separated list of expressions can be described as: .DS .ft 5 expression_list: expression [ ',' expression ]* ; .ft .DE .LP Alternatives must be separated by `\f5|\fP'. C code (``actions'') can be inserted at all points between the colon and the semicolon. Variables global to the complete rule can be declared just in front of the colon enclosed in the brackets `\f5{\fP' and `\f5}\fP'. All other declarations are local to their actions. Nonterminals can have parameters to pass information. A more mature version of the above example would be: .DS .ft 5 expression_list(expr *e;) { expr e1, e2; } : expression(&e1) [ ',' expression(&e2) { e1=append(e1, e2); } ]* { *e=e1; } ; .ft .DE As LLgen generates a recursive-descent parser with no backtrack, it must at all times be able to determine what to do, based on the current input symbol. Unfortunately, this cannot be done for all grammars. Two kinds of conflicts are possible, viz. the \fBalternation\fP and \fBrepetition\fP conflict. An alternation confict arises if two sides of an alternation can start with the same symbol. E.g. .DS .ft 5 plus: '+' | '+' ; .ft .DE The parser doesn't know which `\f5+\fP' to choose (neither do we). Such a conflict can be resolved by putting an \fBif-condition\fP in front of the first conflicting production. It consists of a \fB``%if''\fP followed by a C-expression between parentheses. If a conflict occurs (and only if it does) the C-expression is evaluated and parsing continues along this path if non-zero. Example: .DS .ft 5 plus: %if (some_plusses_are_more_equal_than_others()) '+' | '+' ; .ft .DE A repetition conflict arises when the parser cannot decide whether ``\f5productionrule\fP'' in e.g. ``\f5[ productionrule ]*\fP'' must be chosen once more, or that it should continue. This kind of conflicts can be resolved by putting a \fBwhile-condition\fP right after the opening parentheses. It consists of a \fB``%while''\fP followed by a C-expression between parentheses. As an example, we can look at the \fBcomma-expression\fP in C. The comma may only be used for the comma-expression if the total expression is not part of another comma-separated list: .DS .nf .ft 5 comma_expression: sub_expression [ %while (not_part_of_comma_separated_list()) ',' sub_expression ]* ; .ft .fi .DE Again, the \fB``%while''\fP is only used in case of a conflict. .LP Error recovery is done almost completely automatically. All you have to do is to write a routine called \fILLmessage\fP to give the necessary error messages and supply information about terminals found missing. .NH 2 Indentation .PP The way conflicts can be resolved are of great use to Occam. The use of indentation, to group statements, leads to many conflicts because the spaces used for indentation are just token separators to the lexical analyzer, i.e. ``white space''. The lexical analyzer can be instructed to generate `BEGIN' and `END' tokens at each indentation change, but that leads to great difficulties as expressions may occupy several lines, thus leading to indentation changes at the strangest moments. So we decided to resolve the conflicts by looking at the indentation ourselves. The lexical analyzer puts the current indentation level in the global variable \fIind\fP for use by the parser. The best example is the \fBSEQ\fP construct, which exists in two flavors, one with a replicator and one process: .DS .nf .ft 5 seq i = [ 1 for str[byte 0] ] out ! str[byte i] .ft .fi .DE and one without a replicator and several processes: .DS .nf .ft 5 seq in ? c out ! c .ft .fi .DE The LLgen skeleton grammar to handle these two is: .DS .nf .ft 5 SEQ { line=yylineno; oind=ind; } [ %if (line==yylineno) replicator process | [ %while (ind>oind) process ]* ] .ft .fi .DE This shows clearly that, a replicator must be on the same line as the \fBSEQ\fP, and new processes are collected as long as the indentation level of each process is greater than the indentation level of \fBSEQ\fP (with appropriate checks on this identation). .PP Different indentation styles are accepted, as long as the same amount of spaces is used for each indentation shift. The ascii tab character sets the indentation level to an eight space boundary. The first indentation level found in a file is used to compare all other indentation levels to.