ack/doc/occam/p2

.NH
The Compiler
.PP
The compiler is written in \fBC\fP using LLgen and Lex and compiles
Occam programs to EM code, using the procedural interface as defined for EM.
In the following sub-sections we describe the LLgen parser generator and
the aspect of indentation.
.NH 2
The LLgen Parser Generator
.PP
LLgen accepts a Context Free syntax extended with the operators `\f(CW*\fP', `\f(CW?\fP' and `\f(CW+\fP'
that have effects similar to those in regular expressions.
The `\f(CW*\fP' is the closure set operator without an upperbound; `\f(CW+\fP' is the positive
closure operator without an upperbound; `\f(CW?\fP' is the optional operator;
`\f(CW[\fP' and `\f(CW]\fP' can be used for grouping.
For example, a comma-separated list of expressions can be described as:
.DS
.ft CW
	expression_list:
		  expression [ ',' expression ]*
		;
.ft
.DE
.LP
Alternatives must be separated by `\f(CW|\fP'.
C code (``actions'') can be inserted at all points between the colon and the
semicolon.
Variables global to the complete rule can be declared just in front of the
colon enclosed in the brackets `\f(CW{\fP' and `\f(CW}\fP'.  All other declarations are local to
their actions.
Nonterminals can have parameters to pass information.
A more mature version of the above example would be:
.DS
.ft CW
       expression_list(expr *e;)       {	expr e1, e2;	} :
                expression(&e1)
                [ ',' expression(&e2)
                                       {	e1=append(e1, e2);	}
                ]*
                                       {	*e=e1;	}
              ;
.ft
.DE
As LLgen generates a recursive-descent parser with no backtrack, it must at all
times be able to determine what to do, based on the current input symbol. 
Unfortunately, this cannot be done for all grammars. Two kinds of conflicts
are possible, viz. the \fBalternation\fP and \fBrepetition\fP conflict.
An alternation confict arises if two sides of an alternation can start with the
same symbol. E.g.
.DS
.ft CW
	plus:	'+' | '+' ;
.ft
.DE
The parser doesn't know which `\f(CW+\fP' to choose (neither do we).
Such a conflict can be resolved by putting an \fBif-condition\fP in front of
the first conflicting production. It consists of a \fB``%if''\fP followed by a
C-expression between parentheses.
If a conflict occurs (and only if it does) the C-expression is evaluated and
parsing continues along this path if non-zero. Example:
.DS
.ft CW
	plus:
		  %if (some_plusses_are_more_equal_than_others())
		  '+'
		|
		  '+'
		;
.ft
.DE
A repetition conflict arises when the parser cannot decide whether
``\f(CWproductionrule\fP'' in e.g. ``\f(CW[ productionrule ]*\fP'' must be chosen
once more, or that it should continue.
This kind of conflicts can be resolved by putting a \fBwhile-condition\fP right
after the opening parentheses.  It consists of a \fB``%while''\fP
followed by a C-expression between parentheses. As an example, we can look at
the \fBcomma-expression\fP in C.  The comma may only be used for the
comma-expression if the total expression is not part of another comma-separated
list:
.DS
.nf
.ft CW
	comma_expression:
		  sub_expression
		  [ %while (not_part_of_comma_separated_list())
			  ',' sub_expression
		  ]*
		;
.ft
.fi
.DE
Again, the \fB``%while''\fP is only used in case of a conflict.
.LP
Error recovery is done almost completely automatically. All the LLgen-user has to do
is write a routine called \fILLmessage\fP to give the necessary error
messages and supply information about terminals found missing.
.NH 2	
Indentation
.PP
The way conflicts can be resolved are of great use to Occam. The use of
indentation, to group statements, leads to many conflicts because the spaces
used for indentation are just token separators to the lexical analyzer, i.e.
``white space''. The lexical analyzer can be instructed to generate `BEGIN' and
`END' tokens at each indentation change, but that leads to great difficulties
as expressions may occupy several lines, thus leading to indentation changes
at the strangest moments. So we decided to resolve the conflicts by looking
at the indentation ourselves. The lexical analyzer puts the current indentation
level in the global variable \fIind\fP for use by the parser. The best example
is the \fBSEQ\fP construct, which exists in two flavors, one with a replicator
and one process:
.DS
.nf
.ft CW
	seq i = [ 1 for str[byte 0] ]
		out ! str[byte i]
.ft
.fi
.DE
and one without a replicator and several processes:
.DS
.nf
.ft CW
	seq
		in ? c
		out ! c
.ft
.fi
.DE
The LLgen skeleton grammar to handle these two is:
.DS
.nf
.ft CW
	SEQ			{	line=yylineno; oind=ind; }
	[	  %if (line==yylineno)
		  replicator
		  process
		|
		  [ %while (ind>oind) process ]*
	]
.ft
.fi
.DE
This shows clearly that, a replicator must be on the same line as the \fBSEQ\fP,
and new processes are collected as long as the indentation level of each process
is greater than the indentation level of \fBSEQ\fP (with appropriate checks on this
identation).
.PP
Different indentation styles are accepted, as long as the same amount of spaces
is used for each indentation shift. The ascii tab character sets the indentation
level to an eight space boundary. The first indentation level found in a file
is used to compare all other indentation levels to.
Initial revision 1987-02-26 10:26:19 +00:00			`.NH`
			`The Compiler`
			`.PP`
			`The compiler is written in \fBC\fP using LLgen and Lex and compiles`
			`Occam programs to EM code, using the procedural interface as defined for EM.`
			`In the following sub-sections we describe the LLgen parser generator and`
			`the aspect of indentation.`
			`.NH 2`
			`The LLgen Parser Generator`
			`.PP`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			LLgen accepts a Context Free syntax extended with the operators `\f(CW*\fP', `\f(CW?\fP' and `\f(CW+\fP'
Initial revision 1987-02-26 10:26:19 +00:00			`that have effects similar to those in regular expressions.`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			The `\f(CW*\fP' is the closure set operator without an upperbound; `\f(CW+\fP' is the positive
			closure operator without an upperbound; `\f(CW?\fP' is the optional operator;
			`\f(CW[\fP' and `\f(CW]\fP' can be used for grouping.
Initial revision 1987-02-26 10:26:19 +00:00			`For example, a comma-separated list of expressions can be described as:`
			`.DS`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			`.ft CW`
Initial revision 1987-02-26 10:26:19 +00:00			`expression_list:`
			`expression [ ',' expression ]*`
			`;`
			`.ft`
			`.DE`
			`.LP`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			Alternatives must be separated by `\f(CW\|\fP'.
Initial revision 1987-02-26 10:26:19 +00:00			C code (``actions'') can be inserted at all points between the colon and the
			`semicolon.`
			`Variables global to the complete rule can be declared just in front of the`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			colon enclosed in the brackets `\f(CW{\fP' and `\f(CW}\fP'. All other declarations are local to
Initial revision 1987-02-26 10:26:19 +00:00			`their actions.`
			`Nonterminals can have parameters to pass information.`
			`A more mature version of the above example would be:`
			`.DS`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			`.ft CW`
Initial revision 1987-02-26 10:26:19 +00:00			`expression_list(expr *e;) { expr e1, e2; } :`
			`expression(&e1)`
			`[ ',' expression(&e2)`
			`{ e1=append(e1, e2); }`
			`]*`
			`{ *e=e1; }`
			`;`
			`.ft`
			`.DE`
			`As LLgen generates a recursive-descent parser with no backtrack, it must at all`
			`times be able to determine what to do, based on the current input symbol.`
			`Unfortunately, this cannot be done for all grammars. Two kinds of conflicts`
			`are possible, viz. the \fBalternation\fP and \fBrepetition\fP conflict.`
			`An alternation confict arises if two sides of an alternation can start with the`
			`same symbol. E.g.`
			`.DS`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			`.ft CW`
Initial revision 1987-02-26 10:26:19 +00:00			`plus: '+' \| '+' ;`
			`.ft`
			`.DE`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			The parser doesn't know which `\f(CW+\fP' to choose (neither do we).
Initial revision 1987-02-26 10:26:19 +00:00			`Such a conflict can be resolved by putting an \fBif-condition\fP in front of`
			the first conflicting production. It consists of a \fB``%if''\fP followed by a
			`C-expression between parentheses.`
			`If a conflict occurs (and only if it does) the C-expression is evaluated and`
			`parsing continues along this path if non-zero. Example:`
			`.DS`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			`.ft CW`
Initial revision 1987-02-26 10:26:19 +00:00			`plus:`
			`%if (some_plusses_are_more_equal_than_others())`
			`'+'`
			`\|`
			`'+'`
			`;`
			`.ft`
			`.DE`
			`A repetition conflict arises when the parser cannot decide whether`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			``\f(CWproductionrule\fP'' in e.g. ``\f(CW[ productionrule ]*\fP'' must be chosen
Initial revision 1987-02-26 10:26:19 +00:00			`once more, or that it should continue.`
			`This kind of conflicts can be resolved by putting a \fBwhile-condition\fP right`
			after the opening parentheses. It consists of a \fB``%while''\fP
			`followed by a C-expression between parentheses. As an example, we can look at`
			`the \fBcomma-expression\fP in C. The comma may only be used for the`
			`comma-expression if the total expression is not part of another comma-separated`
			`list:`
			`.DS`
			`.nf`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			`.ft CW`
Initial revision 1987-02-26 10:26:19 +00:00			`comma_expression:`
			`sub_expression`
			`[ %while (not_part_of_comma_separated_list())`
			`',' sub_expression`
			`]*`
			`;`
			`.ft`
			`.fi`
			`.DE`
			Again, the \fB``%while''\fP is only used in case of a conflict.
			`.LP`
Avoid informal usage of 'you' 1991-11-19 13:37:20 +00:00			`Error recovery is done almost completely automatically. All the LLgen-user has to do`
			`is write a routine called \fILLmessage\fP to give the necessary error`
Initial revision 1987-02-26 10:26:19 +00:00			`messages and supply information about terminals found missing.`
			`.NH 2`
			`Indentation`
			`.PP`
			`The way conflicts can be resolved are of great use to Occam. The use of`
			`indentation, to group statements, leads to many conflicts because the spaces`
			`used for indentation are just token separators to the lexical analyzer, i.e.`
			``white space''. The lexical analyzer can be instructed to generate `BEGIN' and
			`END' tokens at each indentation change, but that leads to great difficulties
			`as expressions may occupy several lines, thus leading to indentation changes`
			`at the strangest moments. So we decided to resolve the conflicts by looking`
			`at the indentation ourselves. The lexical analyzer puts the current indentation`
			`level in the global variable \fIind\fP for use by the parser. The best example`
			`is the \fBSEQ\fP construct, which exists in two flavors, one with a replicator`
			`and one process:`
			`.DS`
			`.nf`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			`.ft CW`
Initial revision 1987-02-26 10:26:19 +00:00			`seq i = [ 1 for str[byte 0] ]`
			`out ! str[byte i]`
			`.ft`
			`.fi`
			`.DE`
			`and one without a replicator and several processes:`
			`.DS`
			`.nf`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			`.ft CW`
Initial revision 1987-02-26 10:26:19 +00:00			`seq`
			`in ? c`
			`out ! c`
			`.ft`
			`.fi`
			`.DE`
			`The LLgen skeleton grammar to handle these two is:`
			`.DS`
			`.nf`
changed font 5 references to font CW references 1988-04-18 13:34:29 +00:00			`.ft CW`
Initial revision 1987-02-26 10:26:19 +00:00			`SEQ { line=yylineno; oind=ind; }`
			`[ %if (line==yylineno)`
			`replicator`
			`process`
			`\|`
			`[ %while (ind>oind) process ]*`
			`]`
			`.ft`
			`.fi`
			`.DE`
			`This shows clearly that, a replicator must be on the same line as the \fBSEQ\fP,`
			`and new processes are collected as long as the indentation level of each process`
			`is greater than the indentation level of \fBSEQ\fP (with appropriate checks on this`
			`identation).`
			`.PP`
			`Different indentation styles are accepted, as long as the same amount of spaces`
			`is used for each indentation shift. The ascii tab character sets the indentation`
			`level to an eight space boundary. The first indentation level found in a file`
			`is used to compare all other indentation levels to.`