151 lines
		
	
	
	
		
			5.1 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			151 lines
		
	
	
	
		
			5.1 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
.NH
 | 
						|
The Compiler
 | 
						|
.PP
 | 
						|
The compiler is written in \fBC\fP using LLgen and Lex and compiles
 | 
						|
Occam programs to EM code, using the procedural interface as defined for EM.
 | 
						|
In the following sub-sections we describe the LLgen parser generator and
 | 
						|
the aspect of indentation.
 | 
						|
.NH 2
 | 
						|
The LLgen Parser Generator
 | 
						|
.PP
 | 
						|
LLgen accepts a Context Free syntax extended with the operators `\f5*\fP', `\f5?\fP' and `\f5+\fP'
 | 
						|
that have effects similar to those in regular expressions.
 | 
						|
The `\f5*\fP' is the closure set operator without an upperbound; `\f5+\fP' is the positive
 | 
						|
closure operator without an upperbound; `\f5?\fP' is the optional operator;
 | 
						|
`\f5[\fP' and `\f5]\fP' can be used for grouping.
 | 
						|
For example, a comma-separated list of expressions can be described as:
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
	expression_list:
 | 
						|
		  expression [ ',' expression ]*
 | 
						|
		;
 | 
						|
.ft
 | 
						|
.DE
 | 
						|
.LP
 | 
						|
Alternatives must be separated by `\f5|\fP'.
 | 
						|
C code (``actions'') can be inserted at all points between the colon and the
 | 
						|
semicolon.
 | 
						|
Variables global to the complete rule can be declared just in front of the
 | 
						|
colon enclosed in the brackets `\f5{\fP' and `\f5}\fP'.  All other declarations are local to
 | 
						|
their actions.
 | 
						|
Nonterminals can have parameters to pass information.
 | 
						|
A more mature version of the above example would be:
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
       expression_list(expr *e;)       {	expr e1, e2;	} :
 | 
						|
                expression(&e1)
 | 
						|
                [ ',' expression(&e2)
 | 
						|
                                       {	e1=append(e1, e2);	}
 | 
						|
                ]*
 | 
						|
                                       {	*e=e1;	}
 | 
						|
              ;
 | 
						|
.ft
 | 
						|
.DE
 | 
						|
As LLgen generates a recursive-descent parser with no backtrack, it must at all
 | 
						|
times be able to determine what to do, based on the current input symbol. 
 | 
						|
Unfortunately, this cannot be done for all grammars. Two kinds of conflicts
 | 
						|
are possible, viz. the \fBalternation\fP and \fBrepetition\fP conflict.
 | 
						|
An alternation confict arises if two sides of an alternation can start with the
 | 
						|
same symbol. E.g.
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
	plus:	'+' | '+' ;
 | 
						|
.ft
 | 
						|
.DE
 | 
						|
The parser doesn't know which `\f5+\fP' to choose (neither do we).
 | 
						|
Such a conflict can be resolved by putting an \fBif-condition\fP in front of
 | 
						|
the first conflicting production. It consists of a \fB``%if''\fP followed by a
 | 
						|
C-expression between parentheses.
 | 
						|
If a conflict occurs (and only if it does) the C-expression is evaluated and
 | 
						|
parsing continues along this path if non-zero. Example:
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
	plus:
 | 
						|
		  %if (some_plusses_are_more_equal_than_others())
 | 
						|
		  '+'
 | 
						|
		|
 | 
						|
		  '+'
 | 
						|
		;
 | 
						|
.ft
 | 
						|
.DE
 | 
						|
A repetition conflict arises when the parser cannot decide whether
 | 
						|
``\f5productionrule\fP'' in e.g. ``\f5[ productionrule ]*\fP'' must be chosen
 | 
						|
once more, or that it should continue.
 | 
						|
This kind of conflicts can be resolved by putting a \fBwhile-condition\fP right
 | 
						|
after the opening parentheses.  It consists of a \fB``%while''\fP
 | 
						|
followed by a C-expression between parentheses. As an example, we can look at
 | 
						|
the \fBcomma-expression\fP in C.  The comma may only be used for the
 | 
						|
comma-expression if the total expression is not part of another comma-separated
 | 
						|
list:
 | 
						|
.DS
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
	comma_expression:
 | 
						|
		  sub_expression
 | 
						|
		  [ %while (not_part_of_comma_separated_list())
 | 
						|
			  ',' sub_expression
 | 
						|
		  ]*
 | 
						|
		;
 | 
						|
.ft
 | 
						|
.fi
 | 
						|
.DE
 | 
						|
Again, the \fB``%while''\fP is only used in case of a conflict.
 | 
						|
.LP
 | 
						|
Error recovery is done almost completely automatically. All you have to do
 | 
						|
is to write a routine called \fILLmessage\fP to give the necessary error
 | 
						|
messages and supply information about terminals found missing.
 | 
						|
.NH 2	
 | 
						|
Indentation
 | 
						|
.PP
 | 
						|
The way conflicts can be resolved are of great use to Occam. The use of
 | 
						|
indentation, to group statements, leads to many conflicts because the spaces
 | 
						|
used for indentation are just token separators to the lexical analyzer, i.e.
 | 
						|
``white space''. The lexical analyzer can be instructed to generate `BEGIN' and
 | 
						|
`END' tokens at each indentation change, but that leads to great difficulties
 | 
						|
as expressions may occupy several lines, thus leading to indentation changes
 | 
						|
at the strangest moments. So we decided to resolve the conflicts by looking
 | 
						|
at the indentation ourselves. The lexical analyzer puts the current indentation
 | 
						|
level in the global variable \fIind\fP for use by the parser. The best example
 | 
						|
is the \fBSEQ\fP construct, which exists in two flavors, one with a replicator
 | 
						|
and one process:
 | 
						|
.DS
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
	seq i = [ 1 for str[byte 0] ]
 | 
						|
		out ! str[byte i]
 | 
						|
.ft
 | 
						|
.fi
 | 
						|
.DE
 | 
						|
and one without a replicator and several processes:
 | 
						|
.DS
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
	seq
 | 
						|
		in ? c
 | 
						|
		out ! c
 | 
						|
.ft
 | 
						|
.fi
 | 
						|
.DE
 | 
						|
The LLgen skeleton grammar to handle these two is:
 | 
						|
.DS
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
	SEQ			{	line=yylineno; oind=ind; }
 | 
						|
	[	  %if (line==yylineno)
 | 
						|
		  replicator
 | 
						|
		  process
 | 
						|
		|
 | 
						|
		  [ %while (ind>oind) process ]*
 | 
						|
	]
 | 
						|
.ft
 | 
						|
.fi
 | 
						|
.DE
 | 
						|
This shows clearly that, a replicator must be on the same line as the \fBSEQ\fP,
 | 
						|
and new processes are collected as long as the indentation level of each process
 | 
						|
is greater than the indentation level of \fBSEQ\fP (with appropriate checks on this
 | 
						|
identation).
 | 
						|
.PP
 | 
						|
Different indentation styles are accepted, as long as the same amount of spaces
 | 
						|
is used for each indentation shift. The ascii tab character sets the indentation
 | 
						|
level to an eight space boundary. The first indentation level found in a file
 | 
						|
is used to compare all other indentation levels to.
 |