151 lines
		
	
	
	
		
			5.1 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			151 lines
		
	
	
	
		
			5.1 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| .NH
 | |
| The Compiler
 | |
| .PP
 | |
| The compiler is written in \fBC\fP using LLgen and Lex and compiles
 | |
| Occam programs to EM code, using the procedural interface as defined for EM.
 | |
| In the following sub-sections we describe the LLgen parser generator and
 | |
| the aspect of indentation.
 | |
| .NH 2
 | |
| The LLgen Parser Generator
 | |
| .PP
 | |
| LLgen accepts a Context Free syntax extended with the operators `\f(CW*\fP', `\f(CW?\fP' and `\f(CW+\fP'
 | |
| that have effects similar to those in regular expressions.
 | |
| The `\f(CW*\fP' is the closure set operator without an upperbound; `\f(CW+\fP' is the positive
 | |
| closure operator without an upperbound; `\f(CW?\fP' is the optional operator;
 | |
| `\f(CW[\fP' and `\f(CW]\fP' can be used for grouping.
 | |
| For example, a comma-separated list of expressions can be described as:
 | |
| .DS
 | |
| .ft CW
 | |
| 	expression_list:
 | |
| 		  expression [ ',' expression ]*
 | |
| 		;
 | |
| .ft
 | |
| .DE
 | |
| .LP
 | |
| Alternatives must be separated by `\f(CW|\fP'.
 | |
| C code (``actions'') can be inserted at all points between the colon and the
 | |
| semicolon.
 | |
| Variables global to the complete rule can be declared just in front of the
 | |
| colon enclosed in the brackets `\f(CW{\fP' and `\f(CW}\fP'.  All other declarations are local to
 | |
| their actions.
 | |
| Nonterminals can have parameters to pass information.
 | |
| A more mature version of the above example would be:
 | |
| .DS
 | |
| .ft CW
 | |
|        expression_list(expr *e;)       {	expr e1, e2;	} :
 | |
|                 expression(&e1)
 | |
|                 [ ',' expression(&e2)
 | |
|                                        {	e1=append(e1, e2);	}
 | |
|                 ]*
 | |
|                                        {	*e=e1;	}
 | |
|               ;
 | |
| .ft
 | |
| .DE
 | |
| As LLgen generates a recursive-descent parser with no backtrack, it must at all
 | |
| times be able to determine what to do, based on the current input symbol. 
 | |
| Unfortunately, this cannot be done for all grammars. Two kinds of conflicts
 | |
| are possible, viz. the \fBalternation\fP and \fBrepetition\fP conflict.
 | |
| An alternation confict arises if two sides of an alternation can start with the
 | |
| same symbol. E.g.
 | |
| .DS
 | |
| .ft CW
 | |
| 	plus:	'+' | '+' ;
 | |
| .ft
 | |
| .DE
 | |
| The parser doesn't know which `\f(CW+\fP' to choose (neither do we).
 | |
| Such a conflict can be resolved by putting an \fBif-condition\fP in front of
 | |
| the first conflicting production. It consists of a \fB``%if''\fP followed by a
 | |
| C-expression between parentheses.
 | |
| If a conflict occurs (and only if it does) the C-expression is evaluated and
 | |
| parsing continues along this path if non-zero. Example:
 | |
| .DS
 | |
| .ft CW
 | |
| 	plus:
 | |
| 		  %if (some_plusses_are_more_equal_than_others())
 | |
| 		  '+'
 | |
| 		|
 | |
| 		  '+'
 | |
| 		;
 | |
| .ft
 | |
| .DE
 | |
| A repetition conflict arises when the parser cannot decide whether
 | |
| ``\f(CWproductionrule\fP'' in e.g. ``\f(CW[ productionrule ]*\fP'' must be chosen
 | |
| once more, or that it should continue.
 | |
| This kind of conflicts can be resolved by putting a \fBwhile-condition\fP right
 | |
| after the opening parentheses.  It consists of a \fB``%while''\fP
 | |
| followed by a C-expression between parentheses. As an example, we can look at
 | |
| the \fBcomma-expression\fP in C.  The comma may only be used for the
 | |
| comma-expression if the total expression is not part of another comma-separated
 | |
| list:
 | |
| .DS
 | |
| .nf
 | |
| .ft CW
 | |
| 	comma_expression:
 | |
| 		  sub_expression
 | |
| 		  [ %while (not_part_of_comma_separated_list())
 | |
| 			  ',' sub_expression
 | |
| 		  ]*
 | |
| 		;
 | |
| .ft
 | |
| .fi
 | |
| .DE
 | |
| Again, the \fB``%while''\fP is only used in case of a conflict.
 | |
| .LP
 | |
| Error recovery is done almost completely automatically. All the LLgen-user has to do
 | |
| is write a routine called \fILLmessage\fP to give the necessary error
 | |
| messages and supply information about terminals found missing.
 | |
| .NH 2	
 | |
| Indentation
 | |
| .PP
 | |
| The way conflicts can be resolved are of great use to Occam. The use of
 | |
| indentation, to group statements, leads to many conflicts because the spaces
 | |
| used for indentation are just token separators to the lexical analyzer, i.e.
 | |
| ``white space''. The lexical analyzer can be instructed to generate `BEGIN' and
 | |
| `END' tokens at each indentation change, but that leads to great difficulties
 | |
| as expressions may occupy several lines, thus leading to indentation changes
 | |
| at the strangest moments. So we decided to resolve the conflicts by looking
 | |
| at the indentation ourselves. The lexical analyzer puts the current indentation
 | |
| level in the global variable \fIind\fP for use by the parser. The best example
 | |
| is the \fBSEQ\fP construct, which exists in two flavors, one with a replicator
 | |
| and one process:
 | |
| .DS
 | |
| .nf
 | |
| .ft CW
 | |
| 	seq i = [ 1 for str[byte 0] ]
 | |
| 		out ! str[byte i]
 | |
| .ft
 | |
| .fi
 | |
| .DE
 | |
| and one without a replicator and several processes:
 | |
| .DS
 | |
| .nf
 | |
| .ft CW
 | |
| 	seq
 | |
| 		in ? c
 | |
| 		out ! c
 | |
| .ft
 | |
| .fi
 | |
| .DE
 | |
| The LLgen skeleton grammar to handle these two is:
 | |
| .DS
 | |
| .nf
 | |
| .ft CW
 | |
| 	SEQ			{	line=yylineno; oind=ind; }
 | |
| 	[	  %if (line==yylineno)
 | |
| 		  replicator
 | |
| 		  process
 | |
| 		|
 | |
| 		  [ %while (ind>oind) process ]*
 | |
| 	]
 | |
| .ft
 | |
| .fi
 | |
| .DE
 | |
| This shows clearly that, a replicator must be on the same line as the \fBSEQ\fP,
 | |
| and new processes are collected as long as the indentation level of each process
 | |
| is greater than the indentation level of \fBSEQ\fP (with appropriate checks on this
 | |
| identation).
 | |
| .PP
 | |
| Different indentation styles are accepted, as long as the same amount of spaces
 | |
| is used for each indentation shift. The ascii tab character sets the indentation
 | |
| level to an eight space boundary. The first indentation level found in a file
 | |
| is used to compare all other indentation levels to.
 |