1025 lines
		
	
	
	
		
			31 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			1025 lines
		
	
	
	
		
			31 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
.\"	$Header$
 | 
						|
.\"	Run this paper off with
 | 
						|
.\"	refer [options] -p LLgen.refs LLgen.doc | [n]eqn | tbl | (nt)roff -ms
 | 
						|
.if '\*(>.'' \{\
 | 
						|
.	if '\*(.>'' \{\
 | 
						|
.		if n .ds >. .
 | 
						|
.		if n .ds >, ,
 | 
						|
.		if t .ds .> .
 | 
						|
.		if t .ds ,> ,\
 | 
						|
\}\
 | 
						|
\}
 | 
						|
.cs 5 22u
 | 
						|
.ND
 | 
						|
.EQ
 | 
						|
delim @@
 | 
						|
.EN
 | 
						|
.TL
 | 
						|
LLgen, an extended LL(1) parser generator
 | 
						|
.AU
 | 
						|
Ceriel J. H. Jacobs
 | 
						|
.AI
 | 
						|
Dept. of Mathematics and Computer Science
 | 
						|
Vrije Universiteit
 | 
						|
Amsterdam, The Netherlands
 | 
						|
.AB
 | 
						|
\fILLgen\fR provides a
 | 
						|
tool for generating an efficient recursive descent parser
 | 
						|
with no backtrack from
 | 
						|
an Extended Context Free syntax.
 | 
						|
The \fILLgen\fR
 | 
						|
user specifies the syntax, together with code
 | 
						|
describing actions associated with the parsing process.
 | 
						|
\fILLgen\fR
 | 
						|
turns this specification into a number of subroutines that handle the
 | 
						|
parsing process.
 | 
						|
.PP
 | 
						|
The grammar may be ambiguous.
 | 
						|
\fILLgen\fR contains both static and dynamic facilities
 | 
						|
to resolve these ambiguities.
 | 
						|
.PP
 | 
						|
The specification can be split into several files, for each of
 | 
						|
which \fILLgen\fR generates an output file containing the
 | 
						|
corresponding part of the parser.
 | 
						|
Furthermore, only output files that differ from their previous
 | 
						|
version are updated.
 | 
						|
Other output files are not affected in any
 | 
						|
way.
 | 
						|
This allows the user to recompile only those output files that have
 | 
						|
changed.
 | 
						|
.PP
 | 
						|
The subroutine produced by \fILLgen\fR calls a user supplied routine
 | 
						|
that must return the next token. This way, the input to the
 | 
						|
parser can be split into single characters or higher level
 | 
						|
tokens.
 | 
						|
.PP
 | 
						|
An error recovery mechanism is generated almost completely
 | 
						|
automatically.
 | 
						|
It is based on so called \fBdefault choices\fR, which are
 | 
						|
implicitly or explicitly specified by the user.
 | 
						|
.PP
 | 
						|
\fILLgen\fR has succesfully been used to create recognizers for
 | 
						|
Pascal and C.
 | 
						|
.AE
 | 
						|
.NH
 | 
						|
Introduction
 | 
						|
.PP
 | 
						|
\fILLgen\fR
 | 
						|
provides a tool for generating an efficient recursive
 | 
						|
descent parser with no backtrack from an Extended Context Free
 | 
						|
syntax.
 | 
						|
A parser generated by
 | 
						|
\fILLgen\fR
 | 
						|
will be called
 | 
						|
\fILLparse\fR
 | 
						|
for the rest of this document.
 | 
						|
It is assumed that the reader has some knowledge of LL(1) grammars and
 | 
						|
recursive descent parsers.
 | 
						|
For a survey on the subject, see .
 | 
						|
.[
 | 
						|
griffiths
 | 
						|
.]
 | 
						|
.PP
 | 
						|
Extended LL(1) parsers are an extension of LL(1) parsers. They are
 | 
						|
derived from an Extended Context-Free (ECF) syntax instead of a Context-Free
 | 
						|
(CF) syntax.
 | 
						|
ECF syntax is described in section 2.
 | 
						|
Section 3 provides an outline of a
 | 
						|
specification as accepted by
 | 
						|
\fILLgen\fR and also discusses the lexical conventions of
 | 
						|
grammar specification files.
 | 
						|
Section 4 provides a description of the way the
 | 
						|
\fILLgen\fR
 | 
						|
user can associate
 | 
						|
actions with the syntax. These actions must be written in the programming
 | 
						|
language C ,
 | 
						|
.[
 | 
						|
kernighan ritchie
 | 
						|
.]
 | 
						|
which also is the target language of \fILLgen\fR.
 | 
						|
The error recovery technique is discussed in section 5.
 | 
						|
This section also discusses what the user can do about it.
 | 
						|
Section 6 discusses
 | 
						|
the facilities \fILLgen\fR offers
 | 
						|
to resolve ambiguities and conflicts.
 | 
						|
\fILLgen\fR offers facilities to resolve them both at parser
 | 
						|
generation time and during the execution of \fILLparse\fR.
 | 
						|
Section 7 discusses the
 | 
						|
\fILLgen\fR
 | 
						|
working environment.
 | 
						|
It also discusses the lexical analyzer that must be supplied by the
 | 
						|
user.
 | 
						|
This lexical analyzer must read the input stream and break it
 | 
						|
up into basic input items, called \fBtokens\fR for the rest of
 | 
						|
this document.
 | 
						|
Appendix A gives a summary of the
 | 
						|
\fILLgen\fR
 | 
						|
input syntax.
 | 
						|
Appendix B gives an example.
 | 
						|
It is very instructive to compare this example with the one
 | 
						|
given in .
 | 
						|
.[
 | 
						|
yacc
 | 
						|
.]
 | 
						|
It demonstrates the struggle \fILLparse\fR and other LL(1)
 | 
						|
parsers have with expressions.
 | 
						|
Appendix C gives an example of the \fILLgen\fR features
 | 
						|
allowing the user to recompile only those output files that
 | 
						|
have changed, using the \fImake\fR program .
 | 
						|
.[
 | 
						|
make
 | 
						|
.]
 | 
						|
.NH
 | 
						|
The Extended Context-Free Syntax
 | 
						|
.PP
 | 
						|
The extensions of an ECF syntax with respect to an ordinary CF syntax are:
 | 
						|
.IP 1. 10
 | 
						|
An ECF syntax contains the repetition operator: "N" (N represents a positive
 | 
						|
integer).
 | 
						|
.IP 2. 10
 | 
						|
An ECF syntax contains the closure set operator without and with
 | 
						|
upperbound: "*" and "*N".
 | 
						|
.IP 3. 10
 | 
						|
An ECF syntax contains the positive closure set operator without and with
 | 
						|
upperbound: "+" and "+N".
 | 
						|
.IP 4. 10
 | 
						|
An ECF syntax contains the optional operator: "?", which is a
 | 
						|
shorthand for "*1".
 | 
						|
.IP 5. 10
 | 
						|
An ECF syntax contains parentheses "[" and "]" which can be
 | 
						|
used for grouping.
 | 
						|
.PP
 | 
						|
We can describe the syntax of an ECF syntax with an ECF syntax :
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
grammar         : rule +
 | 
						|
                ;
 | 
						|
.ft R
 | 
						|
.DE
 | 
						|
This grammar rule states that a grammar consists of one or more
 | 
						|
rules.
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
rule            : nonterminal ':' productionrule ';'
 | 
						|
                ;
 | 
						|
.ft R
 | 
						|
.DE
 | 
						|
A rule consists of a left hand side, the nonterminal,
 | 
						|
followed by ":",
 | 
						|
the \fBproduce symbol\fR, followed by a production rule, followed by a
 | 
						|
";", in\%di\%ca\%ting the end of the rule.
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
productionrule  : production [ '|' production ]*
 | 
						|
                ;
 | 
						|
.ft R
 | 
						|
.DE
 | 
						|
A production rule consists of one or
 | 
						|
more alternative productions separated by "|". This symbol is called the
 | 
						|
\fBalternation symbol\fR.
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
production      : term *
 | 
						|
                ;
 | 
						|
.ft R
 | 
						|
.DE
 | 
						|
A production consists of a possibly empty list of terms.
 | 
						|
So, empty productions are allowed.
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
term            : element repeats
 | 
						|
                ;
 | 
						|
.ft R
 | 
						|
.DE
 | 
						|
A term is an element, possibly with a repeat specification.
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
element         : LITERAL
 | 
						|
                | IDENTIFIER
 | 
						|
                | '[' productionrule ']'
 | 
						|
                ;
 | 
						|
.ft R
 | 
						|
.DE
 | 
						|
An element can be a LITERAL, which basically is a single character
 | 
						|
between apostrophes, it can be an IDENTIFIER, which is either a
 | 
						|
nonterminal or a token, and it can be a production rule
 | 
						|
between square parentheses.
 | 
						|
.DS
 | 
						|
.ft 5
 | 
						|
repeats         : '?'
 | 
						|
                | [ '*' | '+' ] NUMBER ?
 | 
						|
                | NUMBER ?
 | 
						|
                ;
 | 
						|
.ft R
 | 
						|
.DE
 | 
						|
These are the repeat specifications discussed above. Notice that
 | 
						|
this specification may be empty.
 | 
						|
.PP
 | 
						|
The class of ECF languages
 | 
						|
is identical with the class of CF languages. However, in many
 | 
						|
cases recursive definitions of language features can now be
 | 
						|
replaced by iterative ones. This tends to reduce the number of
 | 
						|
nonterminals and gives rise to very efficient recursive descent
 | 
						|
parsers.
 | 
						|
.NH
 | 
						|
Grammar Specifications
 | 
						|
.PP
 | 
						|
The major part of a
 | 
						|
\fILLgen\fR
 | 
						|
grammar specification consists of an
 | 
						|
ECF syntax specification.
 | 
						|
Names in this syntax specification refer to either tokens or nonterminal
 | 
						|
symbols.
 | 
						|
\fILLgen\fR
 | 
						|
requires token names to be declared as such. This way it
 | 
						|
can be avoided that a typing error in a nonterminal name causes it to
 | 
						|
be accepted as a token name. The token declarations will be
 | 
						|
discussed later.
 | 
						|
A name will be regarded as a nonterminal symbol, unless it is declared
 | 
						|
as a token name.
 | 
						|
If there is no production rule for a nonterminal symbol, \fILLgen\fR
 | 
						|
will complain.
 | 
						|
.PP
 | 
						|
A grammar specification may also include some C routines,
 | 
						|
for instance the lexical analyzer and an error reporting
 | 
						|
routine.
 | 
						|
Thus, a grammar specification file can contain declarations,
 | 
						|
grammar rules and C-code.
 | 
						|
.PP
 | 
						|
Blanks, tabs and newlines are ignored, but may not appear in names or
 | 
						|
keywords.
 | 
						|
Comments may appear wherever a name is legal (which is almost
 | 
						|
everywhere).
 | 
						|
They are enclosed in
 | 
						|
/* ... */, as in C. Comments do not nest.
 | 
						|
.PP
 | 
						|
Names may be of arbitrary length, and can be made up of letters, underscore
 | 
						|
"\_" and non-initial digits. Upper and lower case letters are distinct.
 | 
						|
Only the first 50 characters are significant.
 | 
						|
Notice however, that the names for the tokens will be used by the
 | 
						|
C-preprocessor.
 | 
						|
The number of significant characters therefore depends on the
 | 
						|
underlying C-implementation.
 | 
						|
A safe rule is to make the identifiers distinct in the first six
 | 
						|
characters, case ignored.
 | 
						|
.PP
 | 
						|
There are two kinds of tokens:
 | 
						|
those that are declared and are denoted by a name,
 | 
						|
and literals.
 | 
						|
.PP
 | 
						|
A literal consists of a character enclosed in apostrophes "'".
 | 
						|
The "\e" is an escape character within literals. The following escapes
 | 
						|
are recognized :
 | 
						|
.TS
 | 
						|
center;
 | 
						|
l l.
 | 
						|
\&'\en'	newline
 | 
						|
\&'\er'	return
 | 
						|
\&'\e''	apostrophe "'"
 | 
						|
\&'\e\e'	backslash "\e"
 | 
						|
\&'\et'	tab
 | 
						|
\&'\eb'	backspace
 | 
						|
\&'\ef'	form feed
 | 
						|
\&'\exxx'	"xxx" in octal
 | 
						|
.TE
 | 
						|
.PP
 | 
						|
Names representing tokens must be declared before they are used.
 | 
						|
This can be done using the "\fB%token\fR" keyword,
 | 
						|
by writing
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
.sp 1
 | 
						|
%token  name1, name2, . . . ;
 | 
						|
.ft R
 | 
						|
.fi
 | 
						|
.PP
 | 
						|
\fILLparse\fR is designed to recognize special nonterminal
 | 
						|
symbols called \fBstart symbols\fR.
 | 
						|
\fILLgen\fR allows for more than one start symbol.
 | 
						|
Thus, grammars with more than one entry point are accepted.
 | 
						|
The start symbols must be declared explicitly using the
 | 
						|
"\fB%start\fR" keyword. It can be used whenever a declaration is
 | 
						|
legal, f.i.:
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
.sp 1
 | 
						|
%start LLparse, specification ;
 | 
						|
.ft R
 | 
						|
.fi
 | 
						|
.sp 1
 | 
						|
declares "specification" as a start symbol and associates the
 | 
						|
identifier "LLparse" with it.
 | 
						|
"LLparse" will now be the name of the C-function that must be
 | 
						|
called to recognize "specification".
 | 
						|
.NH
 | 
						|
Actions
 | 
						|
.PP
 | 
						|
\fILLgen\fR
 | 
						|
allows arbitrary insertions of actions within the right hand side
 | 
						|
of a production rule in the ECF syntax. An action consists of a number of C
 | 
						|
statements, enclosed in the brackets "{" and "}".
 | 
						|
.PP
 | 
						|
\fILLgen\fR
 | 
						|
generates a parsing routine for each rule in the grammar. The actions
 | 
						|
supplied by the user are just inserted in the proper place.
 | 
						|
There may also be declarations before the statements in the
 | 
						|
action, as
 | 
						|
the "{" and "}" are copied into the target code along with the
 | 
						|
action. The scope of these declarations terminates with the
 | 
						|
closing bracket "}" of the action.
 | 
						|
.PP
 | 
						|
In addition to actions, it is also possible to declare local variables
 | 
						|
in the parsing routine, which can then be used in the actions.
 | 
						|
Such a declaration consists of a number of C variable declarations,
 | 
						|
enclosed in the brackets "{" and "}". It must be placed
 | 
						|
right in front of the ":" in the grammar rule.
 | 
						|
The scope of these local variables consists of the complete
 | 
						|
grammar rule.
 | 
						|
.PP
 | 
						|
In order to facilitate communication between the actions and
 | 
						|
\fILLparse\fR,
 | 
						|
the parsing routines can be given C-like parameters. So, for example
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
.sp 1
 | 
						|
expr(int *pval;) { int fact; } :
 | 
						|
                /*
 | 
						|
                 * Rule with one parameter, a pointer to an int.
 | 
						|
                 * Parameter specifications are ordinary C declarations.
 | 
						|
                 * One local variable, of type int.
 | 
						|
                 */
 | 
						|
        factor (&fact)          { *pval = fact; }
 | 
						|
                /*
 | 
						|
                 * factor is another nonterminal symbol.
 | 
						|
                 * One actual parameter is supplied.
 | 
						|
                 * Notice that the parameter passing mechanism is that
 | 
						|
                 * of C.
 | 
						|
                 */
 | 
						|
        [ '+' factor (&fact)    { *pval += fact; } ]*
 | 
						|
                /*
 | 
						|
                 * remember the '*' means zero or more times
 | 
						|
                 */
 | 
						|
        ;
 | 
						|
.sp 1
 | 
						|
.ft R
 | 
						|
.fi
 | 
						|
is a rule to recognize a number of factors, separated by "+", and
 | 
						|
to compute their sum.
 | 
						|
.PP
 | 
						|
\fILLgen\fR
 | 
						|
generates C code, so the parameter passing mechanism is that of
 | 
						|
C, as is shown in the example above.
 | 
						|
.PP
 | 
						|
Actions often manipulate attributes of the token just read.
 | 
						|
For instance, when an identifier is read, its name must be
 | 
						|
looked up in a symbol table.
 | 
						|
Therefore, \fILLgen\fR generates code
 | 
						|
such that at a number of places in the grammar rule
 | 
						|
it is defined which token has last been read.
 | 
						|
After a token, the last token read is this token.
 | 
						|
After a "[" or a "|", the last token read is the next token to
 | 
						|
be accepted by \fILLparse\fR.
 | 
						|
At all other places, it is undefined which token has last been
 | 
						|
read.
 | 
						|
The last token read is available in the global integer variable
 | 
						|
\fILLsymb\fR.
 | 
						|
.PP
 | 
						|
The user may also specify C-code wherever a \fILLgen\fR-declaration is
 | 
						|
legal.
 | 
						|
Again, this code must be enclosed in the brackets "{" and "}".
 | 
						|
This way, the user can define global declarations and
 | 
						|
C-functions.
 | 
						|
To avoid name-conflicts with identifiers generated by
 | 
						|
\fILLgen\fR, \fILLparse\fR only uses names beginning with
 | 
						|
"LL"; the user should avoid such names.
 | 
						|
.NH
 | 
						|
Error Recovery
 | 
						|
.PP
 | 
						|
The error recovery technique used by \fILLgen\fR is a
 | 
						|
modification of the one presented in .
 | 
						|
.[
 | 
						|
automatic construction error correcting
 | 
						|
.]
 | 
						|
It is based on \fBdefault choices\fR, which just are
 | 
						|
what the word says, default choices at
 | 
						|
every point in the grammar where there is a
 | 
						|
choice.
 | 
						|
Thus, in an alternation, one of the productions is marked as a
 | 
						|
default choice, and in a term with a non-fixed repetition
 | 
						|
specification there will also be a default choice (between
 | 
						|
doing the term (once more) and continuing with the rest of the
 | 
						|
production in which the term appears).
 | 
						|
.PP
 | 
						|
When \fILLparse\fR detects an error after having parsed the
 | 
						|
string @s@, the default choices enable it to compute one
 | 
						|
syntactically correct continuation,
 | 
						|
consisting of the tokens @t sub 1~...~t sub n@,
 | 
						|
such that @s~t sub 1~...~t sub n@ is a string of tokens that
 | 
						|
is a member of the language defined by the grammar.
 | 
						|
Notice, that the computation of this continuation must
 | 
						|
terminate, which implies that the default choices may not
 | 
						|
invoke recursive rules.
 | 
						|
.PP
 | 
						|
At each point in this continuation, a certain number of other
 | 
						|
tokens could also be syntactically correct, f.i. the token
 | 
						|
@t@ is syntactically correct at point @t sub i@ in this
 | 
						|
continuation, if the string @s~t sub 1~...~t sub i~t~s sub 1@
 | 
						|
is a string of the language defined by the grammar for some
 | 
						|
string @s sub 1@ and i >= 0.
 | 
						|
.PP
 | 
						|
The set @T@
 | 
						|
containing all these tokens (including @t sub 1 ,~...,~t sub n@) is computed.
 | 
						|
Next, \fILLparse\fR discards zero
 | 
						|
or more tokens from its input, until a token
 | 
						|
@t@ \(mo @T@ is found.
 | 
						|
The error is then corrected by inserting i (i >= 0) tokens
 | 
						|
@t sub 1~...~t sub i@, such that the string
 | 
						|
@s~t sub 1~...~t sub i~t~s sub 1@ is a string of the language
 | 
						|
defined by the grammar, for some @s sub 1@.
 | 
						|
Then, normal parsing is resumed.
 | 
						|
.PP
 | 
						|
The above is difficult to implement in a recursive decent
 | 
						|
parser, and is not the way \fILLparse\fR does it, but the
 | 
						|
effect is the same. In fact, \fILLparse\fR maintains a list
 | 
						|
of tokens that may not be discarded, which is adjusted as
 | 
						|
\fILLparse\fR proceeds. This list is just a representation
 | 
						|
of the set @T@ mentioned
 | 
						|
above. When an error occurs, \fILLparse\fR discards tokens until
 | 
						|
a token @t@ that is a member of this list is found.
 | 
						|
Then, it continues parsing, following the default choices,
 | 
						|
inserting tokens along the way, until this token @t@ is legal.
 | 
						|
The selection of
 | 
						|
the default choices must guarantee that this will always
 | 
						|
happen.
 | 
						|
.PP
 | 
						|
The default choices are explicitly or implicitly
 | 
						|
specified by the user.
 | 
						|
By default, the default choice in an alternation is the
 | 
						|
alternative with the shortest possible terminal production.
 | 
						|
The user can select one of the other productions in the
 | 
						|
alternation as the default choice by putting the keyword
 | 
						|
"\fB%default\fR" in front of it.
 | 
						|
.PP
 | 
						|
By default, for terms with a repetition count containing "*" or
 | 
						|
"?" the default choice is to continue with the rest of the rule
 | 
						|
in which the term appears, and
 | 
						|
.sp 1
 | 
						|
.ft 5
 | 
						|
.nf
 | 
						|
                term+
 | 
						|
.fi
 | 
						|
.ft R
 | 
						|
.sp 1
 | 
						|
is treated as
 | 
						|
.sp 1
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
                term term* .
 | 
						|
.ft R
 | 
						|
.fi
 | 
						|
.PP
 | 
						|
It is also clear, that it can never be the default choice to do
 | 
						|
the term (once more), because this could cause the parser to
 | 
						|
loop, inserting tokens forever.
 | 
						|
However, when the user does not want the parser to skip
 | 
						|
tokens that would not have been skipped if the term
 | 
						|
would have been the default choice,
 | 
						|
the skipping of such a term can be prevented by
 | 
						|
using the keyword "\fB%persistent\fR".
 | 
						|
For instance, the rule
 | 
						|
.sp 1
 | 
						|
.ft 5
 | 
						|
.nf
 | 
						|
commandlist : command* ;
 | 
						|
.fi
 | 
						|
.ft R
 | 
						|
.sp 1
 | 
						|
could be changed to
 | 
						|
.sp 1
 | 
						|
.ft 5
 | 
						|
.nf
 | 
						|
commandlist : [ %persistent command ]* ;
 | 
						|
.fi
 | 
						|
.ft R
 | 
						|
.sp 1
 | 
						|
The effects of this in case of a syntax error are twofold:
 | 
						|
The set @T@ mentioned above will be extended as if "command" were
 | 
						|
in the default production, so that fewer tokens will be
 | 
						|
skipped.
 | 
						|
Also, if the first token that is not skipped is a member of the
 | 
						|
subset of @T@ arising from the grammar rule for "command",
 | 
						|
\fILLparse\fR will enter that rule.
 | 
						|
So, in fact the default choice
 | 
						|
is determined dynamically (by \fILLparse\fR).
 | 
						|
Again, \fILLgen\fR checks (statically)
 | 
						|
that \fILLparse\fR will always terminate, and if not,
 | 
						|
\fILLgen\fR will complain.
 | 
						|
.PP
 | 
						|
An important property of this error recovery method is that,
 | 
						|
once a rule is started, it will be finished.
 | 
						|
This means that all actions in the rule will be executed
 | 
						|
normally, so that the user can be sure that there will be no
 | 
						|
inconsistencies in his data structures because of syntax
 | 
						|
errors.
 | 
						|
Also, as the method is in fact error correcting, the
 | 
						|
actions in a rule only have to deal with syntactically correct
 | 
						|
input.
 | 
						|
.NH
 | 
						|
Ambiguities and conflicts
 | 
						|
.PP
 | 
						|
As \fILLgen\fR generates a recursive descent parser with no backtrack,
 | 
						|
it must at all times be able to determine what to do,
 | 
						|
based on the current input symbol.
 | 
						|
Unfortunately, this cannot be done for all grammars.
 | 
						|
Two kinds of conflicts can arise :
 | 
						|
.IP 1) 10
 | 
						|
the grammar rule is of the form "production1 | production2",
 | 
						|
and \fILLparse\fR cannot decide which production to chose.
 | 
						|
This we call an \fBalternation conflict\fR.
 | 
						|
.IP 2) 10
 | 
						|
the grammar rule is of the form "[ productionrule ]...",
 | 
						|
where ... specifies a non-fixed repetition count,
 | 
						|
and \fILLparse\fR cannot decide whether to
 | 
						|
choose "productionrule" once more, or to continue.
 | 
						|
This we call a \fBrepetition conflict\fR.
 | 
						|
.PP
 | 
						|
There can be several causes for conflicts: the grammar may be
 | 
						|
ambiguous, or the grammar may require a more complex parser
 | 
						|
than \fILLgen\fR can construct.
 | 
						|
The conflicts can be examined by inspecting the verbose
 | 
						|
(-\fBv\fR) option output file.
 | 
						|
The conflicts can be resolved by rewriting the grammar
 | 
						|
or by using \fBconflict resolvers\fR.
 | 
						|
The mechanism described here is based on the attributed parsing
 | 
						|
of .
 | 
						|
.[
 | 
						|
milton
 | 
						|
.]
 | 
						|
.PP
 | 
						|
An alternation conflict can be resolved by putting an \fBif condition\fR
 | 
						|
in front of the first conflicting production.
 | 
						|
It consists of a "\fB%if\fR" followed by a
 | 
						|
C-expression between parentheses.
 | 
						|
\fILLparse\fR will then evaluate this expression whenever a
 | 
						|
token is met at this point on which there is a conflict, so
 | 
						|
the conflict will be resolved dynamically.
 | 
						|
If the expression evaluates to
 | 
						|
non-zero, the first conflicting production is chosen,
 | 
						|
otherwise one of the remaining ones is chosen.
 | 
						|
.PP
 | 
						|
An alternation conflict can also be resolved using the keywords
 | 
						|
"\fB%prefer\fR" or "\fB%avoid\fR". "\fB%prefer\fR"
 | 
						|
is equivalent in behaviour to
 | 
						|
"\fB%if\fR (1)". "\fB%avoid\fR" is equivalent to "\fB%if\fR (0)".
 | 
						|
In these cases however, "\fB%prefer\fR" and "\fB%avoid\fR" should be used,
 | 
						|
as they resolve the conflict statically and thus
 | 
						|
give rise to better C-code.
 | 
						|
.PP
 | 
						|
A repetition conflict can be resolved by putting a \fBwhile condition\fR
 | 
						|
right after the opening parentheses. This while condition
 | 
						|
consists of a "\fB%while\fR" followed by a C-expression between
 | 
						|
parentheses. Again, \fILLparse\fR will then
 | 
						|
evaluate this expression whenever a token is met
 | 
						|
at this point on which there is a conflict.
 | 
						|
If the expression evaluates to non-zero, the
 | 
						|
repeating part is chosen, otherwise the parser continues with
 | 
						|
the rest of the rule.
 | 
						|
Appendix B will give an example of these features.
 | 
						|
.PP
 | 
						|
A useful aid in writing conflict resolvers is the "\fB%first\fR" keyword.
 | 
						|
It is used to declare a C-macro that forms an expression
 | 
						|
returning 1 if the parameter supplied can start a specified
 | 
						|
nonterminal, f.i.:
 | 
						|
.sp 1
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
%first fmac, nonterm ;
 | 
						|
.ft R
 | 
						|
.sp 1
 | 
						|
.fi
 | 
						|
declares "fmac" as a macro with one parameter, whose value
 | 
						|
is a token number. If the parameter
 | 
						|
X can start the nonterminal "nonterm", "fmac(X)" is true,
 | 
						|
otherwise it is false.
 | 
						|
.NH
 | 
						|
The LLgen working environment
 | 
						|
.PP
 | 
						|
\fILLgen\fR generates a number of files: one for each input
 | 
						|
file, and two other files: \fILpars.c\fR and \fILpars.h\fR.
 | 
						|
\fILpars.h\fR contains "#-define"s for the tokennames.
 | 
						|
\fILpars.c\fR contains the error recovery routines and tables.
 | 
						|
Only those output files that differ from their previous version
 | 
						|
are updated. See appendix C for a possible application of this
 | 
						|
feature.
 | 
						|
.PP
 | 
						|
The names of the output files are constructed as
 | 
						|
follows:
 | 
						|
in the input file name, the suffix after the last point is
 | 
						|
replaced by a "c". If no point is present in the input file
 | 
						|
name, ".c" is appended to it. \fILLgen\fR checks that the
 | 
						|
filename constructed this way in fact represents a previous
 | 
						|
version, or does not exist already.
 | 
						|
.PP
 | 
						|
The user must provide some environment to obtain a complete
 | 
						|
program.
 | 
						|
Routines called \fImain\fR and \fILLmessage\fR must be defined.
 | 
						|
Also, a lexical analyzer must be provided.
 | 
						|
.PP
 | 
						|
The routine \fImain\fR must be defined, as it must be in every
 | 
						|
C-program. It should eventually call one of the startsymbol
 | 
						|
routines.
 | 
						|
.PP
 | 
						|
The routine \fILLmessage\fR must accept one
 | 
						|
parameter, whose value is a token number, zero or -1.
 | 
						|
.br
 | 
						|
A zero parameter indicates that the current token (the one in
 | 
						|
the external variable \fILLsymb\fR) is deleted.
 | 
						|
.br
 | 
						|
A -1 parameter indicates that the parser expected end of file, but did'nt get
 | 
						|
it.
 | 
						|
The parser will then skip tokens until end of file is detected.
 | 
						|
.br
 | 
						|
A parameter that is a token number (a positive parameter)
 | 
						|
indicates that this
 | 
						|
token is to be inserted in front of the token currently in
 | 
						|
\fILLsymb\fR.
 | 
						|
The user can give the token the proper attributes.
 | 
						|
Also, the user must take care, that the token currently in
 | 
						|
\fILLsymb\fR is again returned by the \fBnext\fR call to the
 | 
						|
lexical analyzer, with the proper attributes.
 | 
						|
So, the lexical analyzer must have a facility to push back one
 | 
						|
token.
 | 
						|
.PP
 | 
						|
The user must supply a lexical analyzer to read the input stream and
 | 
						|
break it up into tokens, which are passed to
 | 
						|
.I LLparse.
 | 
						|
It should be an integer valued function, returning the token number.
 | 
						|
The name of this function can be declared using the
 | 
						|
"\fB%lexical\fR" keyword.
 | 
						|
This keyword can be used wherever a declaration is legal and may appear
 | 
						|
only once in the grammar specification, f.i.:
 | 
						|
.sp 1
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
%lexical scanner ;
 | 
						|
.ft R
 | 
						|
.fi
 | 
						|
.sp 1
 | 
						|
declares "scanner" as the name of the lexical analyzer.
 | 
						|
The default name for the lexical analyzer is "yylex".
 | 
						|
The reason for this funny name is that a useful tool for constructing
 | 
						|
lexical analyzers is the
 | 
						|
.I Lex
 | 
						|
program ,
 | 
						|
.[
 | 
						|
lex
 | 
						|
.]
 | 
						|
which generates a routine of that name.
 | 
						|
.PP
 | 
						|
The token numbers are chosen by \fILLgen\fR.
 | 
						|
The token number for a literal
 | 
						|
is the numerical value of the character in the local character set.
 | 
						|
If the tokens have a name,
 | 
						|
the "#\ define" mechanism of C is used to give them a value and
 | 
						|
to allow the lexical analyzer to return their token numbers symbolically.
 | 
						|
These "#\ define"s are collected in the file \fILpars.h\fR which
 | 
						|
can be "#\ include"d in any file that needs the token-names.
 | 
						|
.PP
 | 
						|
The lexical analyzer must signal the end
 | 
						|
of input to \fILLparse\fR
 | 
						|
by returning a number less than or equal to zero.
 | 
						|
.bp
 | 
						|
.[
 | 
						|
$LIST$
 | 
						|
.]
 | 
						|
.bp
 | 
						|
.SH
 | 
						|
Appendix A : LLgen Input Syntax
 | 
						|
.PP
 | 
						|
This appendix has a description of the \fILLgen\fR input syntax,
 | 
						|
as a \fILLgen\fR specification. As a matter of fact, the current
 | 
						|
version of \fILLgen\fR is written with \fILLgen\fR.
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
.sp 2
 | 
						|
/*
 | 
						|
 * First the declarations of the terminals
 | 
						|
 * The order is not important
 | 
						|
 */
 | 
						|
 | 
						|
%token  IDENTIFIER;            /* terminal or nonterminal name */
 | 
						|
%token  NUMBER;
 | 
						|
%token  LITERAL;
 | 
						|
 | 
						|
/*
 | 
						|
 * Reserved words
 | 
						|
 */
 | 
						|
 | 
						|
%token  TOKEN;         /* %token */
 | 
						|
%token  START;         /* %start */
 | 
						|
%token  PERSISTENT;    /* %persistent */
 | 
						|
%token  IF;            /* %if */
 | 
						|
%token  WHILE;         /* %while */
 | 
						|
%token  AVOID;         /* %avoid */
 | 
						|
%token  PREFER;        /* %prefer */
 | 
						|
%token  DEFAULT;       /* %default */
 | 
						|
%token  LEXICAL;       /* %lexical */
 | 
						|
%token  FIRST;         /* %first */
 | 
						|
 | 
						|
/*
 | 
						|
 * Declare LLparse to be a C-routine that recognizes "specification"
 | 
						|
 */
 | 
						|
 | 
						|
%start  LLparse, specification;
 | 
						|
 | 
						|
specification
 | 
						|
        : declaration*
 | 
						|
        ;
 | 
						|
 | 
						|
declaration
 | 
						|
        : START
 | 
						|
                IDENTIFIER ',' IDENTIFIER
 | 
						|
          ';'
 | 
						|
        | '{'
 | 
						|
                /* Read C-declaration here */
 | 
						|
          '}'
 | 
						|
        | TOKEN
 | 
						|
                IDENTIFIER
 | 
						|
                [ ',' IDENTIFIER ]*
 | 
						|
          ';'
 | 
						|
        | FIRST
 | 
						|
                IDENTIFIER ',' IDENTIFIER
 | 
						|
          ';'
 | 
						|
        | LEXICAL
 | 
						|
                IDENTIFIER
 | 
						|
          ';'
 | 
						|
        | rule
 | 
						|
        ;
 | 
						|
 | 
						|
rule    : IDENTIFIER parameters? ldecl?
 | 
						|
                ':' productions
 | 
						|
          ';'
 | 
						|
        ;
 | 
						|
 | 
						|
ldecl   : '{'
 | 
						|
                /* Read C-declaration here */
 | 
						|
          '}'
 | 
						|
        ;
 | 
						|
 | 
						|
productions
 | 
						|
        : simpleproduction
 | 
						|
          [ '|' DEFAULT? simpleproduction ]*
 | 
						|
        ;
 | 
						|
 | 
						|
simpleproduction
 | 
						|
        : [ IF '(' /* Read C-expression here */ ')'
 | 
						|
          | PREFER
 | 
						|
          | AVOID
 | 
						|
          ]?
 | 
						|
          [ element repeats ]*
 | 
						|
        ;
 | 
						|
 | 
						|
element : '{'
 | 
						|
                /* Read action here */
 | 
						|
          '}'
 | 
						|
        | '[' [ WHILE '(' /* Read C-expression here */ ')' ]?
 | 
						|
                PERSISTENT?
 | 
						|
                productions
 | 
						|
          ']'
 | 
						|
        | LITERAL
 | 
						|
        | IDENTIFIER parameters?
 | 
						|
        ;
 | 
						|
 | 
						|
parameters
 | 
						|
        : '(' /* Read C-parameters here */ ')'
 | 
						|
        ;
 | 
						|
 | 
						|
repeats : /* empty */
 | 
						|
        | [ '*' | '+' ] NUMBER?
 | 
						|
        | NUMBER
 | 
						|
        | '?'
 | 
						|
        ;
 | 
						|
 | 
						|
.fi
 | 
						|
.ft R
 | 
						|
.bp
 | 
						|
.SH
 | 
						|
Appendix B : An example
 | 
						|
.PP
 | 
						|
This example gives the complete \fILLgen\fR specification of a simple
 | 
						|
desk calculator. It has 26 registers, labeled "a" through "z",
 | 
						|
and accepts arithmetic expressions made up of the C operators
 | 
						|
+, -, *, /, %, &, and |, with their usual priorities.
 | 
						|
The value of the expression is
 | 
						|
printed. As in C, an integer that begins with 0 is assumed to
 | 
						|
be octal; otherwise it is assumed to be decimal.
 | 
						|
.PP
 | 
						|
Although the example is short and not very complicated, it
 | 
						|
demonstrates the use of if and while conditions. In
 | 
						|
the example they are in fact used to reduce the number of
 | 
						|
nonterminals, and to reduce the overhead due to the recursion
 | 
						|
that would be involved in parsing an expression with an
 | 
						|
ordinary recursive descent parser. In an ordinary LL(1)
 | 
						|
grammar there would be one nonterminal for each operator
 | 
						|
priority. The example shows how we can do it all with one
 | 
						|
nonterminal, no matter how many priority levels there are.
 | 
						|
.sp 1
 | 
						|
.nf
 | 
						|
.ft 5
 | 
						|
{
 | 
						|
#include <stdio.h>
 | 
						|
#include <ctype.h>
 | 
						|
#define MAXPRIO      5
 | 
						|
#define prio(op)     (ptab[op])
 | 
						|
 | 
						|
struct token {
 | 
						|
        int     t_tokno;        /* token number */
 | 
						|
        int     t_tval;         /* Its attribute */
 | 
						|
} stok = { 0,0 }, tok;
 | 
						|
 | 
						|
int     nerrors = 0;
 | 
						|
int     regs[26];               /* Space for the registers */
 | 
						|
int     ptab[128];              /* Attribute table */
 | 
						|
 | 
						|
struct token
 | 
						|
nexttok() {  /* Read next token and return it */
 | 
						|
        register        c;
 | 
						|
        struct token    new;
 | 
						|
 | 
						|
        while ((c = getchar()) == ' ' || c == '\et') { /* nothing */ }
 | 
						|
        if (isdigit(c)) new.t_tokno = DIGIT;
 | 
						|
        else if (islower(c)) new.t_tokno = IDENT;
 | 
						|
        else new.t_tokno = c;
 | 
						|
        if (c >= 0) new.t_tval = ptab[c];
 | 
						|
        return new;
 | 
						|
}   }
 | 
						|
 | 
						|
%token  DIGIT, IDENT;
 | 
						|
%start  parse, list;
 | 
						|
 | 
						|
list    : stat* ;
 | 
						|
 | 
						|
stat    {       int     ident, val; } :
 | 
						|
        %if (stok = nexttok(),
 | 
						|
             stok.t_tokno == '=')
 | 
						|
                    /* The conflict is resolved by looking one further
 | 
						|
                     * token ahead. The grammar is LL(2)
 | 
						|
                     */
 | 
						|
          IDENT
 | 
						|
                                {       ident = tok.t_tval; }
 | 
						|
          '=' expr(1,&val) '\en'
 | 
						|
                                {       if (!nerrors) regs[ident] = val; }
 | 
						|
        | expr(1,&val) '\en'
 | 
						|
                                {       if (!nerrors) printf("%d\en",val); }
 | 
						|
        | '\en'
 | 
						|
        ;
 | 
						|
 | 
						|
expr(int level, *val;) {       int     expr; } :
 | 
						|
        %if (level <= MAXPRIO)
 | 
						|
                    /* The grammar is ambiguous here. If level > MAXPRIO,
 | 
						|
                     * this invocation will only scan one factor
 | 
						|
                     */
 | 
						|
            expr(MAXPRIO+1,val)
 | 
						|
            [ %while (prio(tok.t_tokno) >= level)
 | 
						|
                    /* Swallow operators as long as their priority is
 | 
						|
                     * larger than or equal to the level of this invocation
 | 
						|
                     */
 | 
						|
                '+' expr(prio('+')+1,&expr)
 | 
						|
                                {       *val += expr; }
 | 
						|
                    /* This states that '+' groups left to right. If it
 | 
						|
                     * should group right to left, the rule should read:
 | 
						|
                     * '+' expr(prio('+'),&expr)
 | 
						|
                     */
 | 
						|
              | '-' expr(prio('-'),&expr)
 | 
						|
                                {       *val -= expr; }
 | 
						|
              | '*' expr(prio('*'),&expr)
 | 
						|
                                {       *val *= expr; }
 | 
						|
              | '/' expr(prio('/'),&expr)
 | 
						|
                                {       *val /= expr; }
 | 
						|
              | '%' expr(prio('%'),&expr)
 | 
						|
                                {       *val %= expr; }
 | 
						|
              | '&' expr(prio('&'),&expr)
 | 
						|
                                {       *val &= expr; }
 | 
						|
              | '|' expr(prio('|'),&expr)
 | 
						|
                                {       *val |= expr; }
 | 
						|
            ]*
 | 
						|
                    /* Notice the "*" here. It is important.
 | 
						|
                     */
 | 
						|
          | '(' expr(1,val) ')'
 | 
						|
          | '-' expr(MAXPRIO+1,val)
 | 
						|
                                {       *val = -*val; }
 | 
						|
          | number(val)
 | 
						|
          | IDENT
 | 
						|
                                {       *val = regs[tok.t_tval]; }
 | 
						|
        ;
 | 
						|
 | 
						|
number(int *val;) {       int base; }
 | 
						|
        : DIGIT
 | 
						|
                                {       base = (*val=tok.t_tval)==0?8:10; }
 | 
						|
          [ DIGIT
 | 
						|
                                {       *val = base * *val + tok.t_tval; }
 | 
						|
          ]*        ;
 | 
						|
 | 
						|
%lexical scanner ;
 | 
						|
{
 | 
						|
scanner() {
 | 
						|
        if (stok.t_tokno) { /* a token has been inserted or read ahead */
 | 
						|
                tok = stok;
 | 
						|
                stok.t_tokno = 0;
 | 
						|
                return tok.t_tokno;
 | 
						|
        }
 | 
						|
        if (nerrors && tok.t_tokno == '\en') {
 | 
						|
                printf("ERROR\en");
 | 
						|
                nerrors = 0;
 | 
						|
        }
 | 
						|
        tok = nexttok();
 | 
						|
        return tok.t_tokno;
 | 
						|
}
 | 
						|
 | 
						|
LLmessage(insertedtok) {
 | 
						|
        nerrors++;
 | 
						|
        if (insertedtok) { /* token inserted, save old token */
 | 
						|
                stok = tok;
 | 
						|
                tok.t_tval = 0;
 | 
						|
                if (insertedtok < 128) tok.t_tval = ptab[insertedtok];
 | 
						|
        }
 | 
						|
}
 | 
						|
 | 
						|
main() {
 | 
						|
        register *p;
 | 
						|
 | 
						|
        for (p = ptab; p < &ptab[128]; p++) *p = 0;
 | 
						|
        /* for letters, their attribute is their index in the regs array */
 | 
						|
        for (p = &ptab['a']; p <= &ptab['z']; p++) *p = p - &ptab['a'];
 | 
						|
        /* for digits, their attribute is their value */
 | 
						|
        for (p = &ptab['0']; p <= &ptab['9']; p++) *p = p - &ptab['0'];
 | 
						|
        /* for operators, their attribute is their priority */
 | 
						|
        ptab['*'] = 4;
 | 
						|
        ptab['/'] = 4;
 | 
						|
        ptab['%'] = 4;
 | 
						|
        ptab['+'] = 3;
 | 
						|
        ptab['-'] = 3;
 | 
						|
        ptab['&'] = 2;
 | 
						|
        ptab['|'] = 1;
 | 
						|
        return parse();
 | 
						|
}   }
 | 
						|
.fi
 | 
						|
.ft R
 | 
						|
.bp
 | 
						|
.SH
 | 
						|
Appendix C. How to use \fILLgen\fR.
 | 
						|
.PP
 | 
						|
This appendix demonstrates how \fILLgen\fR can be used in
 | 
						|
combination with the \fImake\fR program, to make effective use
 | 
						|
of the \fILLgen\fR-feature that it only changes output files
 | 
						|
when neccessary. \fIMake\fR uses a "makefile", which
 | 
						|
is a file containing dependencies and associated commands.
 | 
						|
A dependency usually indicates that some files depend on other
 | 
						|
files. When a file depends on another file and is older than
 | 
						|
that other file, the commands associated with the dependency
 | 
						|
are executed.
 | 
						|
.PP
 | 
						|
So, \fImake\fR seems just the program that we always wanted.
 | 
						|
However, it
 | 
						|
is not very good in handling programs that generate more than
 | 
						|
one file.
 | 
						|
As usual, there is a way around this problem.
 | 
						|
A sample makefile follows:
 | 
						|
.sp 1
 | 
						|
.ft 5
 | 
						|
.nf
 | 
						|
# The grammar exists of the files decl.g, stat.g and expr.g.
 | 
						|
# The ".o"-files are the result of a C-compilation.
 | 
						|
 | 
						|
GFILES = decl.g stat.g expr.g
 | 
						|
OFILES = decl.o stat.o expr.o Lpars.o
 | 
						|
LLOPT =
 | 
						|
 | 
						|
# As make does'nt handle programs that generate more than one
 | 
						|
# file well, we just don't tell make about it.
 | 
						|
# We just create a dummy file, and touch it whenever LLgen is
 | 
						|
# executed. This way, the dummy in fact depends on the grammar
 | 
						|
# files.
 | 
						|
# Then, we execute make again, to do the C-compilations and
 | 
						|
# such.
 | 
						|
 | 
						|
all:	dummy
 | 
						|
        make parser
 | 
						|
 | 
						|
dummy:  $(GFILES)
 | 
						|
        LLgen $(LLOPT) $(GFILES)
 | 
						|
        touch dummy
 | 
						|
 | 
						|
parser: $(OFILES)
 | 
						|
        $(CC) -o parser $(LDFLAGS) $(OFILES)
 | 
						|
 | 
						|
# Some dependencies without actions :
 | 
						|
# make already knows what to do about them
 | 
						|
 | 
						|
Lpars.o:        Lpars.h
 | 
						|
stat.o:         Lpars.h
 | 
						|
decl.o:         Lpars.h
 | 
						|
expr.o:         Lpars.h
 | 
						|
 | 
						|
.fi
 | 
						|
.ft R
 |