1077 lines
33 KiB
Plaintext
1077 lines
33 KiB
Plaintext
.\" $Id$
|
|
.\" Run this paper off with
|
|
.\" refer [options] -p LLgen.refs LLgen.doc | [n]eqn | tbl | (nt)roff -ms
|
|
.if '\*(>.'' \{\
|
|
. if '\*(<.'' \{\
|
|
. if n .ds >. .
|
|
. if n .ds >, ,
|
|
. if t .ds <. .
|
|
. if t .ds <, ,\
|
|
\}\
|
|
\}
|
|
.ND
|
|
.EQ
|
|
delim @@
|
|
.EN
|
|
.TL
|
|
LLgen, an extended LL(1) parser generator
|
|
.AU
|
|
Ceriel J. H. Jacobs
|
|
.AI
|
|
Dept. of Mathematics and Computer Science
|
|
Vrije Universiteit
|
|
Amsterdam, The Netherlands
|
|
.AB
|
|
\fILLgen\fR provides a
|
|
tool for generating an efficient recursive descent parser
|
|
with no backtrack from
|
|
an Extended Context Free syntax.
|
|
The \fILLgen\fR
|
|
user specifies the syntax, together with code
|
|
describing actions associated with the parsing process.
|
|
\fILLgen\fR
|
|
turns this specification into a number of subroutines that handle the
|
|
parsing process.
|
|
.PP
|
|
The grammar may be ambiguous.
|
|
\fILLgen\fR contains both static and dynamic facilities
|
|
to resolve these ambiguities.
|
|
.PP
|
|
The specification can be split into several files, for each of
|
|
which \fILLgen\fR generates an output file containing the
|
|
corresponding part of the parser.
|
|
Furthermore, only output files that differ from their previous
|
|
version are updated.
|
|
Other output files are not affected in any
|
|
way.
|
|
This allows the user to recompile only those output files that have
|
|
changed.
|
|
.PP
|
|
The subroutine produced by \fILLgen\fR calls a user supplied routine
|
|
that must return the next token. This way, the input to the
|
|
parser can be split into single characters or higher level
|
|
tokens.
|
|
.PP
|
|
An error recovery mechanism is generated almost completely
|
|
automatically.
|
|
It is based on so called \fBdefault choices\fR, which are
|
|
implicitly or explicitly specified by the user.
|
|
.PP
|
|
\fILLgen\fR has succesfully been used to create recognizers for
|
|
Pascal, C, and Modula-2.
|
|
.AE
|
|
.NH
|
|
Introduction
|
|
.PP
|
|
\fILLgen\fR
|
|
provides a tool for generating an efficient recursive
|
|
descent parser with no backtrack from an Extended Context Free
|
|
syntax.
|
|
A parser generated by
|
|
\fILLgen\fR
|
|
will be called
|
|
\fILLparse\fR
|
|
for the rest of this document.
|
|
It is assumed that the reader has some knowledge of LL(1) grammars and
|
|
recursive descent parsers.
|
|
For a survey on the subject, see reference
|
|
.[ (
|
|
griffiths
|
|
.]).
|
|
.PP
|
|
Extended LL(1) parsers are an extension of LL(1) parsers. They are
|
|
derived from an Extended Context-Free (ECF) syntax instead of a Context-Free
|
|
(CF) syntax.
|
|
ECF syntax is described in section 2.
|
|
Section 3 provides an outline of a
|
|
specification as accepted by
|
|
\fILLgen\fR and also discusses the lexical conventions of
|
|
grammar specification files.
|
|
Section 4 provides a description of the way the
|
|
\fILLgen\fR
|
|
user can associate
|
|
actions with the syntax. These actions must be written in the programming
|
|
language C,
|
|
.[
|
|
kernighan ritchie
|
|
.]
|
|
which also is the target language of \fILLgen\fR.
|
|
The error recovery technique is discussed in section 5.
|
|
This section also discusses what the user can do about it.
|
|
Section 6 discusses
|
|
the facilities \fILLgen\fR offers
|
|
to resolve ambiguities and conflicts.
|
|
\fILLgen\fR offers facilities to resolve them both at parser
|
|
generation time and during the execution of \fILLparse\fR.
|
|
Section 7 discusses the
|
|
\fILLgen\fR
|
|
working environment.
|
|
It also discusses the lexical analyzer that must be supplied by the
|
|
user.
|
|
This lexical analyzer must read the input stream and break it
|
|
up into basic input items, called \fBtokens\fR for the rest of
|
|
this document.
|
|
Appendix A gives a summary of the
|
|
\fILLgen\fR
|
|
input syntax.
|
|
Appendix B gives an example.
|
|
It is very instructive to compare this example with the one
|
|
given in reference
|
|
.[ (
|
|
yacc
|
|
.]).
|
|
It demonstrates the struggle \fILLparse\fR and other LL(1)
|
|
parsers have with expressions.
|
|
Appendix C gives an example of the \fILLgen\fR features
|
|
allowing the user to recompile only those output files that
|
|
have changed, using the \fImake\fR program.
|
|
.[
|
|
make
|
|
.]
|
|
.NH
|
|
The Extended Context-Free Syntax
|
|
.PP
|
|
The extensions of an ECF syntax with respect to an ordinary CF syntax are:
|
|
.IP 1. 10
|
|
An ECF syntax contains the repetition operator: "N" (N represents a positive
|
|
integer).
|
|
.IP 2. 10
|
|
An ECF syntax contains the closure set operator without and with
|
|
upperbound: "*" and "*N".
|
|
.IP 3. 10
|
|
An ECF syntax contains the positive closure set operator without and with
|
|
upperbound: "+" and "+N".
|
|
.IP 4. 10
|
|
An ECF syntax contains the optional operator: "?", which is a
|
|
shorthand for "*1".
|
|
.IP 5. 10
|
|
An ECF syntax contains parentheses "[" and "]" which can be
|
|
used for grouping.
|
|
.PP
|
|
We can describe the syntax of an ECF syntax with an ECF syntax :
|
|
.DS
|
|
.ft CW
|
|
grammar : rule +
|
|
;
|
|
.ft R
|
|
.DE
|
|
This grammar rule states that a grammar consists of one or more
|
|
rules.
|
|
.DS
|
|
.ft CW
|
|
rule : nonterminal ':' productionrule ';'
|
|
;
|
|
.ft R
|
|
.DE
|
|
A rule consists of a left hand side, the nonterminal,
|
|
followed by ":",
|
|
the \fBproduce symbol\fR, followed by a production rule, followed by a
|
|
";", in\%di\%ca\%ting the end of the rule.
|
|
.DS
|
|
.ft CW
|
|
productionrule : production [ '|' production ]*
|
|
;
|
|
.ft R
|
|
.DE
|
|
A production rule consists of one or
|
|
more alternative productions separated by "|". This symbol is called the
|
|
\fBalternation symbol\fR.
|
|
.DS
|
|
.ft CW
|
|
production : term *
|
|
;
|
|
.ft R
|
|
.DE
|
|
A production consists of a possibly empty list of terms.
|
|
So, empty productions are allowed.
|
|
.DS
|
|
.ft CW
|
|
term : element repeats
|
|
;
|
|
.ft R
|
|
.DE
|
|
A term is an element, possibly with a repeat specification.
|
|
.DS
|
|
.ft CW
|
|
element : LITERAL
|
|
| IDENTIFIER
|
|
| '[' productionrule ']'
|
|
;
|
|
.ft R
|
|
.DE
|
|
An element can be a LITERAL, which basically is a single character
|
|
between apostrophes, it can be an IDENTIFIER, which is either a
|
|
nonterminal or a token, and it can be a production rule
|
|
between square parentheses.
|
|
.DS
|
|
.ft CW
|
|
repeats : '?'
|
|
| [ '*' | '+' ] NUMBER ?
|
|
| NUMBER ?
|
|
;
|
|
.ft R
|
|
.DE
|
|
These are the repeat specifications discussed above. Notice that
|
|
this specification may be empty.
|
|
.PP
|
|
The class of ECF languages
|
|
is identical with the class of CF languages. However, in many
|
|
cases recursive definitions of language features can now be
|
|
replaced by iterative ones. This tends to reduce the number of
|
|
nonterminals and gives rise to very efficient recursive descent
|
|
parsers.
|
|
.NH
|
|
Grammar Specifications
|
|
.PP
|
|
The major part of a
|
|
\fILLgen\fR
|
|
grammar specification consists of an
|
|
ECF syntax specification.
|
|
Names in this syntax specification refer to either tokens or nonterminal
|
|
symbols.
|
|
\fILLgen\fR
|
|
requires token names to be declared as such. This way it
|
|
can be avoided that a typing error in a nonterminal name causes it to
|
|
be accepted as a token name. The token declarations will be
|
|
discussed later.
|
|
A name will be regarded as a nonterminal symbol, unless it is declared
|
|
as a token name.
|
|
If there is no production rule for a nonterminal symbol, \fILLgen\fR
|
|
will complain.
|
|
.PP
|
|
A grammar specification may also include some C routines,
|
|
for instance the lexical analyzer and an error reporting
|
|
routine.
|
|
Thus, a grammar specification file can contain declarations,
|
|
grammar rules and C-code.
|
|
.PP
|
|
Blanks, tabs and newlines are ignored, but may not appear in names or
|
|
keywords.
|
|
Comments may appear wherever a name is legal (which is almost
|
|
everywhere).
|
|
They are enclosed in
|
|
/* ... */, as in C. Comments do not nest.
|
|
.PP
|
|
Names may be of arbitrary length, and can be made up of letters, underscore
|
|
"\_" and non-initial digits. Upper and lower case letters are distinct.
|
|
Only the first 50 characters are significant.
|
|
Notice however, that the names for the tokens will be used by the
|
|
C-preprocessor.
|
|
The number of significant characters therefore depends on the
|
|
underlying C-implementation.
|
|
A safe rule is to make the identifiers distinct in the first six
|
|
characters, case ignored.
|
|
.PP
|
|
There are two kinds of tokens:
|
|
those that are declared and are denoted by a name,
|
|
and literals.
|
|
.PP
|
|
A literal consists of a character enclosed in apostrophes "'".
|
|
The "\e" is an escape character within literals. The following escapes
|
|
are recognized :
|
|
.TS
|
|
center;
|
|
l l.
|
|
\&'\en' newline
|
|
\&'\er' return
|
|
\&'\e'' apostrophe "'"
|
|
\&'\e\e' backslash "\e"
|
|
\&'\et' tab
|
|
\&'\eb' backspace
|
|
\&'\ef' form feed
|
|
\&'\exxx' "xxx" in octal
|
|
.TE
|
|
.PP
|
|
Names representing tokens must be declared before they are used.
|
|
This can be done using the "\fB%token\fR" keyword,
|
|
by writing
|
|
.nf
|
|
.ft CW
|
|
.sp 1
|
|
%token name1, name2, . . . ;
|
|
.ft R
|
|
.fi
|
|
.PP
|
|
\fILLparse\fR is designed to recognize special nonterminal
|
|
symbols called \fBstart symbols\fR.
|
|
\fILLgen\fR allows for more than one start symbol.
|
|
Thus, grammars with more than one entry point are accepted.
|
|
The start symbols must be declared explicitly using the
|
|
"\fB%start\fR" keyword. It can be used whenever a declaration is
|
|
legal, f.i.:
|
|
.nf
|
|
.ft CW
|
|
.sp 1
|
|
%start LLparse, specification ;
|
|
.ft R
|
|
.fi
|
|
.sp 1
|
|
declares "specification" as a start symbol and associates the
|
|
identifier "LLparse" with it.
|
|
"LLparse" will now be the name of the C-function that must be
|
|
called to recognize "specification".
|
|
.NH
|
|
Actions
|
|
.PP
|
|
\fILLgen\fR
|
|
allows arbitrary insertions of actions within the right hand side
|
|
of a production rule in the ECF syntax. An action consists of a number of C
|
|
statements, enclosed in the brackets "{" and "}".
|
|
.PP
|
|
\fILLgen\fR
|
|
generates a parsing routine for each rule in the grammar. The actions
|
|
supplied by the user are just inserted in the proper place.
|
|
There may also be declarations before the statements in the
|
|
action, as
|
|
the "{" and "}" are copied into the target code along with the
|
|
action. The scope of these declarations terminates with the
|
|
closing bracket "}" of the action.
|
|
.PP
|
|
In addition to actions, it is also possible to declare local variables
|
|
in the parsing routine, which can then be used in the actions.
|
|
Such a declaration consists of a number of C variable declarations,
|
|
enclosed in the brackets "{" and "}". It must be placed
|
|
right in front of the ":" in the grammar rule.
|
|
The scope of these local variables consists of the complete
|
|
grammar rule.
|
|
.PP
|
|
In order to facilitate communication between the actions and
|
|
\fILLparse\fR,
|
|
the parsing routines can be given C-like parameters.
|
|
Each parameter must be declared separately, and each of these declarations must
|
|
end with a semicolon.
|
|
For the last parameter, the semicolon is optional.
|
|
.PP
|
|
So, for example
|
|
.nf
|
|
.ft CW
|
|
.sp 1
|
|
expr(int *pval;) { int fact; } :
|
|
/*
|
|
* Rule with one parameter, a pointer to an int.
|
|
* Parameter specifications are ordinary C declarations.
|
|
* One local variable, of type int.
|
|
*/
|
|
factor (&fact) { *pval = fact; }
|
|
/*
|
|
* factor is another nonterminal symbol.
|
|
* One actual parameter is supplied.
|
|
* Notice that the parameter passing mechanism is that
|
|
* of C.
|
|
*/
|
|
[ '+' factor (&fact) { *pval += fact; } ]*
|
|
/*
|
|
* remember the '*' means zero or more times
|
|
*/
|
|
;
|
|
.sp 1
|
|
.ft R
|
|
.fi
|
|
is a rule to recognize a number of factors, separated by "+", and
|
|
to compute their sum.
|
|
.PP
|
|
\fILLgen\fR
|
|
generates C code, so the parameter passing mechanism is that of
|
|
C, as is shown in the example above.
|
|
.PP
|
|
Actions often manipulate attributes of the token just read.
|
|
For instance, when an identifier is read, its name must be
|
|
looked up in a symbol table.
|
|
Therefore, \fILLgen\fR generates code
|
|
such that at a number of places in the grammar rule
|
|
it is defined which token has last been read.
|
|
After a token, the last token read is this token.
|
|
After a "[" or a "|", the last token read is the next token to
|
|
be accepted by \fILLparse\fR.
|
|
At all other places, it is undefined which token has last been
|
|
read.
|
|
The last token read is available in the global integer variable
|
|
\fILLsymb\fR.
|
|
.PP
|
|
The user may also specify C-code wherever a \fILLgen\fR-declaration is
|
|
legal.
|
|
Again, this code must be enclosed in the brackets "{" and "}".
|
|
This way, the user can define global declarations and
|
|
C-functions.
|
|
To avoid name-conflicts with identifiers generated by
|
|
\fILLgen\fR, \fILLparse\fR only uses names beginning with
|
|
"LL"; the user should avoid such names.
|
|
.NH
|
|
Error Recovery
|
|
.PP
|
|
The error recovery technique used by \fILLgen\fR is a
|
|
modification of the one presented in reference
|
|
.[ (
|
|
automatic construction error correcting
|
|
.]).
|
|
It is based on \fBdefault choices\fR, which just are
|
|
what the word says, default choices at
|
|
every point in the grammar where there is a
|
|
choice.
|
|
Thus, in an alternation, one of the productions is marked as a
|
|
default choice, and in a term with a non-fixed repetition
|
|
specification there will also be a default choice (between
|
|
doing the term (once more) and continuing with the rest of the
|
|
production in which the term appears).
|
|
.PP
|
|
When \fILLparse\fR detects an error after having parsed the
|
|
string @s@, the default choices enable it to compute one
|
|
syntactically correct continuation,
|
|
consisting of the tokens @t sub 1~...~t sub n@,
|
|
such that @s~t sub 1~...~t sub n@ is a string of tokens that
|
|
is a member of the language defined by the grammar.
|
|
Notice, that the computation of this continuation must
|
|
terminate, which implies that the default choices may not
|
|
invoke recursive rules.
|
|
.PP
|
|
At each point in this continuation, a certain number of other
|
|
tokens could also be syntactically correct, f.i. the token
|
|
@t@ is syntactically correct at point @t sub i@ in this
|
|
continuation, if the string @s~t sub 1~...~t sub i~t~s sub 1@
|
|
is a string of the language defined by the grammar for some
|
|
string @s sub 1@ and i >= 0.
|
|
.PP
|
|
The set @T@
|
|
containing all these tokens (including @t sub 1 ,~...,~t sub n@) is computed.
|
|
Next, \fILLparse\fR discards zero
|
|
or more tokens from its input, until a token
|
|
@t@ \(mo @T@ is found.
|
|
The error is then corrected by inserting i (i >= 0) tokens
|
|
@t sub 1~...~t sub i@, such that the string
|
|
@s~t sub 1~...~t sub i~t~s sub 1@ is a string of the language
|
|
defined by the grammar, for some @s sub 1@.
|
|
Then, normal parsing is resumed.
|
|
.PP
|
|
The above is difficult to implement in a recursive decent
|
|
parser, and is not the way \fILLparse\fR does it, but the
|
|
effect is the same. In fact, \fILLparse\fR maintains a list
|
|
of tokens that may not be discarded, which is adjusted as
|
|
\fILLparse\fR proceeds. This list is just a representation
|
|
of the set @T@ mentioned
|
|
above. When an error occurs, \fILLparse\fR discards tokens until
|
|
a token @t@ that is a member of this list is found.
|
|
Then, it continues parsing, following the default choices,
|
|
inserting tokens along the way, until this token @t@ is legal.
|
|
The selection of
|
|
the default choices must guarantee that this will always
|
|
happen.
|
|
.PP
|
|
The default choices are explicitly or implicitly
|
|
specified by the user.
|
|
By default, the default choice in an alternation is the
|
|
alternative with the shortest possible terminal production.
|
|
The user can select one of the other productions in the
|
|
alternation as the default choice by putting the keyword
|
|
"\fB%default\fR" in front of it.
|
|
.PP
|
|
By default, for terms with a repetition count containing "*" or
|
|
"?" the default choice is to continue with the rest of the rule
|
|
in which the term appears, and
|
|
.sp 1
|
|
.ft CW
|
|
.nf
|
|
term+
|
|
.fi
|
|
.ft R
|
|
.sp 1
|
|
is treated as
|
|
.sp 1
|
|
.nf
|
|
.ft CW
|
|
term term* .
|
|
.ft R
|
|
.fi
|
|
.PP
|
|
It is also clear, that it can never be the default choice to do
|
|
the term (once more), because this could cause the parser to
|
|
loop, inserting tokens forever.
|
|
However, when the user does not want the parser to skip
|
|
tokens that would not have been skipped if the term
|
|
would have been the default choice,
|
|
the skipping of such a term can be prevented by
|
|
using the keyword "\fB%persistent\fR".
|
|
For instance, the rule
|
|
.sp 1
|
|
.ft CW
|
|
.nf
|
|
commandlist : command* ;
|
|
.fi
|
|
.ft R
|
|
.sp 1
|
|
could be changed to
|
|
.sp 1
|
|
.ft CW
|
|
.nf
|
|
commandlist : [ %persistent command ]* ;
|
|
.fi
|
|
.ft R
|
|
.sp 1
|
|
The effects of this in case of a syntax error are twofold:
|
|
The set @T@ mentioned above will be extended as if "command" were
|
|
in the default production, so that fewer tokens will be
|
|
skipped.
|
|
Also, if the first token that is not skipped is a member of the
|
|
subset of @T@ arising from the grammar rule for "command",
|
|
\fILLparse\fR will enter that rule.
|
|
So, in fact the default choice
|
|
is determined dynamically (by \fILLparse\fR).
|
|
Again, \fILLgen\fR checks (statically)
|
|
that \fILLparse\fR will always terminate, and if not,
|
|
\fILLgen\fR will complain.
|
|
.PP
|
|
An important property of this error recovery method is that,
|
|
once a rule is started, it will be finished.
|
|
This means that all actions in the rule will be executed
|
|
normally, so that the user can be sure that there will be no
|
|
inconsistencies in his data structures because of syntax
|
|
errors.
|
|
Also, as the method is in fact error correcting, the
|
|
actions in a rule only have to deal with syntactically correct
|
|
input.
|
|
.NH
|
|
Ambiguities and conflicts
|
|
.PP
|
|
As \fILLgen\fR generates a recursive descent parser with no backtrack,
|
|
it must at all times be able to determine what to do,
|
|
based on the current input symbol.
|
|
Unfortunately, this cannot be done for all grammars.
|
|
Two kinds of conflicts can arise :
|
|
.IP 1) 10
|
|
the grammar rule is of the form "production1 | production2",
|
|
and \fILLparse\fR cannot decide which production to chose.
|
|
This we call an \fBalternation conflict\fR.
|
|
.IP 2) 10
|
|
the grammar rule is of the form "[ productionrule ]...",
|
|
where ... specifies a non-fixed repetition count,
|
|
and \fILLparse\fR cannot decide whether to
|
|
choose "productionrule" once more, or to continue.
|
|
This we call a \fBrepetition conflict\fR.
|
|
.PP
|
|
There can be several causes for conflicts: the grammar may be
|
|
ambiguous, or the grammar may require a more complex parser
|
|
than \fILLgen\fR can construct.
|
|
The conflicts can be examined by inspecting the verbose
|
|
(-\fBv\fR) option output file.
|
|
The conflicts can be resolved by rewriting the grammar
|
|
or by using \fBconflict resolvers\fR.
|
|
The mechanism described here is based on the attributed parsing
|
|
of reference
|
|
.[ (
|
|
milton
|
|
.]).
|
|
.PP
|
|
An alternation conflict can be resolved by putting an \fBif condition\fR
|
|
in front of the first conflicting production.
|
|
It consists of a "\fB%if\fR" followed by a
|
|
C-expression between parentheses.
|
|
\fILLparse\fR will then evaluate this expression whenever a
|
|
token is met at this point on which there is a conflict, so
|
|
the conflict will be resolved dynamically.
|
|
If the expression evaluates to
|
|
non-zero, the first conflicting production is chosen,
|
|
otherwise one of the remaining ones is chosen.
|
|
.PP
|
|
An alternation conflict can also be resolved using the keywords
|
|
"\fB%prefer\fR" or "\fB%avoid\fR". "\fB%prefer\fR"
|
|
is equivalent in behaviour to
|
|
"\fB%if\fR (1)". "\fB%avoid\fR" is equivalent to "\fB%if\fR (0)".
|
|
In these cases however, "\fB%prefer\fR" and "\fB%avoid\fR" should be used,
|
|
as they resolve the conflict statically and thus
|
|
give rise to better C-code.
|
|
.PP
|
|
A repetition conflict can be resolved by putting a \fBwhile condition\fR
|
|
right after the opening parentheses. This while condition
|
|
consists of a "\fB%while\fR" followed by a C-expression between
|
|
parentheses. Again, \fILLparse\fR will then
|
|
evaluate this expression whenever a token is met
|
|
at this point on which there is a conflict.
|
|
If the expression evaluates to non-zero, the
|
|
repeating part is chosen, otherwise the parser continues with
|
|
the rest of the rule.
|
|
Appendix B will give an example of these features.
|
|
.PP
|
|
A useful aid in writing conflict resolvers is the "\fB%first\fR" keyword.
|
|
It is used to declare a C-macro that forms an expression
|
|
returning 1 if the parameter supplied can start a specified
|
|
nonterminal, f.i.:
|
|
.sp 1
|
|
.nf
|
|
.ft CW
|
|
%first fmac, nonterm ;
|
|
.ft R
|
|
.sp 1
|
|
.fi
|
|
declares "fmac" as a macro with one parameter, whose value
|
|
is a token number. If the parameter
|
|
X can start the nonterminal "nonterm", "fmac(X)" is true,
|
|
otherwise it is false.
|
|
.NH
|
|
The LLgen working environment
|
|
.PP
|
|
\fILLgen\fR generates a number of files: one for each input
|
|
file, and two other files: \fILpars.c\fR and \fILpars.h\fR.
|
|
\fILpars.h\fR contains "#-define"s for the tokennames.
|
|
\fILpars.c\fR contains the error recovery routines and tables.
|
|
Only those output files that differ from their previous version
|
|
are updated. See appendix C for a possible application of this
|
|
feature.
|
|
.PP
|
|
The names of the output files are constructed as
|
|
follows:
|
|
in the input file name, the suffix after the last point is
|
|
replaced by a "c". If no point is present in the input file
|
|
name, ".c" is appended to it. \fILLgen\fR checks that the
|
|
filename constructed this way in fact represents a previous
|
|
version, or does not exist already.
|
|
.PP
|
|
The user must provide some environment to obtain a complete
|
|
program.
|
|
Routines called \fImain\fR and \fILLmessage\fR must be defined.
|
|
Also, a lexical analyzer must be provided.
|
|
.PP
|
|
The routine \fImain\fR must be defined, as it must be in every
|
|
C-program. It should eventually call one of the startsymbol
|
|
routines.
|
|
.PP
|
|
The routine \fILLmessage\fR must accept one
|
|
parameter, whose value is a token number, zero or -1.
|
|
.br
|
|
A zero parameter indicates that the current token (the one in
|
|
the external variable \fILLsymb\fR) is deleted.
|
|
.br
|
|
A -1 parameter indicates that the parser expected end of file, but didn't get
|
|
it.
|
|
The parser will then skip tokens until end of file is detected.
|
|
.br
|
|
A parameter that is a token number (a positive parameter)
|
|
indicates that this
|
|
token is to be inserted in front of the token currently in
|
|
\fILLsymb\fR.
|
|
The user can give the token the proper attributes.
|
|
Also, the user must take care, that the token currently in
|
|
\fILLsymb\fR is again returned by the \fBnext\fR call to the
|
|
lexical analyzer, with the proper attributes.
|
|
So, the lexical analyzer must have a facility to push back one
|
|
token.
|
|
.PP
|
|
The user may also supply his own error recovery routines, or handle
|
|
errors differently. For this purpose, the name of a routine to be called
|
|
when an error occurs may be declared using the keyword \fB%onerror\fR.
|
|
This routine takes two parameters.
|
|
The first one is either the token number of the
|
|
token expected, or 0. In the last case, the error occurred at a choice.
|
|
In both cases, the routine must ensure that the next call to the lexical
|
|
analyser returns the token that replaces the current one. Of course,
|
|
that could well be the current one, in which case
|
|
.I LLparse
|
|
recovers from the error.
|
|
The second parameter contains a list of tokens that are not skipped at the
|
|
error point. The list is in the form of a null-terminated array of integers,
|
|
whose address is passed.
|
|
.PP
|
|
The user must supply a lexical analyzer to read the input stream and
|
|
break it up into tokens, which are passed to
|
|
.I LLparse.
|
|
It should be an integer valued function, returning the token number.
|
|
The name of this function can be declared using the
|
|
"\fB%lexical\fR" keyword.
|
|
This keyword can be used wherever a declaration is legal and may appear
|
|
only once in the grammar specification, f.i.:
|
|
.sp 1
|
|
.nf
|
|
.ft CW
|
|
%lexical scanner ;
|
|
.ft R
|
|
.fi
|
|
.sp 1
|
|
declares "scanner" as the name of the lexical analyzer.
|
|
The default name for the lexical analyzer is "yylex".
|
|
The reason for this funny name is that a useful tool for constructing
|
|
lexical analyzers is the
|
|
.I Lex
|
|
program,
|
|
.[
|
|
lex
|
|
.]
|
|
which generates a routine of that name.
|
|
.PP
|
|
The token numbers are chosen by \fILLgen\fR.
|
|
The token number for a literal
|
|
is the numerical value of the character in the local character set.
|
|
If the tokens have a name,
|
|
the "#\ define" mechanism of C is used to give them a value and
|
|
to allow the lexical analyzer to return their token numbers symbolically.
|
|
These "#\ define"s are collected in the file \fILpars.h\fR which
|
|
can be "#\ include"d in any file that needs the token-names.
|
|
The maximum token number chosen is defined in the macro \fILL_MAXTOKNO\fP.
|
|
.PP
|
|
The lexical analyzer must signal the end
|
|
of input to \fILLparse\fR
|
|
by returning a number less than or equal to zero.
|
|
.NH
|
|
Programs with more than one parser
|
|
.PP
|
|
\fILLgen\fR offers a simple facility for having more than one parser in
|
|
a program: in this case, the user can change the names of global procedures,
|
|
variables, etc, by giving a different prefix, like this:
|
|
.sp 1
|
|
.nf
|
|
.ft CW
|
|
%prefix XX ;
|
|
.ft R
|
|
.fi
|
|
.sp 1
|
|
The effect of this is that all global names start with XX instead of LL, for
|
|
the parser that has this prefix. This holds for the variables \fILLsymb\fP,
|
|
which now is called \fIXXsymb\fP, for the routine \fILLmessage\fP,
|
|
which must now be called \fIXXmessage\fP, and for the macro \fILL_MAXTOKNO\fP,
|
|
which is now called \fIXX_MAXTOKNO\fP.
|
|
\fILL.output\fP is now \fIXX.output\fP, and \fILpars.c\fP and \fILpars.h\fP
|
|
are now called \fIXXpars.c\fP and \fIXXpars.h\fP.
|
|
.bp
|
|
.SH
|
|
References
|
|
.[
|
|
$LIST$
|
|
.]
|
|
.bp
|
|
.SH
|
|
Appendix A : LLgen Input Syntax
|
|
.PP
|
|
This appendix has a description of the \fILLgen\fR input syntax,
|
|
as a \fILLgen\fR specification. As a matter of fact, the current
|
|
version of \fILLgen\fR is written with \fILLgen\fR.
|
|
.nf
|
|
.ft CW
|
|
.sp 2
|
|
/*
|
|
* First the declarations of the terminals
|
|
* The order is not important
|
|
*/
|
|
|
|
%token IDENTIFIER; /* terminal or nonterminal name */
|
|
%token NUMBER;
|
|
%token LITERAL;
|
|
|
|
/*
|
|
* Reserved words
|
|
*/
|
|
|
|
%token TOKEN; /* %token */
|
|
%token START; /* %start */
|
|
%token PERSISTENT; /* %persistent */
|
|
%token IF; /* %if */
|
|
%token WHILE; /* %while */
|
|
%token AVOID; /* %avoid */
|
|
%token PREFER; /* %prefer */
|
|
%token DEFAULT; /* %default */
|
|
%token LEXICAL; /* %lexical */
|
|
%token PREFIX; /* %prefix */
|
|
%token ONERROR; /* %onerror */
|
|
%token FIRST; /* %first */
|
|
|
|
/*
|
|
* Declare LLparse to be a C-routine that recognizes "specification"
|
|
*/
|
|
|
|
%start LLparse, specification;
|
|
|
|
specification
|
|
: declaration*
|
|
;
|
|
|
|
declaration
|
|
: START
|
|
IDENTIFIER ',' IDENTIFIER
|
|
';'
|
|
| '{'
|
|
/* Read C-declaration here */
|
|
'}'
|
|
| TOKEN
|
|
IDENTIFIER
|
|
[ ',' IDENTIFIER ]*
|
|
';'
|
|
| FIRST
|
|
IDENTIFIER ',' IDENTIFIER
|
|
';'
|
|
| LEXICAL
|
|
IDENTIFIER
|
|
';'
|
|
| PREFIX
|
|
IDENTIFIER
|
|
';'
|
|
| ONERROR
|
|
IDENTIFIER
|
|
';'
|
|
| rule
|
|
;
|
|
|
|
rule : IDENTIFIER parameters? ldecl?
|
|
':' productions
|
|
';'
|
|
;
|
|
|
|
ldecl : '{'
|
|
/* Read C-declaration here */
|
|
'}'
|
|
;
|
|
|
|
productions
|
|
: simpleproduction
|
|
[ '|' simpleproduction ]*
|
|
;
|
|
|
|
simpleproduction
|
|
: DEFAULT?
|
|
[ IF '(' /* Read C-expression here */ ')'
|
|
| PREFER
|
|
| AVOID
|
|
]?
|
|
[ element repeats ]*
|
|
;
|
|
|
|
element : '{'
|
|
/* Read action here */
|
|
'}'
|
|
| '[' [ WHILE '(' /* Read C-expression here */ ')' ]?
|
|
PERSISTENT?
|
|
productions
|
|
']'
|
|
| LITERAL
|
|
| IDENTIFIER parameters?
|
|
;
|
|
|
|
parameters
|
|
: '(' /* Read C-parameters here */ ')'
|
|
;
|
|
|
|
repeats : /* empty */
|
|
| [ '*' | '+' ] NUMBER?
|
|
| NUMBER
|
|
| '?'
|
|
;
|
|
|
|
.fi
|
|
.ft R
|
|
.bp
|
|
.SH
|
|
Appendix B : An example
|
|
.PP
|
|
This example gives the complete \fILLgen\fR specification of a simple
|
|
desk calculator. It has 26 registers, labeled "a" through "z",
|
|
and accepts arithmetic expressions made up of the C operators
|
|
+, -, *, /, %, &, and |, with their usual priorities.
|
|
The value of the expression is
|
|
printed. As in C, an integer that begins with 0 is assumed to
|
|
be octal; otherwise it is assumed to be decimal.
|
|
.PP
|
|
Although the example is short and not very complicated, it
|
|
demonstrates the use of if and while conditions. In
|
|
the example they are in fact used to reduce the number of
|
|
nonterminals, and to reduce the overhead due to the recursion
|
|
that would be involved in parsing an expression with an
|
|
ordinary recursive descent parser. In an ordinary LL(1)
|
|
grammar there would be one nonterminal for each operator
|
|
priority. The example shows how we can do it all with one
|
|
nonterminal, no matter how many priority levels there are.
|
|
.sp 1
|
|
.nf
|
|
.ft CW
|
|
{
|
|
#include <stdio.h>
|
|
#include <ctype.h>
|
|
#define MAXPRIO 5
|
|
#define prio(op) (ptab[op])
|
|
|
|
struct token {
|
|
int t_tokno; /* token number */
|
|
int t_tval; /* Its attribute */
|
|
} stok = { 0,0 }, tok;
|
|
|
|
int nerrors = 0;
|
|
int regs[26]; /* Space for the registers */
|
|
int ptab[128]; /* Attribute table */
|
|
|
|
struct token
|
|
nexttok() { /* Read next token and return it */
|
|
register c;
|
|
struct token new;
|
|
|
|
while ((c = getchar()) == ' ' || c == '\et') { /* nothing */ }
|
|
if (isdigit(c)) new.t_tokno = DIGIT;
|
|
else if (islower(c)) new.t_tokno = IDENT;
|
|
else new.t_tokno = c;
|
|
if (c >= 0) new.t_tval = ptab[c];
|
|
return new;
|
|
} }
|
|
|
|
%token DIGIT, IDENT;
|
|
%start parse, list;
|
|
|
|
list : stat* ;
|
|
|
|
stat { int ident, val; } :
|
|
%if (stok = nexttok(),
|
|
stok.t_tokno == '=')
|
|
/* The conflict is resolved by looking one further
|
|
* token ahead. The grammar is LL(2)
|
|
*/
|
|
IDENT
|
|
{ ident = tok.t_tval; }
|
|
'=' expr(1,&val) '\en'
|
|
{ if (!nerrors) regs[ident] = val; }
|
|
| expr(1,&val) '\en'
|
|
{ if (!nerrors) printf("%d\en",val); }
|
|
| '\en'
|
|
;
|
|
|
|
expr(int level; int *val;) { int expr; } :
|
|
factor(val)
|
|
[ %while (prio(tok.t_tokno) >= level)
|
|
/* Swallow operators as long as their priority is
|
|
* larger than or equal to the level of this invocation
|
|
*/
|
|
'+' expr(prio('+')+1,&expr)
|
|
{ *val += expr; }
|
|
/* This states that '+' groups left to right. If it
|
|
* should group right to left, the rule should read:
|
|
* '+' expr(prio('+'),&expr)
|
|
*/
|
|
| '-' expr(prio('-')+1,&expr)
|
|
{ *val -= expr; }
|
|
| '*' expr(prio('*')+1,&expr)
|
|
{ *val *= expr; }
|
|
| '/' expr(prio('/')+1,&expr)
|
|
{ *val /= expr; }
|
|
| '%' expr(prio('%')+1,&expr)
|
|
{ *val %= expr; }
|
|
| '&' expr(prio('&')+1,&expr)
|
|
{ *val &= expr; }
|
|
| '|' expr(prio('|')+1,&expr)
|
|
{ *val |= expr; }
|
|
]*
|
|
/* Notice the "*" here. It is important.
|
|
*/
|
|
;
|
|
|
|
factor(int *val;):
|
|
'(' expr(1,val) ')'
|
|
| '-' expr(MAXPRIO+1,val)
|
|
{ *val = -*val; }
|
|
| number(val)
|
|
| IDENT
|
|
{ *val = regs[tok.t_tval]; }
|
|
;
|
|
|
|
number(int *val;) { int base; }
|
|
: DIGIT
|
|
{ base = (*val=tok.t_tval)==0?8:10; }
|
|
[ DIGIT
|
|
{ *val = base * *val + tok.t_tval; }
|
|
]* ;
|
|
|
|
%lexical scanner ;
|
|
{
|
|
scanner() {
|
|
if (stok.t_tokno) { /* a token has been inserted or read ahead */
|
|
tok = stok;
|
|
stok.t_tokno = 0;
|
|
return tok.t_tokno;
|
|
}
|
|
if (nerrors && tok.t_tokno == '\en') {
|
|
printf("ERROR\en");
|
|
nerrors = 0;
|
|
}
|
|
tok = nexttok();
|
|
return tok.t_tokno;
|
|
}
|
|
|
|
LLmessage(insertedtok) {
|
|
nerrors++;
|
|
if (insertedtok) { /* token inserted, save old token */
|
|
stok = tok;
|
|
tok.t_tval = 0;
|
|
if (insertedtok < 128) tok.t_tval = ptab[insertedtok];
|
|
}
|
|
}
|
|
|
|
main() {
|
|
register *p;
|
|
|
|
for (p = ptab; p < &ptab[128]; p++) *p = 0;
|
|
/* for letters, their attribute is their index in the regs array */
|
|
for (p = &ptab['a']; p <= &ptab['z']; p++) *p = p - &ptab['a'];
|
|
/* for digits, their attribute is their value */
|
|
for (p = &ptab['0']; p <= &ptab['9']; p++) *p = p - &ptab['0'];
|
|
/* for operators, their attribute is their priority */
|
|
ptab['*'] = 4;
|
|
ptab['/'] = 4;
|
|
ptab['%'] = 4;
|
|
ptab['+'] = 3;
|
|
ptab['-'] = 3;
|
|
ptab['&'] = 2;
|
|
ptab['|'] = 1;
|
|
parse();
|
|
exit(nerrors);
|
|
} }
|
|
.fi
|
|
.ft R
|
|
.bp
|
|
.SH
|
|
Appendix C. How to use \fILLgen\fR.
|
|
.PP
|
|
This appendix demonstrates how \fILLgen\fR can be used in
|
|
combination with the \fImake\fR program, to make effective use
|
|
of the \fILLgen\fR-feature that it only changes output files
|
|
when neccessary. \fIMake\fR uses a "makefile", which
|
|
is a file containing dependencies and associated commands.
|
|
A dependency usually indicates that some files depend on other
|
|
files. When a file depends on another file and is older than
|
|
that other file, the commands associated with the dependency
|
|
are executed.
|
|
.PP
|
|
So, \fImake\fR seems just the program that we always wanted.
|
|
However, it
|
|
is not very good in handling programs that generate more than
|
|
one file.
|
|
As usual, there is a way around this problem.
|
|
A sample makefile follows:
|
|
.sp 1
|
|
.ft CW
|
|
.nf
|
|
# The grammar exists of the files decl.g, stat.g and expr.g.
|
|
# The ".o"-files are the result of a C-compilation.
|
|
|
|
GFILES = decl.g stat.g expr.g
|
|
OFILES = decl.o stat.o expr.o Lpars.o
|
|
LLOPT =
|
|
|
|
# As make does'nt handle programs that generate more than one
|
|
# file well, we just don't tell make about it.
|
|
# We just create a dummy file, and touch it whenever LLgen is
|
|
# executed. This way, the dummy in fact depends on the grammar
|
|
# files.
|
|
# Then, we execute make again, to do the C-compilations and
|
|
# such.
|
|
|
|
all: dummy
|
|
make parser
|
|
|
|
dummy: $(GFILES)
|
|
LLgen $(LLOPT) $(GFILES)
|
|
touch dummy
|
|
|
|
parser: $(OFILES)
|
|
$(CC) -o parser $(LDFLAGS) $(OFILES)
|
|
|
|
# Some dependencies without actions :
|
|
# make already knows what to do about them
|
|
|
|
Lpars.o: Lpars.h
|
|
stat.o: Lpars.h
|
|
decl.o: Lpars.h
|
|
expr.o: Lpars.h
|
|
|
|
.fi
|
|
.ft R
|