ack/doc/LLgen/LLgen.n

1047 lines
32 KiB
Plaintext
Raw Normal View History

1986-12-10 10:38:42 +00:00
.\" $Header$
.\" Run this paper off with
.\" refer [options] -p LLgen.refs LLgen.doc | [n]eqn | tbl | (nt)roff -ms
1986-12-10 11:31:35 +00:00
.if '\*(>.'' \{\
. if '\*(<.'' \{\
1986-12-10 11:31:35 +00:00
. if n .ds >. .
. if n .ds >, ,
. if t .ds <. .
. if t .ds <, ,\
1986-12-10 11:31:35 +00:00
\}\
\}
1986-12-10 10:38:42 +00:00
.cs 5 22u
.ND
.EQ
delim @@
.EN
.TL
LLgen, an extended LL(1) parser generator
.AU
Ceriel J. H. Jacobs
.AI
Dept. of Mathematics and Computer Science
Vrije Universiteit
Amsterdam, The Netherlands
.AB
\fILLgen\fR provides a
tool for generating an efficient recursive descent parser
with no backtrack from
an Extended Context Free syntax.
The \fILLgen\fR
user specifies the syntax, together with code
describing actions associated with the parsing process.
\fILLgen\fR
turns this specification into a number of subroutines that handle the
parsing process.
.PP
The grammar may be ambiguous.
\fILLgen\fR contains both static and dynamic facilities
to resolve these ambiguities.
.PP
The specification can be split into several files, for each of
which \fILLgen\fR generates an output file containing the
corresponding part of the parser.
Furthermore, only output files that differ from their previous
version are updated.
Other output files are not affected in any
way.
This allows the user to recompile only those output files that have
changed.
.PP
The subroutine produced by \fILLgen\fR calls a user supplied routine
that must return the next token. This way, the input to the
parser can be split into single characters or higher level
tokens.
.PP
An error recovery mechanism is generated almost completely
automatically.
It is based on so called \fBdefault choices\fR, which are
implicitly or explicitly specified by the user.
.PP
\fILLgen\fR has succesfully been used to create recognizers for
Pascal and C.
.AE
.NH
Introduction
.PP
\fILLgen\fR
provides a tool for generating an efficient recursive
descent parser with no backtrack from an Extended Context Free
syntax.
A parser generated by
\fILLgen\fR
will be called
\fILLparse\fR
for the rest of this document.
It is assumed that the reader has some knowledge of LL(1) grammars and
recursive descent parsers.
For a survey on the subject, see reference
.[ (
1986-12-10 10:38:42 +00:00
griffiths
.]).
1986-12-10 10:38:42 +00:00
.PP
Extended LL(1) parsers are an extension of LL(1) parsers. They are
derived from an Extended Context-Free (ECF) syntax instead of a Context-Free
(CF) syntax.
ECF syntax is described in section 2.
Section 3 provides an outline of a
specification as accepted by
\fILLgen\fR and also discusses the lexical conventions of
grammar specification files.
Section 4 provides a description of the way the
\fILLgen\fR
user can associate
actions with the syntax. These actions must be written in the programming
language C,
1986-12-10 10:38:42 +00:00
.[
kernighan ritchie
.]
which also is the target language of \fILLgen\fR.
The error recovery technique is discussed in section 5.
This section also discusses what the user can do about it.
Section 6 discusses
the facilities \fILLgen\fR offers
to resolve ambiguities and conflicts.
\fILLgen\fR offers facilities to resolve them both at parser
generation time and during the execution of \fILLparse\fR.
Section 7 discusses the
\fILLgen\fR
working environment.
It also discusses the lexical analyzer that must be supplied by the
user.
This lexical analyzer must read the input stream and break it
up into basic input items, called \fBtokens\fR for the rest of
this document.
Appendix A gives a summary of the
\fILLgen\fR
input syntax.
Appendix B gives an example.
It is very instructive to compare this example with the one
given in reference
.[ (
1986-12-10 10:38:42 +00:00
yacc
.]).
1986-12-10 10:38:42 +00:00
It demonstrates the struggle \fILLparse\fR and other LL(1)
parsers have with expressions.
Appendix C gives an example of the \fILLgen\fR features
allowing the user to recompile only those output files that
have changed, using the \fImake\fR program.
1986-12-10 10:38:42 +00:00
.[
make
.]
.NH
The Extended Context-Free Syntax
.PP
The extensions of an ECF syntax with respect to an ordinary CF syntax are:
.IP 1. 10
An ECF syntax contains the repetition operator: "N" (N represents a positive
integer).
.IP 2. 10
An ECF syntax contains the closure set operator without and with
upperbound: "*" and "*N".
.IP 3. 10
An ECF syntax contains the positive closure set operator without and with
upperbound: "+" and "+N".
.IP 4. 10
An ECF syntax contains the optional operator: "?", which is a
shorthand for "*1".
.IP 5. 10
An ECF syntax contains parentheses "[" and "]" which can be
used for grouping.
.PP
We can describe the syntax of an ECF syntax with an ECF syntax :
.DS
.ft CW
1986-12-10 10:38:42 +00:00
grammar : rule +
;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.DE
This grammar rule states that a grammar consists of one or more
rules.
.DS
.ft CW
1986-12-10 10:38:42 +00:00
rule : nonterminal ':' productionrule ';'
;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.DE
A rule consists of a left hand side, the nonterminal,
followed by ":",
the \fBproduce symbol\fR, followed by a production rule, followed by a
";", in\%di\%ca\%ting the end of the rule.
.DS
.ft CW
1986-12-10 10:38:42 +00:00
productionrule : production [ '|' production ]*
;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.DE
A production rule consists of one or
more alternative productions separated by "|". This symbol is called the
\fBalternation symbol\fR.
.DS
.ft CW
1986-12-10 10:38:42 +00:00
production : term *
;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.DE
A production consists of a possibly empty list of terms.
So, empty productions are allowed.
.DS
.ft CW
1986-12-10 10:38:42 +00:00
term : element repeats
;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.DE
A term is an element, possibly with a repeat specification.
.DS
.ft CW
1986-12-10 10:38:42 +00:00
element : LITERAL
| IDENTIFIER
| '[' productionrule ']'
;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.DE
An element can be a LITERAL, which basically is a single character
between apostrophes, it can be an IDENTIFIER, which is either a
nonterminal or a token, and it can be a production rule
between square parentheses.
.DS
.ft CW
1986-12-10 10:38:42 +00:00
repeats : '?'
| [ '*' | '+' ] NUMBER ?
| NUMBER ?
;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.DE
These are the repeat specifications discussed above. Notice that
this specification may be empty.
.PP
The class of ECF languages
is identical with the class of CF languages. However, in many
cases recursive definitions of language features can now be
replaced by iterative ones. This tends to reduce the number of
nonterminals and gives rise to very efficient recursive descent
parsers.
.NH
Grammar Specifications
.PP
The major part of a
\fILLgen\fR
grammar specification consists of an
ECF syntax specification.
Names in this syntax specification refer to either tokens or nonterminal
symbols.
\fILLgen\fR
requires token names to be declared as such. This way it
can be avoided that a typing error in a nonterminal name causes it to
be accepted as a token name. The token declarations will be
discussed later.
A name will be regarded as a nonterminal symbol, unless it is declared
as a token name.
If there is no production rule for a nonterminal symbol, \fILLgen\fR
will complain.
.PP
A grammar specification may also include some C routines,
for instance the lexical analyzer and an error reporting
routine.
Thus, a grammar specification file can contain declarations,
grammar rules and C-code.
.PP
Blanks, tabs and newlines are ignored, but may not appear in names or
keywords.
Comments may appear wherever a name is legal (which is almost
everywhere).
They are enclosed in
/* ... */, as in C. Comments do not nest.
.PP
Names may be of arbitrary length, and can be made up of letters, underscore
"\_" and non-initial digits. Upper and lower case letters are distinct.
Only the first 50 characters are significant.
Notice however, that the names for the tokens will be used by the
C-preprocessor.
The number of significant characters therefore depends on the
underlying C-implementation.
A safe rule is to make the identifiers distinct in the first six
characters, case ignored.
.PP
There are two kinds of tokens:
those that are declared and are denoted by a name,
and literals.
.PP
A literal consists of a character enclosed in apostrophes "'".
The "\e" is an escape character within literals. The following escapes
are recognized :
.TS
center;
l l.
\&'\en' newline
\&'\er' return
\&'\e'' apostrophe "'"
\&'\e\e' backslash "\e"
\&'\et' tab
\&'\eb' backspace
\&'\ef' form feed
\&'\exxx' "xxx" in octal
.TE
.PP
Names representing tokens must be declared before they are used.
This can be done using the "\fB%token\fR" keyword,
by writing
.nf
.ft CW
1986-12-10 10:38:42 +00:00
.sp 1
%token name1, name2, . . . ;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.fi
.PP
\fILLparse\fR is designed to recognize special nonterminal
symbols called \fBstart symbols\fR.
\fILLgen\fR allows for more than one start symbol.
Thus, grammars with more than one entry point are accepted.
The start symbols must be declared explicitly using the
"\fB%start\fR" keyword. It can be used whenever a declaration is
legal, f.i.:
.nf
.ft CW
1986-12-10 10:38:42 +00:00
.sp 1
%start LLparse, specification ;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.fi
.sp 1
declares "specification" as a start symbol and associates the
identifier "LLparse" with it.
"LLparse" will now be the name of the C-function that must be
called to recognize "specification".
.NH
Actions
.PP
\fILLgen\fR
allows arbitrary insertions of actions within the right hand side
of a production rule in the ECF syntax. An action consists of a number of C
statements, enclosed in the brackets "{" and "}".
.PP
\fILLgen\fR
generates a parsing routine for each rule in the grammar. The actions
supplied by the user are just inserted in the proper place.
There may also be declarations before the statements in the
action, as
the "{" and "}" are copied into the target code along with the
action. The scope of these declarations terminates with the
closing bracket "}" of the action.
.PP
In addition to actions, it is also possible to declare local variables
in the parsing routine, which can then be used in the actions.
Such a declaration consists of a number of C variable declarations,
enclosed in the brackets "{" and "}". It must be placed
right in front of the ":" in the grammar rule.
The scope of these local variables consists of the complete
grammar rule.
.PP
In order to facilitate communication between the actions and
\fILLparse\fR,
the parsing routines can be given C-like parameters. So, for example
.nf
.ft CW
1986-12-10 10:38:42 +00:00
.sp 1
expr(int *pval;) { int fact; } :
/*
* Rule with one parameter, a pointer to an int.
* Parameter specifications are ordinary C declarations.
* One local variable, of type int.
*/
factor (&fact) { *pval = fact; }
/*
* factor is another nonterminal symbol.
* One actual parameter is supplied.
* Notice that the parameter passing mechanism is that
* of C.
*/
[ '+' factor (&fact) { *pval += fact; } ]*
/*
* remember the '*' means zero or more times
*/
;
.sp 1
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.fi
is a rule to recognize a number of factors, separated by "+", and
to compute their sum.
.PP
\fILLgen\fR
generates C code, so the parameter passing mechanism is that of
C, as is shown in the example above.
.PP
Actions often manipulate attributes of the token just read.
For instance, when an identifier is read, its name must be
looked up in a symbol table.
Therefore, \fILLgen\fR generates code
such that at a number of places in the grammar rule
it is defined which token has last been read.
After a token, the last token read is this token.
After a "[" or a "|", the last token read is the next token to
be accepted by \fILLparse\fR.
At all other places, it is undefined which token has last been
read.
The last token read is available in the global integer variable
\fILLsymb\fR.
.PP
The user may also specify C-code wherever a \fILLgen\fR-declaration is
legal.
Again, this code must be enclosed in the brackets "{" and "}".
This way, the user can define global declarations and
C-functions.
To avoid name-conflicts with identifiers generated by
\fILLgen\fR, \fILLparse\fR only uses names beginning with
"LL"; the user should avoid such names.
.NH
Error Recovery
.PP
The error recovery technique used by \fILLgen\fR is a
modification of the one presented in reference
.[ (
1986-12-10 11:31:35 +00:00
automatic construction error correcting
.]).
1986-12-10 10:38:42 +00:00
It is based on \fBdefault choices\fR, which just are
what the word says, default choices at
every point in the grammar where there is a
choice.
Thus, in an alternation, one of the productions is marked as a
default choice, and in a term with a non-fixed repetition
specification there will also be a default choice (between
doing the term (once more) and continuing with the rest of the
production in which the term appears).
.PP
When \fILLparse\fR detects an error after having parsed the
string @s@, the default choices enable it to compute one
syntactically correct continuation,
consisting of the tokens @t sub 1~...~t sub n@,
such that @s~t sub 1~...~t sub n@ is a string of tokens that
is a member of the language defined by the grammar.
Notice, that the computation of this continuation must
terminate, which implies that the default choices may not
invoke recursive rules.
.PP
At each point in this continuation, a certain number of other
tokens could also be syntactically correct, f.i. the token
@t@ is syntactically correct at point @t sub i@ in this
continuation, if the string @s~t sub 1~...~t sub i~t~s sub 1@
is a string of the language defined by the grammar for some
string @s sub 1@ and i >= 0.
.PP
The set @T@
containing all these tokens (including @t sub 1 ,~...,~t sub n@) is computed.
Next, \fILLparse\fR discards zero
or more tokens from its input, until a token
@t@ \(mo @T@ is found.
The error is then corrected by inserting i (i >= 0) tokens
@t sub 1~...~t sub i@, such that the string
@s~t sub 1~...~t sub i~t~s sub 1@ is a string of the language
defined by the grammar, for some @s sub 1@.
Then, normal parsing is resumed.
.PP
The above is difficult to implement in a recursive decent
parser, and is not the way \fILLparse\fR does it, but the
effect is the same. In fact, \fILLparse\fR maintains a list
of tokens that may not be discarded, which is adjusted as
\fILLparse\fR proceeds. This list is just a representation
of the set @T@ mentioned
above. When an error occurs, \fILLparse\fR discards tokens until
a token @t@ that is a member of this list is found.
Then, it continues parsing, following the default choices,
inserting tokens along the way, until this token @t@ is legal.
The selection of
the default choices must guarantee that this will always
happen.
.PP
The default choices are explicitly or implicitly
specified by the user.
By default, the default choice in an alternation is the
alternative with the shortest possible terminal production.
The user can select one of the other productions in the
alternation as the default choice by putting the keyword
"\fB%default\fR" in front of it.
.PP
By default, for terms with a repetition count containing "*" or
"?" the default choice is to continue with the rest of the rule
in which the term appears, and
.sp 1
.ft CW
1986-12-10 10:38:42 +00:00
.nf
term+
.fi
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.sp 1
is treated as
.sp 1
.nf
.ft CW
1986-12-10 10:38:42 +00:00
term term* .
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.fi
.PP
It is also clear, that it can never be the default choice to do
the term (once more), because this could cause the parser to
loop, inserting tokens forever.
However, when the user does not want the parser to skip
tokens that would not have been skipped if the term
would have been the default choice,
the skipping of such a term can be prevented by
using the keyword "\fB%persistent\fR".
For instance, the rule
.sp 1
.ft CW
1986-12-10 10:38:42 +00:00
.nf
commandlist : command* ;
.fi
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.sp 1
could be changed to
.sp 1
.ft CW
1986-12-10 10:38:42 +00:00
.nf
commandlist : [ %persistent command ]* ;
.fi
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.sp 1
The effects of this in case of a syntax error are twofold:
The set @T@ mentioned above will be extended as if "command" were
in the default production, so that fewer tokens will be
skipped.
Also, if the first token that is not skipped is a member of the
subset of @T@ arising from the grammar rule for "command",
\fILLparse\fR will enter that rule.
So, in fact the default choice
is determined dynamically (by \fILLparse\fR).
Again, \fILLgen\fR checks (statically)
that \fILLparse\fR will always terminate, and if not,
\fILLgen\fR will complain.
.PP
An important property of this error recovery method is that,
once a rule is started, it will be finished.
This means that all actions in the rule will be executed
normally, so that the user can be sure that there will be no
inconsistencies in his data structures because of syntax
errors.
Also, as the method is in fact error correcting, the
actions in a rule only have to deal with syntactically correct
input.
.NH
Ambiguities and conflicts
.PP
As \fILLgen\fR generates a recursive descent parser with no backtrack,
it must at all times be able to determine what to do,
based on the current input symbol.
Unfortunately, this cannot be done for all grammars.
Two kinds of conflicts can arise :
.IP 1) 10
the grammar rule is of the form "production1 | production2",
and \fILLparse\fR cannot decide which production to chose.
This we call an \fBalternation conflict\fR.
.IP 2) 10
the grammar rule is of the form "[ productionrule ]...",
where ... specifies a non-fixed repetition count,
and \fILLparse\fR cannot decide whether to
choose "productionrule" once more, or to continue.
This we call a \fBrepetition conflict\fR.
.PP
There can be several causes for conflicts: the grammar may be
ambiguous, or the grammar may require a more complex parser
than \fILLgen\fR can construct.
The conflicts can be examined by inspecting the verbose
(-\fBv\fR) option output file.
The conflicts can be resolved by rewriting the grammar
or by using \fBconflict resolvers\fR.
The mechanism described here is based on the attributed parsing
of reference
.[ (
1986-12-10 10:38:42 +00:00
milton
.]).
1986-12-10 10:38:42 +00:00
.PP
An alternation conflict can be resolved by putting an \fBif condition\fR
in front of the first conflicting production.
It consists of a "\fB%if\fR" followed by a
C-expression between parentheses.
\fILLparse\fR will then evaluate this expression whenever a
token is met at this point on which there is a conflict, so
the conflict will be resolved dynamically.
If the expression evaluates to
non-zero, the first conflicting production is chosen,
otherwise one of the remaining ones is chosen.
.PP
An alternation conflict can also be resolved using the keywords
"\fB%prefer\fR" or "\fB%avoid\fR". "\fB%prefer\fR"
is equivalent in behaviour to
"\fB%if\fR (1)". "\fB%avoid\fR" is equivalent to "\fB%if\fR (0)".
In these cases however, "\fB%prefer\fR" and "\fB%avoid\fR" should be used,
as they resolve the conflict statically and thus
give rise to better C-code.
.PP
A repetition conflict can be resolved by putting a \fBwhile condition\fR
right after the opening parentheses. This while condition
consists of a "\fB%while\fR" followed by a C-expression between
parentheses. Again, \fILLparse\fR will then
evaluate this expression whenever a token is met
at this point on which there is a conflict.
If the expression evaluates to non-zero, the
repeating part is chosen, otherwise the parser continues with
the rest of the rule.
Appendix B will give an example of these features.
.PP
A useful aid in writing conflict resolvers is the "\fB%first\fR" keyword.
It is used to declare a C-macro that forms an expression
returning 1 if the parameter supplied can start a specified
nonterminal, f.i.:
.sp 1
.nf
.ft CW
1986-12-10 10:38:42 +00:00
%first fmac, nonterm ;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.sp 1
.fi
declares "fmac" as a macro with one parameter, whose value
is a token number. If the parameter
X can start the nonterminal "nonterm", "fmac(X)" is true,
otherwise it is false.
.NH
The LLgen working environment
.PP
\fILLgen\fR generates a number of files: one for each input
file, and two other files: \fILpars.c\fR and \fILpars.h\fR.
\fILpars.h\fR contains "#-define"s for the tokennames.
\fILpars.c\fR contains the error recovery routines and tables.
Only those output files that differ from their previous version
are updated. See appendix C for a possible application of this
feature.
.PP
The names of the output files are constructed as
follows:
in the input file name, the suffix after the last point is
replaced by a "c". If no point is present in the input file
name, ".c" is appended to it. \fILLgen\fR checks that the
filename constructed this way in fact represents a previous
version, or does not exist already.
.PP
The user must provide some environment to obtain a complete
program.
Routines called \fImain\fR and \fILLmessage\fR must be defined.
Also, a lexical analyzer must be provided.
.PP
The routine \fImain\fR must be defined, as it must be in every
C-program. It should eventually call one of the startsymbol
routines.
.PP
The routine \fILLmessage\fR must accept one
1986-12-10 15:26:10 +00:00
parameter, whose value is a token number, zero or -1.
1986-12-10 10:38:42 +00:00
.br
A zero parameter indicates that the current token (the one in
the external variable \fILLsymb\fR) is deleted.
.br
1986-12-10 15:26:10 +00:00
A -1 parameter indicates that the parser expected end of file, but did'nt get
it.
The parser will then skip tokens until end of file is detected.
1986-12-10 10:38:42 +00:00
.br
A parameter that is a token number (a positive parameter)
indicates that this
token is to be inserted in front of the token currently in
\fILLsymb\fR.
The user can give the token the proper attributes.
Also, the user must take care, that the token currently in
\fILLsymb\fR is again returned by the \fBnext\fR call to the
lexical analyzer, with the proper attributes.
So, the lexical analyzer must have a facility to push back one
token.
.PP
The user may also supply his own error recovery routines, or handle
errors differently. For this purpose, the name of a routine to be called
when an error occurs may be declared using the keyword \fB%onerror\fR.
This routine takes two parameters.
The first one is either the token number of the
token expected, or 0. In the last case, the error occurred at a choice.
In both cases, the routine must ensure that the next call to the lexical
analyser returns the token that replaces the current one. Of course,
that could well be the current one, in which case
.I LLparse
recovers from the error.
The second parameter contains a list of tokens that are not skipped at the
error point. The list is in the form of a null-terminated array of integers,
whose address is passed.
.PP
1986-12-10 10:38:42 +00:00
The user must supply a lexical analyzer to read the input stream and
break it up into tokens, which are passed to
.I LLparse.
It should be an integer valued function, returning the token number.
The name of this function can be declared using the
"\fB%lexical\fR" keyword.
This keyword can be used wherever a declaration is legal and may appear
only once in the grammar specification, f.i.:
.sp 1
.nf
.ft CW
1986-12-10 10:38:42 +00:00
%lexical scanner ;
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.fi
.sp 1
declares "scanner" as the name of the lexical analyzer.
The default name for the lexical analyzer is "yylex".
The reason for this funny name is that a useful tool for constructing
lexical analyzers is the
.I Lex
program,
1986-12-10 10:38:42 +00:00
.[
lex
.]
which generates a routine of that name.
.PP
The token numbers are chosen by \fILLgen\fR.
The token number for a literal
is the numerical value of the character in the local character set.
If the tokens have a name,
the "#\ define" mechanism of C is used to give them a value and
to allow the lexical analyzer to return their token numbers symbolically.
These "#\ define"s are collected in the file \fILpars.h\fR which
can be "#\ include"d in any file that needs the token-names.
.PP
The lexical analyzer must signal the end
of input to \fILLparse\fR
by returning a number less than or equal to zero.
.bp
.SH
References
1986-12-10 10:38:42 +00:00
.[
$LIST$
.]
.bp
.SH
Appendix A : LLgen Input Syntax
.PP
This appendix has a description of the \fILLgen\fR input syntax,
as a \fILLgen\fR specification. As a matter of fact, the current
version of \fILLgen\fR is written with \fILLgen\fR.
.nf
.ft CW
1986-12-10 10:38:42 +00:00
.sp 2
/*
* First the declarations of the terminals
* The order is not important
*/
%token IDENTIFIER; /* terminal or nonterminal name */
%token NUMBER;
%token LITERAL;
/*
* Reserved words
*/
%token TOKEN; /* %token */
%token START; /* %start */
%token PERSISTENT; /* %persistent */
%token IF; /* %if */
%token WHILE; /* %while */
%token AVOID; /* %avoid */
%token PREFER; /* %prefer */
%token DEFAULT; /* %default */
%token LEXICAL; /* %lexical */
%token ONERROR; /* %onerror */
1986-12-10 10:38:42 +00:00
%token FIRST; /* %first */
/*
* Declare LLparse to be a C-routine that recognizes "specification"
*/
%start LLparse, specification;
specification
: declaration*
;
declaration
: START
IDENTIFIER ',' IDENTIFIER
';'
| '{'
/* Read C-declaration here */
'}'
| TOKEN
IDENTIFIER
[ ',' IDENTIFIER ]*
';'
| FIRST
IDENTIFIER ',' IDENTIFIER
';'
| LEXICAL
IDENTIFIER
';'
| ONERROR
IDENTIFIER
';'
1986-12-10 10:38:42 +00:00
| rule
;
rule : IDENTIFIER parameters? ldecl?
':' productions
';'
;
ldecl : '{'
/* Read C-declaration here */
'}'
;
productions
: simpleproduction
[ '|' DEFAULT? simpleproduction ]*
;
simpleproduction
: [ IF '(' /* Read C-expression here */ ')'
| PREFER
| AVOID
]?
[ element repeats ]*
;
element : '{'
/* Read action here */
'}'
| '[' [ WHILE '(' /* Read C-expression here */ ')' ]?
PERSISTENT?
productions
']'
| LITERAL
| IDENTIFIER parameters?
;
parameters
: '(' /* Read C-parameters here */ ')'
;
repeats : /* empty */
| [ '*' | '+' ] NUMBER?
| NUMBER
| '?'
;
.fi
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.bp
.SH
Appendix B : An example
.PP
This example gives the complete \fILLgen\fR specification of a simple
desk calculator. It has 26 registers, labeled "a" through "z",
and accepts arithmetic expressions made up of the C operators
+, -, *, /, %, &, and |, with their usual priorities.
The value of the expression is
printed. As in C, an integer that begins with 0 is assumed to
be octal; otherwise it is assumed to be decimal.
.PP
Although the example is short and not very complicated, it
demonstrates the use of if and while conditions. In
the example they are in fact used to reduce the number of
nonterminals, and to reduce the overhead due to the recursion
that would be involved in parsing an expression with an
ordinary recursive descent parser. In an ordinary LL(1)
grammar there would be one nonterminal for each operator
priority. The example shows how we can do it all with one
nonterminal, no matter how many priority levels there are.
.sp 1
.nf
.ft CW
1986-12-10 10:38:42 +00:00
{
#include <stdio.h>
#include <ctype.h>
1986-12-10 11:31:35 +00:00
#define MAXPRIO 5
#define prio(op) (ptab[op])
1986-12-10 10:38:42 +00:00
struct token {
int t_tokno; /* token number */
int t_tval; /* Its attribute */
} stok = { 0,0 }, tok;
int nerrors = 0;
int regs[26]; /* Space for the registers */
int ptab[128]; /* Attribute table */
struct token
nexttok() { /* Read next token and return it */
register c;
struct token new;
while ((c = getchar()) == ' ' || c == '\et') { /* nothing */ }
if (isdigit(c)) new.t_tokno = DIGIT;
else if (islower(c)) new.t_tokno = IDENT;
else new.t_tokno = c;
if (c >= 0) new.t_tval = ptab[c];
return new;
} }
%token DIGIT, IDENT;
%start parse, list;
list : stat* ;
stat { int ident, val; } :
%if (stok = nexttok(),
stok.t_tokno == '=')
/* The conflict is resolved by looking one further
* token ahead. The grammar is LL(2)
*/
IDENT
{ ident = tok.t_tval; }
'=' expr(1,&val) '\en'
{ if (!nerrors) regs[ident] = val; }
| expr(1,&val) '\en'
{ if (!nerrors) printf("%d\en",val); }
| '\en'
;
expr(int level, *val;) { int expr; } :
factor(val)
[ %while (prio(tok.t_tokno) >= level)
1986-12-10 10:38:42 +00:00
/* Swallow operators as long as their priority is
* larger than or equal to the level of this invocation
*/
'+' expr(prio('+')+1,&expr)
1986-12-10 10:38:42 +00:00
{ *val += expr; }
/* This states that '+' groups left to right. If it
* should group right to left, the rule should read:
* '+' expr(prio('+'),&expr)
*/
| '-' expr(prio('-')+1,&expr)
1986-12-10 10:38:42 +00:00
{ *val -= expr; }
| '*' expr(prio('*')+1,&expr)
1986-12-10 10:38:42 +00:00
{ *val *= expr; }
| '/' expr(prio('/')+1,&expr)
1986-12-10 10:38:42 +00:00
{ *val /= expr; }
| '%' expr(prio('%')+1,&expr)
1986-12-10 10:38:42 +00:00
{ *val %= expr; }
| '&' expr(prio('&')+1,&expr)
1986-12-10 10:38:42 +00:00
{ *val &= expr; }
| '|' expr(prio('|')+1,&expr)
1986-12-10 10:38:42 +00:00
{ *val |= expr; }
]*
1986-12-10 10:38:42 +00:00
/* Notice the "*" here. It is important.
*/
;
factor(int *val;):
1986-12-10 10:38:42 +00:00
| '(' expr(1,val) ')'
| '-' expr(MAXPRIO+1,val)
{ *val = -*val; }
| number(val)
| IDENT
{ *val = regs[tok.t_tval]; }
;
number(int *val;) { int base; }
: DIGIT
{ base = (*val=tok.t_tval)==0?8:10; }
[ DIGIT
{ *val = base * *val + tok.t_tval; }
]* ;
%lexical scanner ;
{
scanner() {
if (stok.t_tokno) { /* a token has been inserted or read ahead */
tok = stok;
stok.t_tokno = 0;
return tok.t_tokno;
}
if (nerrors && tok.t_tokno == '\en') {
printf("ERROR\en");
nerrors = 0;
}
tok = nexttok();
return tok.t_tokno;
}
LLmessage(insertedtok) {
nerrors++;
if (insertedtok) { /* token inserted, save old token */
stok = tok;
tok.t_tval = 0;
if (insertedtok < 128) tok.t_tval = ptab[insertedtok];
}
}
main() {
register *p;
for (p = ptab; p < &ptab[128]; p++) *p = 0;
/* for letters, their attribute is their index in the regs array */
for (p = &ptab['a']; p <= &ptab['z']; p++) *p = p - &ptab['a'];
/* for digits, their attribute is their value */
for (p = &ptab['0']; p <= &ptab['9']; p++) *p = p - &ptab['0'];
/* for operators, their attribute is their priority */
ptab['*'] = 4;
ptab['/'] = 4;
ptab['%'] = 4;
ptab['+'] = 3;
ptab['-'] = 3;
ptab['&'] = 2;
ptab['|'] = 1;
1988-02-18 18:06:28 +00:00
parse();
exit(nerrors);
1986-12-10 10:38:42 +00:00
} }
.fi
1986-12-10 11:31:35 +00:00
.ft R
1986-12-10 10:38:42 +00:00
.bp
.SH
Appendix C. How to use \fILLgen\fR.
.PP
This appendix demonstrates how \fILLgen\fR can be used in
combination with the \fImake\fR program, to make effective use
of the \fILLgen\fR-feature that it only changes output files
when neccessary. \fIMake\fR uses a "makefile", which
is a file containing dependencies and associated commands.
A dependency usually indicates that some files depend on other
files. When a file depends on another file and is older than
that other file, the commands associated with the dependency
are executed.
.PP
So, \fImake\fR seems just the program that we always wanted.
However, it
is not very good in handling programs that generate more than
one file.
As usual, there is a way around this problem.
A sample makefile follows:
.sp 1
.ft CW
1986-12-10 10:38:42 +00:00
.nf
# The grammar exists of the files decl.g, stat.g and expr.g.
# The ".o"-files are the result of a C-compilation.
GFILES = decl.g stat.g expr.g
OFILES = decl.o stat.o expr.o Lpars.o
LLOPT =
# As make does'nt handle programs that generate more than one
# file well, we just don't tell make about it.
# We just create a dummy file, and touch it whenever LLgen is
# executed. This way, the dummy in fact depends on the grammar
# files.
# Then, we execute make again, to do the C-compilations and
# such.
all: dummy
make parser
dummy: $(GFILES)
LLgen $(LLOPT) $(GFILES)
touch dummy
parser: $(OFILES)
$(CC) -o parser $(LDFLAGS) $(OFILES)
# Some dependencies without actions :
# make already knows what to do about them
Lpars.o: Lpars.h
stat.o: Lpars.h
decl.o: Lpars.h
expr.o: Lpars.h
.fi
1986-12-10 11:31:35 +00:00
.ft R