1077 lines
		
	
	
	
		
			33 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			1077 lines
		
	
	
	
		
			33 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| .\"	$Id$
 | |
| .\"	Run this paper off with
 | |
| .\"	refer [options] -p LLgen.refs LLgen.doc | [n]eqn | tbl | (nt)roff -ms
 | |
| .if '\*(>.'' \{\
 | |
| .	if '\*(<.'' \{\
 | |
| .		if n .ds >. .
 | |
| .		if n .ds >, ,
 | |
| .		if t .ds <. .
 | |
| .		if t .ds <, ,\
 | |
| \}\
 | |
| \}
 | |
| .cs 5 22u
 | |
| .ND
 | |
| .EQ
 | |
| delim @@
 | |
| .EN
 | |
| .TL
 | |
| LLgen, an extended LL(1) parser generator
 | |
| .AU
 | |
| Ceriel J. H. Jacobs
 | |
| .AI
 | |
| Dept. of Mathematics and Computer Science
 | |
| Vrije Universiteit
 | |
| Amsterdam, The Netherlands
 | |
| .AB
 | |
| \fILLgen\fR provides a
 | |
| tool for generating an efficient recursive descent parser
 | |
| with no backtrack from
 | |
| an Extended Context Free syntax.
 | |
| The \fILLgen\fR
 | |
| user specifies the syntax, together with code
 | |
| describing actions associated with the parsing process.
 | |
| \fILLgen\fR
 | |
| turns this specification into a number of subroutines that handle the
 | |
| parsing process.
 | |
| .PP
 | |
| The grammar may be ambiguous.
 | |
| \fILLgen\fR contains both static and dynamic facilities
 | |
| to resolve these ambiguities.
 | |
| .PP
 | |
| The specification can be split into several files, for each of
 | |
| which \fILLgen\fR generates an output file containing the
 | |
| corresponding part of the parser.
 | |
| Furthermore, only output files that differ from their previous
 | |
| version are updated.
 | |
| Other output files are not affected in any
 | |
| way.
 | |
| This allows the user to recompile only those output files that have
 | |
| changed.
 | |
| .PP
 | |
| The subroutine produced by \fILLgen\fR calls a user supplied routine
 | |
| that must return the next token. This way, the input to the
 | |
| parser can be split into single characters or higher level
 | |
| tokens.
 | |
| .PP
 | |
| An error recovery mechanism is generated almost completely
 | |
| automatically.
 | |
| It is based on so called \fBdefault choices\fR, which are
 | |
| implicitly or explicitly specified by the user.
 | |
| .PP
 | |
| \fILLgen\fR has succesfully been used to create recognizers for
 | |
| Pascal, C, and Modula-2.
 | |
| .AE
 | |
| .NH
 | |
| Introduction
 | |
| .PP
 | |
| \fILLgen\fR
 | |
| provides a tool for generating an efficient recursive
 | |
| descent parser with no backtrack from an Extended Context Free
 | |
| syntax.
 | |
| A parser generated by
 | |
| \fILLgen\fR
 | |
| will be called
 | |
| \fILLparse\fR
 | |
| for the rest of this document.
 | |
| It is assumed that the reader has some knowledge of LL(1) grammars and
 | |
| recursive descent parsers.
 | |
| For a survey on the subject, see reference
 | |
| .[ (
 | |
| griffiths
 | |
| .]).
 | |
| .PP
 | |
| Extended LL(1) parsers are an extension of LL(1) parsers. They are
 | |
| derived from an Extended Context-Free (ECF) syntax instead of a Context-Free
 | |
| (CF) syntax.
 | |
| ECF syntax is described in section 2.
 | |
| Section 3 provides an outline of a
 | |
| specification as accepted by
 | |
| \fILLgen\fR and also discusses the lexical conventions of
 | |
| grammar specification files.
 | |
| Section 4 provides a description of the way the
 | |
| \fILLgen\fR
 | |
| user can associate
 | |
| actions with the syntax. These actions must be written in the programming
 | |
| language C,
 | |
| .[
 | |
| kernighan ritchie
 | |
| .]
 | |
| which also is the target language of \fILLgen\fR.
 | |
| The error recovery technique is discussed in section 5.
 | |
| This section also discusses what the user can do about it.
 | |
| Section 6 discusses
 | |
| the facilities \fILLgen\fR offers
 | |
| to resolve ambiguities and conflicts.
 | |
| \fILLgen\fR offers facilities to resolve them both at parser
 | |
| generation time and during the execution of \fILLparse\fR.
 | |
| Section 7 discusses the
 | |
| \fILLgen\fR
 | |
| working environment.
 | |
| It also discusses the lexical analyzer that must be supplied by the
 | |
| user.
 | |
| This lexical analyzer must read the input stream and break it
 | |
| up into basic input items, called \fBtokens\fR for the rest of
 | |
| this document.
 | |
| Appendix A gives a summary of the
 | |
| \fILLgen\fR
 | |
| input syntax.
 | |
| Appendix B gives an example.
 | |
| It is very instructive to compare this example with the one
 | |
| given in reference
 | |
| .[ (
 | |
| yacc
 | |
| .]).
 | |
| It demonstrates the struggle \fILLparse\fR and other LL(1)
 | |
| parsers have with expressions.
 | |
| Appendix C gives an example of the \fILLgen\fR features
 | |
| allowing the user to recompile only those output files that
 | |
| have changed, using the \fImake\fR program.
 | |
| .[
 | |
| make
 | |
| .]
 | |
| .NH
 | |
| The Extended Context-Free Syntax
 | |
| .PP
 | |
| The extensions of an ECF syntax with respect to an ordinary CF syntax are:
 | |
| .IP 1. 10
 | |
| An ECF syntax contains the repetition operator: "N" (N represents a positive
 | |
| integer).
 | |
| .IP 2. 10
 | |
| An ECF syntax contains the closure set operator without and with
 | |
| upperbound: "*" and "*N".
 | |
| .IP 3. 10
 | |
| An ECF syntax contains the positive closure set operator without and with
 | |
| upperbound: "+" and "+N".
 | |
| .IP 4. 10
 | |
| An ECF syntax contains the optional operator: "?", which is a
 | |
| shorthand for "*1".
 | |
| .IP 5. 10
 | |
| An ECF syntax contains parentheses "[" and "]" which can be
 | |
| used for grouping.
 | |
| .PP
 | |
| We can describe the syntax of an ECF syntax with an ECF syntax :
 | |
| .DS
 | |
| .ft CW
 | |
| grammar         : rule +
 | |
|                 ;
 | |
| .ft R
 | |
| .DE
 | |
| This grammar rule states that a grammar consists of one or more
 | |
| rules.
 | |
| .DS
 | |
| .ft CW
 | |
| rule            : nonterminal ':' productionrule ';'
 | |
|                 ;
 | |
| .ft R
 | |
| .DE
 | |
| A rule consists of a left hand side, the nonterminal,
 | |
| followed by ":",
 | |
| the \fBproduce symbol\fR, followed by a production rule, followed by a
 | |
| ";", in\%di\%ca\%ting the end of the rule.
 | |
| .DS
 | |
| .ft CW
 | |
| productionrule  : production [ '|' production ]*
 | |
|                 ;
 | |
| .ft R
 | |
| .DE
 | |
| A production rule consists of one or
 | |
| more alternative productions separated by "|". This symbol is called the
 | |
| \fBalternation symbol\fR.
 | |
| .DS
 | |
| .ft CW
 | |
| production      : term *
 | |
|                 ;
 | |
| .ft R
 | |
| .DE
 | |
| A production consists of a possibly empty list of terms.
 | |
| So, empty productions are allowed.
 | |
| .DS
 | |
| .ft CW
 | |
| term            : element repeats
 | |
|                 ;
 | |
| .ft R
 | |
| .DE
 | |
| A term is an element, possibly with a repeat specification.
 | |
| .DS
 | |
| .ft CW
 | |
| element         : LITERAL
 | |
|                 | IDENTIFIER
 | |
|                 | '[' productionrule ']'
 | |
|                 ;
 | |
| .ft R
 | |
| .DE
 | |
| An element can be a LITERAL, which basically is a single character
 | |
| between apostrophes, it can be an IDENTIFIER, which is either a
 | |
| nonterminal or a token, and it can be a production rule
 | |
| between square parentheses.
 | |
| .DS
 | |
| .ft CW
 | |
| repeats         : '?'
 | |
|                 | [ '*' | '+' ] NUMBER ?
 | |
|                 | NUMBER ?
 | |
|                 ;
 | |
| .ft R
 | |
| .DE
 | |
| These are the repeat specifications discussed above. Notice that
 | |
| this specification may be empty.
 | |
| .PP
 | |
| The class of ECF languages
 | |
| is identical with the class of CF languages. However, in many
 | |
| cases recursive definitions of language features can now be
 | |
| replaced by iterative ones. This tends to reduce the number of
 | |
| nonterminals and gives rise to very efficient recursive descent
 | |
| parsers.
 | |
| .NH
 | |
| Grammar Specifications
 | |
| .PP
 | |
| The major part of a
 | |
| \fILLgen\fR
 | |
| grammar specification consists of an
 | |
| ECF syntax specification.
 | |
| Names in this syntax specification refer to either tokens or nonterminal
 | |
| symbols.
 | |
| \fILLgen\fR
 | |
| requires token names to be declared as such. This way it
 | |
| can be avoided that a typing error in a nonterminal name causes it to
 | |
| be accepted as a token name. The token declarations will be
 | |
| discussed later.
 | |
| A name will be regarded as a nonterminal symbol, unless it is declared
 | |
| as a token name.
 | |
| If there is no production rule for a nonterminal symbol, \fILLgen\fR
 | |
| will complain.
 | |
| .PP
 | |
| A grammar specification may also include some C routines,
 | |
| for instance the lexical analyzer and an error reporting
 | |
| routine.
 | |
| Thus, a grammar specification file can contain declarations,
 | |
| grammar rules and C-code.
 | |
| .PP
 | |
| Blanks, tabs and newlines are ignored, but may not appear in names or
 | |
| keywords.
 | |
| Comments may appear wherever a name is legal (which is almost
 | |
| everywhere).
 | |
| They are enclosed in
 | |
| /* ... */, as in C. Comments do not nest.
 | |
| .PP
 | |
| Names may be of arbitrary length, and can be made up of letters, underscore
 | |
| "\_" and non-initial digits. Upper and lower case letters are distinct.
 | |
| Only the first 50 characters are significant.
 | |
| Notice however, that the names for the tokens will be used by the
 | |
| C-preprocessor.
 | |
| The number of significant characters therefore depends on the
 | |
| underlying C-implementation.
 | |
| A safe rule is to make the identifiers distinct in the first six
 | |
| characters, case ignored.
 | |
| .PP
 | |
| There are two kinds of tokens:
 | |
| those that are declared and are denoted by a name,
 | |
| and literals.
 | |
| .PP
 | |
| A literal consists of a character enclosed in apostrophes "'".
 | |
| The "\e" is an escape character within literals. The following escapes
 | |
| are recognized :
 | |
| .TS
 | |
| center;
 | |
| l l.
 | |
| \&'\en'	newline
 | |
| \&'\er'	return
 | |
| \&'\e''	apostrophe "'"
 | |
| \&'\e\e'	backslash "\e"
 | |
| \&'\et'	tab
 | |
| \&'\eb'	backspace
 | |
| \&'\ef'	form feed
 | |
| \&'\exxx'	"xxx" in octal
 | |
| .TE
 | |
| .PP
 | |
| Names representing tokens must be declared before they are used.
 | |
| This can be done using the "\fB%token\fR" keyword,
 | |
| by writing
 | |
| .nf
 | |
| .ft CW
 | |
| .sp 1
 | |
| %token  name1, name2, . . . ;
 | |
| .ft R
 | |
| .fi
 | |
| .PP
 | |
| \fILLparse\fR is designed to recognize special nonterminal
 | |
| symbols called \fBstart symbols\fR.
 | |
| \fILLgen\fR allows for more than one start symbol.
 | |
| Thus, grammars with more than one entry point are accepted.
 | |
| The start symbols must be declared explicitly using the
 | |
| "\fB%start\fR" keyword. It can be used whenever a declaration is
 | |
| legal, f.i.:
 | |
| .nf
 | |
| .ft CW
 | |
| .sp 1
 | |
| %start LLparse, specification ;
 | |
| .ft R
 | |
| .fi
 | |
| .sp 1
 | |
| declares "specification" as a start symbol and associates the
 | |
| identifier "LLparse" with it.
 | |
| "LLparse" will now be the name of the C-function that must be
 | |
| called to recognize "specification".
 | |
| .NH
 | |
| Actions
 | |
| .PP
 | |
| \fILLgen\fR
 | |
| allows arbitrary insertions of actions within the right hand side
 | |
| of a production rule in the ECF syntax. An action consists of a number of C
 | |
| statements, enclosed in the brackets "{" and "}".
 | |
| .PP
 | |
| \fILLgen\fR
 | |
| generates a parsing routine for each rule in the grammar. The actions
 | |
| supplied by the user are just inserted in the proper place.
 | |
| There may also be declarations before the statements in the
 | |
| action, as
 | |
| the "{" and "}" are copied into the target code along with the
 | |
| action. The scope of these declarations terminates with the
 | |
| closing bracket "}" of the action.
 | |
| .PP
 | |
| In addition to actions, it is also possible to declare local variables
 | |
| in the parsing routine, which can then be used in the actions.
 | |
| Such a declaration consists of a number of C variable declarations,
 | |
| enclosed in the brackets "{" and "}". It must be placed
 | |
| right in front of the ":" in the grammar rule.
 | |
| The scope of these local variables consists of the complete
 | |
| grammar rule.
 | |
| .PP
 | |
| In order to facilitate communication between the actions and
 | |
| \fILLparse\fR,
 | |
| the parsing routines can be given C-like parameters.
 | |
| Each parameter must be declared separately, and each of these declarations must
 | |
| end with a semicolon.
 | |
| For the last parameter, the semicolon is optional.
 | |
| .PP
 | |
| So, for example
 | |
| .nf
 | |
| .ft CW
 | |
| .sp 1
 | |
| expr(int *pval;) { int fact; } :
 | |
|                 /*
 | |
|                  * Rule with one parameter, a pointer to an int.
 | |
|                  * Parameter specifications are ordinary C declarations.
 | |
|                  * One local variable, of type int.
 | |
|                  */
 | |
|         factor (&fact)          { *pval = fact; }
 | |
|                 /*
 | |
|                  * factor is another nonterminal symbol.
 | |
|                  * One actual parameter is supplied.
 | |
|                  * Notice that the parameter passing mechanism is that
 | |
|                  * of C.
 | |
|                  */
 | |
|         [ '+' factor (&fact)    { *pval += fact; } ]*
 | |
|                 /*
 | |
|                  * remember the '*' means zero or more times
 | |
|                  */
 | |
|         ;
 | |
| .sp 1
 | |
| .ft R
 | |
| .fi
 | |
| is a rule to recognize a number of factors, separated by "+", and
 | |
| to compute their sum.
 | |
| .PP
 | |
| \fILLgen\fR
 | |
| generates C code, so the parameter passing mechanism is that of
 | |
| C, as is shown in the example above.
 | |
| .PP
 | |
| Actions often manipulate attributes of the token just read.
 | |
| For instance, when an identifier is read, its name must be
 | |
| looked up in a symbol table.
 | |
| Therefore, \fILLgen\fR generates code
 | |
| such that at a number of places in the grammar rule
 | |
| it is defined which token has last been read.
 | |
| After a token, the last token read is this token.
 | |
| After a "[" or a "|", the last token read is the next token to
 | |
| be accepted by \fILLparse\fR.
 | |
| At all other places, it is undefined which token has last been
 | |
| read.
 | |
| The last token read is available in the global integer variable
 | |
| \fILLsymb\fR.
 | |
| .PP
 | |
| The user may also specify C-code wherever a \fILLgen\fR-declaration is
 | |
| legal.
 | |
| Again, this code must be enclosed in the brackets "{" and "}".
 | |
| This way, the user can define global declarations and
 | |
| C-functions.
 | |
| To avoid name-conflicts with identifiers generated by
 | |
| \fILLgen\fR, \fILLparse\fR only uses names beginning with
 | |
| "LL"; the user should avoid such names.
 | |
| .NH
 | |
| Error Recovery
 | |
| .PP
 | |
| The error recovery technique used by \fILLgen\fR is a
 | |
| modification of the one presented in reference
 | |
| .[ (
 | |
| automatic construction error correcting
 | |
| .]).
 | |
| It is based on \fBdefault choices\fR, which just are
 | |
| what the word says, default choices at
 | |
| every point in the grammar where there is a
 | |
| choice.
 | |
| Thus, in an alternation, one of the productions is marked as a
 | |
| default choice, and in a term with a non-fixed repetition
 | |
| specification there will also be a default choice (between
 | |
| doing the term (once more) and continuing with the rest of the
 | |
| production in which the term appears).
 | |
| .PP
 | |
| When \fILLparse\fR detects an error after having parsed the
 | |
| string @s@, the default choices enable it to compute one
 | |
| syntactically correct continuation,
 | |
| consisting of the tokens @t sub 1~...~t sub n@,
 | |
| such that @s~t sub 1~...~t sub n@ is a string of tokens that
 | |
| is a member of the language defined by the grammar.
 | |
| Notice, that the computation of this continuation must
 | |
| terminate, which implies that the default choices may not
 | |
| invoke recursive rules.
 | |
| .PP
 | |
| At each point in this continuation, a certain number of other
 | |
| tokens could also be syntactically correct, f.i. the token
 | |
| @t@ is syntactically correct at point @t sub i@ in this
 | |
| continuation, if the string @s~t sub 1~...~t sub i~t~s sub 1@
 | |
| is a string of the language defined by the grammar for some
 | |
| string @s sub 1@ and i >= 0.
 | |
| .PP
 | |
| The set @T@
 | |
| containing all these tokens (including @t sub 1 ,~...,~t sub n@) is computed.
 | |
| Next, \fILLparse\fR discards zero
 | |
| or more tokens from its input, until a token
 | |
| @t@ \(mo @T@ is found.
 | |
| The error is then corrected by inserting i (i >= 0) tokens
 | |
| @t sub 1~...~t sub i@, such that the string
 | |
| @s~t sub 1~...~t sub i~t~s sub 1@ is a string of the language
 | |
| defined by the grammar, for some @s sub 1@.
 | |
| Then, normal parsing is resumed.
 | |
| .PP
 | |
| The above is difficult to implement in a recursive decent
 | |
| parser, and is not the way \fILLparse\fR does it, but the
 | |
| effect is the same. In fact, \fILLparse\fR maintains a list
 | |
| of tokens that may not be discarded, which is adjusted as
 | |
| \fILLparse\fR proceeds. This list is just a representation
 | |
| of the set @T@ mentioned
 | |
| above. When an error occurs, \fILLparse\fR discards tokens until
 | |
| a token @t@ that is a member of this list is found.
 | |
| Then, it continues parsing, following the default choices,
 | |
| inserting tokens along the way, until this token @t@ is legal.
 | |
| The selection of
 | |
| the default choices must guarantee that this will always
 | |
| happen.
 | |
| .PP
 | |
| The default choices are explicitly or implicitly
 | |
| specified by the user.
 | |
| By default, the default choice in an alternation is the
 | |
| alternative with the shortest possible terminal production.
 | |
| The user can select one of the other productions in the
 | |
| alternation as the default choice by putting the keyword
 | |
| "\fB%default\fR" in front of it.
 | |
| .PP
 | |
| By default, for terms with a repetition count containing "*" or
 | |
| "?" the default choice is to continue with the rest of the rule
 | |
| in which the term appears, and
 | |
| .sp 1
 | |
| .ft CW
 | |
| .nf
 | |
|                 term+
 | |
| .fi
 | |
| .ft R
 | |
| .sp 1
 | |
| is treated as
 | |
| .sp 1
 | |
| .nf
 | |
| .ft CW
 | |
|                 term term* .
 | |
| .ft R
 | |
| .fi
 | |
| .PP
 | |
| It is also clear, that it can never be the default choice to do
 | |
| the term (once more), because this could cause the parser to
 | |
| loop, inserting tokens forever.
 | |
| However, when the user does not want the parser to skip
 | |
| tokens that would not have been skipped if the term
 | |
| would have been the default choice,
 | |
| the skipping of such a term can be prevented by
 | |
| using the keyword "\fB%persistent\fR".
 | |
| For instance, the rule
 | |
| .sp 1
 | |
| .ft CW
 | |
| .nf
 | |
| commandlist : command* ;
 | |
| .fi
 | |
| .ft R
 | |
| .sp 1
 | |
| could be changed to
 | |
| .sp 1
 | |
| .ft CW
 | |
| .nf
 | |
| commandlist : [ %persistent command ]* ;
 | |
| .fi
 | |
| .ft R
 | |
| .sp 1
 | |
| The effects of this in case of a syntax error are twofold:
 | |
| The set @T@ mentioned above will be extended as if "command" were
 | |
| in the default production, so that fewer tokens will be
 | |
| skipped.
 | |
| Also, if the first token that is not skipped is a member of the
 | |
| subset of @T@ arising from the grammar rule for "command",
 | |
| \fILLparse\fR will enter that rule.
 | |
| So, in fact the default choice
 | |
| is determined dynamically (by \fILLparse\fR).
 | |
| Again, \fILLgen\fR checks (statically)
 | |
| that \fILLparse\fR will always terminate, and if not,
 | |
| \fILLgen\fR will complain.
 | |
| .PP
 | |
| An important property of this error recovery method is that,
 | |
| once a rule is started, it will be finished.
 | |
| This means that all actions in the rule will be executed
 | |
| normally, so that the user can be sure that there will be no
 | |
| inconsistencies in his data structures because of syntax
 | |
| errors.
 | |
| Also, as the method is in fact error correcting, the
 | |
| actions in a rule only have to deal with syntactically correct
 | |
| input.
 | |
| .NH
 | |
| Ambiguities and conflicts
 | |
| .PP
 | |
| As \fILLgen\fR generates a recursive descent parser with no backtrack,
 | |
| it must at all times be able to determine what to do,
 | |
| based on the current input symbol.
 | |
| Unfortunately, this cannot be done for all grammars.
 | |
| Two kinds of conflicts can arise :
 | |
| .IP 1) 10
 | |
| the grammar rule is of the form "production1 | production2",
 | |
| and \fILLparse\fR cannot decide which production to chose.
 | |
| This we call an \fBalternation conflict\fR.
 | |
| .IP 2) 10
 | |
| the grammar rule is of the form "[ productionrule ]...",
 | |
| where ... specifies a non-fixed repetition count,
 | |
| and \fILLparse\fR cannot decide whether to
 | |
| choose "productionrule" once more, or to continue.
 | |
| This we call a \fBrepetition conflict\fR.
 | |
| .PP
 | |
| There can be several causes for conflicts: the grammar may be
 | |
| ambiguous, or the grammar may require a more complex parser
 | |
| than \fILLgen\fR can construct.
 | |
| The conflicts can be examined by inspecting the verbose
 | |
| (-\fBv\fR) option output file.
 | |
| The conflicts can be resolved by rewriting the grammar
 | |
| or by using \fBconflict resolvers\fR.
 | |
| The mechanism described here is based on the attributed parsing
 | |
| of reference
 | |
| .[ (
 | |
| milton
 | |
| .]).
 | |
| .PP
 | |
| An alternation conflict can be resolved by putting an \fBif condition\fR
 | |
| in front of the first conflicting production.
 | |
| It consists of a "\fB%if\fR" followed by a
 | |
| C-expression between parentheses.
 | |
| \fILLparse\fR will then evaluate this expression whenever a
 | |
| token is met at this point on which there is a conflict, so
 | |
| the conflict will be resolved dynamically.
 | |
| If the expression evaluates to
 | |
| non-zero, the first conflicting production is chosen,
 | |
| otherwise one of the remaining ones is chosen.
 | |
| .PP
 | |
| An alternation conflict can also be resolved using the keywords
 | |
| "\fB%prefer\fR" or "\fB%avoid\fR". "\fB%prefer\fR"
 | |
| is equivalent in behaviour to
 | |
| "\fB%if\fR (1)". "\fB%avoid\fR" is equivalent to "\fB%if\fR (0)".
 | |
| In these cases however, "\fB%prefer\fR" and "\fB%avoid\fR" should be used,
 | |
| as they resolve the conflict statically and thus
 | |
| give rise to better C-code.
 | |
| .PP
 | |
| A repetition conflict can be resolved by putting a \fBwhile condition\fR
 | |
| right after the opening parentheses. This while condition
 | |
| consists of a "\fB%while\fR" followed by a C-expression between
 | |
| parentheses. Again, \fILLparse\fR will then
 | |
| evaluate this expression whenever a token is met
 | |
| at this point on which there is a conflict.
 | |
| If the expression evaluates to non-zero, the
 | |
| repeating part is chosen, otherwise the parser continues with
 | |
| the rest of the rule.
 | |
| Appendix B will give an example of these features.
 | |
| .PP
 | |
| A useful aid in writing conflict resolvers is the "\fB%first\fR" keyword.
 | |
| It is used to declare a C-macro that forms an expression
 | |
| returning 1 if the parameter supplied can start a specified
 | |
| nonterminal, f.i.:
 | |
| .sp 1
 | |
| .nf
 | |
| .ft CW
 | |
| %first fmac, nonterm ;
 | |
| .ft R
 | |
| .sp 1
 | |
| .fi
 | |
| declares "fmac" as a macro with one parameter, whose value
 | |
| is a token number. If the parameter
 | |
| X can start the nonterminal "nonterm", "fmac(X)" is true,
 | |
| otherwise it is false.
 | |
| .NH
 | |
| The LLgen working environment
 | |
| .PP
 | |
| \fILLgen\fR generates a number of files: one for each input
 | |
| file, and two other files: \fILpars.c\fR and \fILpars.h\fR.
 | |
| \fILpars.h\fR contains "#-define"s for the tokennames.
 | |
| \fILpars.c\fR contains the error recovery routines and tables.
 | |
| Only those output files that differ from their previous version
 | |
| are updated. See appendix C for a possible application of this
 | |
| feature.
 | |
| .PP
 | |
| The names of the output files are constructed as
 | |
| follows:
 | |
| in the input file name, the suffix after the last point is
 | |
| replaced by a "c". If no point is present in the input file
 | |
| name, ".c" is appended to it. \fILLgen\fR checks that the
 | |
| filename constructed this way in fact represents a previous
 | |
| version, or does not exist already.
 | |
| .PP
 | |
| The user must provide some environment to obtain a complete
 | |
| program.
 | |
| Routines called \fImain\fR and \fILLmessage\fR must be defined.
 | |
| Also, a lexical analyzer must be provided.
 | |
| .PP
 | |
| The routine \fImain\fR must be defined, as it must be in every
 | |
| C-program. It should eventually call one of the startsymbol
 | |
| routines.
 | |
| .PP
 | |
| The routine \fILLmessage\fR must accept one
 | |
| parameter, whose value is a token number, zero or -1.
 | |
| .br
 | |
| A zero parameter indicates that the current token (the one in
 | |
| the external variable \fILLsymb\fR) is deleted.
 | |
| .br
 | |
| A -1 parameter indicates that the parser expected end of file, but didn't get
 | |
| it.
 | |
| The parser will then skip tokens until end of file is detected.
 | |
| .br
 | |
| A parameter that is a token number (a positive parameter)
 | |
| indicates that this
 | |
| token is to be inserted in front of the token currently in
 | |
| \fILLsymb\fR.
 | |
| The user can give the token the proper attributes.
 | |
| Also, the user must take care, that the token currently in
 | |
| \fILLsymb\fR is again returned by the \fBnext\fR call to the
 | |
| lexical analyzer, with the proper attributes.
 | |
| So, the lexical analyzer must have a facility to push back one
 | |
| token.
 | |
| .PP
 | |
| The user may also supply his own error recovery routines, or handle
 | |
| errors differently. For this purpose, the name of a routine to be called
 | |
| when an error occurs may be declared using the keyword \fB%onerror\fR.
 | |
| This routine takes two parameters.
 | |
| The first one is either the token number of the
 | |
| token expected, or 0. In the last case, the error occurred at a choice.
 | |
| In both cases, the routine must ensure that the next call to the lexical
 | |
| analyser returns the token that replaces the current one. Of course,
 | |
| that could well be the current one, in which case
 | |
| .I LLparse
 | |
| recovers from the error.
 | |
| The second parameter contains a list of tokens that are not skipped at the
 | |
| error point. The list is in the form of a null-terminated array of integers,
 | |
| whose address is passed.
 | |
| .PP
 | |
| The user must supply a lexical analyzer to read the input stream and
 | |
| break it up into tokens, which are passed to
 | |
| .I LLparse.
 | |
| It should be an integer valued function, returning the token number.
 | |
| The name of this function can be declared using the
 | |
| "\fB%lexical\fR" keyword.
 | |
| This keyword can be used wherever a declaration is legal and may appear
 | |
| only once in the grammar specification, f.i.:
 | |
| .sp 1
 | |
| .nf
 | |
| .ft CW
 | |
| %lexical scanner ;
 | |
| .ft R
 | |
| .fi
 | |
| .sp 1
 | |
| declares "scanner" as the name of the lexical analyzer.
 | |
| The default name for the lexical analyzer is "yylex".
 | |
| The reason for this funny name is that a useful tool for constructing
 | |
| lexical analyzers is the
 | |
| .I Lex
 | |
| program,
 | |
| .[
 | |
| lex
 | |
| .]
 | |
| which generates a routine of that name.
 | |
| .PP
 | |
| The token numbers are chosen by \fILLgen\fR.
 | |
| The token number for a literal
 | |
| is the numerical value of the character in the local character set.
 | |
| If the tokens have a name,
 | |
| the "#\ define" mechanism of C is used to give them a value and
 | |
| to allow the lexical analyzer to return their token numbers symbolically.
 | |
| These "#\ define"s are collected in the file \fILpars.h\fR which
 | |
| can be "#\ include"d in any file that needs the token-names.
 | |
| The maximum token number chosen is defined in the macro \fILL_MAXTOKNO\fP.
 | |
| .PP
 | |
| The lexical analyzer must signal the end
 | |
| of input to \fILLparse\fR
 | |
| by returning a number less than or equal to zero.
 | |
| .NH
 | |
| Programs with more than one parser
 | |
| .PP
 | |
| \fILLgen\fR offers a simple facility for having more than one parser in
 | |
| a program: in this case, the user can change the names of global procedures,
 | |
| variables, etc, by giving a different prefix, like this:
 | |
| .sp 1
 | |
| .nf
 | |
| .ft CW
 | |
| %prefix XX ;
 | |
| .ft R
 | |
| .fi
 | |
| .sp 1
 | |
| The effect of this is that all global names start with XX instead of LL, for
 | |
| the parser that has this prefix. This holds for the variables \fILLsymb\fP,
 | |
| which now is called \fIXXsymb\fP, for the routine \fILLmessage\fP,
 | |
| which must now be called \fIXXmessage\fP, and for the macro \fILL_MAXTOKNO\fP,
 | |
| which is now called \fIXX_MAXTOKNO\fP.
 | |
| \fILL.output\fP is now \fIXX.output\fP, and \fILpars.c\fP and \fILpars.h\fP
 | |
| are now called \fIXXpars.c\fP and \fIXXpars.h\fP.
 | |
| .bp
 | |
| .SH
 | |
| References
 | |
| .[
 | |
| $LIST$
 | |
| .]
 | |
| .bp
 | |
| .SH
 | |
| Appendix A : LLgen Input Syntax
 | |
| .PP
 | |
| This appendix has a description of the \fILLgen\fR input syntax,
 | |
| as a \fILLgen\fR specification. As a matter of fact, the current
 | |
| version of \fILLgen\fR is written with \fILLgen\fR.
 | |
| .nf
 | |
| .ft CW
 | |
| .sp 2
 | |
| /*
 | |
|  * First the declarations of the terminals
 | |
|  * The order is not important
 | |
|  */
 | |
| 
 | |
| %token  IDENTIFIER;            /* terminal or nonterminal name */
 | |
| %token  NUMBER;
 | |
| %token  LITERAL;
 | |
| 
 | |
| /*
 | |
|  * Reserved words
 | |
|  */
 | |
| 
 | |
| %token  TOKEN;         /* %token */
 | |
| %token  START;         /* %start */
 | |
| %token  PERSISTENT;    /* %persistent */
 | |
| %token  IF;            /* %if */
 | |
| %token  WHILE;         /* %while */
 | |
| %token  AVOID;         /* %avoid */
 | |
| %token  PREFER;        /* %prefer */
 | |
| %token  DEFAULT;       /* %default */
 | |
| %token  LEXICAL;       /* %lexical */
 | |
| %token  PREFIX;        /* %prefix */
 | |
| %token  ONERROR;       /* %onerror */
 | |
| %token  FIRST;         /* %first */
 | |
| 
 | |
| /*
 | |
|  * Declare LLparse to be a C-routine that recognizes "specification"
 | |
|  */
 | |
| 
 | |
| %start  LLparse, specification;
 | |
| 
 | |
| specification
 | |
|         : declaration*
 | |
|         ;
 | |
| 
 | |
| declaration
 | |
|         : START
 | |
|                 IDENTIFIER ',' IDENTIFIER
 | |
|           ';'
 | |
|         | '{'
 | |
|                 /* Read C-declaration here */
 | |
|           '}'
 | |
|         | TOKEN
 | |
|                 IDENTIFIER
 | |
|                 [ ',' IDENTIFIER ]*
 | |
|           ';'
 | |
|         | FIRST
 | |
|                 IDENTIFIER ',' IDENTIFIER
 | |
|           ';'
 | |
|         | LEXICAL
 | |
|                 IDENTIFIER
 | |
|           ';'
 | |
|         | PREFIX
 | |
|                 IDENTIFIER
 | |
|           ';'
 | |
|         | ONERROR
 | |
|                 IDENTIFIER
 | |
| 	  ';'
 | |
|         | rule
 | |
|         ;
 | |
| 
 | |
| rule    : IDENTIFIER parameters? ldecl?
 | |
|                 ':' productions
 | |
|           ';'
 | |
|         ;
 | |
| 
 | |
| ldecl   : '{'
 | |
|                 /* Read C-declaration here */
 | |
|           '}'
 | |
|         ;
 | |
| 
 | |
| productions
 | |
|         : simpleproduction
 | |
|           [ '|' simpleproduction ]*
 | |
|         ;
 | |
| 
 | |
| simpleproduction
 | |
|         : DEFAULT?
 | |
| 	  [ IF '(' /* Read C-expression here */ ')'
 | |
|           | PREFER
 | |
|           | AVOID
 | |
|           ]?
 | |
|           [ element repeats ]*
 | |
|         ;
 | |
| 
 | |
| element : '{'
 | |
|                 /* Read action here */
 | |
|           '}'
 | |
|         | '[' [ WHILE '(' /* Read C-expression here */ ')' ]?
 | |
|                 PERSISTENT?
 | |
|                 productions
 | |
|           ']'
 | |
|         | LITERAL
 | |
|         | IDENTIFIER parameters?
 | |
|         ;
 | |
| 
 | |
| parameters
 | |
|         : '(' /* Read C-parameters here */ ')'
 | |
|         ;
 | |
| 
 | |
| repeats : /* empty */
 | |
|         | [ '*' | '+' ] NUMBER?
 | |
|         | NUMBER
 | |
|         | '?'
 | |
|         ;
 | |
| 
 | |
| .fi
 | |
| .ft R
 | |
| .bp
 | |
| .SH
 | |
| Appendix B : An example
 | |
| .PP
 | |
| This example gives the complete \fILLgen\fR specification of a simple
 | |
| desk calculator. It has 26 registers, labeled "a" through "z",
 | |
| and accepts arithmetic expressions made up of the C operators
 | |
| +, -, *, /, %, &, and |, with their usual priorities.
 | |
| The value of the expression is
 | |
| printed. As in C, an integer that begins with 0 is assumed to
 | |
| be octal; otherwise it is assumed to be decimal.
 | |
| .PP
 | |
| Although the example is short and not very complicated, it
 | |
| demonstrates the use of if and while conditions. In
 | |
| the example they are in fact used to reduce the number of
 | |
| nonterminals, and to reduce the overhead due to the recursion
 | |
| that would be involved in parsing an expression with an
 | |
| ordinary recursive descent parser. In an ordinary LL(1)
 | |
| grammar there would be one nonterminal for each operator
 | |
| priority. The example shows how we can do it all with one
 | |
| nonterminal, no matter how many priority levels there are.
 | |
| .sp 1
 | |
| .nf
 | |
| .ft CW
 | |
| {
 | |
| #include <stdio.h>
 | |
| #include <ctype.h>
 | |
| #define MAXPRIO      5
 | |
| #define prio(op)     (ptab[op])
 | |
| 
 | |
| struct token {
 | |
|         int     t_tokno;        /* token number */
 | |
|         int     t_tval;         /* Its attribute */
 | |
| } stok = { 0,0 }, tok;
 | |
| 
 | |
| int     nerrors = 0;
 | |
| int     regs[26];               /* Space for the registers */
 | |
| int     ptab[128];              /* Attribute table */
 | |
| 
 | |
| struct token
 | |
| nexttok() {  /* Read next token and return it */
 | |
|         register        c;
 | |
|         struct token    new;
 | |
| 
 | |
|         while ((c = getchar()) == ' ' || c == '\et') { /* nothing */ }
 | |
|         if (isdigit(c)) new.t_tokno = DIGIT;
 | |
|         else if (islower(c)) new.t_tokno = IDENT;
 | |
|         else new.t_tokno = c;
 | |
|         if (c >= 0) new.t_tval = ptab[c];
 | |
|         return new;
 | |
| }   }
 | |
| 
 | |
| %token  DIGIT, IDENT;
 | |
| %start  parse, list;
 | |
| 
 | |
| list    : stat* ;
 | |
| 
 | |
| stat    {       int     ident, val; } :
 | |
|         %if (stok = nexttok(),
 | |
|              stok.t_tokno == '=')
 | |
|                     /* The conflict is resolved by looking one further
 | |
|                      * token ahead. The grammar is LL(2)
 | |
|                      */
 | |
|           IDENT
 | |
|                                 {       ident = tok.t_tval; }
 | |
|           '=' expr(1,&val) '\en'
 | |
|                                 {       if (!nerrors) regs[ident] = val; }
 | |
|         | expr(1,&val) '\en'
 | |
|                                 {       if (!nerrors) printf("%d\en",val); }
 | |
|         | '\en'
 | |
|         ;
 | |
| 
 | |
| expr(int level; int *val;) {       int     expr; } :
 | |
|           factor(val)
 | |
|           [ %while (prio(tok.t_tokno) >= level)
 | |
|                     /* Swallow operators as long as their priority is
 | |
|                      * larger than or equal to the level of this invocation
 | |
|                      */
 | |
|               '+' expr(prio('+')+1,&expr)
 | |
|                                 {       *val += expr; }
 | |
|                     /* This states that '+' groups left to right. If it
 | |
|                      * should group right to left, the rule should read:
 | |
|                      * '+' expr(prio('+'),&expr)
 | |
|                      */
 | |
|             | '-' expr(prio('-')+1,&expr)
 | |
|                                 {       *val -= expr; }
 | |
|             | '*' expr(prio('*')+1,&expr)
 | |
|                                 {       *val *= expr; }
 | |
|             | '/' expr(prio('/')+1,&expr)
 | |
|                                 {       *val /= expr; }
 | |
|             | '%' expr(prio('%')+1,&expr)
 | |
|                                 {       *val %= expr; }
 | |
|             | '&' expr(prio('&')+1,&expr)
 | |
|                                 {       *val &= expr; }
 | |
|             | '|' expr(prio('|')+1,&expr)
 | |
|                                 {       *val |= expr; }
 | |
|           ]*
 | |
|                     /* Notice the "*" here. It is important.
 | |
|                      */
 | |
| 	;
 | |
| 
 | |
| factor(int *val;):
 | |
|             '(' expr(1,val) ')'
 | |
|           | '-' expr(MAXPRIO+1,val)
 | |
|                                 {       *val = -*val; }
 | |
|           | number(val)
 | |
|           | IDENT
 | |
|                                 {       *val = regs[tok.t_tval]; }
 | |
|         ;
 | |
| 
 | |
| number(int *val;) {       int base; }
 | |
|         : DIGIT
 | |
|                                 {       base = (*val=tok.t_tval)==0?8:10; }
 | |
|           [ DIGIT
 | |
|                                 {       *val = base * *val + tok.t_tval; }
 | |
|           ]*        ;
 | |
| 
 | |
| %lexical scanner ;
 | |
| {
 | |
| scanner() {
 | |
|         if (stok.t_tokno) { /* a token has been inserted or read ahead */
 | |
|                 tok = stok;
 | |
|                 stok.t_tokno = 0;
 | |
|                 return tok.t_tokno;
 | |
|         }
 | |
|         if (nerrors && tok.t_tokno == '\en') {
 | |
|                 printf("ERROR\en");
 | |
|                 nerrors = 0;
 | |
|         }
 | |
|         tok = nexttok();
 | |
|         return tok.t_tokno;
 | |
| }
 | |
| 
 | |
| LLmessage(insertedtok) {
 | |
|         nerrors++;
 | |
|         if (insertedtok) { /* token inserted, save old token */
 | |
|                 stok = tok;
 | |
|                 tok.t_tval = 0;
 | |
|                 if (insertedtok < 128) tok.t_tval = ptab[insertedtok];
 | |
|         }
 | |
| }
 | |
| 
 | |
| main() {
 | |
|         register *p;
 | |
| 
 | |
|         for (p = ptab; p < &ptab[128]; p++) *p = 0;
 | |
|         /* for letters, their attribute is their index in the regs array */
 | |
|         for (p = &ptab['a']; p <= &ptab['z']; p++) *p = p - &ptab['a'];
 | |
|         /* for digits, their attribute is their value */
 | |
|         for (p = &ptab['0']; p <= &ptab['9']; p++) *p = p - &ptab['0'];
 | |
|         /* for operators, their attribute is their priority */
 | |
|         ptab['*'] = 4;
 | |
|         ptab['/'] = 4;
 | |
|         ptab['%'] = 4;
 | |
|         ptab['+'] = 3;
 | |
|         ptab['-'] = 3;
 | |
|         ptab['&'] = 2;
 | |
|         ptab['|'] = 1;
 | |
|         parse();
 | |
| 	exit(nerrors);
 | |
| }   }
 | |
| .fi
 | |
| .ft R
 | |
| .bp
 | |
| .SH
 | |
| Appendix C. How to use \fILLgen\fR.
 | |
| .PP
 | |
| This appendix demonstrates how \fILLgen\fR can be used in
 | |
| combination with the \fImake\fR program, to make effective use
 | |
| of the \fILLgen\fR-feature that it only changes output files
 | |
| when neccessary. \fIMake\fR uses a "makefile", which
 | |
| is a file containing dependencies and associated commands.
 | |
| A dependency usually indicates that some files depend on other
 | |
| files. When a file depends on another file and is older than
 | |
| that other file, the commands associated with the dependency
 | |
| are executed.
 | |
| .PP
 | |
| So, \fImake\fR seems just the program that we always wanted.
 | |
| However, it
 | |
| is not very good in handling programs that generate more than
 | |
| one file.
 | |
| As usual, there is a way around this problem.
 | |
| A sample makefile follows:
 | |
| .sp 1
 | |
| .ft CW
 | |
| .nf
 | |
| # The grammar exists of the files decl.g, stat.g and expr.g.
 | |
| # The ".o"-files are the result of a C-compilation.
 | |
| 
 | |
| GFILES = decl.g stat.g expr.g
 | |
| OFILES = decl.o stat.o expr.o Lpars.o
 | |
| LLOPT =
 | |
| 
 | |
| # As make does'nt handle programs that generate more than one
 | |
| # file well, we just don't tell make about it.
 | |
| # We just create a dummy file, and touch it whenever LLgen is
 | |
| # executed. This way, the dummy in fact depends on the grammar
 | |
| # files.
 | |
| # Then, we execute make again, to do the C-compilations and
 | |
| # such.
 | |
| 
 | |
| all:	dummy
 | |
|         make parser
 | |
| 
 | |
| dummy:  $(GFILES)
 | |
|         LLgen $(LLOPT) $(GFILES)
 | |
|         touch dummy
 | |
| 
 | |
| parser: $(OFILES)
 | |
|         $(CC) -o parser $(LDFLAGS) $(OFILES)
 | |
| 
 | |
| # Some dependencies without actions :
 | |
| # make already knows what to do about them
 | |
| 
 | |
| Lpars.o:        Lpars.h
 | |
| stat.o:         Lpars.h
 | |
| decl.o:         Lpars.h
 | |
| expr.o:         Lpars.h
 | |
| 
 | |
| .fi
 | |
| .ft R
 |