diff --git a/util/LLgen/doc/LLgen.1 b/util/LLgen/doc/LLgen.1 new file mode 100644 index 000000000..0436ffa30 --- /dev/null +++ b/util/LLgen/doc/LLgen.1 @@ -0,0 +1,139 @@ +.\" $Id$ +.TH LLGEN 1 "$Revision$" +.ad +.SH NAME +LLgen, an extended LL(1) parser generator +.SH SYNOPSIS +LLgen [ \-vxwans ] [ \-j[\fInum\fP] ] [ \-l\fInum\fP ] [ \-h\fInum\fP ] file ... +.SH DESCRIPTION +\fILLgen\fP +converts a context-free grammar into a set of +functions which form a recursive descent parser with no backtrack. +The grammar may be ambiguous; +ambiguities can be broken by user specifications. +.PP +\fILLgen\fP +reads each +\fIfile\fP +in sequence. +Together, these files must constitute a context-free grammar. +For each file, +\fILLgen\fP +generates an output file, which must be compiled by the +C-compiler. +In addition, it generates the files +\fILpars.c\fP +and +\fILpars.h.\fP +\fILpars.h\fP +contains the +\fIdefine\fP +statements that associate the +\fILLgen\fP-assigned `token-codes' with user declared `token-names'. +This allows other source files, for instance the source file +containing the lexical analyzer, +to access the token-codes by +using the token-names. +\fILpars.c\fP +contains the error recovery routines and tables. It must also +be compiled by the C-compiler. When the generated parser uses non-correcting +error recovery ( +\fB\-n\fP +option) +\fILLgen\fP +also generates a file +\fILncor.c\fP +that contains the non-correcting recovery mechanism. +.PP +\fILLgen\fP +will only update those output files that differ from their previous +version. +This allows +\fILLgen\fP +to be used with +\fImake\fP +(1) convieniently. +.PP +To obtain a working program, the user must also supply a +lexical analyzer, as well as +\fImain\fP +and +\fILLmessage\fP, +an error reporting routine; +\fILex\fP +(1) is a useful program for creating lexical analysers usable +by +\fILLgen\fP. +.PP +\fILLgen\fP accepts the following flags: +.IP \fB\-v\fP +create a file called +\fILL.output\fP, +which contains a description of the conflicts that +were not resolved. +If the flag is given more than once, +\fILLgen\fP +will be more "verbose". +If it is given three times, a complete description of the +grammar will be supplied. +.IP \fB\-x\fP +the sets that are computed are extended with the nonterminal +symbols and these extended sets are also included in the +\fILL.output\fP +file. +.IP \fB\-w\fP +no warnings are given. +.IP \fB\-a\fP +Produce ANSI C function headers and prototypes. +.IP \fB\-n\fP +Produce a parser with non-correcting error recovery. +.IP \fB\-s\fP +Simulate the calling of all defined subparsers in all semantic actions. When +using non-correcting error recovery, subparsers that are called in semantic +actions may cause problems; this flag provides a `brute-force' solution. +.IP \fB\-j\fP[\fInum\fP] +when this flag is given, \fILLgen\fP will generate dense switches, +so that the compiler can generate a jump table for it. This will only be +done for switches that have density between +\fIlow_percentage\fP and \fIhigh_percentage\fP, as explained below. +Usually, compilers generate a jumptable when the density of the switch +is above a certain threshold. When jump tables are to be used more often, +\fIhigh_percentage\fP must be set to this threshold, and \fIlow_percentage\fP +must be set to a minimum threshold. There is a time-space trade-off here. +.I num +is the minimum number of cases in a switch for the \fB\-j\fP option to be +effective. The default value (if +.I num +is not given) is 8. +.IP \fB\-l\fP\fInum\fP +The \fIlow_percentage\fP, as described above. Default value is 10. +.IP \fB\-h\fP\fInum\fP +The \fIhigh_percentage\fP, as described above. Default value is 30. +.SH FILES +LL.output verbose output file +.br +Lpars.c the error recovery routines +.br +Lncor.c non-correcting error recovery mechanism +.br +Lpars.h defines for token names +.SH "SEE ALSO" +\fIlex\fP(1) +.br +\fImake\fP(1) +.br +\fILLgen, an Extended LL(1) Parser Generator\fP +by C.J.H. Jacobs. +.br +\fITop-down Non-Correcting Error Recovery in LLgen\fP +by A.W van Deudekom and P.J. Kooiman +.SH DIAGNOSTICS +Are intended to be self-explanatory. They are reported +on standard error. A more detailed report is found in the +\fILL.output\fP +file. +.SH AUTHOR +Ceriel J. H. Jacobs +.br +The non-correcting error recovery mechanism is written by +A.W van Deudekom and P.J. Kooiman. diff --git a/util/LLgen/doc/LLgen.n b/util/LLgen/doc/LLgen.n new file mode 100644 index 000000000..3d9786a5b --- /dev/null +++ b/util/LLgen/doc/LLgen.n @@ -0,0 +1,1077 @@ +.\" $Id$ +.\" Run this paper off with +.\" refer [options] -p LLgen.refs LLgen.doc | [n]eqn | tbl | (nt)roff -ms +.if '\*(>.'' \{\ +. if '\*(<.'' \{\ +. if n .ds >. . +. if n .ds >, , +. if t .ds <. . +. if t .ds <, ,\ +\}\ +\} +.cs 5 22u +.ND +.EQ +delim @@ +.EN +.TL +LLgen, an extended LL(1) parser generator +.AU +Ceriel J. H. Jacobs +.AI +Dept. of Mathematics and Computer Science +Vrije Universiteit +Amsterdam, The Netherlands +.AB +\fILLgen\fR provides a +tool for generating an efficient recursive descent parser +with no backtrack from +an Extended Context Free syntax. +The \fILLgen\fR +user specifies the syntax, together with code +describing actions associated with the parsing process. +\fILLgen\fR +turns this specification into a number of subroutines that handle the +parsing process. +.PP +The grammar may be ambiguous. +\fILLgen\fR contains both static and dynamic facilities +to resolve these ambiguities. +.PP +The specification can be split into several files, for each of +which \fILLgen\fR generates an output file containing the +corresponding part of the parser. +Furthermore, only output files that differ from their previous +version are updated. +Other output files are not affected in any +way. +This allows the user to recompile only those output files that have +changed. +.PP +The subroutine produced by \fILLgen\fR calls a user supplied routine +that must return the next token. This way, the input to the +parser can be split into single characters or higher level +tokens. +.PP +An error recovery mechanism is generated almost completely +automatically. +It is based on so called \fBdefault choices\fR, which are +implicitly or explicitly specified by the user. +.PP +\fILLgen\fR has succesfully been used to create recognizers for +Pascal, C, and Modula-2. +.AE +.NH +Introduction +.PP +\fILLgen\fR +provides a tool for generating an efficient recursive +descent parser with no backtrack from an Extended Context Free +syntax. +A parser generated by +\fILLgen\fR +will be called +\fILLparse\fR +for the rest of this document. +It is assumed that the reader has some knowledge of LL(1) grammars and +recursive descent parsers. +For a survey on the subject, see reference +.[ ( +griffiths +.]). +.PP +Extended LL(1) parsers are an extension of LL(1) parsers. They are +derived from an Extended Context-Free (ECF) syntax instead of a Context-Free +(CF) syntax. +ECF syntax is described in section 2. +Section 3 provides an outline of a +specification as accepted by +\fILLgen\fR and also discusses the lexical conventions of +grammar specification files. +Section 4 provides a description of the way the +\fILLgen\fR +user can associate +actions with the syntax. These actions must be written in the programming +language C, +.[ +kernighan ritchie +.] +which also is the target language of \fILLgen\fR. +The error recovery technique is discussed in section 5. +This section also discusses what the user can do about it. +Section 6 discusses +the facilities \fILLgen\fR offers +to resolve ambiguities and conflicts. +\fILLgen\fR offers facilities to resolve them both at parser +generation time and during the execution of \fILLparse\fR. +Section 7 discusses the +\fILLgen\fR +working environment. +It also discusses the lexical analyzer that must be supplied by the +user. +This lexical analyzer must read the input stream and break it +up into basic input items, called \fBtokens\fR for the rest of +this document. +Appendix A gives a summary of the +\fILLgen\fR +input syntax. +Appendix B gives an example. +It is very instructive to compare this example with the one +given in reference +.[ ( +yacc +.]). +It demonstrates the struggle \fILLparse\fR and other LL(1) +parsers have with expressions. +Appendix C gives an example of the \fILLgen\fR features +allowing the user to recompile only those output files that +have changed, using the \fImake\fR program. +.[ +make +.] +.NH +The Extended Context-Free Syntax +.PP +The extensions of an ECF syntax with respect to an ordinary CF syntax are: +.IP 1. 10 +An ECF syntax contains the repetition operator: "N" (N represents a positive +integer). +.IP 2. 10 +An ECF syntax contains the closure set operator without and with +upperbound: "*" and "*N". +.IP 3. 10 +An ECF syntax contains the positive closure set operator without and with +upperbound: "+" and "+N". +.IP 4. 10 +An ECF syntax contains the optional operator: "?", which is a +shorthand for "*1". +.IP 5. 10 +An ECF syntax contains parentheses "[" and "]" which can be +used for grouping. +.PP +We can describe the syntax of an ECF syntax with an ECF syntax : +.DS +.ft CW +grammar : rule + + ; +.ft R +.DE +This grammar rule states that a grammar consists of one or more +rules. +.DS +.ft CW +rule : nonterminal ':' productionrule ';' + ; +.ft R +.DE +A rule consists of a left hand side, the nonterminal, +followed by ":", +the \fBproduce symbol\fR, followed by a production rule, followed by a +";", in\%di\%ca\%ting the end of the rule. +.DS +.ft CW +productionrule : production [ '|' production ]* + ; +.ft R +.DE +A production rule consists of one or +more alternative productions separated by "|". This symbol is called the +\fBalternation symbol\fR. +.DS +.ft CW +production : term * + ; +.ft R +.DE +A production consists of a possibly empty list of terms. +So, empty productions are allowed. +.DS +.ft CW +term : element repeats + ; +.ft R +.DE +A term is an element, possibly with a repeat specification. +.DS +.ft CW +element : LITERAL + | IDENTIFIER + | '[' productionrule ']' + ; +.ft R +.DE +An element can be a LITERAL, which basically is a single character +between apostrophes, it can be an IDENTIFIER, which is either a +nonterminal or a token, and it can be a production rule +between square parentheses. +.DS +.ft CW +repeats : '?' + | [ '*' | '+' ] NUMBER ? + | NUMBER ? + ; +.ft R +.DE +These are the repeat specifications discussed above. Notice that +this specification may be empty. +.PP +The class of ECF languages +is identical with the class of CF languages. However, in many +cases recursive definitions of language features can now be +replaced by iterative ones. This tends to reduce the number of +nonterminals and gives rise to very efficient recursive descent +parsers. +.NH +Grammar Specifications +.PP +The major part of a +\fILLgen\fR +grammar specification consists of an +ECF syntax specification. +Names in this syntax specification refer to either tokens or nonterminal +symbols. +\fILLgen\fR +requires token names to be declared as such. This way it +can be avoided that a typing error in a nonterminal name causes it to +be accepted as a token name. The token declarations will be +discussed later. +A name will be regarded as a nonterminal symbol, unless it is declared +as a token name. +If there is no production rule for a nonterminal symbol, \fILLgen\fR +will complain. +.PP +A grammar specification may also include some C routines, +for instance the lexical analyzer and an error reporting +routine. +Thus, a grammar specification file can contain declarations, +grammar rules and C-code. +.PP +Blanks, tabs and newlines are ignored, but may not appear in names or +keywords. +Comments may appear wherever a name is legal (which is almost +everywhere). +They are enclosed in +/* ... */, as in C. Comments do not nest. +.PP +Names may be of arbitrary length, and can be made up of letters, underscore +"\_" and non-initial digits. Upper and lower case letters are distinct. +Only the first 50 characters are significant. +Notice however, that the names for the tokens will be used by the +C-preprocessor. +The number of significant characters therefore depends on the +underlying C-implementation. +A safe rule is to make the identifiers distinct in the first six +characters, case ignored. +.PP +There are two kinds of tokens: +those that are declared and are denoted by a name, +and literals. +.PP +A literal consists of a character enclosed in apostrophes "'". +The "\e" is an escape character within literals. The following escapes +are recognized : +.TS +center; +l l. +\&'\en' newline +\&'\er' return +\&'\e'' apostrophe "'" +\&'\e\e' backslash "\e" +\&'\et' tab +\&'\eb' backspace +\&'\ef' form feed +\&'\exxx' "xxx" in octal +.TE +.PP +Names representing tokens must be declared before they are used. +This can be done using the "\fB%token\fR" keyword, +by writing +.nf +.ft CW +.sp 1 +%token name1, name2, . . . ; +.ft R +.fi +.PP +\fILLparse\fR is designed to recognize special nonterminal +symbols called \fBstart symbols\fR. +\fILLgen\fR allows for more than one start symbol. +Thus, grammars with more than one entry point are accepted. +The start symbols must be declared explicitly using the +"\fB%start\fR" keyword. It can be used whenever a declaration is +legal, f.i.: +.nf +.ft CW +.sp 1 +%start LLparse, specification ; +.ft R +.fi +.sp 1 +declares "specification" as a start symbol and associates the +identifier "LLparse" with it. +"LLparse" will now be the name of the C-function that must be +called to recognize "specification". +.NH +Actions +.PP +\fILLgen\fR +allows arbitrary insertions of actions within the right hand side +of a production rule in the ECF syntax. An action consists of a number of C +statements, enclosed in the brackets "{" and "}". +.PP +\fILLgen\fR +generates a parsing routine for each rule in the grammar. The actions +supplied by the user are just inserted in the proper place. +There may also be declarations before the statements in the +action, as +the "{" and "}" are copied into the target code along with the +action. The scope of these declarations terminates with the +closing bracket "}" of the action. +.PP +In addition to actions, it is also possible to declare local variables +in the parsing routine, which can then be used in the actions. +Such a declaration consists of a number of C variable declarations, +enclosed in the brackets "{" and "}". It must be placed +right in front of the ":" in the grammar rule. +The scope of these local variables consists of the complete +grammar rule. +.PP +In order to facilitate communication between the actions and +\fILLparse\fR, +the parsing routines can be given C-like parameters. +Each parameter must be declared separately, and each of these declarations must +end with a semicolon. +For the last parameter, the semicolon is optional. +.PP +So, for example +.nf +.ft CW +.sp 1 +expr(int *pval;) { int fact; } : + /* + * Rule with one parameter, a pointer to an int. + * Parameter specifications are ordinary C declarations. + * One local variable, of type int. + */ + factor (&fact) { *pval = fact; } + /* + * factor is another nonterminal symbol. + * One actual parameter is supplied. + * Notice that the parameter passing mechanism is that + * of C. + */ + [ '+' factor (&fact) { *pval += fact; } ]* + /* + * remember the '*' means zero or more times + */ + ; +.sp 1 +.ft R +.fi +is a rule to recognize a number of factors, separated by "+", and +to compute their sum. +.PP +\fILLgen\fR +generates C code, so the parameter passing mechanism is that of +C, as is shown in the example above. +.PP +Actions often manipulate attributes of the token just read. +For instance, when an identifier is read, its name must be +looked up in a symbol table. +Therefore, \fILLgen\fR generates code +such that at a number of places in the grammar rule +it is defined which token has last been read. +After a token, the last token read is this token. +After a "[" or a "|", the last token read is the next token to +be accepted by \fILLparse\fR. +At all other places, it is undefined which token has last been +read. +The last token read is available in the global integer variable +\fILLsymb\fR. +.PP +The user may also specify C-code wherever a \fILLgen\fR-declaration is +legal. +Again, this code must be enclosed in the brackets "{" and "}". +This way, the user can define global declarations and +C-functions. +To avoid name-conflicts with identifiers generated by +\fILLgen\fR, \fILLparse\fR only uses names beginning with +"LL"; the user should avoid such names. +.NH +Error Recovery +.PP +The error recovery technique used by \fILLgen\fR is a +modification of the one presented in reference +.[ ( +automatic construction error correcting +.]). +It is based on \fBdefault choices\fR, which just are +what the word says, default choices at +every point in the grammar where there is a +choice. +Thus, in an alternation, one of the productions is marked as a +default choice, and in a term with a non-fixed repetition +specification there will also be a default choice (between +doing the term (once more) and continuing with the rest of the +production in which the term appears). +.PP +When \fILLparse\fR detects an error after having parsed the +string @s@, the default choices enable it to compute one +syntactically correct continuation, +consisting of the tokens @t sub 1~...~t sub n@, +such that @s~t sub 1~...~t sub n@ is a string of tokens that +is a member of the language defined by the grammar. +Notice, that the computation of this continuation must +terminate, which implies that the default choices may not +invoke recursive rules. +.PP +At each point in this continuation, a certain number of other +tokens could also be syntactically correct, f.i. the token +@t@ is syntactically correct at point @t sub i@ in this +continuation, if the string @s~t sub 1~...~t sub i~t~s sub 1@ +is a string of the language defined by the grammar for some +string @s sub 1@ and i >= 0. +.PP +The set @T@ +containing all these tokens (including @t sub 1 ,~...,~t sub n@) is computed. +Next, \fILLparse\fR discards zero +or more tokens from its input, until a token +@t@ \(mo @T@ is found. +The error is then corrected by inserting i (i >= 0) tokens +@t sub 1~...~t sub i@, such that the string +@s~t sub 1~...~t sub i~t~s sub 1@ is a string of the language +defined by the grammar, for some @s sub 1@. +Then, normal parsing is resumed. +.PP +The above is difficult to implement in a recursive decent +parser, and is not the way \fILLparse\fR does it, but the +effect is the same. In fact, \fILLparse\fR maintains a list +of tokens that may not be discarded, which is adjusted as +\fILLparse\fR proceeds. This list is just a representation +of the set @T@ mentioned +above. When an error occurs, \fILLparse\fR discards tokens until +a token @t@ that is a member of this list is found. +Then, it continues parsing, following the default choices, +inserting tokens along the way, until this token @t@ is legal. +The selection of +the default choices must guarantee that this will always +happen. +.PP +The default choices are explicitly or implicitly +specified by the user. +By default, the default choice in an alternation is the +alternative with the shortest possible terminal production. +The user can select one of the other productions in the +alternation as the default choice by putting the keyword +"\fB%default\fR" in front of it. +.PP +By default, for terms with a repetition count containing "*" or +"?" the default choice is to continue with the rest of the rule +in which the term appears, and +.sp 1 +.ft CW +.nf + term+ +.fi +.ft R +.sp 1 +is treated as +.sp 1 +.nf +.ft CW + term term* . +.ft R +.fi +.PP +It is also clear, that it can never be the default choice to do +the term (once more), because this could cause the parser to +loop, inserting tokens forever. +However, when the user does not want the parser to skip +tokens that would not have been skipped if the term +would have been the default choice, +the skipping of such a term can be prevented by +using the keyword "\fB%persistent\fR". +For instance, the rule +.sp 1 +.ft CW +.nf +commandlist : command* ; +.fi +.ft R +.sp 1 +could be changed to +.sp 1 +.ft CW +.nf +commandlist : [ %persistent command ]* ; +.fi +.ft R +.sp 1 +The effects of this in case of a syntax error are twofold: +The set @T@ mentioned above will be extended as if "command" were +in the default production, so that fewer tokens will be +skipped. +Also, if the first token that is not skipped is a member of the +subset of @T@ arising from the grammar rule for "command", +\fILLparse\fR will enter that rule. +So, in fact the default choice +is determined dynamically (by \fILLparse\fR). +Again, \fILLgen\fR checks (statically) +that \fILLparse\fR will always terminate, and if not, +\fILLgen\fR will complain. +.PP +An important property of this error recovery method is that, +once a rule is started, it will be finished. +This means that all actions in the rule will be executed +normally, so that the user can be sure that there will be no +inconsistencies in his data structures because of syntax +errors. +Also, as the method is in fact error correcting, the +actions in a rule only have to deal with syntactically correct +input. +.NH +Ambiguities and conflicts +.PP +As \fILLgen\fR generates a recursive descent parser with no backtrack, +it must at all times be able to determine what to do, +based on the current input symbol. +Unfortunately, this cannot be done for all grammars. +Two kinds of conflicts can arise : +.IP 1) 10 +the grammar rule is of the form "production1 | production2", +and \fILLparse\fR cannot decide which production to chose. +This we call an \fBalternation conflict\fR. +.IP 2) 10 +the grammar rule is of the form "[ productionrule ]...", +where ... specifies a non-fixed repetition count, +and \fILLparse\fR cannot decide whether to +choose "productionrule" once more, or to continue. +This we call a \fBrepetition conflict\fR. +.PP +There can be several causes for conflicts: the grammar may be +ambiguous, or the grammar may require a more complex parser +than \fILLgen\fR can construct. +The conflicts can be examined by inspecting the verbose +(-\fBv\fR) option output file. +The conflicts can be resolved by rewriting the grammar +or by using \fBconflict resolvers\fR. +The mechanism described here is based on the attributed parsing +of reference +.[ ( +milton +.]). +.PP +An alternation conflict can be resolved by putting an \fBif condition\fR +in front of the first conflicting production. +It consists of a "\fB%if\fR" followed by a +C-expression between parentheses. +\fILLparse\fR will then evaluate this expression whenever a +token is met at this point on which there is a conflict, so +the conflict will be resolved dynamically. +If the expression evaluates to +non-zero, the first conflicting production is chosen, +otherwise one of the remaining ones is chosen. +.PP +An alternation conflict can also be resolved using the keywords +"\fB%prefer\fR" or "\fB%avoid\fR". "\fB%prefer\fR" +is equivalent in behaviour to +"\fB%if\fR (1)". "\fB%avoid\fR" is equivalent to "\fB%if\fR (0)". +In these cases however, "\fB%prefer\fR" and "\fB%avoid\fR" should be used, +as they resolve the conflict statically and thus +give rise to better C-code. +.PP +A repetition conflict can be resolved by putting a \fBwhile condition\fR +right after the opening parentheses. This while condition +consists of a "\fB%while\fR" followed by a C-expression between +parentheses. Again, \fILLparse\fR will then +evaluate this expression whenever a token is met +at this point on which there is a conflict. +If the expression evaluates to non-zero, the +repeating part is chosen, otherwise the parser continues with +the rest of the rule. +Appendix B will give an example of these features. +.PP +A useful aid in writing conflict resolvers is the "\fB%first\fR" keyword. +It is used to declare a C-macro that forms an expression +returning 1 if the parameter supplied can start a specified +nonterminal, f.i.: +.sp 1 +.nf +.ft CW +%first fmac, nonterm ; +.ft R +.sp 1 +.fi +declares "fmac" as a macro with one parameter, whose value +is a token number. If the parameter +X can start the nonterminal "nonterm", "fmac(X)" is true, +otherwise it is false. +.NH +The LLgen working environment +.PP +\fILLgen\fR generates a number of files: one for each input +file, and two other files: \fILpars.c\fR and \fILpars.h\fR. +\fILpars.h\fR contains "#-define"s for the tokennames. +\fILpars.c\fR contains the error recovery routines and tables. +Only those output files that differ from their previous version +are updated. See appendix C for a possible application of this +feature. +.PP +The names of the output files are constructed as +follows: +in the input file name, the suffix after the last point is +replaced by a "c". If no point is present in the input file +name, ".c" is appended to it. \fILLgen\fR checks that the +filename constructed this way in fact represents a previous +version, or does not exist already. +.PP +The user must provide some environment to obtain a complete +program. +Routines called \fImain\fR and \fILLmessage\fR must be defined. +Also, a lexical analyzer must be provided. +.PP +The routine \fImain\fR must be defined, as it must be in every +C-program. It should eventually call one of the startsymbol +routines. +.PP +The routine \fILLmessage\fR must accept one +parameter, whose value is a token number, zero or -1. +.br +A zero parameter indicates that the current token (the one in +the external variable \fILLsymb\fR) is deleted. +.br +A -1 parameter indicates that the parser expected end of file, but didn't get +it. +The parser will then skip tokens until end of file is detected. +.br +A parameter that is a token number (a positive parameter) +indicates that this +token is to be inserted in front of the token currently in +\fILLsymb\fR. +The user can give the token the proper attributes. +Also, the user must take care, that the token currently in +\fILLsymb\fR is again returned by the \fBnext\fR call to the +lexical analyzer, with the proper attributes. +So, the lexical analyzer must have a facility to push back one +token. +.PP +The user may also supply his own error recovery routines, or handle +errors differently. For this purpose, the name of a routine to be called +when an error occurs may be declared using the keyword \fB%onerror\fR. +This routine takes two parameters. +The first one is either the token number of the +token expected, or 0. In the last case, the error occurred at a choice. +In both cases, the routine must ensure that the next call to the lexical +analyser returns the token that replaces the current one. Of course, +that could well be the current one, in which case +.I LLparse +recovers from the error. +The second parameter contains a list of tokens that are not skipped at the +error point. The list is in the form of a null-terminated array of integers, +whose address is passed. +.PP +The user must supply a lexical analyzer to read the input stream and +break it up into tokens, which are passed to +.I LLparse. +It should be an integer valued function, returning the token number. +The name of this function can be declared using the +"\fB%lexical\fR" keyword. +This keyword can be used wherever a declaration is legal and may appear +only once in the grammar specification, f.i.: +.sp 1 +.nf +.ft CW +%lexical scanner ; +.ft R +.fi +.sp 1 +declares "scanner" as the name of the lexical analyzer. +The default name for the lexical analyzer is "yylex". +The reason for this funny name is that a useful tool for constructing +lexical analyzers is the +.I Lex +program, +.[ +lex +.] +which generates a routine of that name. +.PP +The token numbers are chosen by \fILLgen\fR. +The token number for a literal +is the numerical value of the character in the local character set. +If the tokens have a name, +the "#\ define" mechanism of C is used to give them a value and +to allow the lexical analyzer to return their token numbers symbolically. +These "#\ define"s are collected in the file \fILpars.h\fR which +can be "#\ include"d in any file that needs the token-names. +The maximum token number chosen is defined in the macro \fILL_MAXTOKNO\fP. +.PP +The lexical analyzer must signal the end +of input to \fILLparse\fR +by returning a number less than or equal to zero. +.NH +Programs with more than one parser +.PP +\fILLgen\fR offers a simple facility for having more than one parser in +a program: in this case, the user can change the names of global procedures, +variables, etc, by giving a different prefix, like this: +.sp 1 +.nf +.ft CW +%prefix XX ; +.ft R +.fi +.sp 1 +The effect of this is that all global names start with XX instead of LL, for +the parser that has this prefix. This holds for the variables \fILLsymb\fP, +which now is called \fIXXsymb\fP, for the routine \fILLmessage\fP, +which must now be called \fIXXmessage\fP, and for the macro \fILL_MAXTOKNO\fP, +which is now called \fIXX_MAXTOKNO\fP. +\fILL.output\fP is now \fIXX.output\fP, and \fILpars.c\fP and \fILpars.h\fP +are now called \fIXXpars.c\fP and \fIXXpars.h\fP. +.bp +.SH +References +.[ +$LIST$ +.] +.bp +.SH +Appendix A : LLgen Input Syntax +.PP +This appendix has a description of the \fILLgen\fR input syntax, +as a \fILLgen\fR specification. As a matter of fact, the current +version of \fILLgen\fR is written with \fILLgen\fR. +.nf +.ft CW +.sp 2 +/* + * First the declarations of the terminals + * The order is not important + */ + +%token IDENTIFIER; /* terminal or nonterminal name */ +%token NUMBER; +%token LITERAL; + +/* + * Reserved words + */ + +%token TOKEN; /* %token */ +%token START; /* %start */ +%token PERSISTENT; /* %persistent */ +%token IF; /* %if */ +%token WHILE; /* %while */ +%token AVOID; /* %avoid */ +%token PREFER; /* %prefer */ +%token DEFAULT; /* %default */ +%token LEXICAL; /* %lexical */ +%token PREFIX; /* %prefix */ +%token ONERROR; /* %onerror */ +%token FIRST; /* %first */ + +/* + * Declare LLparse to be a C-routine that recognizes "specification" + */ + +%start LLparse, specification; + +specification + : declaration* + ; + +declaration + : START + IDENTIFIER ',' IDENTIFIER + ';' + | '{' + /* Read C-declaration here */ + '}' + | TOKEN + IDENTIFIER + [ ',' IDENTIFIER ]* + ';' + | FIRST + IDENTIFIER ',' IDENTIFIER + ';' + | LEXICAL + IDENTIFIER + ';' + | PREFIX + IDENTIFIER + ';' + | ONERROR + IDENTIFIER + ';' + | rule + ; + +rule : IDENTIFIER parameters? ldecl? + ':' productions + ';' + ; + +ldecl : '{' + /* Read C-declaration here */ + '}' + ; + +productions + : simpleproduction + [ '|' simpleproduction ]* + ; + +simpleproduction + : DEFAULT? + [ IF '(' /* Read C-expression here */ ')' + | PREFER + | AVOID + ]? + [ element repeats ]* + ; + +element : '{' + /* Read action here */ + '}' + | '[' [ WHILE '(' /* Read C-expression here */ ')' ]? + PERSISTENT? + productions + ']' + | LITERAL + | IDENTIFIER parameters? + ; + +parameters + : '(' /* Read C-parameters here */ ')' + ; + +repeats : /* empty */ + | [ '*' | '+' ] NUMBER? + | NUMBER + | '?' + ; + +.fi +.ft R +.bp +.SH +Appendix B : An example +.PP +This example gives the complete \fILLgen\fR specification of a simple +desk calculator. It has 26 registers, labeled "a" through "z", +and accepts arithmetic expressions made up of the C operators ++, -, *, /, %, &, and |, with their usual priorities. +The value of the expression is +printed. As in C, an integer that begins with 0 is assumed to +be octal; otherwise it is assumed to be decimal. +.PP +Although the example is short and not very complicated, it +demonstrates the use of if and while conditions. In +the example they are in fact used to reduce the number of +nonterminals, and to reduce the overhead due to the recursion +that would be involved in parsing an expression with an +ordinary recursive descent parser. In an ordinary LL(1) +grammar there would be one nonterminal for each operator +priority. The example shows how we can do it all with one +nonterminal, no matter how many priority levels there are. +.sp 1 +.nf +.ft CW +{ +#include +#include +#define MAXPRIO 5 +#define prio(op) (ptab[op]) + +struct token { + int t_tokno; /* token number */ + int t_tval; /* Its attribute */ +} stok = { 0,0 }, tok; + +int nerrors = 0; +int regs[26]; /* Space for the registers */ +int ptab[128]; /* Attribute table */ + +struct token +nexttok() { /* Read next token and return it */ + register c; + struct token new; + + while ((c = getchar()) == ' ' || c == '\et') { /* nothing */ } + if (isdigit(c)) new.t_tokno = DIGIT; + else if (islower(c)) new.t_tokno = IDENT; + else new.t_tokno = c; + if (c >= 0) new.t_tval = ptab[c]; + return new; +} } + +%token DIGIT, IDENT; +%start parse, list; + +list : stat* ; + +stat { int ident, val; } : + %if (stok = nexttok(), + stok.t_tokno == '=') + /* The conflict is resolved by looking one further + * token ahead. The grammar is LL(2) + */ + IDENT + { ident = tok.t_tval; } + '=' expr(1,&val) '\en' + { if (!nerrors) regs[ident] = val; } + | expr(1,&val) '\en' + { if (!nerrors) printf("%d\en",val); } + | '\en' + ; + +expr(int level; int *val;) { int expr; } : + factor(val) + [ %while (prio(tok.t_tokno) >= level) + /* Swallow operators as long as their priority is + * larger than or equal to the level of this invocation + */ + '+' expr(prio('+')+1,&expr) + { *val += expr; } + /* This states that '+' groups left to right. If it + * should group right to left, the rule should read: + * '+' expr(prio('+'),&expr) + */ + | '-' expr(prio('-')+1,&expr) + { *val -= expr; } + | '*' expr(prio('*')+1,&expr) + { *val *= expr; } + | '/' expr(prio('/')+1,&expr) + { *val /= expr; } + | '%' expr(prio('%')+1,&expr) + { *val %= expr; } + | '&' expr(prio('&')+1,&expr) + { *val &= expr; } + | '|' expr(prio('|')+1,&expr) + { *val |= expr; } + ]* + /* Notice the "*" here. It is important. + */ + ; + +factor(int *val;): + '(' expr(1,val) ')' + | '-' expr(MAXPRIO+1,val) + { *val = -*val; } + | number(val) + | IDENT + { *val = regs[tok.t_tval]; } + ; + +number(int *val;) { int base; } + : DIGIT + { base = (*val=tok.t_tval)==0?8:10; } + [ DIGIT + { *val = base * *val + tok.t_tval; } + ]* ; + +%lexical scanner ; +{ +scanner() { + if (stok.t_tokno) { /* a token has been inserted or read ahead */ + tok = stok; + stok.t_tokno = 0; + return tok.t_tokno; + } + if (nerrors && tok.t_tokno == '\en') { + printf("ERROR\en"); + nerrors = 0; + } + tok = nexttok(); + return tok.t_tokno; +} + +LLmessage(insertedtok) { + nerrors++; + if (insertedtok) { /* token inserted, save old token */ + stok = tok; + tok.t_tval = 0; + if (insertedtok < 128) tok.t_tval = ptab[insertedtok]; + } +} + +main() { + register *p; + + for (p = ptab; p < &ptab[128]; p++) *p = 0; + /* for letters, their attribute is their index in the regs array */ + for (p = &ptab['a']; p <= &ptab['z']; p++) *p = p - &ptab['a']; + /* for digits, their attribute is their value */ + for (p = &ptab['0']; p <= &ptab['9']; p++) *p = p - &ptab['0']; + /* for operators, their attribute is their priority */ + ptab['*'] = 4; + ptab['/'] = 4; + ptab['%'] = 4; + ptab['+'] = 3; + ptab['-'] = 3; + ptab['&'] = 2; + ptab['|'] = 1; + parse(); + exit(nerrors); +} } +.fi +.ft R +.bp +.SH +Appendix C. How to use \fILLgen\fR. +.PP +This appendix demonstrates how \fILLgen\fR can be used in +combination with the \fImake\fR program, to make effective use +of the \fILLgen\fR-feature that it only changes output files +when neccessary. \fIMake\fR uses a "makefile", which +is a file containing dependencies and associated commands. +A dependency usually indicates that some files depend on other +files. When a file depends on another file and is older than +that other file, the commands associated with the dependency +are executed. +.PP +So, \fImake\fR seems just the program that we always wanted. +However, it +is not very good in handling programs that generate more than +one file. +As usual, there is a way around this problem. +A sample makefile follows: +.sp 1 +.ft CW +.nf +# The grammar exists of the files decl.g, stat.g and expr.g. +# The ".o"-files are the result of a C-compilation. + +GFILES = decl.g stat.g expr.g +OFILES = decl.o stat.o expr.o Lpars.o +LLOPT = + +# As make does'nt handle programs that generate more than one +# file well, we just don't tell make about it. +# We just create a dummy file, and touch it whenever LLgen is +# executed. This way, the dummy in fact depends on the grammar +# files. +# Then, we execute make again, to do the C-compilations and +# such. + +all: dummy + make parser + +dummy: $(GFILES) + LLgen $(LLOPT) $(GFILES) + touch dummy + +parser: $(OFILES) + $(CC) -o parser $(LDFLAGS) $(OFILES) + +# Some dependencies without actions : +# make already knows what to do about them + +Lpars.o: Lpars.h +stat.o: Lpars.h +decl.o: Lpars.h +expr.o: Lpars.h + +.fi +.ft R diff --git a/util/LLgen/doc/LLgen.refs b/util/LLgen/doc/LLgen.refs new file mode 100644 index 000000000..df73595b8 --- /dev/null +++ b/util/LLgen/doc/LLgen.refs @@ -0,0 +1,54 @@ +%T An ALL(1) Compiler Generator +%A D. R. Milton +%A L. W. Kirchhoff +%A B. R. Rowland +%B Proc. of the SIGPLAN '79 Symposium on Compiler Construction +%D August 1979 +%J SIGPLAN Notices +%N 8 +%P 152-157 +%V 14 + +%T Lex - A Lexical Analyser Generator +%A M. E. Lesk +%I Bell Laboratories +%D October 1975 +%C Murray Hill, New Jersey +%R Comp. Sci. Tech. Rep. No. 39 + +%T Yacc: Yet Another Compiler Compiler +%A S. C. Johnson +%I Bell Laboratories +%D 1975 +%C Murray Hill, New Jersey +%R Comp. Sci. Tech. Rep. No. 32 + +%T The C Programming Language +%A B. W. Kernighan +%A D. M. Ritchie +%I Prentice-Hall, Inc. +%C Englewood Cliffs, New Jersey +%D 1978 + +%A M. Griffiths +%T LL(1) Grammars and Analysers +%E F. L. Bauer and J. Eickel +%B Compiler Construction, An Advanced Course +%I Springer-Verlag +%C New York, N.Y. +%D 1974 + +%T Make - A Program for Maintaining Computer Programs +%A S. I. Feldman +%J Software - Practice and Experience +%V 10 +%N 8 +%P 255-265 +%D August 1979 + +%T Methods for the Automatic Construction of Error Correcting Parsers +%A J. R\*:ohrich +%J Acta Informatica +%V 13 +%P 115-139 +%D 1980 diff --git a/util/LLgen/doc/LLgen_NCER.n b/util/LLgen/doc/LLgen_NCER.n new file mode 100644 index 000000000..3693a1525 --- /dev/null +++ b/util/LLgen/doc/LLgen_NCER.n @@ -0,0 +1,2712 @@ +.RP +.TL + + + +Top-down Non-Correcting Error Recovery + in LLgen +.AU +Arthur van Deudekom +Peter Kooiman +.AI +Department of Mathematics and Computer Science +Vrije Universiteit +Amsterdam + + + + + +Supervised by +.AU +dr. D. Grune +.AI +Department of Mathematics and Computer Science +Vrije Universiteit +Amsterdam + +.AB +This paper describes the design and implementation of a parser +generator with non-correcting error recovery based on the extended LL(1) +parser generator LLgen. It describes a top-down algorithm for implementing +this error recovery technique that can handle left-recursive grammars. +The parser generator has been tested with several existing ACK-compilers, +among which C and Modula-2. Various optimizations have been tried and are +discussed in this paper. +.AE +.LP +.nr PS 12 +.nr VS 14 + +.NH +Introduction +.EQ +delim $$ +.EN + +.nr PS 10 +.nr VS 12 +.RS +.LP +One of the trickier problems in constructing parser-generators is what +to do when the input to the generated parser is not well formed. Several +approaches are known, most of which are `correcting', meaning that they +modify the input to make it correct. However, in most cases there are +several possible corrections, and often the one chosen will turn out +to be the wrong one. As a result of such an incorrect choice, spurious error +messages can occur. Every programmer knows from experience how the omission +of a single `)' can on occasion lead to pages of error messages. + +.LP +A radically different approach is to just discard all the input up to +and including the offending token, and start with a clean slate at the +token following the offending one. [RICHTER] describes how +this idea can be used to construct a non-correcting error recovery system +that will never introduce spurious error messages. It is, however, +possible that errors are overlooked. + +.LP +In this paper we describe the incorporation of this non-correcting error +recovery into LLgen, an existing LL(1) parser generator. +In this introduction, we will describe in detail this non-correcting error +recovery technique, give an overview of LLgen and how it handles +errors, and finally describe how we have incorporated noncorrecting +error recovery in LLgen. +.RE + +.NH 2 +Non-correcting syntax error recovery + +.LP +Richter describes how syntax error recovery can be done +without making any corrections to the input text. Richter gives three +reasons why recovery without correction is desirable: + +.IP 1 +In most cases there are many possible corrections, the choice among which +will severely influence the further processing of the input. Thus, the +probability of selecting the right correction is not high. + +.IP 2 +The harm done by selecting the wrong correction is often unlimited. + +.IP 3 +The loss of information to the user of a non-correcting recovery technique +need not be grave. + +.LP +The non-correcting technique described by Richter can be summarized as +follows: When a syntax-error has occurred, the input up to and including the +erroneous symbol is discarded; the remainder of the +input is processed by a substring parser of the input +language, that is a parser that recognizes any substring of a string in the input +language. When the substring parser detects a syntax error, the offending +symbol is reported as another error, and the input up to and including the +erroneous symbol is discarded. The process is then repeated with the remaining input, possibly +finding other syntax errors, until all the input is scanned. +This process yields what Richter calls a +.I +suffix analysis +.R +of an input string. Formally, given an input string +.I x +, suffix analysis produces a set of strings $w sub k$ and a set of symbols +$ a sub k$ such that +.br + +.IP +$x~ =~ w sub 0 a sub 0 w sub 1 a sub 1~...w sub n-1 a sub n-1 w sub n$ +.LP +and such that: +.br +.IP + $w sub 0$ is the longest prefix of $x$ that is a prefix of +a string in the input language L, formally: there is a string $y$ such that +$w sub 0 y$ is in L, but there is no string $z$ such that $w sub 0 a sub 0 z$ +is in L; +.IP +For $0 < k < n$, $w sub k$ is a longest substring of $x$ that is also a +substring of a string in L, formally there are strings $u$ and $v$ such that +$u w sub k v$ is in L, but there are no strings $y$ en $z$ such that +$y w sub k a sub k z$ is in L; +.IP +$w sub n$ is a substring of $x$ +that is a substring of a string in L, formally: +there exist $u$ and $v$, such that $u w sub n v$ is in L. Note that +$w sub n$ need not be a suffix of a string in L, if $x$ represents incomplete +input $w sub n$ is not a suffix of a string in L. + +.LP +Now, the $a sub k$ indicate points at which an error is detected. The +"real" error need not be at $a sub k$, it can have occurred anywhere +within $w sub k a sub k$. +In his paper, Richter shows that, although this method may miss errors, it +will never introduce spurious errors. + +.LP +For implementing the technique, a parser that recognizes any +substring of the input language is needed. If we confine ourselves to +syntactical analysis, it is sufficient to construct a substring +recognizer. Richter himself does not give a practical construction, but +[CORMACK] describes how a LR substring parser can be constructed +that handles BC-LR(1,1) grammars. In this paper, we describe the construction +of a LL substring recognizer that can handle any grammar. Furthermore, +our recognizer is actually a suffix-recognizer, that is, a recognizer that +recognizes any suffix of a string in the input language. Our suffix recognizer has the +correct-prefix property, +meaning that it detects the first syntax error as early as possible +in a left-to-right scan of the input. Specifically, if the input language +is L and the invalid input is $x$ , it finds a string $w$ and an input symbol +$a$ such that $x = way$ , there is a string $z$ such that $wz$ +is in L, and there is no string $z$ such that $waz$ is in L. +Because the suffix parser has this correct-prefix property, it can be +used as a substring parser, because it will detect the first input symbol that +is not part of a substring of the language. Because it is a suffix-recognizer, +it additionally will detect incomplete input, because in that case +at the end of the input the parser will not be in an accepting state. + +.NH 2 +Overview of LLgen + +.LP +LLgen is an extended LL(1) parser generator. For a complete description, +see [GRUNE]. +LLgen can actually handle grammars that are not LL(1), because it allows +the use of conflict-resolvers. In case of an LL(1) conflict, these resolvers +are used to statically or dynamically decide which rule to use. As we will see +later, this feature makes it necessary for the suffix-recognizer to +handle grammars that are not LL(1). Semantic actions can occur anywhere +in the grammar rules, and they are executed when their position is +reached during parsing. A typical LLgen rule looks like +.br +.IP +S: A { +.I action +} B +.LP +where the action is a piece of C-code, that will be executed +when the parser is using the rule for S and has recognized A. + +.LP +LLgen-generated parsers use correcting syntax error recovery, based on a +scheme designed by R\*:ohrich [ROEHRICH], inserting or deleting symbols at the point of error detection +until correct input results. This means that actions in the parser will +always be executed in an order that could also have resulted from +syntactically correct input, and most importantly, once a grammar-rule +is started it is guaranteed to be completed. This means that syntactic +errors can never result in inconsistencies for the actions. Actions +only have to deal with syntactically correct input. In a nutshell, the +error recovery in LLgen-parsers works as follows: Suppose the parser is +presented with correct input that breaks off before the end. The error +recovery mechanism now provides a continuation path, chosen in such a +way that all active rules are left as soon as possible. Effectively, the +continuation path is the `shortest way out'. The symbols on this path are +called `acceptable', and end-of-file is also `acceptable'. Furthermore, at +each point along this `shortest path' there can be other terminals that +would be correct; these are `acceptable' as well. Now, when an +error occurs, all symbols that are not acceptable are discarded, until +an acceptable symbol appears in the input. The tokens on the path up to +but not including the acceptable input symbol are inserted. +From then on, normal parsing resumes. + +.NH 2 +Incorporation of non-correcting error recovery in LLgen + +.LP +An important consideration in incorporating the non-correcting recovery +in LLgen was that correct programs should suffer as little as possible +in what regards compilation speed. Furthermore, the existing error +recovery method has the highly desirable property that rules that are +started will be finished too, thus ensuring that errors in the +input text will not cause inconsistencies in the semantic actions. We have +implemented the non-correcting error recovery in such a way that this +property is preserved. + +.LP +The way we have achieved these goals is by actually including +the suffix recognizer as a `second recognizer' in the generated parser. +Correct programs are handled in the usual way by the parser, but if an error +occurs the following happens: instead of going to the standard error +recovery routine, the parser starts executing the non-correcting error +handler. This process continues, reporting all errors, until the +end of the input text is reached. Then, control is handed back to +the standard error recovery routine. This routine will now think +there is no more input, and thus start inserting tokens so as to construct +a `shortest way out'. This ensures that all rules that were started are +also finished, and no inconsistencies can occur in the semantic actions. +However, this method does require some modifications to the error reporting +routine. Normally, if the generated parser inserts a token, it reports +this to the user, but in this case this is undesirable. The insertions only +serve to maintain consistency in the semantic actions +and do not signify errors, so reporting of insertions should be suppressed. +.bp +.nr PS 12 +.nr VS 14 +.PS +boxwid = boxwid / 1.5 +boxht = boxht / 1.5 +arcrad = arcrad / 1.5 +movewid = movewid / 1.5 +moveht = moveht / 1.5 +arrowwid = arrowwid / 1.5 +arrowht = arrowht / 1.5 +arrowhead = arrowhead / 1.5 +linewid = linewid / 1.5 +lineht = lineht / 1.5 +.PE +.NH +The LL suffix parser + +.nr PS 10 +.nr VS 12 +.RS +.LP +In this chapter, we describe the construction of the LL suffix parser. +The described parser is not restricted to LL(1) grammars, because the +presence of conflict resolvers in LLgen allows for more general grammars, +that may even be left-recursive. We start this chapter with a discussion +of the implications of conflict resolvers, and continue with descriptions +of the parser algorithm, the used data-structures, +the handling of left- and right recursion, and some possible optimizations. +.RE + +.NH 2 +LLgen conflict resolvers and their implications + +.LP +In grammars that are nearly but not completely LL(1), conflicts +will arise in the two places where parsing decisions are made: the choice +of which alternative to start (`alternation conflicts') and the decision +to stop or continue a repeated item (`repetition conflicts'). In order to +allow LLgen to handle this type of grammar, the user can +specify conflict resolvers in those places where conflicts arise. +These resolvers are Boolean expressions labeling an alternative, +and are evaluated when a conflict arises during parsing. If the +expression evaluates to `true' the labeled alternative will be taken. +The Boolean expressions are expressions in C, and can consult +any information available at the point they occur. +However, if a syntactic error has occurred in the input, and the non-correcting +error recovery starts, we can no longer rely on the conflict resolvers to +guide parsing decisions. The suffix recognizer is only concerned with +syntax, and will not execute any semantic actions. It recognizes suffices +of correct input, but does not know or care what prefix would make +the suffix a correct program; as a result, the information that conflict +resolvers could use is not available, because the semantic actions +that would build this information have not been executed. +Therefore, the information used by the conflict resolvers is no longer +reliable, and the suffix parser needs to be able to handle the underlying +grammar without their help. In particular, it has to be able to handle +left-recursive grammars. + +.NH 2 +The suffix parser algorithm + +.LP +Our algorithm needs easy access to the grammar rules; in the description +we assume there is an efficient way to access the grammar rules. In +the next chapter we will describe the details of the actual implementation. +For the moment, we will only consider grammars that are not left- or +right-recursive. In the next section, we will discuss how the algorithm has to be adapted +to handle left- and right recursion. + +.LP +Suppose the grammar is G, and the input to the suffix recognizer is +$a sub 0 a sub 1 ... a sub n-1 a sub n$. Remember that parsing is +always started by the `normal' LLgen generated parser. It's only after +a syntactic error has occurred that the suffix recognizer will be started. +The input to the suffix recognizer thus is the `tail' of the input, starting +at the first symbol after the position where the first syntax error was +found. + +.LP +Now, in order to get parsing going again, the parser scans the grammar +for rules which contain symbol $a sub 0$ in the right hand side: +.br + + A: $alpha ~ a sub 0 ~ beta$ +.br + +.LP +where $alpha$ and $beta$ represent a string of terminals and non-terminals, +possible empty. Now, for each of these rules found, and for any string +$b sub 0 b sub 1$...$ b sub m$ that can be generated by $beta$ it holds that +$a sub 0 b sub o b sub 1$...$b sub m$ is a substring of some string in L. +This can be shown as follows, supposing that the start symbol is S and +S $-> sup * gamma$ A $delta$: +.br + +S $-> sup * gamma$ A $delta$ $-> sup * gamma ~ alpha ~ a sub 0 beta ~ delta +-> sup * gamma ~ alpha ~ a sub 0 b sub 0 b sub 1$...$b sub m delta$ + +.br +Of course, there may very well be more than one such string +$b sub 1 b sub 2$..$b sub m$, and one of these strings can be empty as well, if +$beta$ can produce empty. Now, in what we will call the +.I +predicting phase +.R +the algorithm will +produce all possible symbols $b sub 0$. Then, in what we will call the +.I +accepting phase +.R +these symbols are matched against +the input, and those not matching are discarded. Then, entering the next +predicting phase, the algorithm will produce +all symbols $b sub 1$, and match them against the next input symbol in +the subsequent accepting phase, +etc. In case one of the strings $b sub 0$...$b sub m$ is empty, or +the end of one of the strings is reached, some way to continue is +needed; we will discuss this later. First let's see how the +algorithm produces the strings $b sub 0$...$b sub m$ . + +.LP +For each rule in the grammar of the form +.br + + A: $alpha a sub 0 W sub 1 W sub 2$...$W sub p$ +.br + +with each $W sub k$ a terminal or nonterminal, a +.I +prediction graph +.R +is created that looks like this: + +.PS +down; box "$a sub 0$"; arrow; box "$W sub 1$"; arrow +box "$W sub 2$"; arrow dashed; box "$W sub p$" +arrow; box "END" "$[A]$" +.PE + +.LP +The bottom element of these prediction graphs is an end-marker containing the +left-hand side of the rule used. All these graphs have $a sub 0$ on top, and +this $a sub 0$ is matched against the $a sub 0$ in the input in the +accepting phase that follows, removing the +$a sub 0$ from the graph. If the prediction graph is now empty, we have to find a way +to continue; this case is treated later. First we will consider what to do if +the prediction graph is not empty. There are two possibilities: either $W sub 1$ is a +terminal, or it is a nonterminal. If it is a terminal, we are finished for +the moment; if not, the algorithm scans for rules of the form +.br + + $W sub 1$: $U sub 1 U sub 2$...$U sub i$ +.br + +.LP +with each $U sub k$ a terminal or nonterminal. Now, the algorithm substitutes +the top of the prediction graph with the right-hand sides +of all the rules found. Because there can be more than one rule, the +prediction graph can now become a DAG (Directed Acyclic Graph). +Supposing there are two rules with $W sub 1$ in the LHS: + +.br + + $W sub 1$: $U sub 1 U sub 2$...$U sub i$ +.br + $W sub 1$: $V sub 1 V sub 2$...$V sub j$ + +.LP +the prediction graph will now look like this: + +.PS +B1: box "$U sub 1$" +move +B2: box "$V sub 1$" +arrow dashed down from bottom of B1 +B3: box "$U sub i$" +arrow dashed down from bottom of B2 +B4:box "$V sub j$" +move to 0.5 +down;move +B5:box "$[W sub 1 ]$" +arrow dashed; +box "$W sub p$" +arrow; +box "END" "$[A]$" +arrow from B3.bottom to B5.top +arrow from B4.bottom to B5.top +.PE + +.LP +The graph element representing $W sub 1$ is left in the stack, the +notation $[W sub 1 ]$ indicates it has been substituted. These substituted +element will from now on be ignored by the algorithm. The elements +$U sub 1$ and $V sub 1$ are now `on top' of the prediction graph. + +.LP +If $W sub 1$ can also produce empty, its successor in the prediction graph +has to be processed +as well; the algorithm walks down the graph to this successor, and +there the process is repeated; if it is a terminal we are finished, else we +substitute it with the right hand sides of its grammar rule. +However, the element that we want to substitute now, say $W sub k$, cannot +be marked `substituted' just like that, because it can be on another +path, on which it cannot be substituted yet. Therefore, a copy of element +$W sub k$ is made, it is marked $[W sub k ]$, and an edge is created +from $[W sub k ]$ to the successor of $W sub k$. This produces graphs like +this: +.br +.PS +B1: box "$U sub 1$" +move +B2: box "$V sub 1$" +move +X1:box "$X sub 1$" +arrow dashed down from bottom of B1 +B3: box "$U sub m$" +arrow dashed down from bottom of B2 +B4:box "$V sub j$" +arrow dashed down from bottom of X1 +Xj: box "$X sub j$" +move to 0.5 +down;move +B5:box "$[W sub 1 ]$" +arrow dashed; +B6: box "$W sub k$" +arrow +Wk1:box "$W sub k+1$" +arrow dashed +box "$W sub n$" +arrow; +box "END" "$[A]$" +arrow from B3.bottom to B5.top +arrow from B4.bottom to B5.top +move down from Xj.top;move;move;move +Wk: box "$[W sub k ]$" +arrow from Xj.bottom to Wk.top +arrow from Wk.bottom to Wk1.top +.PE + +.LP +This process of substituting is repeated with all nonterminals that are +now on top of the prediction graph, until there are only terminals on top of +the graph. +This completes the prediction phase of the algorithm, not taking into account +what to do if an END marker appears on top of the graph. +Now, the algorithm enters its accepting phase, in which +the terminals on top are compared with the next symbol in the input. +If a terminal in the graph matches the input, its element is deleted +from the graph, and the substitution process will continue with its +successors, in the next prediction phase. +If a terminal on top of the graph does not +match the input, the path it is on represents a `dead-end', which +does not need to be processed any further. The terminal is no longer +a `top', and the algorithm will not visit it again. + +.LP +There is one tricky situation: consider again this graph: + +.PS +B1: box "$U$" +move +B2: box "$a$" +move to 0.5 +down;move +B5:box "$W sub 1 $" +arrow dashed; +box "$W sub n$" +arrow; +box "END" "$[A]$" +arrow from B1.bottom to B5.top +arrow from B2.bottom to B5.top +.PE + +.LP +Here, the algorithm is processing $W sub 1$ in the predicting phase, and +using some rule it has produced $a$ on top; there is another rule with +$W sub 1$ in its LHS which has produced nonterminal $U$ on top. +Now, suppose $U$ is a nonterminal that can +produce empty. Now, the algorithm starts substituting $U$, and walks +down $W sub 1$. What we definitely do not want +is the algorithm to start substituting $W sub 1$ again, because then we +would loop forever. Therefore, if the algorithm starts processing +element $W sub 1$ it should make it $[W sub 1 ]$ before it does +anything else. On entering the element +for the second time in the prediction phase , it sees that it is already substituted, +so there is nothing to do. +It then just walks to the successor of $W sub 1$ and +starts substituting it. This is correct, since the fact that the algorithm +enters an element for the second time in a prediction phase means that the element +indirectly can produce the empty string, and thus its successor must +be substituted as well in the prediction phase. + +.LP +It is easy to see that the substitution process will stop: the algorithm can +only loop if it starts processing an element for the second time in a +prediction phase, +or if the processing of an element eventually yields a graph with that +same element on top. +The first case cannot occur because the algorithm marks elements it is +processing as `substituted' before it does anything else, meaning that those elements will not +be processed again; the second case can only occur if the grammar is +left-recursive, which we assumed it was not. + +.LP +The algorithm simulates +left-most derivations of strings $a sub 0 b sub 0 b sub 1$..$b sub n$ +starting from $a sub 0 W sub 1$..$W sub p$; as we showed before, if +the algorithm recognizes a string $a sub 0 b sub 0$..$b sub n$ that +string is a substring of some string in L. Conversely, because the +algorithm start out by using all rules of the form +A: $alpha a sub 0 beta$, and then proceeds to simulate all +possible left-most derivations, it will recognize all input +$a sub 0 b sub 0$... $b sub n$ that can be produced starting from +$a sub 0 beta$. + +.LP +Now we will discuss what has to be done if an END marker appears as +top of the prediction graph. +When this happens, it means that starting from some rule +.br + + A: $alpha a sub 0 beta$ + +.br +the algorithm has produced a leftmost-derivation of a string +$a sub 0 b sub 1 .. b sub n$ starting from $a sub 0 beta$, or that $beta$ can produce +empty and the string so far is just $a sub 0$. The next step is to assume +that the have recognized A and that that some string produced by $alpha$ +is part of the prefix that makes the suffix we are recognizing a +correct string in L. Remember that in the END marker we kept record of +the LHS of the rule that has started the graph, and we will now use this +LHS to continue recognizing. What the algorithm does is scan for all +rules of the form: +.br + + B: $gamma$ A $delta$ +.br + +with $gamma$ and $delta$ possibly empty strings of terminals and nonterminals. +The algorithm now starts a new component in the prediction graph, and if $delta$ is +$W sub 1 W sub 2$...$W sub n$ it looks like this: + +.PS +down;box "$W sub 1$"; arrow +box "$W sub 2$"; line dashed; box "$W sub n$" +arrow; box "END" "$[B]$" +.PE + +.LP +Note that the END marker now contains B, because we have started to match +a rule for B. If the $delta$ in the rule for B was empty, this just produces +and END marker with B in it; in this case, the process is just repeated +with all rules of the form: +.br + + C: $zeta$ B $eta$ +.br + +.LP +etc, until we have a prediction graph with a nonterminal or terminal on top. +Now, the substitution algorithm is again applied over all nonterminals on +top, until every top contains a terminal. It is possible that during +substitution again an END marker will turn up; if this happens +we again scan for rules to continue with etc. +This `continuation algorithm' can only loop if, when +trying to build a new prediction graph for matched symbol A, it produces an empty +graph with again matched symbol A. If this happens, the grammar was +(directly or indirectly) right-recursive, and we assumed that it was not. +Therefore, the algorithm will terminate. The terminals on top of the +new graph after applying this `continuation' algorithm are exactly those +that could follow the string $A sub 0 b sub 0$..$b sub n$ in a substring +of a string in L. +To see this, suppose we have `recognized' the rule +.br + + A: $alpha a sub 0 beta$ + +.br +and $a sub 0 b sub 0 b sub 1$...$b sub n$ is the string produced from +$a sub 0 beta$ by the algorithm. Now, using rule: +.br + + B: $gamma$ A $delta$ + +.br +and supposing that S $->$ $zeta$ B $eta$ we get +.br + + S $->$ $zeta$ B $eta$ $->$ $zeta gamma$ A $delta$ $eta$ $->$ $zeta gamma a sub 0 b sub 0 b sub 1$ ... $b sub n$ $delta$ $eta$ + +.br +.LP +and thus any string produced by a derivation starting from +$delta$ can come right after $a sub 0 b sub 0$...$b sub n$ in a substring +of some string in L. The algorithm will proceed to generate all these +strings starting from $delta$. If $delta$ produces empty, the above +is just repeated. Because in the `continuation' part +all possible rules are considered, the whole algorithm will recognize +all substrings of any string in L. In order to determine if we +have actually recognized a suffix of some string in L, we need to +remember if within a predicting phase the `continuation' part of the algorithm has been run +on an END marker containing the start-symbol S; +if this is the case, then the input seen until now is a suffix of some string in L. +Formally, it means that there is a derivation starting from start symbol +$S$ such that if the +input seen until now is $a sub 0 a sub 1$..$a sub n$, then: +.br + + S $-> sup * alpha beta$ $-> sup * alpha a sub 0 a sub 1$..$a sub n$ +.br + +.LP +where $alpha$ can be empty, $beta$ is not empty. + +.NH 2 +The prediction graph data structure + +.LP +The graphs that are produced by the suffix recognizer may grow extremely +large; to facilitate an efficient +implementation we have devised a way of keeping the size of the +data structure under control, in a way that is very similar to +the way described in [TOMITA]. + +.LP +The basic idea is, that in a prediction phase of the algorithm, it is not +necessary to explicitly substitute each nonterminal every time it +turns up as a `top'; it is sufficient to do it once, because the +second substitution will produce exactly the same subgraph starting at +the substituted nonterminal. Here is an example: + +.PS +down;box "$a$";arrow;box "A";arrow dashed;box "[B]";arrow +box "C";arrow dashed;box "END" "[X]" +move right from last box.e; +box "END" "[Y]"; +arrow <- dashed up from last box.top; +box "D";arrow <- up from last box.top +box "B" +.PE + +.LP +Here, in the left component of the graph, nonterminal B has been +substituted. Now, in the same prediction phase, the algorithm again runs into +B, now in the right component. There is no need to compute again +what the substitution will produce, it is exactly the part on top +of B in the left component. Therefore, all that is needed is: + +.PS +down;box "$a$";arrow;box "A";arrow dashed; +B1: box "[B]";arrow +box "C";arrow dashed;box "END" "[X]" +move right from last box.e; +box "END" "[Y]"; +arrow <- dashed up from last box.top; +box "D" +arrow from B1.bottom to last box.top +.PE + +So, when, in a prediction phase of the algorithm, a nonterminal is substituted, +the nonterminal is placed on a list, together with a pointer to +the substituted nonterminal. If in the same prediction phase a nonterminal that +is on the list becomes a top, all we need to do is place an edge +between the already substituted one and the successor of the top we are currently +processing. When a prediction phase is finished, the list is cleared. +There is one catch: if we consider again the last picture, +note that if nonterminal B can (directly or indirectly) produce empty, +it is also necessary to substitute D. However, it is not difficult to +determine if a nonterminal can produce empty. LLgen already computes +this information for each nonterminal. + +.LP +Without this `joining together' of graph components, each +element in the graph has exactly one successor, except the END marker, +which has none. +Now that components get joined as described, an element can have any +number of successors. The recognizer algorithm now has to consider all +successors of a graph element instead of one. + +.NH 2 +Handling right recursion + +.LP +The only problem right-recursive grammars cause in the algorithm is in the +`continuation' part; they can cause this part of the algorithm to loop +forever. As an example, consider: +.br + + A: $alpha$ B +.br + B: $beta$ C +.br + C: $gamma$ A + +.LP +Now suppose the `substitution' part of the algorithm has turned up +an END marker with nonterminal A in it. The continuation algorithm will +now produce: + +.PS +box "END" "[A]";move;box "END" "[C]";move;box "END" "[B]";move +box "END" "[A]";move;box "END" "[C]" +.PE + +.LP +etc. etc. However, a slight modification to the algorithm suffices +to eliminate this problem; within each prediction phase of the algorithm, we +simply maintain a list of nonterminals that have turned up in an +END marker. As soon as an END marker turns up whose nonterminal is +already in the list, we stop the `continuation' algorithm; the part +of the graph that would be produced by it already has been generated +by an earlier invocation of the algorithm in the same prediction phase. +At the end +of a prediction phase, when all heads are terminals, we clear the list. +This way, no looping can occur; even if the right recursion is +indirect, for instance if in the above example the rule for A had been +.br + + A: $alpha$ B $delta$ +.br +.LP +where $delta$ can produce empty, the algorithm still works; the substitution +of $delta$ will yield an END marker on top, and when trying to find +a continuation for LHS A the algorithm notices A is already on the list. + + +.NH 2 +Handling left recursion + +.LP +Left-recursion is, unfortunately, a much tougher problem than +right-recursion. The result of left-recursive grammar rules is that +the substitution algorithm never stops, because it can keep on building +the graph with the same set of rules without ever turning up a terminal. +One course of action would be to pre-process the grammar rules to +eliminate left-recursion; there are algorithms that eliminate direct +and indirect left-recursion. However, we have taken another course; by +allowing the produced graphs to contain loops, we can handle left +recursion without any modifications to the grammar. As soon as +we come to the point that we want to substitute a nonterminal +which was already substituted earlier on the same path and in +the same prediction phase, we can +make a link from the `older' nonterminal to the successor of +the `new' nonterminal. In this way we have constructed a loop +in the graph. As an example, suppose we have the following rules: +.br + +D: A + +A: B a + +B: A | x + +.br +Suppose also that we have nonterminal `D' on top of a stack. We +now start substituting `D': + +.PS +A: box "A" +move +X: box "x" +move to 0.5 +down +move +B: box "[B]" +arrow +box "a" +arrow +box "[A]" +arrow +box "[D]" +arrow dashed +box "END" "[S]" + +arrow from A.s to B.n +arrow from X.s to B.n + +.PE + +.LP +We now have an `A' on top of of the stack which was already +substituted on the same path and also in the same prediction phase. To avoid +never ending substitution we make a loop as follows: + +.PS +A: box "A" dashed +move +X: box "x" +move to 0.5 +down +move +B: box "[B]" +arrow +box "a" +arrow +A2: box "[A]" +arrow +box "[D]" +arrow dashed +box "END" "[S]" + +arrow dashed from A.s to B.n +arrow from X.s to B.n +arc <- from B.w to A2.w +.PE + +.LP +The dashed box with `A' in it means that it can be deleted, because +there is already an occurrence of it in the loop. + +.LP +The most beautiful result of loops in graphs is +that the original parsing algorithm needs only one minor change. +When the algorithm visits an element which has more than one +outgoing edge the algorithm starts tracking down both paths, +just like before, only now there may be one or more backedges among +these edges, but the algorithm needs not to be aware of this fact. +The only difficulty with loops is that the algorithm might go into +a loop; it continues searching for terminals but it might happen +that there are no valid terminals in the loop. The solution to this +problem is not very difficult; just set a flag at all elements we +visit. When we reach an element which has this flag turned on, we +don't have to search any further. At the end of the prediction phase, when we +have found all possible new heads, all flags are cleared. +Even if there are no loops in the +prediction graph, setting flags may be used as an optimization: +it is possible that two paths come together at one point. In that situation +it is useless to scan for the second time the part of the graph which +both paths have in common. + +.NH 2 +Some optimizations using reference counts + +.LP +As explained in section 2.2, it is sometimes necessary to copy a +prediction graph element before substituting it. In order to determine +if a certain element has to be copied, it is convenient to maintain +a reference count in each graph element. This reference count keeps +track of the number of edges that enter an element. Now, when we want +to substitute an element with reference count not 0, we need to +copy it, because there is another path in the prediction graph that +contains the element we want to substitute, and on this other path +the element cannot be substituted yet. + +.LP +Maintaining reference counts also enables us to perform another +optimization: remember that if, in a prediction phase, a terminal +is predicted that does not match the current inputsymbol, we from +then on just ignore the path in the graph starting at the terminal. +However, we can safely delete the terminal from the graph; furthermore, +all its successors in the prediction graph that have reference count +0 can be deleted as well, as can their successors with reference +count 0, etc. This way, we delete from the prediction graph +most elements that are no longer accessible, but not all of them; as will +be explained in the next section, loops in the prediction graph +can cause problems. + +.NH 2 +The algorithm to delete inaccessible loops + +.LP +Deleting graph elements which are no longer reachable is not as easy +as it looks when there are loops in the graph, introduced by +the extension to the algorithm that handles left recursive grammars. +Suppose for example that we have a very simple loop as in the left +picture below: + +.PS +down +X: box "x" "(0)" +arrow +box "[B]" "(2)" +arrow +box "a" "(1)" +arrow +box "[A]" "(1)" +arrow +box "[D]" "(1)" +arc <- from 2nd box.w to 2nd last box.w + +move right from X.ne +move +move +move +move +move +move +down +box "x" "(0)" dashed +arrow dashed +B: box "[B]" "(1)" +arrow +box "a" "(1)" +arrow +box "[A]" "(1)" +arrow +box "[D]" "(1)" +arc <- from B.w to 2nd last box.w +.PE + +.LP +The number below each symbol indicates the reference count of that element. +Suppose now that we delete `x', then we have the situation depicted in the +picture on the right. The loop consisting of `[B]', `a' and `[A]' is now +unreachable, so all these elements can be deallocated. +The reference count of `[B]' is 1, so it will not be deleted. To be precise +all elements in the loop have their reference counts on 1, and +consequently none of these will be deleted. But we stated earlier +that all elements of the loop cannot be reached anymore and that the +loop had to be deleted! In this example the reference counts of the +loop elements are all 1, but in more complex situations it is also +possible that some of the elements have a reference count of more +than 1. + +.LP +To solve this problem we present an algorithm, devised by E. Wattel, that +determines whether a loop can be deleted or not. +The algorithm consists of two parts. The first part of the algorithm goes as +follows: it presumes that all elements of the loop will indeed be +deleted. Every time it deletes an element it decreases the reference +count of all the successors of the element that are also member of the same +loop. How the algorithm knows which elements belong to the loop and which +do not will be explained later. The situation of the example above will now +look like this: + +.PS +down +box "[B]" "(0)" +arrow +box "a" "(0)" +arrow +box "[A]" "(0)" +arrow +box "[D]" "(1)" +arc <- from 1st box.w to 2nd last box.w +.PE + +.LP +The number below each symbol indicates again the reference count +after we have applied the first part of the algorithm. + +.LP +The second part of the algorithm checks and restores the +reference counts of all members of the loop . When it finds +out that one or more reference counts are not 0, it concludes +that it is still possible to enter the loop in some way, and +that it cannot be +deleted yet. In the other case it reports that the loop can be +deleted, which is also true in our example. + +.LP +We will now formally describe the first part of the algorithm +that finds all directed circuits from a given vertex, and determines if +the vertices on those circuits can be deleted. +The algorithm works on prediction-graphs in which every edge that +is in a circuit is marked. Note that a marked edge may be in more than one circuit. +We will call this mark `C'. +The input to the algorithm is such a prediction graph, and a start vertex, +say A. The first part of the algorithm is: + +.IP 1 +Put the start vertex A on a list L; mark all edges `unused' +.IP 2 +If L is empty, stop +.IP 3 +For each vertex in list L, check if there are edges marked both C' and +`unused'. For each edge found, mark it `used', and traverse it to its +other endpoint; put this endpoint on a new list M, initially empty +.IP 4 +Decrease the reference count of all vertices on M by 1 +.IP 5 +L := M; go to 2 + +.LP +It is clear that the algorithm will terminate: each edge is only traversed once, +and the number of edges is finite. We will now prove some properties of this +part of the algorithm. + +.LP +.I +An edge is traversed by the algorithm if and only if it is on some +directed circuit $A ->$...$->A$. +.R +.br + +The if-part is easy; if an edge $e$ connecting vertices $W$ and $V$ is on some directed circuit starting in +$A$, then there is a path $A ->$...$-> W -> V$; let $A ->$...$-> W -> V$ be a path +of minimum length from $A$ to $V$. If the length of the path from $A$ to +$W$ is $k$, then after turn $k$ of the algorithm $W$ will be on list L. To see +that this is the case, suppose that $W$ is not on list L after turn $k$; +this means that the edge entering $W$ was already marked used in a +previous turn, but then there would be a shorter path from $A$ +to $W$, contradicting the assumption that the path is of +minimum length. The edge +$e$ is marked `C', because it is in a circuit; it is marked `unused', for if +it were marked used, there would be a shorter path from $A$ to $V$. So, +in turn $k + 1$, the edge $e$ will be traversed. + +.LP +On the other hand, suppose that an edge $e$ is traversed by the algorithm; +we will show by induction on the number of turns the algorithm has made +that $e$ is on a directed circuit $A->$..$->A$. In the first turn, all +edges from $A$ that are marked `C' are traversed, and clearly, if an edge +from $A$ is part of a circuit then that edge is part of a circuit from $A$ to $A$. +Now suppose that in turn $n+1$ an edge $e$ connecting vertices $W$ and +$V$ is traversed. This means the edge is +marked `C', so it is part of some circuit. If there is a path from $V$ to $A$, +we can simply trace a circuit +$A->$...$-> W -> V -> $...$-> A$, and clearly $e$ is on a circuit from +$A$ to $A$. Now, suppose there is no path from $V$ to +$A$. We can always trace a circuit $W -> V ->$...$-> W$ because the +edge from $W$ to $V$ is part of a circuit; and by the +induction hypothesis there is a circuit $A ->$...$-> W ->$...$-> A$. We can +now make a `detour' at $W$, yielding a circuit $A->$...$-> W -> V$... +$-> W ->$...$-> A$. This case is shown in the picture below. +So in either case $e$ is on a circuit from $A$ to $A$. + +.PS +down; +B1: box "A"; +arrow dashed; +B3: box dashed; +arrow dashed; +B2: box "W"; +arrow dashed; box dashed; +arc <- from B1.w to last box.w +arrow right "$e$" "C" from B2.e +box "V"; arrow dashed; box dashed; +arrow dashed -> from last box.n to B3.e +.PE + +.LP +.I +A vertex appears on list L if and only if it is on some directed +circuit from $A$ to $A$. +.R +.br + +.LP +If a vertex is in such a circuit, there is an edge that enters it, which +is part of a circuit form $A$ to $A$; we already showed that this edge +is traversed by the algorithm, and thus the vertex will appear on list +L. Conversely, if a vertex appears on list L, then an edge entering +that vertex has been traversed by the algorithm; we showed that this +edge is part of a circuit from $A$ to $A$, and thus the vertex is +part of a circuit from $A$ to $A$. + +.LP +.I +When the algorithm is finished, each vertex that is part of some +directed circuit from $A$ to $A$ has its reference count decreased by exactly +the number of edges entering it that are part of a directed circuit from $A$ to $A$. +.R +.br + +.LP +Each edge that is part of some circuit from $A$ to $A$ is traversed +exactly once; the reference count of the endpoint is decreased +by one after an edge has been traversed. Thus, if a vertex is endpoint +of $k$ such vertices, its reference count is decreased by $k$. + +.LP +.I +If the reference count of each of the vertices visited by the algorithm +is 0 after the algorithm has finised, all these vertices can be deleted; +if the reference count is not zero for one or more of the visited +vertices, then none of them can be deleted. +.R +.br + +.LP +Suppose all visited vertices have reference count 0; this means that +each of the vertices is only entered by edges that are on a circuit +from $A$ to $A$. Therefore, it holds that any path leading to any +of the visited vertices has to start in one of the visited vertices; there +is no path starting in an unvisited vertex to a visited one. Thus, +all the visited vertices are unreachable. +Conversely, if one of the visited vertices has reference count not zero, +then there is a path from an unvisited vertex to this vertex. Because from +the vertex with reference count non zero, we can get to $A$, and from $A$ +we can get to any of the other vertices, all visited vertices are +reachable. + +.LP +The second part of the algorithm now checks if all reference counts are +zero, and if they are, it deletes all visited vertices. + + +.NH 2 +Marking loop elements + +.LP +One point we have omitted so far is how the edges in the prediction +graph that are part of a loop get marked. +Basically, a loop can be detected: + + a. when it is made; +.br + b. when we want to know about it. + +.LP +The first approach checks if a loop is constructed +as soon as we join two paths in the graph, and if so, marks all +edges of the loop. The other approach does not do any checking when two +paths are joined together; it starts looking for loops when we want +to delete an element with reference count not 0, marking all edges +belonging to the loops it discovers. In practice it turns out that +we very often encounter elements that we would like to delete, but that have +reference count not 0, whereas the joining of paths occurs relatively +infrequently. We therefore have chosen to check if a loop is created +when two paths in a prediction graph are joined. + +.LP +Now the question arises how to find and mark all edges of +the loop. For this problem we devised also an algorithm. +Because we already know that there is an edge from the element on which +the new path is connected to the successor of the joined element, the +algorithm only has to find a path from this last element back to the first one. +This can be done by a backtracking depth first search; to find a path from +one element to another we have to find a possible empty path +from one of the successors of the first element to the last element. As +soon as we have found a path, we can mark all the edges on the path and also +the backedge as loop edges. In case that there is more than one path +back to the first element it is necessary that the algorithm continues +searching after it has found one path. + +.LP +To avoid looping of this algorithm we have to set a flag at the elements +which are on the path already. When the algorithm is backtracking it can +clear the flags at the elements it is leaving. + +.LP +To speed up the searching process we can set flags at the edges we have already +visited but did not lead back to the first element. When the algorithm +encounters such an edge it already knows that this edge is not worth +searching again and can be skipped. At the end of the algorithm these +flags have to be cleared again. + +.LP +One might propose another optimization: as soon as +we reach an edge that is already marked as a loop edge, we +can stop searching for other loop edges. There is, however, +a case in which this can go wrong. Imagine the following situation: + +.PS +down +E: box "[E]" +arrow " C" ljust +D: box "[D]" +arrow " C" ljust +C: box "c" +arrow " C" ljust +box "b" +arrow " C" ljust +A: box "[A]" +arrow +box "a" + +move right from D +move right +J: box "[J]" +down +arrow from J.s " C" ljust +I: box "i" +arrow " C" ljust +H: box "[H]" +arrow from H.s to A.e + +arc <- from E.w to A.w +move left from C +move left +"C" +arc -> from H.e to J.e +move right from I +move right +"C" + +arrow dashed from E.s to J.n + + +.PE + +What we have here is a prediction graph with two loops; all edges that belong +to a loop are again marked with an `C'. Note that the edge between `[H]' +and `[A]' is not a loop edge. Suppose that `[J]' is not yet +completely substituted, i.e. there is another production rule for +J: +.br + +J: E + +.br +The `E' on top of the right path is now joined with the `[E]' +on the left path, which is depicted by the dashed arrow +between `[E]' and `[J]'. When we take a good look at the graph +we see that the two loops are merged into one. But that is not +the most important observation we have to make: not only the +edge between `[E]' and `[J]' must be marked as a loop edge, but +also the edge between `[H]' and `[A]'! So it is not possible +to stop searching for loop edges as soon as we have found an +edge which was already marked as a loop edge. We have to continue +until we reach the element at which we started: `[E]'. So the +optimization proposed above is incorrect. + + +.NH 2 +Optimizations using FIRST and FOLLOW sets + +.LP +In the algorithm as we have described it, every nonterminal on top of the graph +is substituted until only terminals remain on top; these terminals are +then matched against the current input symbol. However, by using +FIRST sets, we can save considerably on the number of computations +necessary. Suppose one of the top elements of the graph is nonterminal A, +and the current inputsymbol is $a$. Then, it is of no use to substitute +A if terminal $a$ is not in FIRST(A), because then substituting A will +never produce $a$ on top of the graph. So, before substituting a +nonterminal we check if the current inputsymbol is in its FIRST set; if +it is not, we can declare the path the nonterminal is on a dead end, and +delete it, without having to perform the actual substitution. Of course, if +A can produce empty, we still have to consider its successor in the graph. + +.LP +Similarly, when we have an END marker on top, with nonterminal B in +it, and we consider using rule +.br + + D: $alpha$ B C $gamma$ + +.br +We first check if the current inputsymbol is in FIRST(C); if this is +not the case, there is no need to start a graph component with this +rule, because it will never produce the next inputsymbol on top. +Again, if C produces empty, we still have to evaluate the part of the +rule following C. + +.LP +To circumvent the problems caused in the FIRST set optimization by +nonterminal that produce empty, we can also make use of FOLLOW-sets. +When substituting, if we encounter a nonterminal whose FIRST set does +not contain the current inputsymbol but which can produce empty, +we check if the current inputsymbol is in its FOLLOW set. If it is not, +there is no need to process its successor. Similarly, in case we +are processing an END marker as explained above, there is no need +to process the part of the rule following C if FIRST(C) does not +contain the input symbol, or C produces empty but the inputsymbol +is not in FOLLOW(C). +.bp +.nr PS 12 +.nr VS 14 + +.NH +Test results + +.nr PS 10 +.nr VS 12 +.RS + +.LP +In this chapter, we discuss some test results that were obtained +by recompiling existing ACK compilers with the modified LLgen. +We tried several combinations of possible optimizations, including +`dumb' ones, like no optimization at all, not even deleting unreachable +prediction graph elements. +The incorporation of LLgen with non-correcting error recovery went +smoothly; only minor modifications to the Make-files were necessary. +Specifically, these modifications consisted of passing an extra +flag to LLgen, and including the new generated C-file Lncor.c in +the list of generated C-files. Also, the LLmessage error reporting +routine had to be adapted. We successfully recompiled the C, Modula-2 +and Occam compilers; in the next sections, we discuss some test results +that were obtained with the Modula-2 and C compilers. + +.RE +.LP +.NH 2 +Performance + +.LP +We will now present and discuss, with the aid of some +diagrams, time and space measurements on the non-correcting error +recovery. We have measured the effect of various optimizations. +These optimizations include the first-set optimization and the follow-set +optimization. We also measured the effect of leaving out the loop-deletion +algorithm, regarding both time and space. We performed out measurements using +C- and Modula-2-programs of three different sizes; one of approximately +750 tokens, one of appr. 5000 tokens and one of appr. 15000 tokens. We have +chosen to represent the sizes of programs in the number of tokens instead of +number of lines, because the number of tokens more realistically +reflects the load the programs put on the error recovery mechanism. Also we give +our time measurements in usertime instead of realtime, because realtime +depends heavily on the load of the system, which usertime does not. +Our space measurements are based on the size of the prediction graphs. +Note that all files are entirely recognized by the non-correcting error +recovery technique. We achieved this by putting a `1' at the beginning +of each file; because then each file starts with a syntax error LLgen +is forced to continue with the non-correcting error recovery. + +.NH 3 +Time and space measurements on the effect of the first-set optimization + +.LP +In the diagram below we show our time measurements we got from recognizing +the C-programs both with and without first-set optimization. + +.G1 +coord x 0, 17000 y 0, 65 +ticks bot out at 750, 5000, 15000 +label bot "Number of tokens" +label left "User Time" "(sec)" left .3 +draw no_opt dashed +draw first_opt dashed + +copy thru X + times size +2 at $1, $2 + times size +2 at $1, $3 + next no_opt at $1, $2 + next first_opt at $1, $3 +X until "XXX" + +742 2.5 .9 +5010 16.3 5.8 +14308 54.2 16.8 +XXX + +copy thru X "$1 $2" size -2 at 11000, $3 X until "XXX" +No optimization 55 +First-set optimization 20 +XXX +.G2 + +.I +.ce +Time measurements of three C-programs with and without first-set optimization +.R + +.LP +Notice the considerable time savings we +get when the first-set optimization is turned on; a factor of slightly more than +3. Obviously this is an extremely useful optimization. On the other hand +we found there were no measurable time savings when using the follow-set +optimization; for that reason we did not chart the result of this optimization. +It seems that the time savings gained by the optimization are +waisted again by the extra processing time needed. We conclude that +this optimization is of little or no use when we want to save on time. + +.LP +In the following picture the time measurements of three Modula-2 programs +are given, again with and without first-set optimization. + +.G1 +coord x 0, 17000 y 0, 65 +ticks bot out at 750, 5000, 15000 +label bot "Number of tokens" +label left "User Time" "(sec)" left .3 +draw no_opt dashed +draw first_opt dashed +copy thru X + times size +2 at $1, $2 + times size +2 at $1, $3 + next no_opt at $1, $2 + next first_opt at $1, $3 +X until "XXX" + +823 1.3 .6 +4290 7.6 3.5 +16530 30.5 14.3 +XXX + +copy thru X "$1 $2" size -2 at 13000, $3 X until "XXX" +No optimization 30 +First-set optimization 15 +XXX +.G2 + +.I +.ce +Time measurements of three Modula-2-programs with and without first-set optimization +.R + +.LP +From this picture we can conclude mainly the same as above; considerable +time savings when we use the first-set optimization; +the factor is somewhat less, but still more than 2. Again we have omitted +the results of the follow-set optimization, for the same reason as before. + +.LP +There is however one remarkable difference between the two languages: parsing +C-programs needs almost twice the time as parsing programs of comparable +sizes written in Modula-2. This can be explained by the fact that the +C-grammar is far more complicated than that of Modula-2, and also the +production rules are longer in C, so building, deleting and definitely +traversing the graph will consume more time. + +.LP +Now we come to the space measurements of both C- and Modula-2 programs. +In the picture below we present the maximum sizes of the prediction graphs, +during the recognition of the three C-programs. + +.G1 +coord x 0, 17000 y 0, 18000 +ticks bot out at 750, 5000, 15000 +label bot "Number of tokens" +label left "Maximum size of" "the prediction graph" "(bytes)"left .3 +draw no_opt dashed +draw first_opt dashed +copy thru X + times size +2 at $1, $2 + times size +2 at $1, $3 + next no_opt at $1, $2 + next first_opt at $1, $3 +X until "XXX" + +742 5568 10444 +5010 7668 12664 +14308 13636 17308 +XXX + +copy thru X "$1 $2" size -2 at 8000, $3 X until "XXX" +No optimization 16000 +First-set optimization 7000 +XXX +.G2 + +.I +.ce +Maximum sizes of the prediction graphs when recognizing three C-programs +.R + +.LP +From this diagram we see that, although the prediction graphs +are smaller when the first-set optimization is used, the space savings are +not as spectacular as the time savings achieved by this optimization. + +.LP +In Modula-2 the first-set optimization also causes a decrease in memory +usage. The savings are less than in C, but still about 1.5 Kb. Again +this can be explained by the fact that the rules of the Modula-2 grammar +are shorter than that of C. + +.G1 +coord x 0, 17000 y 0, 12000 +ticks bot out at 750, 5000, 15000 +label bot "Number of tokens" +label left "Maximum size of" "the prediction graph" "(bytes)" left .3 +draw no_opt dashed +draw first_opt dashed +copy thru X + times size +2 at $1, $2 + times size +2 at $1, $3 + next no_opt at $1, $2 + next first_opt at $1, $3 +X until "XXX" + +823 5056 3292 +4290 6420 4664 +16530 11388 9632 +XXX + +copy thru X "$1 $2" size -2 at 8000, $3 X until "XXX" +No optimization 10000 +First-set optimization 4000 +XXX +.G2 + +.I +.ce +Maximum sizes of the prediction graphs when recognizing three Modula-2-programs +.R + +.NH 3 +Input that is recognized in quadratic time + +.LP +The measurements presented may suggest that the time required to +recognize input depends linearly on the length of the input; however, +this is not always the case. When there are recursive rules in the +grammar, the time needed to recognize input that is produced by this +rules can become proportional to the square of the input length. +Consider this set of grammar rules: +.br +.nf + + S: '{' A '}' + A: 'a' A | $epsilon$ + +.fi +.LP +When the input is `{aaa....', the algorithm will produce the following +prediction graphs: + +.PS +up; B1: box "END" "S"; arrow <- ;box "}";arrow <- ;box "A";arrow <- ;box "{"; +move right from B1.se; move +up; B2: box "END" "S"; arrow <-; box "}"; arrow <-; box "[A]"; +arrow <-; box "A"; arrow <-; box "a"; +move right from B2.se; move +up; B3: box "END" "S"; arrow <-; box "}"; arrow <-; box "[A]"; +arrow <-; box "[A]"; arrow <-; box "A"; arrow <-; box "a"; +move right from B3.se;move +up; B4: box "END" "S"; arrow <-; box "}"; arrow <-; box "[A]"; +arrow <-; box "[A]"; arrow <-; box "[A]"; arrow <- ; box "A"; arrow <-;box "a"; +.PE + +.LP +In each prediction phase, a new [A] appears on the prediction graph. However, +since A also produces empty, the prediction algorithm has to traverse all the +elements [A] until it finds the element `}'. In the first prediction phase, +there is one element [A], in the second there are two, etc, so in all +1 + 2 + 3 + ... + k = $k(k+1) over 2$ elements have to be traversed if +there are k prediction phases, making this proportional to the square +of the input length. We constructed a parser with this simple input grammar +and measured the processing time the error recovery mechanism used. +In the following diagram the dashed line shows the processing time needed; +the dotted line is the curve $t = 13 n sup 2$. Clearly the processing time +is proportional to the square of the number of tokens. + +.G1 +coord x 0, 2100 y 0, 60 +ticks bot out at 500, 1000, 1500, 2000 +label bot "Number of tokens" +label left "User Time" "(sec)" left .3 +draw quad dashed + +copy thru X + times size +2 at $1, $2 + next quad at $1, $2 +X until "XXX" + +500 3.0 +1000 12.4 +1500 28.6 +2000 51.4 +XXX + +draw dotted +for i from 0 to 2100 by 25 do { next at i, 0.000013 * i * i } +.G2 + +.LP +In the grammar used for the C compiler, array initializations are handled by a recursive +rule, so we would expect that the error recovery mechanism needs quadratic +processing time to recognize such an initialization; we made measurements on +the processing time and indeed, the +processing time needed grows proportionally to the square of the size of the input, as the +next figure shows. Here, the processing times are about half of those in +the previous example; this is so because the recursion appears after two +tokens are recognized. Note that the algorithm only takes quadratic time +when it is recognizing input that is generated by a recursive grammar rule. +Other input is still recognized in linear time, regardless of the fact that +there are recursive grammar rules. + +.G1 +coord x 0, 5000 y 0, 85 +ticks bot out at 1150, 2400, 3600, 4800 +label bot "Number of tokens" +label left "User Time" "(sec)" left .3 +draw quad dashed + +copy thru X + times size +2 at $1, $2 + next quad at $1, $2 +X until "XXX" + +1150 5.1 +2400 20.3 +3600 43.7 +4800 78.6 +XXX +.G2 + +.LP +Unfortunately, there is no easy way to speed up the recognition of these +recursively defined language elements; they are caused by the substituted +tokens that are left in the prediction graph, and we cannot just delete those +`dummies' from the graph during a prediction phase because the `join' part of the +prediction algorithm depends on them. One could traverse the graph after +a prediction phase to delete the dummies, but then the processing +time needed to recognize non-recursively defined language elements would +increase dramatically. However, we feel that in practice things +like large array initializations will not occur in hand-made programs; when +they occur, it is probably in computer-generated programs, which normally +will be correct anyway, meaning that the error recovery never sees them. +When testing such generated programs, one is likely +to use small test-cases, which are handled well by the error recovery. + +.NH 3 +Time measurements on the effect of leaving out the loop-deletion algorithm + +.LP +We now show what effect the loop-deletion algorithm has on processing time. +To put it another way: how much time can be saved when we turn off the +loop-deletion algorithm. In the diagram below we give the measurements of +the three C-programs; note that we do use the first-set optimization. + +.G1 +coord x 0, 17000 y 0, 22 +ticks bot out at 750, 5000, 15000 +label bot "Number of tokens" +label left "User Time" "(sec)" left .3 +draw no_loop dashed +draw loop dashed +copy thru X + times size +2 at $1, $2 + times size +2 at $1, $3 + next no_loop at $1, $2 + next loop at $1, $3 +X until "XXX" + +742 .9 .4 +5010 5.8 6.8 +14308 16.8 20.5 +XXX + +copy thru X "$1 $2" size -2 at 11300, $3 X until "XXX" +With loop-deletion 20 +Without loop-deletion 9 +XXX +.G2 + +.I +.ce +Time measurements on processing three C-programs with and without the loop-deletion algorithm +.R + +The diagram shows that the loop-deletion algorithm +does not dramatically slow down the recognizing process. There is, however, +a measurable time loss of \(+-25%. As we will see later, the loop-deletion +algorithm will turn out to be extremely useful in efficient use of memory +when there are many loops in the graph. + +The effect of the loop-detecion algorithm on parsing Modula-2 programs +is even less than with C-programs; in fact there is no measurable +time loss: + +.G1 +coord x 0, 17000 y 0, 15 +ticks bot out at 750, 5000, 15000 +label bot "Number of tokens" +label left "User Time" "(sec)" left .3 +draw no_loop dashed +draw loop dashed +copy thru X + times size +2 at $1, $2 + times size +2 at $1, $3 + next no_loop at $1, $2 + next loop at $1, $3 +X until "XXX" + +823 .6 .6 +4290 3.5 3.8 +16530 14.3 14.3 +XXX + +copy thru X "$1 $2" size -2 at 11800, $3 X until "XXX" +With loop-deletion 13 +Without loop-deletion 7 +XXX +.G2 + +.I +.ce +Time measurements on processing three Modula-2-programs with and without a loop-deletion algorithm +.R + +There are at least two reasons for this; both result from the relative +simplicity of the Modula-2 grammar. The distance from a head to an +end of stack marker is shorter than in C, and secondly Modula-2 +causes fewer joins to occur than C, meaning that the loop marking algorithm +is run less often and when it is run it has fewer paths to search. + + +.NH 3 +Space measurements on the effect of leaving out the loop-deletion algorithm + +.LP +Clearly, to make any measurements on the space-usage effects of leaving out +the loop-deletion algorithm we need a program that causes the prediction +graph to contain loops; however, we have not been able to devise a C +or Modula-2 program that does this. In order to be able to make measurements, +we added an extra alternative to a rule of the C compiler grammar, making +it directly left-recursive. To make LLgen accept this new grammar, we +put a `%if' directive in the rule. + +.LP +We have input our standard C test program consisting of 800 tokens to +the error recovery routine for this `doctored' C compiler, +and compared the storage needed for the prediction graphs with the +loop deletion algorithm enabled with the storage needed when the +algorithm is disabled. With the loop-deletion algorithm enabled, the +maximum size of the prediction graph was 5576 bytes. When the loop +algorithm was disabled, the maximum size of the prediction graph +grew to 12676 bytes; furthermore, 12676 bytes of heap were allocated +for the prediction graph, but not deallocated again, because they were +in use by graph elements that were in inaccessible loops. The user-time +the program needed decreased only slightly, from 0.9 to 1.0 seconds. Given the +relatively small input program, this data suggests that when loops +are actually being made, the loop deletion algorithm is definitely +worth the extra overhead it costs, considering the space +that would otherwise be occupied by inaccessible loops. To verify this, +we input the C program consisting of 15000 tokens to the compiler; +execution time increased from 17.3 to 21.1 seconds after enabling +the loop deletion algorithm, while the maximum size of the prediction graph +shrunk from 328664 to 13664 bytes. With the loop-deletion algorithm +disabled, 326720 bytes allocated for the graph were not deallocated again. +Again, given the relatively small increase in execution time and the +large reduction of memory usage, we feel that the loop-deletion +algorithm is useful enough to justify the overhead it creates. + +.NH 2 +Problems encountered + +.LP +In this section we describe some of the problems we encountered +while testing the non-correcting error recovery. + +.NH 3 +The LLgen error reporting mechanism. + +.LP +The parsers generated by LLgen call a user-supplied error reporting +routine, usually called LLmessage. This routine is called with an +integer parameter that is positive, zero or negative. When the parameter +is positive the parser has just inserted a token, whose +number is equal to the parameter; if it is zero, the parser +has deleted a token whose number is in a global variable called LLsymb; if +it is negative, it means that LLgen expected end-of-file, but did not +find it. The routine LLmessage is supposed to print an error message, +and when a token is inserted, it should set all necessary attributes. + +.LP +However, when non-correcting error recovery is used, the situation becomes slightly +different; when the parser inserts a token, it is only to keep the +semantic actions consistent, and does no longer signify an error. +However, the LLmessage routine still has to be called because the +attributes of the inserted token need to be set. Therefore, when +non-correcting error recovery is used, the LLmessage routine should not +print an error message when the parameter is positive, or else it will +print highly confusing error messages indeed. Furthermore, the +LLmessage routine will usually print a message like `token ... deleted' when +it is called with parameter equal to zero; however, when the non-correcting +error recovery is used, it is more appropriate to report something +like `token ... illegal', as the non-correcting error recovery does +not delete tokens. Finally, when an unexpected end-of-file is encountered, +LLgen normally just inserts the missing tokens and calls +LLmessage with the parameter equal to the token number; +when non-correcting error recovery is used we need a way to +actually report we have encountered an unexpected end-of-file. The +way we achieved this is by calling LLgen with parameter 0 and the +global variable LLsymb set to EOFILE when this situation occurs; the +routine LLmessage should print something like `unexpected end of file' +when it is called with parameter 0 and LLsymb is EOFILE. To facilitate +switching between correcting and non-correcting error recovery, the +file Lpars.h contains a statement `#define LLNONCORR' if non-correcting +error recovery is used. + + +.NH 3 +Parsers being started in semantic actions + +.LP +LLgen allows the programmer to define more than one nonterminal as the +start symbol of the input grammar; it will generate a parsing routine +for each of the start symbols. However, the error recovery code +is generated only once; it is shared by all parsers. +The programmer is free to call any +of the generated parsers whenever he wants; for instance, in the C-compiler +a separate parser for expressions in #if and #elsif statements is used. Whenever +the lexical analyzer encounters such a statement, it calls the expression +parser. It is also possible to call a parser in a semantic action of +another parser; in the MODULA-2 compiler a separate parser for +definition modules is used. When the main parser encounters a +FROM defmod IMPORT statement a semantic +actions opens the definition module defmod and starts the parser for +definition modules. + +.LP +The fact that subparsers can be started just about anywhere causes +problems when non-correcting error recovery is used. +Suppose a parser calls another parser in a semantic action +to parse a separate input file. In the Modula-2 compiler, after +seeing the FROM defmod IMPORT statement a semantic action opens +defmod and parses it; now, if a syntax error occurred before the +FROM IMPORT statement, the non-correcting error recovery will not +execute the action that opens and parses the definition module, but +it will not report an error either, because the statement +FROM defmod IMPORT is part of the input language of the main parser. +However, suppose that during the parsing of a definition module +an error occurs; then, some semantic actions that would normally +be executed during parsing of the definition module will not have +taken place. When normal parsing is now resumed by the main parser, +after the non-correcting error recovery has finished with the +definition module, a lot of spurious semantic errors are likely to be +reported, because the semantic actions that would normally have been +executed during the definition module parsing have not been executed +by the error recovery. Therefore, it is desirable that the main parser +does not resume normal parsing, but instead continues with the non-correcting +error recovery as well. Any syntactic errors in the main program will +still be reported, but no spurious semantic errors will be reported +that way. + +.LP +When the lexical analyzer calls other parsers, as is the case in +the ACK C compiler, recursive invocations of the non-correcting error +recovery routine can occur. This will happen if a parser starts the +error recovery, the error recovery calls the lexical analyzer, which +starts another parser that finds a syntax error and calls the +error recovery again. This is not really a problem, but is has +consequences for the implementation of the error recovery routine. + +.LP +The worst case +occurs when two parsers are involved in parsing one input file, and +the secondary parser (e.g. an inline assembly parser) is called in a semantic +action of the main parser. Suppose now that the input text contains +a syntax error; after detecting this error, the parser starts the +non-correcting error recovery. This recovery does not execute any +semantic actions; therefore it will not start the subparser at those points +where the original LLgen generated parser would. As a result, parts +of the program that would be accepted by the subparser will now probably +be rejected as illegal, because the error recovery does not know it +should use another grammar to check these parts. This is a serious +problem, and we have devised and implemented two ways to solve it. + +.LP +The first solution is based on the assumption that whenever a semantic +action occurs in the grammar, another parser can be started at that +point. Obviously, we have no way of knowing which semantic actions start +a parser and which don't, so we assume the worst. +Now, assume that in the grammar there are k symbols defined as +start symbols, say $W sub 1 , W sub 2 , ..., W sub k$. Each of these symbols +will cause LLgen to generate a parser that can be called in any +of the semantic actions of the grammar. We now introduce a new +symbol $X$, and a new grammar rule $X -> W sub 1 X | W sub 2 X | ... | +W sub k X | +epsilon$. +In the grammar the error recovery algorithm uses, we insert this symbol +X at all positions where there are semantic actions in the original grammar, +so a rule $A -> alpha$ { action } $beta$ becomes $A -> alpha X beta$. As a +result, at each position in a grammar rule where a semantic action +occurs, we now accept any input that would be accepted by any of the +parsers. Clearly, this solution is somewhat of a kludge, as it will +accept a lot of input that is not accepted by the original parser. +However, it is guaranteed to never give spurious error messages, because +whenever a parser would be started by the original parser, there now +is an $X$ in the grammar that produces all the strings that would be +accepted by that parser. We have implemented this solution, and found +it to be extremely slow, which of course was to be expected given the +number of semantic actions in the average grammar. Furthermore, +because each time a semantic action occurs in the grammar +a string accepted by any of the generated parsers is accepted, including +strings recognized by the currently running parser, error messages +become hard to interpret. As an example, consider the following +C program: +.br +.nf + + + main() + { + int i, j; + + while (i < j + j++; + + i = 1; + j = 2; + + } + + +.fi +.LP +Clearly, there is a `)' missing in the while-statement; +however, if this program is input to the error recovery it will complain +"} illegal", since after recognizing the +expression controlling the while the original parser starts a +semantic action, so the non-correcting recovery will accept a valid +C program at that point; after recognizing the three statements +following the while-statement as a separate program the +recognizer expects the missing `)', but gets `}' instead. + +.LP +Our second solution is based on the observation that if we knew +which semantic actions can start other parsers, we would only +have to introduce the new symbol $X$ at those places where parsers +can get started. We have therefore extended LLgen with a new directive +%substart, which is used to indicate to the parser generator that +another parser may be started. The %substart is followed by the +startsymbols that will produce the parsers that can be called, +so %substart A, B, C; indicates that in the semantic action +following the directive the parsers produced by startsymbols +A, B, en C can be started. In the grammar used by the error +recovery, a new symbol $X$ will be introduced at this point, +along with a new rule $X -> AX | BX | CX | epsilon$. Of course, this +solution can still accept input that would not have been accepted +by original parser, for instance if a parser is started +conditionally, based on other semantic information. However, it +is a big improvement over the first solution, both in performance +and the input it accepts. + +.NH 3 +Syntactic errors being handled in semantic actions + +.LP +A programmer may decide to handle certain syntactic errors +in semantic actions, for instance because he is not satisfied with +the standard error recovery. However, since the non-correcting error +recovery does not execute semantic actions, this may cause errors +to remain undetected. We encountered the following example in the ACK +Modula-2 compiler, in the grammar rule for assignment statement: +.br +.nf + + + Assignment_statement: lvalue + [ + '=' + { + error(":= expected"); + } + + | + + ':=' + ] + expression + ; + +.fi +.LP +This works well in the original LLgen; however, statements like +`j=9' are not treated as syntactic, but as semantic errors. +The original LLgen generated parser +will print the (semantic) error message, but the non-correcting recovery +will not execute the semantic action and therefore the erroneous +input will be accepted. + +.LP +To facilitate the incorporation of non-correcting error recovery in parsers +that use this kind of `trick', we extended LLgen with the %erroneous +directive. The directive indicates to the non-correcting recovery +mechanism that the token following it is not really part of the grammar. +When recognizing input, the error recovery will ignore tokens in the +grammar that have %erroneous in front of them. If in the example above, +the '=' is replaced with %erroneous '=', the non-correcting mechanism will +report an error when it sees a statement like 'j = 9'. See appendix B +for details about the implementation of the %erroneous directive. + +.LP +Another example is in the ACK C compiler. For some reason, the +grammar accepts function definitions without `()', so according +to the syntax a function definition can look like: +.br +.nf + + int func + { + .... + } +.fi + +.LP +The absence of the `()', however, causes `func' to be entered in the +symbol table as non-function, and when the parser encounters the body +a semantic action will complain with the error message "Making function body +for non-function". This again will cause the non-correcting error +recovery to miss errors. Consider this piece of code: +.br +.nf + +int i int j = 1; +{} + +.fi + +.LP +where apparently there's a `;' missing between the declarations +of i and j. The original LLgen-generated parser only gives semantic errors: +.br +.nf +"Making function body for non-function" +"j is not in parameter list" +"Illegal initialization of formal parameter, ignored" +.fi +.LP +As a result, the non-correcting error recovery will not report +any errors in this piece of code, because it does not execute the +semantic actions that recognize and report the error. Unfortunately, +due to the way the C-grammar is written, it is not possible to solve +this problem using a %erroneous directive; the part of the grammar +that deals with declaratons would have to be rewritten so as to +syntactically reject functions without `()'. + +.NH 3 +Semantic actions that read input + +.LP +There are no restrictions on what a semantic action can do; +there is nothing to stop the programmer from writing a parser in such +a way that some of the input to the parser is processed by semantic +actions. Obviously, because the non-correcting error recovery does not +execute semantic actions, this kind of parser will not work at all +with the new error recovery. Ironically, LLgen itself is written in +such a fashion; {}-enclosed C-code in its input is processed by +a semantic action in the LLgen grammar. We feel that it is bad +practice to write parsers this way; the `eating' of parts of +the input should be done in the lexical analyzer, not in the parser. +After all, in the case of LLgen, one can regard a semantic action +in the input as one token, and thus it should be handled by +the lexical analyzer as such. + +.NH 2 +Examples of error recovery + +.LP +We will now give some examples that compare non-correcting error +recovery with the correcting error recovery used by parsers generated +by `standard' LLgen. + +Consider the next C program, where there is a `)' missing in the +header of function `test'. +.br +.nf + + 1 int test(a,b + 2 + 3 int a,b; + 4 + 5 { + 6 if (a < b) + 7 return(1); + 8 else + 9 return(0); + 10 } +.fi + +.LP +This small error derails the `standard' parser; it produces the +following error messages, where we have left out 7 messages reporting +semantic errors: +.br +.nf + + line 3: , missing before type_identifier + line 3: , missing before identifier + line 3: ) missing before ; + line 5: { deleted + line 6: if deleted + line 6: < deleted + line 6: ) missing before identifier + line 6: ) deleted + line 7: identifier missing before return + line 7: ; missing before return + line 7: { missing before return + line 8: else deleted + +.fi +.LP +In contrast, the parser using non-correcting error recovery produces +only one error message: +.br + + line 3: type_identifier illegal + +This error message correctly pin-points the error: there should +have been a `)' at the position where type-identifier `int' is. + +.LP +Now, an example with Modula-2; consider this program: +.br +.nf + + 1 MODULE test; + 2 + 3 TYPES + 4 ElementRecordType = RECORD + 5 Element: ElementType; + 6 Next, + 7 Prior: ElementPointerType; + 8 END; + 9 + 10 VARS a,b,c: ElementRecordType; + 11 + 12 + 13 BEGIN + 14 + 15 a := b; + 16 + 17 END test. + +.fi +.LP +There are two syntactic errors in this program; on line 3, TYPES should be TYPE, and +on line 10, VARS should be VAR. We have left out the type declarations of +ElementType and ElementPointerType; clearly this will generate semantic +errors, but we are only interested in syntactic errors anyway. +The correcting error recovery parser +again derails on this program; it produces the following syntactic error messages: +.br +.nf + + line 3: CONST missing before identifier + line 4: '=' missing before identifier + line 4: RECORD deleted + line 5: ':' deleted + line 5: ';' missing before identifier + line 5: '=' missing before ';' + line 5: number missing before ';' + line 6: ',' deleted + line 7: '=' missing before identifier + line 7: ':' deleted + line 7: ';' missing before identifier + line 7: '=' missing before ';' + line 7: number missing before ';' + line 8: ';' deleted + line 10: identifier deleted + line 10: ',' deleted + line 10: identifier deleted + line 10: ',' deleted + line 10: identifier deleted + line 10: ':' deleted + line 10: identifier deleted + line 10: ';' deleted + line 13: BEGIN deleted + line 15: identifier deleted + line 15: := deleted + line 15: identifier deleted + line 15: ';' deleted + line 17: END deleted + line 17: identifier deleted + +.fi +.LP +The error correction mechanism clearly makes the wrong guess by inserting +CONST on line 3; as a result, all that follows is rejected as incorrect. +In contrast, the non-correcting error recovery mechanism only produces +two error messages: +.br +.nf + + line 3: identifier illegal + line 10: identifier illegal + +.fi +.LP +This again exactly pin-points the errors: the identifiers TYPES and +VARS constitute the only errors in the program. Note that the +presence of more than one error does not cause any problems to the +non-correcting recovery mechanism. + +.bp +.nr PS 12 +.nr VS 14 + +.NH +Conclusion + +.nr PS 10 +.nr VS 12 + +.LP +After implementing and testing a non-correcting error recovery mechanism +we have come to the conclusion that it indeed is superior to correcting +mechanisms in what regards the error messages it produces; +the examples we have given clearly show this. However, there is a +clear loss of performance when errors are present in a program, +although we have found this performance +degradation to be acceptable. We feel that the benefits of +better error messages outweigh the loss of performance. In any case, +correct programs do not suffer at all from the incorporation +of a non-correcting recovery mechanism. +The error recovery mechanism we implemented does not make +unreasonable demands on resources; the size of the prediction +graphs stays within reasonable limits. + +.LP +The main problems we encountered had to do with recognizing +`languages within languages', and semantic actions that did +unreasonable things like eating input. The more `well-behaved' a +parser is, the better the results the non-correcting error recovery +mechanism gives. This is also true for the input grammars: with a +language like Modula-2, whose syntax has been designed with parser +generators in mind, the performance of the non-correcting mechanism +is better than with C, whose syntax is extremely hard, if not +impossible to describe with a LL(1) grammar. + +.bp +.nr PS 12 +.nr VS 14 + +.NH +Bibliography + +.nr PS 10 +.nr VS 12 + +.IP [CORMACK] 12 +Gordon V. Cormack, `An LR substring parser for noncorrecting syntax error +recovery', ACM SIGPLAN Notices, vol. 24, no. 7, p. 161-169, July 1989 + +.IP [GRUNE] 12 +Dick Grune, Ceriel J.H. Jacobs, `A programmer friendly LL(1) parser +generator', Softw. Pract. Exper., vol. 18, no. 1, p. 29-38, Jan 1988 + +.IP [RICHTER] 12 +Helmut Richter, `Noncorrecting syntax error recovery', ACM Trans. Prog. Lang. +Sys., vol.7, no.3, p. 478-489, July 1985 + +.IP [ROEHRICH] 12 +Johannes R\*:ohrich, `Methods for the automatic construction of error +correcting parsers', Acta Inform., vol. 13, no. 2, p. 115-139, Feb 1980 + +.IP [TOMITA] 12 +Masaru Tomita, Efficient parsing for natural language, Kluwer Academic +Publishers, Boston, p.210, 1986 +.bp +.SH +Appendix A: Implementation Issues + +.nr PS 10 +.nr VS 12 +.RS +.LP +In this appendix we will describe some implementation issues; +the data structure used to store the grammar during non-correcting +error recovery, postponing deletions of graph elements until after +the prediction phase, and the implementation of the %substart directive . +.RE + +.SH +A.1 The grammar data structure + +.LP +The grammar data structure used by the non-correcting error recovery technique has +to meet two conditions: easy access to a rule as a whole to make +substituting nonterminals efficient and easy access to each symbol in the RHS +of a rule to make starting error recovery and finding continuations +efficient. To fulfill these conditions we decided to construct the +storage of the grammar as follows. + +.LP +A rule in the grammar is divided in two +parts: a LHS and a RHS. The LHS is represented by a struct `lhs' and +for each symbol in the RHS a struct 'symbol' is constructed. +A struct `lhs' contains the number of the +nonterminal forming the LHS of the rule, a pointer to the RHS, the +first- and follow-sets of the nonterminal and a flag 'empty' which +indicates whether the nonterminal produces empty or not. A struct +`symbol' contains a field indicating the type of the symbol, i.e. +a terminal or a nonterminal, the number of the symbol, a `link' pointer +to a struct `symbol' that represents the same symbol, a `next' pointer +to the rest of the RHS and a pointer back to the LHS. + +.LP +A special struct `symbol' is added to the end of the RHS to indicate +the end of a rule. The type of this struct is LLEORULE, the number +is set to -1 and the pointers 'link' and `next' are nil. + +.LP +In case that there is more than one RHS for a LHS, all the RHS's +are put after each other and separated by another special struct +`symbol'. The type of this struct is LLALT, the number is set to +-1 and the 'link' pointer is nil. After the last RHS a `LLEORULE'-struct +marker is added. + +.LP +Finally, to make searching efficient there are two arrays: `terminals' +and `nonterminals'. `terminals' is indexed by the number of a terminal +and contains for each terminal a struct containing a 'link' pointer +to a symbol, representing this terminal, in the RHS of a rule. Because +this symbol has again a 'link' pointer to another symbol representing +the terminal, it is possible by following this chain of pointers +to find all rules containing such a terminal. In a similar way `nonterminals' +is indexed by the number of a nonterminal and contains for each +nonterminal a struct. This struct not only contains a 'link' pointer +linking all rules with this nonterminal, but also contains a 'rule' +pointer. This pointer points to the RHS or RHS's of the rules of which +the nonterminal forms the LHS. + +.LP +As an example, consider the following grammar: + +.br +A: a B +.br +B: a | $epsilon$ +.br + +This will result in the picture below. Note that `pointer' fields +without an arrow indicate nil pointers. + +.PS +dx = 0.05 + +down +A_a: box ht boxht/2 "link" +box invis "a" ljust with .e at A_a.w + +move to A_a.s +move +move + +A: box "link" "rule" +B: box "link" "rule" +line dashed from A.w to A.e +line dashed from B.w to B.e +box invis "A" ljust with .e at A.w +box invis "B" ljust with .e at B.w + +move to A.ne +right +move +move +down + +LHS_A: box wid 1.2 * boxwid ht 2.5 * boxht "`A'" "rhs" "first" "follow" "empty 0" +line dashed from 0.2 to 0.2 +line dashed from 0.4 to 0.4 +line dashed from 0.6 to 0.6 +line dashed from 0.8 to 0.8 + +move to LHS_A.ne + (1,0) + +RHS_a1: box wid 2.0 * boxwid ht 2.5 * boxht "LLTERM" "`a'" "link" "next" "lhs" +line dashed from 0.2 to 0.2 +line dashed from 0.4 to 0.4 +line dashed from 0.6 to 0.6 +line dashed from 0.8 to 0.8 + +move to RHS_a1.ne + (1,0) + +RHS_B: box wid 2.0 * boxwid ht 2.5 * boxht "LLNONTERM" "`B'" "link" "next" "lhs" +line dashed from 0.2 to 0.2 +line dashed from 0.4 to 0.4 +line dashed from 0.6 to 0.6 +line dashed from 0.8 to 0.8 + +move to RHS_B.ne + (1,0) + +RHS_END1: box wid 2.0 * boxwid ht 2.5 *boxht "LLEORULE" "-1" "link" "next" "lhs" +line dashed from 0.2 to 0.2 +line dashed from 0.4 to 0.4 +line dashed from 0.6 to 0.6 +line dashed from 0.8 to 0.8 + + +move to LHS_A.s - (0,1) + +LHS_B: box wid 1.2 * boxwid ht 2.5 * boxht "`B'" "rhs" "first" "follow" "empty 1" +line dashed from 0.2 to 0.2 +line dashed from 0.4 to 0.4 +line dashed from 0.6 to 0.6 +line dashed from 0.8 to 0.8 + +move to LHS_B.ne + (1,0) + +RHS_a2: box wid 2.0 * boxwid ht 2.5 * boxht "LLTERM" "`a'" "link" "next" "lhs" +line dashed from 0.2 to 0.2 +line dashed from 0.4 to 0.4 +line dashed from 0.6 to 0.6 +line dashed from 0.8 to 0.8 + +move to RHS_a2.ne + (1,0) + +RHS_ALT: box wid 2.0 * boxwid ht 2.5 * boxht "LLALT" "-1" "link" "next" "lhs" +line dashed from 0.2 to 0.2 +line dashed from 0.4 to 0.4 +line dashed from 0.6 to 0.6 +line dashed from 0.8 to 0.8 + +move to RHS_ALT.ne + (1,0) + +RHS_END2: box wid 2.0 * boxwid ht 2.5 *boxht "LLEORULE" "-1" "link" "next" "lhs" +line dashed from 0.2 to 0.2 +line dashed from 0.4 to 0.4 +line dashed from 0.6 to 0.6 +line dashed from 0.8 to 0.8 + +# Next pointers upper row +.ps 30 +circle radius .01 at 0.75 - (dx, 0) +circle radius .01 at 0.3 - (dx, 0) +circle radius .01 at 0.7 - (dx, 0) +circle radius .01 at 0.7 - (dx, 0) +.ps 10 + +arrow from 0.75 - (dx, 0) to 0.3 +arrow from 0.3 - (dx, 0) to 0.3 +arrow from 0.7 - (dx, 0) to 0.7 +arrow from 0.7 - (dx, 0) to 0.7 + + +# Next pointers lower row +.ps 30 +circle radius .01 at 0.75 - (dx, 0) +circle radius .01 at 0.3 - (dx, 0) +circle radius .01 at 0.7 - (dx, 0) +circle radius .01 at 0.7 - (dx, 0) +.ps 10 + +arrow from 0.75 - (dx, 0) to 0.3 +arrow from 0.3 - (dx, 0) to 0.3 +arrow from 0.7 - (dx, 0) to 0.7 +arrow from 0.7 - (dx, 0) to 0.7 + + +# Link pointers +.ps 30 +circle radius .01 at 0.5 - (2*dx, 0) +circle radius .01 at 0.5 - (dx, 0) +circle radius .01 at 0.25 - (dx, 0) +.ps 10 + +arrow dashed from 0.5 - (2*dx, 0) to RHS_a2.ne - (2*dx,0) +line dashed from 0.5 - (dx, 0) right 4.0 * boxwid then to RHS_a1.ne - (2*dx, 0) -> +line dashed from 0.25 - (dx, 0) right then up .75 then right 7.0 * boxwid then to RHS_B.ne - (2*dx, 0) -> + + +# LHS pointers upper row +.ps 30 +circle radius .01 at 0.9 - (3*dx, 0) +circle radius .01 at 0.9 - (3*dx, 0) +circle radius .01 at 0.9 - (3*dx, 0) +.ps 10 + +line from 0.9 - (3*dx, 0) down -> +line from 0.9 - (3*dx, 0) down -> +line from 0.9 - (3*dx, 0) down then left 8.0 * boxwid then to LHS_A.se -> + + +# LHS pointers lower row +.ps 30 +circle radius .01 at 0.9 - (3*dx, 0) +circle radius .01 at 0.9 - (3*dx, 0) +circle radius .01 at 0.9 - (3*dx, 0) +.ps 10 + +line from 0.9 - (3*dx, 0) down -> +line from 0.9 - (3*dx, 0) down -> +line from 0.9 - (3*dx, 0) down then left 8.0 * boxwid then to LHS_B.se -> + + +# Text above structs +box invis ht boxht/2 "terminals" with .s at A_a.n +box invis ht boxht/2 "nonterminals" with .s at A.n +box invis ht boxht/2 "lhs" with .s at LHS_A.n +box invis ht boxht/2 "lhs" with .s at LHS_B.n +box invis ht boxht/2 "symbol" with .s at RHS_a1.n +box invis ht boxht/2 "symbol" with .s at RHS_B.n +box invis ht boxht/2 "symbol" with .s at RHS_END1.n +box invis ht boxht/2 "symbol" with .s at RHS_a2.n +box invis ht boxht/2 "symbol" with .s at RHS_ALT.n +box invis ht boxht/2 "symbol" with .s at RHS_END2.n +.PE + +.LP +Note that the empty alternative for `B' is represented in the +data structure by the `LLEORULE-struct' immediately following +the `LLALT'-struct. When there are still other alternatives +the `LLEORULE'-struct is replaced by a `LLALT'-struct followed +by the other alternatives and a `LLEORULE'-struct. +Finally, when the empty rule is the only rule for a +nonterminal the RHS will consist only of a `LLEORULE'-struct. + +.SH +A.2 Delayed deletes + +.LP +We encountered a problem with deleting elements during the +prediction phase. Imagine that we have a nonterminal `B' on top of +the graph, and `B' has two alternatives. Now suppose that we +apply the first alternative and we find out that this alternative leads +to a `dead end', i.e. a head that does not match the input symbol, so we want +to get rid of it. When we delete it immediately the deletion algorithm +will also deallocate `[B]' and possibly some elements below `[B]'. +However, there was another alternative for `[B]' which was not yet +developed and maybe this alternative leads to a head which is legal. +But `[B]' has already been deleted and thus cannot be used anymore. A similar +situation can occur when we want to delete a joined element; +the substitution of a nonterminal +that only produces empty and thus has no element above it in the graph +can also lead to such a situation. We therefore decided to put `dead ends' +on a list, `cleanup_arr[]', and after the prediction phase has +finished we delete all elements on this list, and all their descendants +that become unreachable of course. + +.SH +A.3 Clearing flags + +.LP +We implemented two different ways to clear the flags set by the prediction +phase of the algorithm; the first recursively tracks down the whole graph +following the flags, the second puts all elements visited by +the prediction phase +on a list; after the prediction phase has finished the algorithm walks +through this list clearing the flags of all elements on it. We took measurements +on both algorithms and found out that with small programs the times +did not differ much but large programs were processed faster by the +second algorithm. Therefore we decided to use the second algorithm. + +.LP +To speed up the algorithm even more, we do not deallocate the list +after a prediction phase has finished. We just set the number of +elements on the list to 0. This saves considerably on the number +of `Malloc'-calls. + +.SH +A.4 Implementation of %erroneous directive + +.LP +As explained in chapter 3, the user can put a %erroneous directive +in front of a terminal, making the non-correcting error recovery +mechanism ignore that terminal. However, implementing this directive +was not entirely straightforward; consider, for example, the rule +.br +.nf + + A: 'a' | %erroneous 'b' | 'c'; + +.fi +.LP +Just leaving out terminal 'b' will not do, because then nonterminal +A produces empty all of a sudden, which it did not before. +The rule should become +.br +.nf + + A: 'a' | 'c'; + +.fi +but this is hard to implement in LLgen. We took a different approach: +we introduce a new terminal 'ERRONEOUS', and substitute it for all +terminals with an %erroneous directive in front of them. Thus, the +example rule becomes +.br +.nf + + A: 'a' | ERRONEOUS | 'c'; + +.fi +.LP +Since the terminal ERRONEOUS will never be in the input to the parser, +this has exactly the desired effect; when a predicting phase produces +ERRONEOUS as head of a prediction graph this head will never match the +input. In particular, it will not match the terminal that was +originally there (in this case 'b') so that terminal is no longer +regarded as part of the input language at that point. +.bp +.SH +Appendix B: Using the non-correcting error recovery + +.LP +To use the new non-correcting error recovery mechanism, LLgen has to +be called with the new flag -n. LLgen will then create an extra file +called `Lncor.c' which contains the code for the non-correcting recovery +mechanism. This file has to be compiled and linked with the rest +of the program, just like the file `Lpars.c'. + +.LP +The user-supplied error reporting routine `LLmessage' will have to be +modified slightly; when it is called with a positive parameter, it +should only set the attributes of the inserted token, but not report an +error. Note that the lexical analyzer still must return the same token +as it did the last time it was called. When LLmessage is called with +parameter 0, it should report that the token in global variable LLsymb +is illegal; if the value of LLsymb is `EOFILE', the routine should +report an unexpected End-of-file. When LLmessage is called with parameter +-1, it should report that end-of-file was expected. To facilitate +switching between correcting and non-correcting error recovery, +the file Lpars.h contains a statement `#define LLNONCORR' +which indicates that the non-correcting +mechanism is enabled. +Here is a +skeleton for the modified LLmessage routine: +.nr PS 8 +.nr VS 10 +.LP +.br +.nf + + #include "Lpars.h" + extern int LLsymb; + + LLmessage(flag) + int flag; + { + if (flag < 0) + { + /* Error message "end-of-file expected" */; + } + else if (flag) + { + /* flag equals the number of the inserted token */ +#ifndef LLNONCORR + + /* Error message "token inserted" */; +#endif + + /* Code to set attributes for inserted token */ + /* Code to make lexical analyzer return same token as before */ + + else + { + /* The number of the illegal or deleted token is in LLsymb */ +#ifndef LLNONCORR + + /* Error message "token deleted" */; +#else + + if (LLsymb == EOFILE) + { + /* Error message "unexpected end of file" */ + } + else + { + /* Error message "token illegal" */; + } +#endif + + } + + } + +.fi +.nr PS 10 +.nr VS 12 + +.LP +For best results, one should check if the parser calls other parsers +in semantic actions; if this is the case, and the called parser +processes the same input file as the calling parser, then a %substart +should be put in front of the semantic action that starts a parser. +If a semantic action calls parsers defined by startsymbols say +A and B, then `%substart A, B;' should be put in front of the action. +As an alternative, one can use the -s flag of LLgen; this has the +same effect as putting `%substart X, Y, ....;' in front of all +semantic actions, where X, Y, .... are the startsymbols of the grammar. +Clearly, it is preferable to analyze the grammar and put %substart +directives only where appropriate. + +Finally, beware of syntactic errors being handled in semantic +actions; eg, one could have a rule like +.nr PS 8 +.nr VS 10 +.LP +.br +.nf + + Assignment_statement: lvalue + [ + '=' + { + error(":= expected"); + } + + | + + ':=' + ] + expression + ; +.fi + +.nr PS 10 +.nr VS 12 +.LP +To ensure that the non-correcting mechanism will recognize the +`=' as a syntactic error, a `%erroneous' directive should be +put in front of it.