2712 lines
		
	
	
	
		
			97 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			2712 lines
		
	
	
	
		
			97 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
.RP
 | 
						|
.TL
 | 
						|
 | 
						|
 | 
						|
 | 
						|
Top-down Non-Correcting Error Recovery 
 | 
						|
 in LLgen
 | 
						|
.AU
 | 
						|
Arthur van Deudekom
 | 
						|
Peter Kooiman
 | 
						|
.AI
 | 
						|
Department of Mathematics and Computer Science
 | 
						|
Vrije Universiteit 
 | 
						|
Amsterdam
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
Supervised by
 | 
						|
.AU
 | 
						|
dr. D. Grune
 | 
						|
.AI
 | 
						|
Department of Mathematics and Computer Science
 | 
						|
Vrije Universiteit
 | 
						|
Amsterdam
 | 
						|
 | 
						|
.AB
 | 
						|
This paper describes the design and implementation of a parser
 | 
						|
generator with non-correcting error recovery based on the extended LL(1)
 | 
						|
parser generator LLgen. It describes a top-down algorithm for implementing 
 | 
						|
this error recovery technique that can handle left-recursive grammars. 
 | 
						|
The parser generator has been tested with several existing ACK-compilers, 
 | 
						|
among which C and Modula-2. Various optimizations have been tried and are
 | 
						|
discussed in this paper. 
 | 
						|
.AE
 | 
						|
.LP
 | 
						|
.nr PS 12
 | 
						|
.nr VS 14
 | 
						|
 | 
						|
.NH
 | 
						|
Introduction
 | 
						|
.EQ
 | 
						|
delim $$
 | 
						|
.EN
 | 
						|
 | 
						|
.nr PS 10
 | 
						|
.nr VS 12
 | 
						|
.RS
 | 
						|
.LP
 | 
						|
One of the trickier problems in constructing parser-generators is what
 | 
						|
to do when the input to the generated parser is not well formed. Several
 | 
						|
approaches are known, most of which are `correcting', meaning that they
 | 
						|
modify the input to make it correct. However, in most cases there are
 | 
						|
several possible corrections, and often the one chosen will turn out
 | 
						|
to be the wrong one. As a result of such an incorrect choice, spurious error 
 | 
						|
messages can occur. Every programmer knows from experience how the omission 
 | 
						|
of a single `)' can on occasion lead to pages of error messages. 
 | 
						|
 | 
						|
.LP
 | 
						|
A radically different approach is to just discard all the input up to
 | 
						|
and including the offending token, and start with a clean slate at the
 | 
						|
token following the offending one. [RICHTER] describes how
 | 
						|
this idea can be used to construct a non-correcting error recovery system
 | 
						|
that will never introduce spurious error messages. It is, however,
 | 
						|
possible that errors are overlooked.
 | 
						|
 | 
						|
.LP
 | 
						|
In this paper we describe the incorporation of this non-correcting error
 | 
						|
recovery into LLgen, an existing LL(1) parser generator.
 | 
						|
In this introduction, we will describe in detail this non-correcting error
 | 
						|
recovery technique, give an overview of LLgen and how it handles
 | 
						|
errors, and finally describe how we have incorporated noncorrecting
 | 
						|
error recovery in LLgen.
 | 
						|
.RE
 | 
						|
 | 
						|
.NH 2
 | 
						|
Non-correcting syntax error recovery
 | 
						|
 | 
						|
.LP
 | 
						|
Richter describes how syntax error recovery can be done
 | 
						|
without making any corrections to the input text. Richter gives three
 | 
						|
reasons why recovery without correction is desirable:
 | 
						|
 | 
						|
.IP 1
 | 
						|
In most cases there are many possible corrections, the choice among which
 | 
						|
will severely influence the further processing of the input. Thus, the
 | 
						|
probability of selecting the right correction is not high.
 | 
						|
 | 
						|
.IP 2
 | 
						|
The harm done by selecting the wrong correction is often unlimited.
 | 
						|
 | 
						|
.IP 3
 | 
						|
The loss of information to the user of a non-correcting recovery technique
 | 
						|
need not be grave.
 | 
						|
 | 
						|
.LP
 | 
						|
The non-correcting technique described by Richter can be summarized as
 | 
						|
follows: When a syntax-error has occurred, the input up to and including the
 | 
						|
erroneous symbol is discarded; the remainder of the 
 | 
						|
input is processed by a substring parser of the input
 | 
						|
language, that is a parser that recognizes any substring of a string in the input 
 | 
						|
language. When the substring parser detects a syntax error, the offending 
 | 
						|
symbol is reported as another error, and the input up to and including the 
 | 
						|
erroneous symbol is discarded. The process is then repeated with the remaining input, possibly
 | 
						|
finding other syntax errors, until all the input is scanned.
 | 
						|
This process yields what Richter calls a 
 | 
						|
.I 
 | 
						|
suffix analysis 
 | 
						|
.R
 | 
						|
of an input string. Formally, given an input string 
 | 
						|
.I x
 | 
						|
, suffix analysis produces a set of strings $w sub k$ and a set of symbols
 | 
						|
$ a sub k$ such that
 | 
						|
.br
 | 
						|
 | 
						|
.IP
 | 
						|
$x~ =~ w sub 0 a sub 0 w sub 1 a sub 1~...w sub n-1 a sub n-1 w sub n$
 | 
						|
.LP
 | 
						|
and such that:
 | 
						|
.br
 | 
						|
.IP
 | 
						|
 $w sub 0$ is the longest prefix of $x$ that is  a prefix of
 | 
						|
a string in the input language L, formally: there is a string $y$ such that 
 | 
						|
$w sub 0 y$ is in  L, but there is no string $z$ such that $w sub 0 a sub 0 z$
 | 
						|
is in L;
 | 
						|
.IP
 | 
						|
For $0 < k < n$, $w sub k$ is a longest substring of $x$ that is also a
 | 
						|
substring of a string in L, formally there are strings $u$ and $v$ such that 
 | 
						|
$u w sub k v$ is in L, but there are no strings $y$ en $z$ such that 
 | 
						|
$y w sub k a sub k z$ is in L; 
 | 
						|
.IP
 | 
						|
$w sub n$ is a substring of $x$ 
 | 
						|
that is a substring of a string in L, formally:
 | 
						|
there exist $u$ and $v$, such that $u w sub n v$ is in L. Note that
 | 
						|
$w sub n$ need not be a suffix of a string in L, if $x$ represents incomplete
 | 
						|
input $w sub n$ is not a suffix of a string in L.
 | 
						|
 | 
						|
.LP
 | 
						|
Now, the $a sub k$ indicate points at which an error is detected. The
 | 
						|
"real" error need not be at $a sub k$, it can have occurred anywhere
 | 
						|
within $w sub k a sub k$.
 | 
						|
In his paper, Richter shows that, although this method may miss errors, it 
 | 
						|
will never introduce spurious errors.
 | 
						|
 | 
						|
.LP
 | 
						|
For implementing the technique, a parser that recognizes any
 | 
						|
substring of the input language is needed. If we confine ourselves to
 | 
						|
syntactical analysis, it is sufficient to construct a substring
 | 
						|
recognizer. Richter himself does not give a practical construction, but
 | 
						|
[CORMACK] describes how a LR substring parser can be constructed
 | 
						|
that handles BC-LR(1,1) grammars. In this paper, we describe the construction
 | 
						|
of a LL substring recognizer that can handle any grammar. Furthermore,
 | 
						|
our recognizer is actually a suffix-recognizer, that is, a recognizer that 
 | 
						|
recognizes any suffix of a string in the input language. Our suffix recognizer has the
 | 
						|
correct-prefix property, 
 | 
						|
meaning that it detects the first syntax error as early as possible
 | 
						|
in a left-to-right scan of the input. Specifically, if the input language
 | 
						|
is L and the invalid input is $x$ , it finds a string $w$ and an input symbol 
 | 
						|
$a$ such that $x = way$  , there is a string $z$ such that $wz$
 | 
						|
is in L, and there is no string $z$ such that $waz$ is in L.
 | 
						|
Because the suffix parser has this correct-prefix property, it can be
 | 
						|
used as a substring parser, because it will detect the first input symbol that
 | 
						|
is not part of a substring of the language. Because it is a suffix-recognizer,
 | 
						|
it additionally will detect incomplete input, because in that case 
 | 
						|
at the end of the input the parser will not be in an accepting state.
 | 
						|
 | 
						|
.NH 2
 | 
						|
Overview of LLgen
 | 
						|
 | 
						|
.LP
 | 
						|
LLgen is an extended LL(1) parser generator. For a complete description, 
 | 
						|
see [GRUNE].
 | 
						|
LLgen can actually handle grammars that are not LL(1), because it allows
 | 
						|
the use of conflict-resolvers. In case of an LL(1) conflict, these resolvers
 | 
						|
are used to statically or dynamically decide which rule to use. As we will see
 | 
						|
later, this feature makes it necessary for the suffix-recognizer to 
 | 
						|
handle grammars that are not LL(1). Semantic actions can occur anywhere
 | 
						|
in the grammar rules, and they are executed when their position is 
 | 
						|
reached during parsing. A typical LLgen rule looks like
 | 
						|
.br
 | 
						|
.IP
 | 
						|
S:	A {
 | 
						|
.I action
 | 
						|
} B
 | 
						|
.LP
 | 
						|
where the action is a piece of C-code, that will be executed
 | 
						|
when the parser is using the rule for S and has recognized A.
 | 
						|
 | 
						|
.LP
 | 
						|
LLgen-generated parsers use correcting syntax error recovery, based on a
 | 
						|
scheme designed by R\*:ohrich [ROEHRICH], inserting or deleting symbols at the point of error detection
 | 
						|
until correct input results. This means that actions in the parser will
 | 
						|
always be executed in an order that could also have resulted from
 | 
						|
syntactically correct input, and most importantly, once a grammar-rule
 | 
						|
is started it is guaranteed to be completed. This means that syntactic
 | 
						|
errors can never result in inconsistencies for the actions. Actions
 | 
						|
only have to deal with syntactically correct input. In a nutshell, the
 | 
						|
error recovery in LLgen-parsers works as follows: Suppose the parser is
 | 
						|
presented with correct input that breaks off before the end. The error
 | 
						|
recovery mechanism now provides a continuation path, chosen in such a
 | 
						|
way that all active rules are left as soon as possible. Effectively, the
 | 
						|
continuation path is the `shortest way out'. The symbols on this path are
 | 
						|
called `acceptable', and end-of-file is also `acceptable'. Furthermore, at
 | 
						|
each point along this `shortest path' there can be other terminals that
 | 
						|
would be correct; these are `acceptable' as well. Now, when an
 | 
						|
error occurs, all symbols that are not acceptable are discarded, until 
 | 
						|
an acceptable symbol appears in the input. The tokens on the path up to 
 | 
						|
but not including the acceptable input symbol are inserted. 
 | 
						|
From then on, normal parsing resumes.
 | 
						|
 | 
						|
.NH 2
 | 
						|
Incorporation of non-correcting error recovery in LLgen
 | 
						|
 | 
						|
.LP
 | 
						|
An important consideration in incorporating the non-correcting recovery
 | 
						|
in LLgen was that correct programs should suffer as little as possible
 | 
						|
in what regards compilation speed. Furthermore, the existing error
 | 
						|
recovery method has the highly desirable property that rules that are
 | 
						|
started will be finished too, thus ensuring that errors in the
 | 
						|
input text will not cause inconsistencies in the semantic actions. We have
 | 
						|
implemented the non-correcting error recovery in such a way that this 
 | 
						|
property is preserved.
 | 
						|
 | 
						|
.LP
 | 
						|
The way we have achieved these goals is by actually including
 | 
						|
the suffix recognizer as a `second recognizer' in the generated parser. 
 | 
						|
Correct programs are handled in the usual way by the parser, but if an error
 | 
						|
occurs the following happens: instead of going to the standard error
 | 
						|
recovery routine, the parser starts executing the non-correcting error
 | 
						|
handler. This process continues, reporting all errors, until the
 | 
						|
end of the input text is reached. Then, control is handed back to
 | 
						|
the standard error recovery routine. This routine will now think
 | 
						|
there is no more input, and thus start inserting tokens so as to construct
 | 
						|
a `shortest way out'. This ensures that all rules that were started are
 | 
						|
also finished, and no inconsistencies can occur in the semantic actions. 
 | 
						|
However, this method does require some modifications to the error reporting 
 | 
						|
routine. Normally, if the generated parser inserts a token, it reports 
 | 
						|
this to the user, but in this case this is undesirable. The insertions only 
 | 
						|
serve to maintain consistency in the semantic actions 
 | 
						|
and do not signify errors, so reporting of insertions should be suppressed. 
 | 
						|
.bp
 | 
						|
.nr PS 12
 | 
						|
.nr VS 14
 | 
						|
.PS
 | 
						|
boxwid = boxwid / 1.5
 | 
						|
boxht = boxht / 1.5
 | 
						|
arcrad = arcrad / 1.5
 | 
						|
movewid = movewid / 1.5
 | 
						|
moveht = moveht / 1.5
 | 
						|
arrowwid = arrowwid / 1.5
 | 
						|
arrowht = arrowht / 1.5
 | 
						|
arrowhead = arrowhead / 1.5
 | 
						|
linewid = linewid / 1.5
 | 
						|
lineht = lineht / 1.5
 | 
						|
.PE
 | 
						|
.NH
 | 
						|
The LL suffix parser
 | 
						|
 | 
						|
.nr PS 10
 | 
						|
.nr VS 12
 | 
						|
.RS
 | 
						|
.LP
 | 
						|
In this chapter, we describe the construction of the LL suffix parser.
 | 
						|
The described parser is not restricted to LL(1) grammars, because the
 | 
						|
presence of conflict resolvers in LLgen allows for more general grammars,
 | 
						|
that may even be left-recursive. We start this chapter with a discussion
 | 
						|
of the implications of conflict resolvers, and continue with descriptions
 | 
						|
of the parser algorithm, the used data-structures,
 | 
						|
the handling of left- and right recursion, and some possible optimizations.
 | 
						|
.RE
 | 
						|
 | 
						|
.NH 2
 | 
						|
LLgen conflict resolvers and their implications
 | 
						|
 | 
						|
.LP
 | 
						|
In grammars that are nearly but not completely LL(1), conflicts
 | 
						|
will arise in the two places where parsing decisions are made: the choice
 | 
						|
of which alternative to start (`alternation conflicts') and the decision
 | 
						|
to stop or continue a repeated item (`repetition conflicts'). In order to
 | 
						|
allow LLgen to handle this type of grammar, the user can 
 | 
						|
specify conflict resolvers in those places where conflicts arise.
 | 
						|
These resolvers are Boolean expressions labeling an alternative,
 | 
						|
and are evaluated when a conflict arises during parsing. If the 
 | 
						|
expression evaluates to `true' the labeled alternative will be taken.
 | 
						|
The Boolean expressions are expressions in C, and can consult
 | 
						|
any information available at the point they occur.
 | 
						|
However, if a syntactic error has occurred in the input, and the non-correcting
 | 
						|
error recovery starts, we can no longer rely on the conflict resolvers to 
 | 
						|
guide parsing decisions. The suffix recognizer is only concerned with
 | 
						|
syntax, and will not execute any semantic actions. It recognizes suffices
 | 
						|
of correct input, but does not know or care what prefix would make
 | 
						|
the suffix a correct program; as a result, the information that conflict
 | 
						|
resolvers could use is not available, because the semantic actions
 | 
						|
that would build this information have not been executed.
 | 
						|
Therefore, the information used by the conflict resolvers is no longer 
 | 
						|
reliable, and the suffix parser needs to be able to handle the underlying
 | 
						|
grammar without their help. In particular, it has to be able to handle
 | 
						|
left-recursive grammars.  
 | 
						|
 | 
						|
.NH 2
 | 
						|
The suffix parser algorithm
 | 
						|
 | 
						|
.LP
 | 
						|
Our algorithm needs easy access to the grammar rules; in the description
 | 
						|
we assume there is an efficient way to access the grammar rules. In 
 | 
						|
the next chapter we will describe the details of the actual implementation.
 | 
						|
For the moment, we will only consider grammars that are not left- or 
 | 
						|
right-recursive. In the next section, we will discuss how the algorithm has to be adapted
 | 
						|
to handle left- and right recursion. 
 | 
						|
 | 
						|
.LP
 | 
						|
Suppose the grammar is G, and the input to the suffix recognizer is 
 | 
						|
$a sub 0 a sub 1 ... a sub n-1 a sub n$. Remember that parsing is
 | 
						|
always started by the `normal' LLgen generated parser. It's only after
 | 
						|
a syntactic error has occurred that the suffix recognizer will be started.
 | 
						|
The input to the suffix recognizer thus is the `tail' of the input, starting
 | 
						|
at the first symbol after the position where the first syntax error was
 | 
						|
found.
 | 
						|
 | 
						|
.LP
 | 
						|
Now, in order to get parsing going again, the parser scans the grammar
 | 
						|
for rules which contain symbol $a sub 0$ in the right hand side:
 | 
						|
.br
 | 
						|
 | 
						|
	A:	$alpha ~ a sub 0 ~ beta$
 | 
						|
.br
 | 
						|
 | 
						|
.LP
 | 
						|
where $alpha$ and $beta$ represent a string of terminals and non-terminals,
 | 
						|
possible empty. Now, for each of these rules found, and for any string  
 | 
						|
$b sub 0 b sub 1$...$ b sub m$ that can be generated by $beta$ it holds that
 | 
						|
$a sub 0 b sub o b sub 1$...$b sub m$ is a substring of some string in L.
 | 
						|
This can be shown as follows, supposing that the start symbol is S and
 | 
						|
S $-> sup * gamma$  A $delta$:
 | 
						|
.br
 | 
						|
 | 
						|
S $-> sup * gamma$ A $delta$ $-> sup * gamma ~ alpha ~ a sub 0 beta ~ delta
 | 
						|
-> sup * gamma ~ alpha ~ a sub 0 b sub 0 b sub 1$...$b sub m delta$
 | 
						|
 | 
						|
.br
 | 
						|
Of course, there may very well be more than one such string
 | 
						|
$b sub 1 b sub 2$..$b sub m$, and one of these strings can be empty as well, if
 | 
						|
$beta$ can produce empty. Now, in what we will call the 
 | 
						|
.I 
 | 
						|
predicting phase
 | 
						|
.R
 | 
						|
the algorithm will
 | 
						|
produce all possible symbols $b sub 0$. Then, in what we will call the
 | 
						|
.I 
 | 
						|
accepting phase
 | 
						|
.R
 | 
						|
these symbols  are matched against
 | 
						|
the input, and those not matching are discarded. Then, entering the next
 | 
						|
predicting phase, the algorithm will produce
 | 
						|
all symbols $b sub 1$, and match them against the next input symbol in
 | 
						|
the subsequent accepting phase,
 | 
						|
etc. In case one of the strings $b sub 0$...$b sub m$ is empty, or
 | 
						|
the end of one of the strings is reached, some way to continue is
 | 
						|
needed; we will discuss this later. First let's see how the
 | 
						|
algorithm produces the strings $b sub 0$...$b sub m$ .
 | 
						|
 | 
						|
.LP
 | 
						|
For each rule in the grammar of the form
 | 
						|
.br
 | 
						|
 | 
						|
	A:	$alpha a sub 0 W sub 1 W sub 2$...$W sub p$
 | 
						|
.br
 | 
						|
 | 
						|
with each $W sub k$ a terminal or nonterminal, a 
 | 
						|
.I
 | 
						|
prediction graph 
 | 
						|
.R
 | 
						|
is created that looks like this:
 | 
						|
 | 
						|
.PS
 | 
						|
down; box "$a sub 0$"; arrow; box "$W sub 1$"; arrow
 | 
						|
box "$W sub 2$"; arrow dashed; box "$W sub p$"
 | 
						|
arrow; box "END" "$[A]$"
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
The bottom element of these prediction graphs is an end-marker containing the
 | 
						|
left-hand side of the rule used. All these graphs have $a sub 0$ on top, and
 | 
						|
this $a sub 0$ is matched against the $a sub 0$ in the input in the
 | 
						|
accepting phase that follows, removing the
 | 
						|
$a sub 0$ from the graph. If the prediction graph is now empty, we have to find a way 
 | 
						|
to continue;  this case is treated later. First we will consider what to do if
 | 
						|
the prediction graph is not empty. There are two possibilities: either $W sub 1$ is a 
 | 
						|
terminal, or it is a nonterminal. If it is a terminal, we are finished for 
 | 
						|
the moment; if not, the algorithm scans for rules of the form
 | 
						|
.br
 | 
						|
 | 
						|
	$W sub 1$:	$U sub 1 U sub 2$...$U sub i$
 | 
						|
.br
 | 
						|
 | 
						|
.LP
 | 
						|
with each $U sub k$ a terminal or nonterminal. Now, the algorithm substitutes 
 | 
						|
the top of the prediction graph with the right-hand sides 
 | 
						|
of all the rules found. Because there can be more than one rule, the
 | 
						|
prediction graph can now become a DAG (Directed Acyclic Graph).
 | 
						|
Supposing there are two rules with $W sub 1$ in the LHS:
 | 
						|
 | 
						|
.br
 | 
						|
 | 
						|
	$W sub 1$:	$U sub 1 U sub 2$...$U sub i$
 | 
						|
.br
 | 
						|
	$W sub 1$:	$V sub 1 V sub 2$...$V sub j$
 | 
						|
 | 
						|
.LP
 | 
						|
the prediction graph will now look like this:
 | 
						|
 | 
						|
.PS
 | 
						|
B1: box "$U sub 1$"
 | 
						|
move 
 | 
						|
B2: box "$V sub 1$"
 | 
						|
arrow dashed down from bottom of B1
 | 
						|
B3: box "$U sub i$"
 | 
						|
arrow dashed down from bottom of B2
 | 
						|
B4:box "$V sub j$"
 | 
						|
move to 0.5 <B3.se, B4.sw>
 | 
						|
down;move
 | 
						|
B5:box "$[W sub 1 ]$"
 | 
						|
arrow dashed;
 | 
						|
box "$W sub p$"
 | 
						|
arrow;
 | 
						|
box "END" "$[A]$"
 | 
						|
arrow from B3.bottom to B5.top
 | 
						|
arrow from B4.bottom to B5.top
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
The graph element representing $W sub 1$ is left in the stack, the
 | 
						|
notation $[W sub 1 ]$ indicates it has been substituted. These substituted
 | 
						|
element will from now on be ignored by the algorithm. The elements 
 | 
						|
$U sub 1$ and $V sub 1$ are now `on top' of the prediction graph.
 | 
						|
 | 
						|
.LP
 | 
						|
If $W sub 1$ can also produce empty, its successor in the prediction graph 
 | 
						|
has to be processed
 | 
						|
as well; the algorithm walks down the graph to this successor, and
 | 
						|
there the process is repeated; if it is a terminal we are finished, else we 
 | 
						|
substitute it with the right hand sides of its grammar rule. 
 | 
						|
However, the element that we want to substitute now, say $W sub k$, cannot
 | 
						|
be marked `substituted' just like that, because it can be on another
 | 
						|
path, on which it cannot be substituted yet. Therefore, a copy of element
 | 
						|
$W sub k$ is made, it is marked $[W sub k ]$, and an edge is created
 | 
						|
from $[W sub k ]$ to the successor of $W sub k$. This produces graphs like
 | 
						|
this:
 | 
						|
.br
 | 
						|
.PS
 | 
						|
B1: box "$U sub 1$"
 | 
						|
move
 | 
						|
B2: box "$V sub 1$"
 | 
						|
move
 | 
						|
X1:box "$X sub 1$"
 | 
						|
arrow dashed down from bottom of B1
 | 
						|
B3: box "$U sub m$"
 | 
						|
arrow dashed down from bottom of B2
 | 
						|
B4:box "$V sub j$"
 | 
						|
arrow dashed down from bottom of X1
 | 
						|
Xj: box "$X sub j$"
 | 
						|
move to 0.5 <B3.se, B4.sw>
 | 
						|
down;move
 | 
						|
B5:box "$[W sub 1 ]$"
 | 
						|
arrow dashed;
 | 
						|
B6: box "$W sub k$"
 | 
						|
arrow
 | 
						|
Wk1:box "$W sub k+1$"
 | 
						|
arrow dashed
 | 
						|
box "$W sub n$"
 | 
						|
arrow;
 | 
						|
box "END" "$[A]$"
 | 
						|
arrow from B3.bottom to B5.top
 | 
						|
arrow from B4.bottom to B5.top
 | 
						|
move down from Xj.top;move;move;move
 | 
						|
Wk: box "$[W sub k ]$"
 | 
						|
arrow from Xj.bottom to Wk.top
 | 
						|
arrow from Wk.bottom to Wk1.top
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
This process of substituting is repeated with all nonterminals that are
 | 
						|
now on top of the prediction graph, until there are only terminals on top of 
 | 
						|
the graph.
 | 
						|
This completes the prediction phase of the algorithm, not taking into account
 | 
						|
what to do if an END marker appears on top of the graph.
 | 
						|
Now, the algorithm enters its accepting phase, in which
 | 
						|
the terminals on top are compared with the next symbol in the input.
 | 
						|
If a terminal in the graph matches the input, its element is deleted
 | 
						|
from the graph, and the substitution process will continue with its
 | 
						|
successors, in the next prediction phase.
 | 
						|
If a terminal on top of the graph does not
 | 
						|
match the input, the path it is on represents a `dead-end', which
 | 
						|
does not need to be processed any further. The terminal is no longer
 | 
						|
a `top', and the algorithm will not visit it again.
 | 
						|
 | 
						|
.LP
 | 
						|
There is one tricky situation: consider again this graph:
 | 
						|
 | 
						|
.PS
 | 
						|
B1: box "$U$"
 | 
						|
move
 | 
						|
B2: box "$a$"
 | 
						|
move to 0.5 <B1.se, B2.sw>
 | 
						|
down;move
 | 
						|
B5:box "$W sub 1 $"
 | 
						|
arrow dashed;
 | 
						|
box "$W sub n$"
 | 
						|
arrow;
 | 
						|
box "END" "$[A]$"
 | 
						|
arrow from B1.bottom to B5.top
 | 
						|
arrow from B2.bottom to B5.top
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
Here, the algorithm is processing $W sub 1$ in the predicting phase, and
 | 
						|
using some rule it has produced $a$ on top; there is another rule with
 | 
						|
$W sub 1$ in its LHS which has produced nonterminal $U$ on top.
 | 
						|
Now, suppose $U$ is a  nonterminal that can 
 | 
						|
produce empty. Now, the algorithm starts substituting $U$, and walks
 | 
						|
down $W sub 1$. What we definitely do not want 
 | 
						|
is the algorithm to start substituting $W sub 1$ again, because then we
 | 
						|
would loop forever. Therefore, if the algorithm starts processing 
 | 
						|
element $W sub 1$ it should make it $[W sub 1 ]$ before it does
 | 
						|
anything else. On entering the element 
 | 
						|
for the second time in the prediction phase , it sees that it is already substituted, 
 | 
						|
so there is nothing to do.
 | 
						|
It then just walks to the successor of $W sub 1$ and
 | 
						|
starts substituting it. This is correct, since the fact that the algorithm
 | 
						|
enters an element for the second time in a prediction phase  means that the element
 | 
						|
indirectly can produce the empty string, and thus its successor must
 | 
						|
be substituted as well in the prediction phase.
 | 
						|
 | 
						|
.LP
 | 
						|
It is easy to see that the substitution process will stop: the algorithm can 
 | 
						|
only loop if  it starts processing an element for the second time in a
 | 
						|
prediction phase, 
 | 
						|
or if the  processing of an element eventually yields a graph with that 
 | 
						|
same element on top. 
 | 
						|
The first case cannot occur because the algorithm marks elements it is 
 | 
						|
processing as `substituted' before it does anything else, meaning that those elements will not
 | 
						|
be processed again; the second case can only occur if the grammar is 
 | 
						|
left-recursive, which we assumed it was not. 
 | 
						|
 | 
						|
.LP
 | 
						|
The algorithm simulates
 | 
						|
left-most derivations of strings $a sub 0 b sub 0 b sub 1$..$b sub n$
 | 
						|
starting from $a sub 0 W sub 1$..$W sub p$; as we showed before, if
 | 
						|
the algorithm recognizes a string $a sub 0 b sub 0$..$b sub n$ that
 | 
						|
string is a substring of some string in L. Conversely, because the
 | 
						|
algorithm start out by using all rules of the form 
 | 
						|
A:	$alpha a sub 0 beta$, and then proceeds to simulate all
 | 
						|
possible left-most derivations, it will recognize all input
 | 
						|
$a sub 0 b sub 0$... $b sub n$ that can be produced starting from
 | 
						|
$a sub 0 beta$.
 | 
						|
 | 
						|
.LP
 | 
						|
Now we will discuss what has to be done if an END marker appears as
 | 
						|
top of the prediction graph. 
 | 
						|
When this happens, it means that starting from some rule 
 | 
						|
.br
 | 
						|
 | 
						|
	A:	$alpha a sub 0 beta$
 | 
						|
 | 
						|
.br
 | 
						|
the algorithm has produced a leftmost-derivation of a string 
 | 
						|
$a sub 0 b sub 1 .. b sub n$ starting from $a sub 0 beta$, or that $beta$ can produce
 | 
						|
empty and the string so far is just $a sub 0$. The next step is to assume
 | 
						|
that the have recognized A and that that some string produced by $alpha$
 | 
						|
is part of the prefix that makes the suffix we are recognizing a 
 | 
						|
correct string in L. Remember that in the END marker we kept record of
 | 
						|
the LHS of the rule that has started the graph, and we will now use this
 | 
						|
LHS to continue recognizing. What the algorithm does is scan for all
 | 
						|
rules of the form:
 | 
						|
.br
 | 
						|
 | 
						|
	B:	$gamma$ A $delta$
 | 
						|
.br
 | 
						|
 | 
						|
with $gamma$ and $delta$ possibly empty strings of terminals and nonterminals.
 | 
						|
The algorithm now starts a new component in the prediction graph, and if $delta$ is
 | 
						|
$W sub 1 W sub 2$...$W sub n$ it looks like this:
 | 
						|
 | 
						|
.PS
 | 
						|
down;box "$W sub 1$"; arrow
 | 
						|
box "$W sub 2$"; line dashed; box "$W sub n$"
 | 
						|
arrow; box "END" "$[B]$"
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
Note that the END marker now contains B, because we have started to match
 | 
						|
a rule for B. If the $delta$ in the rule for B was empty, this just produces
 | 
						|
and END marker with B in it; in this case, the process is just repeated
 | 
						|
with all rules of the form:
 | 
						|
.br
 | 
						|
 | 
						|
	C:	$zeta$ B $eta$
 | 
						|
.br
 | 
						|
 | 
						|
.LP
 | 
						|
etc, until we have a prediction graph with a nonterminal or terminal on top.
 | 
						|
Now, the substitution algorithm is again applied over all nonterminals on
 | 
						|
top, until every top contains a terminal. It is possible that during
 | 
						|
substitution again an END marker will turn up; if this happens
 | 
						|
we again scan for rules to continue with etc. 
 | 
						|
This `continuation algorithm' can only loop if, when
 | 
						|
trying to build a new prediction graph for matched symbol A, it produces an empty
 | 
						|
graph with again matched symbol A. If this happens, the grammar was
 | 
						|
(directly or indirectly) right-recursive, and we assumed that it was not.
 | 
						|
Therefore, the algorithm will terminate. The terminals on top of the
 | 
						|
new graph after applying this `continuation' algorithm are exactly those
 | 
						|
that could follow the string $A sub 0 b sub 0$..$b sub n$ in a substring
 | 
						|
of a string in L.
 | 
						|
To see this, suppose we have `recognized' the rule
 | 
						|
.br
 | 
						|
 | 
						|
	A:	$alpha a sub 0 beta$
 | 
						|
 | 
						|
.br
 | 
						|
and $a sub 0 b sub 0 b sub 1$...$b sub n$ is the string produced from 
 | 
						|
$a sub 0 beta$ by the algorithm. Now, using rule:
 | 
						|
.br
 | 
						|
 | 
						|
	B:	$gamma$ A $delta$
 | 
						|
 | 
						|
.br
 | 
						|
and supposing that S $->$ $zeta$ B $eta$ we get
 | 
						|
.br
 | 
						|
 | 
						|
	S $->$ $zeta$ B $eta$ $->$ $zeta gamma$ A $delta$ $eta$ $->$ $zeta gamma a sub 0 b sub 0 b sub 1$ ... $b sub n$ $delta$ $eta$
 | 
						|
 | 
						|
.br
 | 
						|
.LP
 | 
						|
and thus any string produced by a derivation starting from
 | 
						|
$delta$ can come right after $a sub 0 b sub 0$...$b sub n$ in a substring 
 | 
						|
of some string in L. The algorithm will proceed to generate all these
 | 
						|
strings starting from $delta$. If $delta$ produces empty, the above
 | 
						|
is just repeated. Because in the `continuation' part
 | 
						|
all possible rules are considered, the whole algorithm will recognize
 | 
						|
all substrings of any string in L. In order to determine if we 
 | 
						|
have actually recognized a suffix of some string in L, we need to
 | 
						|
remember if within a predicting phase the `continuation' part of the algorithm has been run
 | 
						|
on an END marker containing the start-symbol S;
 | 
						|
if this is the case, then the input seen until now is a suffix of some string in L.
 | 
						|
Formally, it means that there is a derivation starting from start symbol
 | 
						|
$S$ such that if the
 | 
						|
input seen until now is $a sub 0 a sub 1$..$a sub n$, then:
 | 
						|
.br
 | 
						|
 | 
						|
	S $-> sup * alpha beta$ $-> sup * alpha a sub 0 a sub 1$..$a sub n$
 | 
						|
.br
 | 
						|
 | 
						|
.LP
 | 
						|
where $alpha$ can be empty, $beta$ is not empty.
 | 
						|
 | 
						|
.NH 2
 | 
						|
The prediction graph data structure
 | 
						|
 | 
						|
.LP
 | 
						|
The graphs that are produced by the suffix recognizer may grow extremely
 | 
						|
large; to facilitate an efficient
 | 
						|
implementation we have devised a way of keeping the size of the
 | 
						|
data structure under control, in a way that is very similar to
 | 
						|
the way described in [TOMITA].
 | 
						|
 | 
						|
.LP 
 | 
						|
The basic idea is, that in a prediction phase of the algorithm, it is not
 | 
						|
necessary to explicitly substitute each nonterminal every time it
 | 
						|
turns up as a `top'; it is sufficient to do it once, because the
 | 
						|
second substitution will produce exactly the same subgraph starting at
 | 
						|
the substituted nonterminal. Here is an example:
 | 
						|
 | 
						|
.PS
 | 
						|
down;box "$a$";arrow;box "A";arrow dashed;box "[B]";arrow
 | 
						|
box "C";arrow dashed;box "END" "[X]"
 | 
						|
move right from last box.e;
 | 
						|
box "END" "[Y]";
 | 
						|
arrow <- dashed up from last box.top;
 | 
						|
box "D";arrow <- up from last box.top
 | 
						|
box "B"
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
Here, in the left component of the graph, nonterminal B has been
 | 
						|
substituted. Now, in the same prediction phase, the algorithm again runs into
 | 
						|
B, now in the right component. There is no need to compute again
 | 
						|
what the substitution will produce, it is exactly the part on top
 | 
						|
of B in the left component. Therefore, all that is needed is:
 | 
						|
 | 
						|
.PS
 | 
						|
down;box "$a$";arrow;box "A";arrow dashed;
 | 
						|
B1: box "[B]";arrow 
 | 
						|
box "C";arrow dashed;box "END" "[X]"
 | 
						|
move right from last box.e;
 | 
						|
box "END" "[Y]";
 | 
						|
arrow <- dashed up from last box.top;
 | 
						|
box "D"
 | 
						|
arrow from B1.bottom to last box.top
 | 
						|
.PE
 | 
						|
 | 
						|
So, when, in a prediction phase of the algorithm, a nonterminal is substituted,
 | 
						|
the nonterminal is placed on a list, together with a pointer to
 | 
						|
the substituted nonterminal. If in the same prediction phase a nonterminal that
 | 
						|
is on the list becomes a top, all we need to do is place an edge
 | 
						|
between the already substituted one and the successor of the top we are currently
 | 
						|
processing. When a prediction phase is finished, the list is cleared.
 | 
						|
There is one catch: if we consider again the last picture,
 | 
						|
note that if nonterminal B can (directly or indirectly) produce empty,
 | 
						|
it is also necessary to substitute D. However, it is not difficult to
 | 
						|
determine if a nonterminal can produce empty. LLgen already computes
 | 
						|
this information for each nonterminal.
 | 
						|
 | 
						|
.LP
 | 
						|
Without this `joining together' of graph components, each
 | 
						|
element in the graph has exactly one successor, except the END marker,
 | 
						|
which has none.
 | 
						|
Now that components get joined as described, an element can have any
 | 
						|
number of successors. The recognizer algorithm now has to consider all
 | 
						|
successors of a graph element instead of one.
 | 
						|
 | 
						|
.NH 2
 | 
						|
Handling right recursion
 | 
						|
 | 
						|
.LP
 | 
						|
The only problem right-recursive grammars cause in the algorithm is in the
 | 
						|
`continuation' part; they can cause this part of the algorithm to loop
 | 
						|
forever. As an example, consider:
 | 
						|
.br
 | 
						|
 | 
						|
	A:	$alpha$ B
 | 
						|
.br
 | 
						|
	B:      $beta$ C
 | 
						|
.br
 | 
						|
	C:	$gamma$ A
 | 
						|
 | 
						|
.LP
 | 
						|
Now suppose the `substitution' part of the algorithm has turned up
 | 
						|
an END marker with nonterminal A in it. The continuation algorithm will
 | 
						|
now produce:
 | 
						|
 | 
						|
.PS
 | 
						|
box "END" "[A]";move;box "END" "[C]";move;box "END" "[B]";move
 | 
						|
box "END" "[A]";move;box "END" "[C]"
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
etc. etc. However, a slight modification to the algorithm suffices
 | 
						|
to eliminate this problem; within each prediction phase of the algorithm, we
 | 
						|
simply maintain a list of nonterminals that have turned up in an
 | 
						|
END marker. As soon as an END marker turns up whose nonterminal is
 | 
						|
already in the list, we stop the `continuation' algorithm; the part
 | 
						|
of the graph that would be produced by it already has been generated
 | 
						|
by an earlier invocation of the algorithm in the same prediction phase.
 | 
						|
At the end
 | 
						|
of a prediction phase, when all heads are terminals, we clear the list.
 | 
						|
This way, no looping can occur; even if the right recursion is
 | 
						|
indirect, for instance if in the above example the rule for A had been
 | 
						|
.br
 | 
						|
 | 
						|
	A:	$alpha$ B $delta$
 | 
						|
.br
 | 
						|
.LP
 | 
						|
where $delta$ can produce empty, the algorithm still works; the substitution
 | 
						|
of $delta$ will yield an END marker on top, and when trying to find
 | 
						|
a continuation for LHS A the algorithm notices A is already on the list.
 | 
						|
 | 
						|
 | 
						|
.NH 2
 | 
						|
Handling left recursion
 | 
						|
 | 
						|
.LP
 | 
						|
Left-recursion is, unfortunately, a much tougher problem than
 | 
						|
right-recursion. The result of left-recursive grammar rules is that
 | 
						|
the substitution algorithm never stops, because it can keep on building
 | 
						|
the graph with the same set of rules without ever turning up a terminal.
 | 
						|
One course of action would be to pre-process the grammar rules to
 | 
						|
eliminate left-recursion; there are algorithms that eliminate direct
 | 
						|
and indirect left-recursion. However, we have taken another course; by
 | 
						|
allowing the produced graphs to contain loops, we can handle left
 | 
						|
recursion without any modifications to the grammar. As soon as 
 | 
						|
we come to the point that we want to substitute a nonterminal
 | 
						|
which was already substituted earlier on the same path and in
 | 
						|
the same prediction phase, we can 
 | 
						|
make a link from the `older' nonterminal to the successor of
 | 
						|
the `new' nonterminal. In this way we have constructed a loop
 | 
						|
in the graph. As an example, suppose we have the following rules:
 | 
						|
.br
 | 
						|
 | 
						|
D: A
 | 
						|
 | 
						|
A: B a
 | 
						|
 | 
						|
B: A | x
 | 
						|
 | 
						|
.br
 | 
						|
Suppose also that we have nonterminal `D' on top of a stack. We 
 | 
						|
now start substituting `D':
 | 
						|
 | 
						|
.PS
 | 
						|
A: box "A"
 | 
						|
move
 | 
						|
X: box "x"
 | 
						|
move to 0.5 <A.se, X.sw>
 | 
						|
down
 | 
						|
move
 | 
						|
B: box "[B]"
 | 
						|
arrow
 | 
						|
box "a"
 | 
						|
arrow
 | 
						|
box "[A]"
 | 
						|
arrow
 | 
						|
box "[D]"
 | 
						|
arrow dashed
 | 
						|
box "END" "[S]"
 | 
						|
 | 
						|
arrow from A.s to B.n
 | 
						|
arrow from X.s to B.n
 | 
						|
 
 | 
						|
.PE 
 | 
						|
 | 
						|
.LP
 | 
						|
We now have an `A' on top of of the stack which was already 
 | 
						|
substituted on the same path and also in the same prediction phase. To avoid
 | 
						|
never ending substitution we make a loop as follows:
 | 
						|
 | 
						|
.PS
 | 
						|
A: box "A" dashed
 | 
						|
move
 | 
						|
X: box "x"
 | 
						|
move to 0.5 <A.se, X.sw>
 | 
						|
down
 | 
						|
move
 | 
						|
B: box "[B]"
 | 
						|
arrow
 | 
						|
box "a"
 | 
						|
arrow
 | 
						|
A2: box "[A]"
 | 
						|
arrow
 | 
						|
box "[D]"
 | 
						|
arrow dashed
 | 
						|
box "END" "[S]"
 | 
						|
 | 
						|
arrow dashed from A.s to B.n
 | 
						|
arrow from X.s to B.n
 | 
						|
arc <- from B.w to A2.w
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
The dashed box with `A' in it means that it can be deleted, because
 | 
						|
there is already an occurrence of it in the loop.
 | 
						|
 | 
						|
.LP
 | 
						|
The most beautiful result of loops in graphs is 
 | 
						|
that the original parsing algorithm needs only one minor change.
 | 
						|
When the algorithm visits an element which has more than one 
 | 
						|
outgoing edge the algorithm starts tracking down both paths,
 | 
						|
just like before, only now there may be one or more backedges among
 | 
						|
these edges, but the algorithm needs not to be aware of this fact.
 | 
						|
The only difficulty with loops is that the algorithm might go into
 | 
						|
a loop; it continues searching for terminals but it might happen
 | 
						|
that there are no valid terminals in the loop. The solution to this
 | 
						|
problem is not very difficult; just set a flag at all elements we
 | 
						|
visit. When we reach an element which has this flag turned on, we
 | 
						|
don't have to search any further.  At the end of the prediction phase, when we
 | 
						|
have found all possible new heads, all flags are cleared.
 | 
						|
Even if there are no loops in the
 | 
						|
prediction graph, setting flags may be used as an optimization: 
 | 
						|
it is possible that two paths come together at one point. In that situation
 | 
						|
it is useless to scan for the second time the part of the graph which 
 | 
						|
both paths have in common.
 | 
						|
 | 
						|
.NH 2
 | 
						|
Some optimizations using reference counts
 | 
						|
 | 
						|
.LP
 | 
						|
As explained in section 2.2, it is sometimes necessary to copy a
 | 
						|
prediction graph element before substituting it. In order to determine
 | 
						|
if a certain element has to be copied, it is convenient to maintain
 | 
						|
a reference count in each graph element. This reference count keeps
 | 
						|
track of the number of edges that enter an element. Now, when we want
 | 
						|
to substitute an element with reference count not 0, we need to
 | 
						|
copy it, because there is another path in the prediction graph that
 | 
						|
contains the element we want to substitute, and on this other path
 | 
						|
the element cannot be substituted yet.
 | 
						|
 | 
						|
.LP
 | 
						|
Maintaining reference counts also enables us to perform another
 | 
						|
optimization: remember that if, in a prediction phase, a terminal
 | 
						|
is predicted that does not match the current inputsymbol, we from
 | 
						|
then on just ignore the path in the graph starting at the terminal.
 | 
						|
However, we can safely delete the terminal from the graph; furthermore,
 | 
						|
all its successors in the prediction graph that have reference count
 | 
						|
0 can be deleted as well, as can their successors with reference
 | 
						|
count 0, etc. This way, we delete from the prediction graph
 | 
						|
most elements that are no longer accessible, but not all of them; as will 
 | 
						|
be explained in the next section, loops in the prediction graph
 | 
						|
can cause problems. 
 | 
						|
 | 
						|
.NH 2
 | 
						|
The algorithm to delete inaccessible loops
 | 
						|
 | 
						|
.LP
 | 
						|
Deleting graph elements which are no longer reachable is not as easy
 | 
						|
as it looks when there are loops in the graph, introduced by
 | 
						|
the extension to the algorithm that handles left recursive grammars.
 | 
						|
Suppose for example that we have a very simple loop as in the left 
 | 
						|
picture below:
 | 
						|
 | 
						|
.PS
 | 
						|
down
 | 
						|
X: box "x" "(0)"
 | 
						|
arrow 
 | 
						|
box "[B]" "(2)"
 | 
						|
arrow
 | 
						|
box "a" "(1)"
 | 
						|
arrow
 | 
						|
box "[A]" "(1)"
 | 
						|
arrow 
 | 
						|
box "[D]" "(1)"
 | 
						|
arc <- from 2nd box.w to 2nd last box.w
 | 
						|
 | 
						|
move right from X.ne
 | 
						|
move 
 | 
						|
move
 | 
						|
move
 | 
						|
move
 | 
						|
move
 | 
						|
move
 | 
						|
down 
 | 
						|
box "x" "(0)" dashed
 | 
						|
arrow dashed
 | 
						|
B: box "[B]" "(1)"
 | 
						|
arrow
 | 
						|
box "a" "(1)"
 | 
						|
arrow
 | 
						|
box "[A]" "(1)"
 | 
						|
arrow
 | 
						|
box "[D]" "(1)"
 | 
						|
arc <- from B.w to 2nd last box.w
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
The number below each symbol indicates the reference count of that element.
 | 
						|
Suppose now that we delete `x', then we have the situation depicted in the 
 | 
						|
picture on the right. The loop consisting of `[B]', `a' and `[A]' is now
 | 
						|
unreachable, so all these elements can be deallocated.
 | 
						|
The reference count of `[B]' is 1, so it will not be deleted. To be precise
 | 
						|
all elements in the loop have their reference counts on 1, and
 | 
						|
consequently none of these will be deleted. But we stated earlier
 | 
						|
that all elements of the loop cannot be reached anymore and that the
 | 
						|
loop had to be deleted! In this example the reference counts of the
 | 
						|
loop elements are all 1, but in more complex situations it is also 
 | 
						|
possible that some of the elements have a reference count of more
 | 
						|
than 1.
 | 
						|
 | 
						|
.LP
 | 
						|
To solve this problem we present an algorithm, devised by E. Wattel, that
 | 
						|
determines whether a loop can be deleted or not.
 | 
						|
The algorithm consists of two parts. The first part of the algorithm goes as
 | 
						|
follows: it presumes that all elements of the loop will indeed be
 | 
						|
deleted. Every time it deletes an element it decreases the reference
 | 
						|
count of all the successors of the element that are also member of the same 
 | 
						|
loop.  How the algorithm knows which elements belong to the loop and which
 | 
						|
do not will be explained later. The situation of the example above will now 
 | 
						|
look like this:
 | 
						|
 
 | 
						|
.PS
 | 
						|
down
 | 
						|
box "[B]" "(0)" 
 | 
						|
arrow 
 | 
						|
box "a" "(0)"
 | 
						|
arrow 
 | 
						|
box "[A]" "(0)"
 | 
						|
arrow
 | 
						|
box "[D]" "(1)"
 | 
						|
arc <- from 1st box.w to 2nd last box.w
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
The number below each symbol indicates again the reference count 
 | 
						|
after we have applied the first part of the algorithm.
 | 
						|
 | 
						|
.LP
 | 
						|
The second part of the algorithm checks and restores the 
 | 
						|
reference counts of all members of the loop . When it finds 
 | 
						|
out that one or more reference counts are not 0, it concludes 
 | 
						|
that it is still possible to enter the loop in some way, and 
 | 
						|
that it cannot be 
 | 
						|
deleted yet. In the other case it reports that the loop can be 
 | 
						|
deleted, which is also true in our example.
 | 
						|
 | 
						|
.LP
 | 
						|
We will now formally describe the first part of the algorithm 
 | 
						|
that finds all directed circuits from a given vertex, and determines if 
 | 
						|
the vertices on those circuits can be deleted.
 | 
						|
The algorithm works on prediction-graphs in which every edge that
 | 
						|
is in a circuit is marked. Note that a marked edge may be in more than one circuit.
 | 
						|
We will call this mark `C'.
 | 
						|
The input to the algorithm is such a prediction graph, and a start vertex,
 | 
						|
say A. The first part of the algorithm is:
 | 
						|
 | 
						|
.IP 1
 | 
						|
Put the start vertex A on a list L; mark all edges `unused'
 | 
						|
.IP 2
 | 
						|
If L is empty, stop
 | 
						|
.IP 3
 | 
						|
For each vertex in list L, check if there are edges marked both C' and
 | 
						|
`unused'. For each edge found, mark it `used', and traverse it to its
 | 
						|
other endpoint; put this endpoint on a new list M, initially empty
 | 
						|
.IP 4
 | 
						|
Decrease the reference count of all vertices on M by 1
 | 
						|
.IP 5
 | 
						|
L := M; go to 2
 | 
						|
 | 
						|
.LP
 | 
						|
It is clear that the algorithm will terminate: each edge is only traversed once,
 | 
						|
and the number of edges is finite. We will now prove some properties of this
 | 
						|
part of the algorithm.
 | 
						|
 | 
						|
.LP
 | 
						|
.I
 | 
						|
An edge is traversed by the algorithm if and only if it is on some
 | 
						|
directed circuit $A ->$...$->A$.
 | 
						|
.R
 | 
						|
.br
 | 
						|
 | 
						|
The if-part is easy; if an edge $e$ connecting vertices $W$ and $V$ is on some directed circuit starting in
 | 
						|
$A$, then there is a path $A ->$...$-> W -> V$; let $A ->$...$-> W -> V$ be a path
 | 
						|
of minimum length from $A$ to $V$. If the length of the path from $A$ to
 | 
						|
$W$ is $k$, then after turn $k$ of the algorithm $W$ will be on list L. To see
 | 
						|
that this is the case, suppose that $W$ is not on list L after turn $k$;
 | 
						|
this means that the edge entering $W$ was already marked used in a
 | 
						|
previous turn, but then there would be a shorter path from $A$
 | 
						|
to $W$, contradicting the assumption that the path is of
 | 
						|
minimum length. The edge
 | 
						|
$e$ is marked `C', because it is in a circuit; it is marked `unused', for if
 | 
						|
it were marked used, there would be a shorter path from $A$ to $V$. So,
 | 
						|
in turn $k + 1$, the edge $e$ will be traversed. 
 | 
						|
 | 
						|
.LP
 | 
						|
On the other hand, suppose that an edge $e$ is traversed by the algorithm;
 | 
						|
we will show by induction on the number of turns the algorithm has made
 | 
						|
that $e$ is on a directed circuit $A->$..$->A$. In the first turn, all
 | 
						|
edges from $A$ that are marked `C' are traversed, and clearly, if an edge
 | 
						|
from $A$ is part of a circuit then that edge is part of a circuit from $A$ to $A$.
 | 
						|
Now suppose that in turn $n+1$ an edge $e$ connecting vertices $W$ and
 | 
						|
$V$ is traversed. This means the edge is 
 | 
						|
marked `C', so it is part of some circuit. If there is a path from $V$ to $A$,
 | 
						|
we can simply trace a circuit
 | 
						|
$A->$...$-> W -> V -> $...$-> A$, and clearly $e$ is on a circuit from
 | 
						|
$A$ to $A$. Now, suppose there is no path from $V$ to
 | 
						|
$A$. We can always trace a circuit $W -> V ->$...$-> W$ because the
 | 
						|
edge from $W$ to $V$ is part of a circuit; and by the
 | 
						|
induction hypothesis there is a circuit $A ->$...$-> W ->$...$-> A$. We can
 | 
						|
now make a `detour' at  $W$, yielding a circuit $A->$...$-> W -> V$...
 | 
						|
$-> W ->$...$-> A$. This case is shown in the picture below.
 | 
						|
So in either case $e$ is on a circuit from $A$ to $A$.
 | 
						|
 | 
						|
.PS
 | 
						|
down; 
 | 
						|
B1: box "A"; 
 | 
						|
arrow dashed; 
 | 
						|
B3: box dashed; 
 | 
						|
arrow dashed;
 | 
						|
B2: box "W";
 | 
						|
arrow dashed; box dashed;
 | 
						|
arc <- from B1.w to last box.w
 | 
						|
arrow right "$e$" "C" from B2.e
 | 
						|
box "V"; arrow dashed; box dashed;
 | 
						|
arrow dashed  -> from last box.n to B3.e
 | 
						|
.PE
 | 
						|
  
 | 
						|
.LP
 | 
						|
.I
 | 
						|
A vertex appears on list L if and only if it is on some directed
 | 
						|
circuit from $A$ to $A$.
 | 
						|
.R
 | 
						|
.br
 | 
						|
 | 
						|
.LP
 | 
						|
If a vertex is in such a circuit, there is an edge that enters it, which
 | 
						|
is part of a circuit form $A$ to $A$; we already showed that this edge
 | 
						|
is traversed by the algorithm, and thus the vertex will appear on list
 | 
						|
L. Conversely, if a vertex appears on list L, then an edge entering
 | 
						|
that vertex has been traversed by the algorithm; we showed that this
 | 
						|
edge is part of a circuit from $A$ to $A$, and thus the vertex is
 | 
						|
part of a circuit from $A$ to $A$.
 | 
						|
 | 
						|
.LP
 | 
						|
.I
 | 
						|
When the algorithm is finished, each vertex that is part of some
 | 
						|
directed circuit from $A$ to $A$ has its reference count decreased by exactly
 | 
						|
the number of edges entering it that are part of a directed circuit from $A$ to $A$.
 | 
						|
.R
 | 
						|
.br
 | 
						|
 | 
						|
.LP
 | 
						|
Each edge that is part of some circuit from $A$ to $A$ is traversed
 | 
						|
exactly once; the reference count of the endpoint is decreased
 | 
						|
by one after an edge has been traversed. Thus, if a vertex is endpoint
 | 
						|
of $k$ such vertices, its reference count is decreased by $k$.
 | 
						|
 | 
						|
.LP
 | 
						|
.I
 | 
						|
If the reference count of each of the vertices visited by the algorithm
 | 
						|
is 0 after the algorithm has finised, all these vertices can be deleted; 
 | 
						|
if the reference count is not zero for one or more of the visited
 | 
						|
vertices, then none of them can be deleted.
 | 
						|
.R
 | 
						|
.br
 | 
						|
 | 
						|
.LP
 | 
						|
Suppose all visited vertices have reference count 0; this means that
 | 
						|
each of the vertices is only entered by edges that are on a circuit
 | 
						|
from $A$ to $A$. Therefore, it holds that any path leading to any
 | 
						|
of the visited vertices has to start in one of the visited vertices; there
 | 
						|
is no path starting in an unvisited vertex to a visited one. Thus,
 | 
						|
all the visited vertices are unreachable.
 | 
						|
Conversely, if one of the visited vertices has reference count not zero,
 | 
						|
then there is a path from an unvisited vertex to this vertex. Because from
 | 
						|
the vertex with reference count non zero, we can get to $A$, and from $A$
 | 
						|
we can get to any of the other vertices, all visited vertices are 
 | 
						|
reachable.
 | 
						|
 | 
						|
.LP
 | 
						|
The second part of the algorithm now checks if all reference counts are
 | 
						|
zero, and if they are, it deletes all visited vertices. 
 | 
						|
 | 
						|
 | 
						|
.NH 2
 | 
						|
Marking loop elements
 | 
						|
 | 
						|
.LP
 | 
						|
One point we have omitted so far is how the edges in the prediction
 | 
						|
graph that are part of a loop get marked.
 | 
						|
Basically, a loop can be detected:
 | 
						|
 | 
						|
	a. when it is made;
 | 
						|
.br
 | 
						|
	b. when we want to know about it.
 | 
						|
 | 
						|
.LP
 | 
						|
The first approach checks if a loop is constructed
 | 
						|
as soon as we join two paths in the graph, and if so, marks all
 | 
						|
edges of the loop. The other approach does not do any checking when two
 | 
						|
paths are joined together; it starts looking for loops when we want
 | 
						|
to delete an element with reference count not 0, marking all edges
 | 
						|
belonging to the loops it discovers. In practice it turns out that
 | 
						|
we very often encounter elements that we would like to delete, but that have
 | 
						|
reference count not 0, whereas the joining of paths occurs relatively 
 | 
						|
infrequently. We therefore have chosen to check if a loop is created
 | 
						|
when two paths in a prediction graph are joined. 
 | 
						|
 | 
						|
.LP
 | 
						|
Now the question arises how to find and mark all edges of
 | 
						|
the loop. For this problem we devised also an algorithm. 
 | 
						|
Because we already know that there is an edge from the element on which 
 | 
						|
the new path is connected to the successor of the joined element, the
 | 
						|
algorithm only has to find a path from this last element back to the first one.
 | 
						|
This can be done by a backtracking depth first search; to find a path from
 | 
						|
one element to another we have to find a possible empty path 
 | 
						|
from one of the successors of the first element to the last element. As
 | 
						|
soon as we have found a path, we can mark all the edges on the path and also
 | 
						|
the backedge as loop edges. In case that there is more than one path
 | 
						|
back to the first element it is necessary that the algorithm continues
 | 
						|
searching after it has found one path.
 | 
						|
 | 
						|
.LP
 | 
						|
To avoid looping of this algorithm we have to set a flag at the elements
 | 
						|
which are on the path already. When the algorithm is backtracking it can 
 | 
						|
clear the flags at the elements it is leaving.
 | 
						|
 | 
						|
.LP
 | 
						|
To speed up the searching process we can set flags at the edges we have already 
 | 
						|
visited but did not lead back to the first element. When the algorithm
 | 
						|
encounters such an edge it already knows that this edge is not worth
 | 
						|
searching again and can be skipped. At the end of the algorithm these
 | 
						|
flags have to be cleared again.
 | 
						|
 | 
						|
.LP
 | 
						|
One might propose another optimization: as soon as
 | 
						|
we reach an edge that is already marked as a loop edge, we
 | 
						|
can stop searching for other loop edges. There is, however, 
 | 
						|
a case in which this can go wrong. Imagine the following situation:
 | 
						|
 | 
						|
.PS
 | 
						|
down
 | 
						|
E: box "[E]"
 | 
						|
arrow " C" ljust
 | 
						|
D: box "[D]"
 | 
						|
arrow " C" ljust
 | 
						|
C: box "c"
 | 
						|
arrow " C" ljust
 | 
						|
box "b"
 | 
						|
arrow " C" ljust
 | 
						|
A: box "[A]"
 | 
						|
arrow 
 | 
						|
box "a"
 | 
						|
 | 
						|
move right from D
 | 
						|
move right
 | 
						|
J: box "[J]"
 | 
						|
down
 | 
						|
arrow from J.s " C" ljust
 | 
						|
I: box "i"
 | 
						|
arrow " C" ljust
 | 
						|
H: box "[H]"
 | 
						|
arrow from H.s to A.e 
 | 
						|
 | 
						|
arc <- from E.w to A.w 
 | 
						|
move left from C
 | 
						|
move left
 | 
						|
"C"
 | 
						|
arc -> from H.e to J.e
 | 
						|
move right from I
 | 
						|
move right
 | 
						|
"C"
 | 
						|
 | 
						|
arrow dashed from E.s to J.n
 | 
						|
 | 
						|
 | 
						|
.PE
 | 
						|
 | 
						|
What we have here is a prediction graph with two loops; all edges that belong 
 | 
						|
to a loop are again marked with an `C'. Note that the edge between `[H]'
 | 
						|
and `[A]' is not a loop edge. Suppose that `[J]' is not yet
 | 
						|
completely substituted, i.e. there is another production rule for
 | 
						|
J:
 | 
						|
.br
 | 
						|
 | 
						|
J:	E
 | 
						|
 | 
						|
.br
 | 
						|
The `E' on top of the right path is now joined with the `[E]' 
 | 
						|
on the left path, which is depicted by the dashed arrow
 | 
						|
between `[E]' and `[J]'. When we take a good look at the graph
 | 
						|
we see that the two loops are merged into one. But that is not
 | 
						|
the most important observation we have to make: not only the 
 | 
						|
edge between `[E]' and `[J]' must be marked as a loop edge, but
 | 
						|
also the edge between `[H]' and `[A]'! So it is not possible
 | 
						|
to stop searching for loop edges as soon as we have found an
 | 
						|
edge which was already marked as a loop edge. We have to continue
 | 
						|
until we reach the element at which we started: `[E]'. So the 
 | 
						|
optimization proposed above is incorrect.
 | 
						|
 | 
						|
 | 
						|
.NH 2
 | 
						|
Optimizations using FIRST and FOLLOW sets
 | 
						|
 | 
						|
.LP
 | 
						|
In the algorithm as we have described it, every nonterminal on top of the graph
 | 
						|
is substituted until only terminals remain on top; these terminals are
 | 
						|
then matched against the current input symbol. However, by using
 | 
						|
FIRST sets, we can save considerably on the number of computations 
 | 
						|
necessary. Suppose one of the top elements of the graph is nonterminal A,
 | 
						|
and the current inputsymbol is $a$. Then, it is of no use to substitute
 | 
						|
A if terminal $a$ is not in FIRST(A), because then substituting A will
 | 
						|
never produce $a$ on top of the graph. So, before substituting a
 | 
						|
nonterminal we check if the current inputsymbol is in its FIRST set; if
 | 
						|
it is not, we can declare the path the nonterminal is on a dead end, and
 | 
						|
delete it, without having to perform the actual substitution. Of course, if
 | 
						|
A can produce empty, we still have to consider its successor in the graph. 
 | 
						|
 | 
						|
.LP
 | 
						|
Similarly, when we have an END marker on top, with nonterminal B in
 | 
						|
it, and we consider using rule 
 | 
						|
.br
 | 
						|
 | 
						|
	D:	$alpha$ B C $gamma$
 | 
						|
 | 
						|
.br
 | 
						|
We first check if the current inputsymbol is in FIRST(C); if this is
 | 
						|
not the case, there is no need to start a graph component with this
 | 
						|
rule, because it will never produce the next inputsymbol on top.
 | 
						|
Again, if C produces empty, we still have to evaluate the part of the
 | 
						|
rule following C.
 | 
						|
 | 
						|
.LP
 | 
						|
To circumvent the problems caused in the FIRST set optimization by
 | 
						|
nonterminal that produce empty, we can also make use of FOLLOW-sets.
 | 
						|
When substituting, if we encounter a nonterminal whose FIRST set does
 | 
						|
not contain the current inputsymbol but which can produce empty,
 | 
						|
we check if the current inputsymbol is in its FOLLOW set. If it is not,
 | 
						|
there is no need to process its successor. Similarly, in case we
 | 
						|
are processing an END marker as explained above, there is no need
 | 
						|
to process the part of the rule following C if FIRST(C) does not
 | 
						|
contain the input symbol, or C produces empty but the inputsymbol
 | 
						|
is not in FOLLOW(C).
 | 
						|
.bp
 | 
						|
.nr PS 12
 | 
						|
.nr VS 14
 | 
						|
 | 
						|
.NH
 | 
						|
Test results
 | 
						|
 | 
						|
.nr PS 10
 | 
						|
.nr VS 12
 | 
						|
.RS
 | 
						|
 | 
						|
.LP
 | 
						|
In this chapter, we discuss some test results that were obtained
 | 
						|
by recompiling existing ACK compilers with the modified LLgen.
 | 
						|
We tried several combinations of possible optimizations, including
 | 
						|
`dumb' ones, like no optimization at all, not even deleting unreachable
 | 
						|
prediction graph elements.
 | 
						|
The incorporation of LLgen with non-correcting error recovery went
 | 
						|
smoothly; only minor modifications to the Make-files were necessary.
 | 
						|
Specifically, these modifications consisted of passing an extra
 | 
						|
flag to LLgen, and including the new generated C-file Lncor.c in
 | 
						|
the list of generated C-files. Also, the LLmessage error reporting
 | 
						|
routine had to be adapted. We successfully recompiled the C, Modula-2
 | 
						|
and Occam compilers; in the next sections, we discuss some test results
 | 
						|
that were obtained with the Modula-2 and C compilers.
 | 
						|
 | 
						|
.RE
 | 
						|
.LP
 | 
						|
.NH 2
 | 
						|
Performance
 | 
						|
 | 
						|
.LP
 | 
						|
We will now present and discuss, with the aid of some 
 | 
						|
diagrams, time and space measurements on the non-correcting error
 | 
						|
recovery. We have measured the effect of various optimizations. 
 | 
						|
These optimizations include the first-set optimization and the follow-set 
 | 
						|
optimization. We also measured the effect of leaving out the loop-deletion
 | 
						|
algorithm, regarding both time and space. We performed out measurements using 
 | 
						|
C- and Modula-2-programs of three different sizes; one of approximately 
 | 
						|
750 tokens, one of appr. 5000 tokens and one of appr. 15000 tokens. We have
 | 
						|
chosen to represent the sizes of programs in the number of tokens instead of 
 | 
						|
number of lines, because the number of tokens more realistically 
 | 
						|
reflects the load the programs put on the error recovery mechanism. Also we give 
 | 
						|
our time measurements in usertime instead of realtime, because realtime 
 | 
						|
depends heavily on the load of the system, which usertime does not.
 | 
						|
Our space measurements are based on the size of the prediction graphs.
 | 
						|
Note that all files are entirely recognized by the non-correcting error 
 | 
						|
recovery technique. We achieved this by putting a `1' at the beginning
 | 
						|
of each file; because then each file starts with a syntax error LLgen 
 | 
						|
is forced to continue with the non-correcting error recovery. 
 | 
						|
 | 
						|
.NH 3
 | 
						|
Time and space measurements on the effect of the first-set optimization 
 | 
						|
 | 
						|
.LP
 | 
						|
In the diagram below we show our time measurements we got from recognizing 
 | 
						|
the C-programs both with and without first-set optimization. 
 | 
						|
 | 
						|
.G1
 | 
						|
coord x 0, 17000 y 0, 65
 | 
						|
ticks bot out at 750, 5000, 15000
 | 
						|
label bot "Number of tokens"
 | 
						|
label left "User Time" "(sec)" left .3
 | 
						|
draw no_opt dashed 
 | 
						|
draw first_opt dashed 
 | 
						|
 | 
						|
copy thru X
 | 
						|
	times size +2 at $1, $2 
 | 
						|
	times size +2 at $1, $3 
 | 
						|
	next no_opt at $1, $2
 | 
						|
	next first_opt at $1, $3
 | 
						|
X until "XXX"
 | 
						|
 | 
						|
742	2.5	.9	
 | 
						|
5010	16.3	5.8
 | 
						|
14308	54.2	16.8
 | 
						|
XXX
 | 
						|
 | 
						|
copy thru X "$1 $2" size -2 at 11000, $3 X until "XXX"
 | 
						|
No optimization 55 
 | 
						|
First-set optimization 20
 | 
						|
XXX
 | 
						|
.G2
 | 
						|
 | 
						|
.I
 | 
						|
.ce
 | 
						|
Time measurements of three C-programs with and without first-set optimization
 | 
						|
.R
 | 
						|
 | 
						|
.LP
 | 
						|
Notice the considerable time savings we 
 | 
						|
get when the first-set optimization is turned on; a factor of slightly more than
 | 
						|
3. Obviously this is an extremely useful optimization. On the other hand
 | 
						|
we found there were no measurable time savings when using the follow-set
 | 
						|
optimization; for that reason we did not chart the result of this optimization.
 | 
						|
It seems that the time savings gained by the optimization are 
 | 
						|
waisted again by the extra processing time needed. We conclude that 
 | 
						|
this optimization is of little or no use when we want to save on time.
 | 
						|
 | 
						|
.LP
 | 
						|
In the following picture the time measurements of three Modula-2 programs 
 | 
						|
are given, again with and without first-set optimization. 
 | 
						|
 | 
						|
.G1
 | 
						|
coord x 0, 17000 y 0, 65
 | 
						|
ticks bot out at 750, 5000, 15000
 | 
						|
label bot "Number of tokens"
 | 
						|
label left "User Time" "(sec)" left .3
 | 
						|
draw no_opt dashed 
 | 
						|
draw first_opt dashed 
 | 
						|
copy thru X
 | 
						|
	times size +2 at $1, $2 
 | 
						|
	times size +2 at $1, $3 
 | 
						|
	next no_opt at $1, $2
 | 
						|
	next first_opt at $1, $3
 | 
						|
X until "XXX"
 | 
						|
 | 
						|
823	1.3	.6	
 | 
						|
4290	7.6	3.5	
 | 
						|
16530	30.5	14.3
 | 
						|
XXX
 | 
						|
 | 
						|
copy thru X "$1 $2" size -2 at 13000, $3 X until "XXX"
 | 
						|
No optimization 30 
 | 
						|
First-set optimization 15
 | 
						|
XXX
 | 
						|
.G2
 | 
						|
 | 
						|
.I
 | 
						|
.ce
 | 
						|
Time measurements of three Modula-2-programs with and without first-set optimization
 | 
						|
.R
 | 
						|
 | 
						|
.LP
 | 
						|
From this picture we can conclude mainly the same as above; considerable
 | 
						|
time savings when we use the first-set optimization; 
 | 
						|
the factor is somewhat less, but still more than 2. Again we have omitted
 | 
						|
the results of the follow-set optimization, for the same reason as before.
 | 
						|
 | 
						|
.LP
 | 
						|
There is however one remarkable difference between the two languages: parsing
 | 
						|
C-programs needs almost twice the time as parsing programs of comparable
 | 
						|
sizes written in Modula-2. This can be explained by the fact that the 
 | 
						|
C-grammar is far more complicated than that of Modula-2, and also the 
 | 
						|
production rules are longer in C, so building, deleting and definitely
 | 
						|
traversing the graph will consume more time.
 | 
						|
 | 
						|
.LP
 | 
						|
Now we come to the space measurements of both C- and Modula-2 programs.
 | 
						|
In the picture below we present the maximum sizes of the prediction graphs,
 | 
						|
during the recognition of the three C-programs.
 | 
						|
 
 | 
						|
.G1
 | 
						|
coord x 0, 17000 y 0, 18000 
 | 
						|
ticks bot out at 750, 5000, 15000
 | 
						|
label bot "Number of tokens"
 | 
						|
label left "Maximum size of" "the prediction graph" "(bytes)"left .3
 | 
						|
draw no_opt dashed 
 | 
						|
draw first_opt dashed 
 | 
						|
copy thru X
 | 
						|
	times size +2 at $1, $2 
 | 
						|
	times size +2 at $1, $3 
 | 
						|
	next no_opt at $1, $2
 | 
						|
	next first_opt at $1, $3
 | 
						|
X until "XXX"
 | 
						|
 | 
						|
742	5568	10444
 | 
						|
5010	7668	12664
 | 
						|
14308	13636	17308
 | 
						|
XXX
 | 
						|
 | 
						|
copy thru X "$1 $2" size -2 at 8000, $3 X until "XXX"
 | 
						|
No optimization 16000 
 | 
						|
First-set optimization 7000 
 | 
						|
XXX
 | 
						|
.G2
 | 
						|
 | 
						|
.I
 | 
						|
.ce
 | 
						|
Maximum sizes of the prediction graphs when recognizing three C-programs
 | 
						|
.R
 | 
						|
 | 
						|
.LP
 | 
						|
From this diagram we see that, although the prediction graphs
 | 
						|
are smaller when the first-set optimization is used, the space savings are
 | 
						|
not as spectacular as the time savings achieved by this optimization.
 | 
						|
 | 
						|
.LP
 | 
						|
In Modula-2 the first-set optimization also causes a decrease in memory
 | 
						|
usage. The savings are less than in C, but still about 1.5 Kb. Again 
 | 
						|
this can be explained by the fact that the rules of the Modula-2 grammar 
 | 
						|
are shorter than that of C.
 | 
						|
 | 
						|
.G1
 | 
						|
coord x 0, 17000 y 0, 12000 
 | 
						|
ticks bot out at 750, 5000, 15000
 | 
						|
label bot "Number of tokens"
 | 
						|
label left "Maximum size of" "the prediction graph" "(bytes)" left .3
 | 
						|
draw no_opt dashed 
 | 
						|
draw first_opt dashed 
 | 
						|
copy thru X
 | 
						|
	times size +2 at $1, $2 
 | 
						|
	times size +2 at $1, $3 
 | 
						|
	next no_opt at $1, $2
 | 
						|
	next first_opt at $1, $3
 | 
						|
X until "XXX"
 | 
						|
 | 
						|
823	5056	3292
 | 
						|
4290	6420	4664
 | 
						|
16530	11388	9632
 | 
						|
XXX
 | 
						|
 | 
						|
copy thru X "$1 $2" size -2 at 8000, $3 X until "XXX"
 | 
						|
No optimization 10000 
 | 
						|
First-set optimization 4000 
 | 
						|
XXX
 | 
						|
.G2
 | 
						|
 | 
						|
.I
 | 
						|
.ce
 | 
						|
Maximum sizes of the prediction graphs when recognizing three Modula-2-programs
 | 
						|
.R
 | 
						|
 | 
						|
.NH 3
 | 
						|
Input that is recognized in quadratic time
 | 
						|
 | 
						|
.LP
 | 
						|
The measurements presented may suggest that the time required to
 | 
						|
recognize input depends linearly on the length of the input; however,
 | 
						|
this is not always the case. When there are recursive rules in the
 | 
						|
grammar, the time needed to recognize input that is produced by this
 | 
						|
rules can become proportional to the square of the input length.
 | 
						|
Consider this set of grammar rules:
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	S:	'{' A '}'
 | 
						|
	A:	'a' A | $epsilon$
 | 
						|
 | 
						|
.fi
 | 
						|
.LP
 | 
						|
When the input is `{aaa....', the algorithm will produce the following 
 | 
						|
prediction graphs: 
 | 
						|
 | 
						|
.PS
 | 
						|
up; B1: box "END" "S"; arrow <- ;box "}";arrow <- ;box "A";arrow <- ;box "{";
 | 
						|
move right from B1.se; move
 | 
						|
up; B2: box "END" "S"; arrow <-; box "}"; arrow <-; box "[A]"; 
 | 
						|
arrow <-; box "A"; arrow <-; box "a";
 | 
						|
move right from B2.se; move
 | 
						|
up; B3: box "END" "S"; arrow <-; box "}"; arrow <-; box "[A]";
 | 
						|
arrow <-; box "[A]"; arrow <-; box "A"; arrow <-; box "a";
 | 
						|
move right from B3.se;move
 | 
						|
up; B4: box "END" "S"; arrow <-; box "}"; arrow <-; box "[A]";
 | 
						|
arrow <-; box "[A]"; arrow <-; box "[A]"; arrow <- ; box "A"; arrow <-;box "a";
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
In each prediction phase, a new [A] appears on the prediction graph. However,
 | 
						|
since A also produces empty, the prediction algorithm has to traverse all the
 | 
						|
elements [A] until it finds the element `}'. In the first prediction phase,
 | 
						|
there is one element [A], in the second there are two, etc, so in all
 | 
						|
1 + 2 + 3 + ... + k = $k(k+1) over 2$ elements have to be traversed if 
 | 
						|
there are k prediction phases, making this proportional to the square
 | 
						|
of the input length. We constructed a parser with this simple input grammar
 | 
						|
and measured the processing time the error recovery mechanism used.
 | 
						|
In the following diagram the dashed line shows the processing time needed;
 | 
						|
the dotted line is the curve $t = 13 n sup 2$. Clearly the processing time
 | 
						|
is proportional to the square of the number of tokens.
 | 
						|
 | 
						|
.G1
 | 
						|
coord x 0, 2100 y 0, 60 
 | 
						|
ticks bot out at 500, 1000, 1500, 2000
 | 
						|
label bot "Number of tokens"
 | 
						|
label left "User Time" "(sec)" left .3
 | 
						|
draw quad dashed 
 | 
						|
 | 
						|
copy thru X
 | 
						|
        times size +2 at $1, $2 
 | 
						|
        next quad at $1, $2
 | 
						|
X until "XXX"
 | 
						|
 | 
						|
500  3.0     
 | 
						|
1000 12.4
 | 
						|
1500 28.6
 | 
						|
2000 51.4 
 | 
						|
XXX
 | 
						|
 | 
						|
draw dotted
 | 
						|
for i from 0 to 2100 by 25 do { next at i, 0.000013 * i * i }
 | 
						|
.G2
 | 
						|
 | 
						|
.LP
 | 
						|
In the grammar used for the C compiler, array initializations are handled by a recursive
 | 
						|
rule, so we would expect that the error recovery mechanism needs quadratic
 | 
						|
processing time to recognize such an initialization; we made measurements on 
 | 
						|
the processing time and indeed, the
 | 
						|
processing time needed grows proportionally to the square of the size of the input, as the
 | 
						|
next figure shows. Here, the processing times are about half of those in
 | 
						|
the previous example; this is so because the recursion appears after two
 | 
						|
tokens are recognized. Note that the algorithm only takes quadratic time
 | 
						|
when it is recognizing input that is generated by a recursive grammar rule.
 | 
						|
Other input is still recognized in linear time, regardless of the fact that
 | 
						|
there are recursive grammar rules.
 | 
						|
 | 
						|
.G1
 | 
						|
coord x 0, 5000 y 0, 85 
 | 
						|
ticks bot out at 1150, 2400, 3600, 4800
 | 
						|
label bot "Number of tokens"
 | 
						|
label left "User Time" "(sec)" left .3
 | 
						|
draw quad dashed 
 | 
						|
 | 
						|
copy thru X
 | 
						|
        times size +2 at $1, $2 
 | 
						|
        next quad at $1, $2
 | 
						|
X until "XXX"
 | 
						|
 | 
						|
1150 5.1      
 | 
						|
2400 20.3
 | 
						|
3600 43.7 
 | 
						|
4800 78.6
 | 
						|
XXX
 | 
						|
.G2
 | 
						|
 | 
						|
.LP
 | 
						|
Unfortunately, there is no easy way to speed up the recognition of these
 | 
						|
recursively defined language elements; they are caused by the substituted
 | 
						|
tokens that are left in the prediction graph, and we cannot just delete those
 | 
						|
`dummies' from the graph during a prediction phase because the `join' part of the
 | 
						|
prediction algorithm depends on them. One could traverse the graph after
 | 
						|
a prediction phase to delete the dummies, but then the processing
 | 
						|
time needed to recognize non-recursively defined language elements would 
 | 
						|
increase dramatically. However, we feel that in practice things
 | 
						|
like large array initializations will not occur in hand-made programs; when
 | 
						|
they occur, it is probably in computer-generated programs, which normally
 | 
						|
will be correct anyway, meaning that the error recovery never sees them.
 | 
						|
When testing such generated programs, one is likely
 | 
						|
to use small test-cases, which are handled well by the error recovery.
 | 
						|
 | 
						|
.NH 3
 | 
						|
Time measurements on the effect of leaving out the loop-deletion algorithm
 | 
						|
 | 
						|
.LP
 | 
						|
We now show what effect the loop-deletion algorithm has on processing time. 
 | 
						|
To put it another way: how much time can be saved when we turn off the 
 | 
						|
loop-deletion algorithm. In the diagram below we give the measurements of 
 | 
						|
the three C-programs; note that we do use the first-set optimization.
 | 
						|
 | 
						|
.G1
 | 
						|
coord x 0, 17000 y 0, 22
 | 
						|
ticks bot out at 750, 5000, 15000
 | 
						|
label bot "Number of tokens"
 | 
						|
label left "User Time" "(sec)" left .3
 | 
						|
draw no_loop dashed 
 | 
						|
draw loop dashed 
 | 
						|
copy thru X
 | 
						|
	times size +2 at $1, $2 
 | 
						|
	times size +2 at $1, $3 
 | 
						|
	next no_loop at $1, $2
 | 
						|
	next loop at $1, $3
 | 
						|
X until "XXX"
 | 
						|
 | 
						|
742	.9	.4	
 | 
						|
5010	5.8	6.8
 | 
						|
14308	16.8	20.5
 | 
						|
XXX
 | 
						|
 | 
						|
copy thru X "$1 $2" size -2 at 11300, $3 X until "XXX"
 | 
						|
With loop-deletion 20 
 | 
						|
Without loop-deletion 9 
 | 
						|
XXX
 | 
						|
.G2
 | 
						|
 | 
						|
.I
 | 
						|
.ce
 | 
						|
Time measurements on processing three C-programs with and without the loop-deletion algorithm
 | 
						|
.R
 | 
						|
 | 
						|
The diagram shows that the loop-deletion algorithm 
 | 
						|
does not dramatically slow down the recognizing process. There is, however, 
 | 
						|
a measurable time loss of \(+-25%. As we will see later, the loop-deletion
 | 
						|
algorithm will turn out to be extremely useful in efficient use of memory 
 | 
						|
when there are many loops in the graph.
 | 
						|
 | 
						|
The effect of the loop-detecion algorithm on parsing Modula-2 programs
 | 
						|
is even less than with C-programs; in fact there is no measurable
 | 
						|
time loss:
 | 
						|
 | 
						|
.G1
 | 
						|
coord x 0, 17000 y 0, 15
 | 
						|
ticks bot out at 750, 5000, 15000
 | 
						|
label bot "Number of tokens"
 | 
						|
label left "User Time" "(sec)" left .3
 | 
						|
draw no_loop dashed 
 | 
						|
draw loop dashed 
 | 
						|
copy thru X
 | 
						|
	times size +2 at $1, $2 
 | 
						|
	times size +2 at $1, $3 
 | 
						|
	next no_loop at $1, $2
 | 
						|
	next loop at $1, $3
 | 
						|
X until "XXX"
 | 
						|
 | 
						|
823	.6	.6
 | 
						|
4290	3.5	3.8
 | 
						|
16530	14.3	14.3
 | 
						|
XXX
 | 
						|
 | 
						|
copy thru X "$1 $2" size -2 at 11800, $3 X until "XXX"
 | 
						|
With loop-deletion 13 
 | 
						|
Without loop-deletion 7
 | 
						|
XXX
 | 
						|
.G2
 | 
						|
 | 
						|
.I
 | 
						|
.ce
 | 
						|
Time measurements on processing three Modula-2-programs with and without a loop-deletion algorithm
 | 
						|
.R 
 | 
						|
 | 
						|
There are at least two reasons for this; both result from the relative
 | 
						|
simplicity of the Modula-2 grammar. The distance from a head to an 
 | 
						|
end of stack marker is shorter than in C, and secondly Modula-2
 | 
						|
causes fewer joins to occur than C, meaning that the loop marking algorithm
 | 
						|
is run less often and when it is run it has fewer paths to search.
 | 
						|
 | 
						|
 | 
						|
.NH 3
 | 
						|
Space measurements on the effect of leaving out the loop-deletion algorithm
 | 
						|
 | 
						|
.LP
 | 
						|
Clearly, to make any measurements on the space-usage effects of leaving out
 | 
						|
the loop-deletion algorithm we need a program that causes the prediction
 | 
						|
graph to contain loops; however, we have not been able to devise a C
 | 
						|
or Modula-2 program that does this. In order to be able to make measurements,
 | 
						|
we added an extra alternative to a rule of the C compiler grammar, making
 | 
						|
it directly left-recursive. To make LLgen accept this new grammar, we
 | 
						|
put a `%if' directive in the rule.
 | 
						|
 | 
						|
.LP
 | 
						|
We have input our standard C test program consisting of 800 tokens to
 | 
						|
the error recovery routine for this `doctored' C compiler, 
 | 
						|
and compared the storage needed for the prediction graphs with the
 | 
						|
loop deletion algorithm enabled with the storage needed when the
 | 
						|
algorithm is disabled. With the loop-deletion algorithm enabled, the
 | 
						|
maximum size of the prediction graph was 5576 bytes. When the loop
 | 
						|
algorithm was disabled, the maximum size of the prediction graph
 | 
						|
grew to 12676 bytes; furthermore, 12676 bytes of heap were allocated
 | 
						|
for the prediction graph, but not deallocated again, because they were
 | 
						|
in use by graph elements that were in inaccessible loops. The user-time
 | 
						|
the program needed decreased only slightly, from 0.9 to 1.0 seconds. Given the
 | 
						|
relatively small input program, this data suggests that when loops
 | 
						|
are actually being made, the loop deletion algorithm is definitely 
 | 
						|
worth the extra overhead it costs, considering the space
 | 
						|
that would otherwise be occupied by inaccessible loops. To verify this,
 | 
						|
we input the C program consisting of 15000 tokens to the compiler;
 | 
						|
execution time increased from 17.3 to 21.1 seconds after enabling
 | 
						|
the loop deletion algorithm, while the maximum size of the prediction graph
 | 
						|
shrunk from 328664 to 13664 bytes. With the loop-deletion algorithm
 | 
						|
disabled, 326720 bytes allocated for the graph were not deallocated again.
 | 
						|
Again, given the relatively small increase in execution time and the
 | 
						|
large reduction of memory usage, we feel that the loop-deletion
 | 
						|
algorithm is useful enough to justify the overhead it creates.
 | 
						|
 | 
						|
.NH 2
 | 
						|
Problems encountered
 | 
						|
 | 
						|
.LP
 | 
						|
In this section we describe some of the problems we encountered
 | 
						|
while testing the non-correcting error recovery.
 | 
						|
 | 
						|
.NH 3
 | 
						|
The LLgen error reporting mechanism.
 | 
						|
 | 
						|
.LP
 | 
						|
The parsers generated by LLgen call a user-supplied error reporting
 | 
						|
routine, usually called LLmessage. This routine is called with an
 | 
						|
integer parameter that is positive, zero or negative. When the parameter
 | 
						|
is positive the parser has just inserted a token, whose
 | 
						|
number is equal to the parameter; if it is zero, the parser 
 | 
						|
has deleted a token whose number is in a global variable called LLsymb; if
 | 
						|
it is negative, it means that LLgen expected end-of-file, but did not
 | 
						|
find it. The routine LLmessage is supposed to print an error message,
 | 
						|
and when a token is inserted, it should set all necessary attributes.
 | 
						|
 | 
						|
.LP
 | 
						|
However, when non-correcting error recovery is used, the situation becomes slightly
 | 
						|
different; when the parser inserts a token, it is only to keep the
 | 
						|
semantic actions consistent, and does no longer signify an error.
 | 
						|
However, the LLmessage routine still has to be called because the
 | 
						|
attributes of the inserted token need to be set. Therefore, when
 | 
						|
non-correcting error recovery is used, the LLmessage routine should not
 | 
						|
print an error message when the parameter is positive, or else it will
 | 
						|
print highly confusing error messages indeed. Furthermore, the
 | 
						|
LLmessage routine will usually print a message like `token ... deleted' when
 | 
						|
it is called with parameter equal to zero; however, when the non-correcting
 | 
						|
error recovery is used, it is more appropriate to report something
 | 
						|
like `token ... illegal', as the non-correcting error recovery does
 | 
						|
not delete tokens. Finally, when an unexpected end-of-file is encountered,
 | 
						|
LLgen normally just inserts the missing tokens and calls
 | 
						|
LLmessage with the parameter equal to the token number;
 | 
						|
when non-correcting error recovery is used we need a way to
 | 
						|
actually report we have encountered an unexpected end-of-file. The
 | 
						|
way we achieved this is by calling LLgen with parameter 0 and the
 | 
						|
global variable LLsymb set to EOFILE when this situation occurs; the
 | 
						|
routine LLmessage should print something like `unexpected end of file'
 | 
						|
when it is called with parameter 0 and LLsymb is EOFILE. To facilitate
 | 
						|
switching between correcting and non-correcting error recovery, the
 | 
						|
file Lpars.h contains a statement `#define LLNONCORR' if non-correcting
 | 
						|
error recovery is used.
 | 
						|
 | 
						|
 | 
						|
.NH 3
 | 
						|
Parsers being started in semantic actions
 | 
						|
 | 
						|
.LP
 | 
						|
LLgen allows the programmer to define more than one nonterminal as the
 | 
						|
start symbol of the input grammar; it will generate a parsing routine
 | 
						|
for each of the start symbols. However, the error recovery code
 | 
						|
is generated only once; it is shared by all parsers.
 | 
						|
The programmer is free to call any
 | 
						|
of the generated parsers whenever he wants; for instance, in the C-compiler
 | 
						|
a separate parser for expressions in #if and #elsif statements is used. Whenever
 | 
						|
the lexical analyzer encounters such a statement, it calls the expression
 | 
						|
parser. It is also possible to call a parser in a semantic action of
 | 
						|
another parser; in the MODULA-2 compiler a separate parser for 
 | 
						|
definition modules is used. When the main parser encounters a 
 | 
						|
FROM defmod IMPORT statement a semantic
 | 
						|
actions opens the definition module defmod and starts the parser for
 | 
						|
definition modules. 
 | 
						|
 | 
						|
.LP
 | 
						|
The fact that subparsers can be started just about anywhere causes
 | 
						|
problems when non-correcting error recovery is used. 
 | 
						|
Suppose a parser calls another parser in a semantic action
 | 
						|
to parse a separate input file. In the Modula-2 compiler, after 
 | 
						|
seeing the FROM defmod IMPORT statement a semantic action opens
 | 
						|
defmod and parses it; now, if a syntax error occurred before the
 | 
						|
FROM IMPORT statement, the non-correcting error recovery will not
 | 
						|
execute the action that opens and parses the definition module, but
 | 
						|
it will not report an error either, because the statement 
 | 
						|
FROM defmod IMPORT is part of the input language of the main parser.
 | 
						|
However, suppose that during the parsing of a definition module
 | 
						|
an error occurs; then, some semantic actions that would normally
 | 
						|
be executed during parsing of the definition module will not have
 | 
						|
taken place. When normal parsing is now resumed by the main parser,
 | 
						|
after the non-correcting error recovery has finished with the
 | 
						|
definition module, a lot of spurious semantic errors are likely to be
 | 
						|
reported, because the semantic actions that would normally have been
 | 
						|
executed during the definition module parsing have not been executed
 | 
						|
by the error recovery. Therefore, it is desirable that the main parser
 | 
						|
does not resume normal parsing, but instead continues with the non-correcting
 | 
						|
error recovery as well. Any syntactic errors in the main program will
 | 
						|
still be reported, but no spurious semantic errors will be reported
 | 
						|
that way.
 | 
						|
 | 
						|
.LP
 | 
						|
When the lexical analyzer calls other parsers, as is the case in
 | 
						|
the ACK C compiler, recursive invocations of the non-correcting error
 | 
						|
recovery routine can occur. This will happen if a parser starts the
 | 
						|
error recovery, the error recovery calls the lexical analyzer, which
 | 
						|
starts another parser that finds a syntax error and calls the
 | 
						|
error recovery again. This is not really a problem, but is has
 | 
						|
consequences for the implementation of the error recovery routine.
 | 
						|
 | 
						|
.LP
 | 
						|
The worst case
 | 
						|
occurs when two parsers are involved in parsing one input file, and
 | 
						|
the secondary parser (e.g. an inline assembly parser) is called in a semantic
 | 
						|
action of the main parser. Suppose now that the input text contains
 | 
						|
a syntax error; after detecting this error, the parser starts the
 | 
						|
non-correcting error recovery. This recovery does not execute any
 | 
						|
semantic actions; therefore it will not start the subparser at those points
 | 
						|
where the original LLgen generated parser would. As a result, parts
 | 
						|
of the program that would be accepted by the subparser will now probably
 | 
						|
be rejected as illegal, because the error recovery does not know it
 | 
						|
should use another grammar to check these parts. This is a serious
 | 
						|
problem, and we have devised and implemented two ways to solve it.
 | 
						|
 | 
						|
.LP
 | 
						|
The first solution is based on the assumption that whenever a semantic
 | 
						|
action occurs in the grammar, another parser can be started at that
 | 
						|
point. Obviously, we have no way of knowing which semantic actions start
 | 
						|
a parser and which don't, so we assume the worst.
 | 
						|
Now, assume that in the grammar there are k symbols defined as
 | 
						|
start symbols, say $W sub 1 , W sub 2 , ..., W sub k$. Each of these symbols
 | 
						|
will cause LLgen to generate a parser that can be called in any
 | 
						|
of the semantic actions of the grammar. We now introduce a new
 | 
						|
symbol $X$, and a new grammar rule $X -> W sub 1 X | W sub 2 X | ... | 
 | 
						|
W sub k X |
 | 
						|
epsilon$.
 | 
						|
In the grammar the error recovery algorithm uses, we insert this symbol
 | 
						|
X at all positions where there are semantic actions in the original grammar,
 | 
						|
so a rule $A -> alpha$ { action } $beta$ becomes $A -> alpha X beta$. As a
 | 
						|
result, at each position in a grammar rule where a semantic action 
 | 
						|
occurs, we now accept any input that would be accepted by any of the
 | 
						|
parsers. Clearly, this solution is somewhat of a kludge, as it will
 | 
						|
accept a lot of input that is not accepted by the original parser.
 | 
						|
However, it is guaranteed to never give spurious error messages, because
 | 
						|
whenever a parser would be started by the original parser, there now
 | 
						|
is an $X$ in the grammar that produces all the strings that would be
 | 
						|
accepted by that parser. We have implemented this solution, and found
 | 
						|
it to be extremely slow, which of course was to be expected given the
 | 
						|
number of semantic actions in the average grammar. Furthermore,
 | 
						|
because each time a semantic action occurs in the grammar
 | 
						|
a string accepted by any of the generated parsers is accepted, including
 | 
						|
strings recognized by the currently running parser, error messages
 | 
						|
become hard to interpret. As an example, consider the following
 | 
						|
C program:
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
 | 
						|
	main()
 | 
						|
	{
 | 
						|
		int i, j;
 | 
						|
 | 
						|
		while (i < j 
 | 
						|
           		j++;
 | 
						|
 | 
						|
		i = 1;
 | 
						|
		j = 2;
 | 
						|
 | 
						|
	}
 | 
						|
 | 
						|
 | 
						|
.fi
 | 
						|
.LP
 | 
						|
Clearly, there is a `)' missing in the while-statement;
 | 
						|
however, if this program is input to the error recovery it will complain 
 | 
						|
"} illegal", since after recognizing the
 | 
						|
expression controlling the while the original parser starts  a 
 | 
						|
semantic action, so the non-correcting recovery will accept a valid
 | 
						|
C program at that point; after recognizing the three statements
 | 
						|
following the while-statement as a separate program the
 | 
						|
recognizer expects the missing `)', but gets `}' instead.
 | 
						|
 | 
						|
.LP
 | 
						|
Our second solution is based on the observation that if we knew
 | 
						|
which semantic actions can start other parsers, we would only
 | 
						|
have to introduce the new symbol $X$ at those places where parsers
 | 
						|
can get started. We have therefore extended LLgen with a new directive
 | 
						|
%substart, which is used to indicate to the parser generator that
 | 
						|
another parser may be started. The %substart is followed by the
 | 
						|
startsymbols that will produce the parsers that can be called,
 | 
						|
so %substart A, B, C; indicates that in the semantic action
 | 
						|
following the directive the parsers produced by startsymbols
 | 
						|
A, B, en C can be started. In the grammar used by the error
 | 
						|
recovery, a new symbol $X$ will be introduced at this point,
 | 
						|
along with a new rule $X -> AX | BX | CX | epsilon$. Of course, this
 | 
						|
solution can still accept input that would not have been accepted
 | 
						|
by original parser, for instance if a parser is started
 | 
						|
conditionally, based on other semantic information. However, it
 | 
						|
is a big improvement over the first solution, both in performance
 | 
						|
and the input it accepts.
 | 
						|
 | 
						|
.NH 3
 | 
						|
Syntactic errors being handled in semantic actions
 | 
						|
 | 
						|
.LP
 | 
						|
A programmer may decide to handle certain syntactic  errors
 | 
						|
in semantic actions, for instance because he is not satisfied with 
 | 
						|
the standard error recovery. However, since the non-correcting error
 | 
						|
recovery does not execute semantic actions, this may cause errors
 | 
						|
to remain undetected. We encountered the following example in the ACK
 | 
						|
Modula-2 compiler, in the grammar rule for assignment statement:
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
 | 
						|
	Assignment_statement:	lvalue
 | 
						|
				[
 | 
						|
				 	'='
 | 
						|
					{
 | 
						|
					 error(":= expected");
 | 
						|
					}
 | 
						|
 | 
						|
					|
 | 
						|
 | 
						|
					':='
 | 
						|
				]
 | 
						|
				expression
 | 
						|
				;
 | 
						|
 | 
						|
.fi
 | 
						|
.LP
 | 
						|
This works well in the original LLgen; however, statements like
 | 
						|
`j=9' are not treated as syntactic, but as semantic errors.
 | 
						|
The original LLgen generated parser
 | 
						|
will print the (semantic) error message, but the non-correcting recovery
 | 
						|
will not execute the semantic action and therefore the erroneous
 | 
						|
input will be accepted.
 | 
						|
 | 
						|
.LP
 | 
						|
To facilitate the incorporation of non-correcting error recovery in parsers 
 | 
						|
that use this kind of `trick', we extended LLgen with the %erroneous
 | 
						|
directive. The directive indicates to the non-correcting recovery 
 | 
						|
mechanism that the token following it is not really part of the grammar.
 | 
						|
When recognizing input, the error recovery will ignore tokens in the
 | 
						|
grammar that have %erroneous in front of them. If in the example above,
 | 
						|
the '=' is replaced with %erroneous '=', the non-correcting mechanism will
 | 
						|
report an error when it sees a statement like 'j = 9'. See appendix B
 | 
						|
for details about the implementation of the %erroneous directive. 
 | 
						|
 | 
						|
.LP
 | 
						|
Another example is in the ACK C compiler. For some reason, the
 | 
						|
grammar accepts function definitions without `()', so according
 | 
						|
to the syntax a function definition can look like:
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	int func
 | 
						|
	{
 | 
						|
	  ....
 | 
						|
	}
 | 
						|
.fi
 | 
						|
 | 
						|
.LP
 | 
						|
The absence of the `()', however, causes `func' to be entered in the
 | 
						|
symbol table as non-function, and when the parser encounters the body
 | 
						|
a semantic action will complain with the error message "Making function body
 | 
						|
for non-function". This again will cause the non-correcting error
 | 
						|
recovery to miss errors. Consider this piece of code:
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
int i int j = 1;
 | 
						|
{}
 | 
						|
 | 
						|
.fi
 | 
						|
 | 
						|
.LP
 | 
						|
where apparently there's a `;' missing between the declarations
 | 
						|
of i and j. The original LLgen-generated parser only gives semantic errors:
 | 
						|
.br
 | 
						|
.nf
 | 
						|
"Making function body for non-function"
 | 
						|
"j is not in parameter list"
 | 
						|
"Illegal initialization of formal parameter, ignored"
 | 
						|
.fi
 | 
						|
.LP
 | 
						|
As a result, the non-correcting error recovery will not report
 | 
						|
any errors in this piece of code, because it does not execute the
 | 
						|
semantic actions that recognize and report the error. Unfortunately,
 | 
						|
due to the way the C-grammar is written, it is not possible to solve
 | 
						|
this problem using a %erroneous directive; the part of the grammar 
 | 
						|
that deals with declaratons would have to be rewritten so as to
 | 
						|
syntactically reject functions without `()'. 
 | 
						|
 | 
						|
.NH 3
 | 
						|
Semantic actions that read input
 | 
						|
 | 
						|
.LP
 | 
						|
There are no restrictions on what a semantic action can do;
 | 
						|
there is nothing to stop the programmer from writing a parser in such
 | 
						|
a way that some of the input to the parser is processed by semantic
 | 
						|
actions. Obviously, because the non-correcting error recovery does not
 | 
						|
execute semantic actions, this kind of parser will not work at all
 | 
						|
with the new error recovery. Ironically, LLgen itself is written in
 | 
						|
such a fashion; {}-enclosed C-code in its input is processed by
 | 
						|
a semantic action in the LLgen grammar. We feel that it is bad
 | 
						|
practice to write parsers this way; the `eating' of parts of
 | 
						|
the input should be done in the lexical analyzer, not in the parser.
 | 
						|
After all, in the case of LLgen, one can regard a semantic action
 | 
						|
in the input as one token, and thus it should be handled by
 | 
						|
the lexical analyzer as such.
 | 
						|
 | 
						|
.NH 2
 | 
						|
Examples of error recovery
 | 
						|
 | 
						|
.LP
 | 
						|
We will now give some examples that compare non-correcting error
 | 
						|
recovery with the correcting error recovery used by parsers generated
 | 
						|
by `standard' LLgen.
 | 
						|
 | 
						|
Consider the next C program, where there is a `)' missing in the
 | 
						|
header of function `test'.
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	1 	int test(a,b
 | 
						|
	2
 | 
						|
	3	int a,b;
 | 
						|
	4
 | 
						|
	5	{
 | 
						|
	6		if (a < b)
 | 
						|
	7			return(1);
 | 
						|
	8		else
 | 
						|
	9			return(0);
 | 
						|
	10	}
 | 
						|
.fi
 | 
						|
 | 
						|
.LP
 | 
						|
This small error derails the `standard' parser; it produces the
 | 
						|
following error messages, where we have left out 7 messages reporting
 | 
						|
semantic errors:
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	line 3: , missing before type_identifier
 | 
						|
	line 3: , missing before identifier
 | 
						|
	line 3: ) missing before ;
 | 
						|
	line 5: { deleted
 | 
						|
	line 6: if deleted
 | 
						|
	line 6: < deleted
 | 
						|
	line 6: ) missing before identifier
 | 
						|
	line 6: ) deleted
 | 
						|
	line 7: identifier missing before return
 | 
						|
	line 7: ; missing before return
 | 
						|
	line 7: { missing before return
 | 
						|
	line 8: else deleted
 | 
						|
 | 
						|
.fi
 | 
						|
.LP
 | 
						|
In contrast, the parser using non-correcting error recovery produces
 | 
						|
only one error message:
 | 
						|
.br
 | 
						|
 | 
						|
	line 3: type_identifier illegal
 | 
						|
 | 
						|
This error message correctly pin-points the error: there should
 | 
						|
have been a `)' at the position where type-identifier `int' is.
 | 
						|
 | 
						|
.LP
 | 
						|
Now, an example with Modula-2; consider this program:
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	1	MODULE test;
 | 
						|
	2
 | 
						|
	3	TYPES
 | 
						|
	4		ElementRecordType = RECORD
 | 
						|
	5		Element: ElementType;
 | 
						|
	6		Next,
 | 
						|
	7		Prior: ElementPointerType;
 | 
						|
	8	END;
 | 
						|
	9
 | 
						|
	10	VARS a,b,c: ElementRecordType;
 | 
						|
	11
 | 
						|
	12
 | 
						|
	13	BEGIN
 | 
						|
	14
 | 
						|
	15		a := b;
 | 
						|
	16
 | 
						|
	17	END test.
 | 
						|
 | 
						|
.fi
 | 
						|
.LP
 | 
						|
There are two syntactic errors in this program; on line 3, TYPES should be TYPE, and
 | 
						|
on line 10, VARS should be VAR. We have left out the type declarations of
 | 
						|
ElementType and ElementPointerType; clearly this will generate semantic
 | 
						|
errors, but we are only interested in syntactic errors anyway.
 | 
						|
The correcting error recovery parser
 | 
						|
again derails on this program; it produces the following syntactic error messages:
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	line 3: CONST missing before identifier
 | 
						|
	line 4: '=' missing before identifier
 | 
						|
	line 4: RECORD deleted
 | 
						|
	line 5: ':' deleted
 | 
						|
	line 5: ';' missing before identifier
 | 
						|
	line 5: '=' missing before ';'
 | 
						|
	line 5: number missing before ';'
 | 
						|
	line 6: ',' deleted
 | 
						|
	line 7: '=' missing before identifier
 | 
						|
	line 7: ':' deleted
 | 
						|
	line 7: ';' missing before identifier
 | 
						|
	line 7: '=' missing before ';'
 | 
						|
	line 7: number missing before ';'
 | 
						|
	line 8: ';' deleted
 | 
						|
	line 10: identifier deleted
 | 
						|
	line 10: ',' deleted
 | 
						|
	line 10: identifier deleted
 | 
						|
	line 10: ',' deleted
 | 
						|
	line 10: identifier deleted
 | 
						|
	line 10: ':' deleted
 | 
						|
	line 10: identifier deleted
 | 
						|
	line 10: ';' deleted
 | 
						|
	line 13: BEGIN deleted
 | 
						|
	line 15: identifier deleted
 | 
						|
	line 15: := deleted
 | 
						|
	line 15: identifier deleted
 | 
						|
	line 15: ';' deleted
 | 
						|
	line 17: END deleted
 | 
						|
	line 17: identifier deleted
 | 
						|
 | 
						|
.fi
 | 
						|
.LP
 | 
						|
The error correction mechanism clearly makes the wrong guess by inserting
 | 
						|
CONST on line 3; as a result, all that follows is rejected as incorrect.
 | 
						|
In contrast, the non-correcting error recovery mechanism only produces
 | 
						|
two error messages:
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	line 3: identifier illegal
 | 
						|
	line 10: identifier illegal
 | 
						|
 | 
						|
.fi
 | 
						|
.LP
 | 
						|
This again exactly pin-points the errors: the identifiers TYPES and
 | 
						|
VARS constitute the only errors in the program. Note that the
 | 
						|
presence of more than one error does not cause any problems to the
 | 
						|
non-correcting recovery mechanism. 
 | 
						|
 | 
						|
.bp
 | 
						|
.nr PS 12
 | 
						|
.nr VS 14
 | 
						|
 | 
						|
.NH
 | 
						|
Conclusion
 | 
						|
 | 
						|
.nr PS 10
 | 
						|
.nr VS 12
 | 
						|
 | 
						|
.LP
 | 
						|
After implementing and testing a non-correcting error recovery mechanism 
 | 
						|
we have come to the conclusion that it indeed is superior to correcting
 | 
						|
mechanisms in what regards the error messages it produces; 
 | 
						|
the examples we have given clearly show this. However, there is a
 | 
						|
clear loss of performance when errors are present in a program, 
 | 
						|
although we have found this performance
 | 
						|
degradation to be acceptable. We feel that the benefits of
 | 
						|
better error messages outweigh the loss of performance. In any case,
 | 
						|
correct programs do not suffer at all from the incorporation 
 | 
						|
of a non-correcting recovery mechanism.
 | 
						|
The error recovery mechanism we implemented does not make
 | 
						|
unreasonable demands on resources; the size of the prediction
 | 
						|
graphs stays within reasonable limits. 
 | 
						|
 | 
						|
.LP
 | 
						|
The main problems we encountered had to do with recognizing
 | 
						|
`languages within languages', and semantic actions that did
 | 
						|
unreasonable things like eating input. The more `well-behaved' a
 | 
						|
parser is, the better the results the non-correcting error recovery
 | 
						|
mechanism gives. This is also true for the input grammars: with a
 | 
						|
language like Modula-2, whose syntax has been designed with parser
 | 
						|
generators in mind, the performance of the non-correcting mechanism
 | 
						|
is better than with C, whose syntax is extremely hard, if not
 | 
						|
impossible to describe with a LL(1) grammar.
 | 
						|
 | 
						|
.bp
 | 
						|
.nr PS 12
 | 
						|
.nr VS 14
 | 
						|
 | 
						|
.NH
 | 
						|
Bibliography
 | 
						|
 | 
						|
.nr PS 10
 | 
						|
.nr VS 12
 | 
						|
 | 
						|
.IP [CORMACK] 12
 | 
						|
Gordon V. Cormack, `An LR substring parser for noncorrecting syntax error
 | 
						|
recovery', ACM SIGPLAN Notices, vol. 24, no. 7, p. 161-169, July 1989
 | 
						|
 | 
						|
.IP [GRUNE] 12
 | 
						|
Dick Grune, Ceriel J.H. Jacobs, `A programmer friendly LL(1) parser
 | 
						|
generator', Softw. Pract. Exper., vol. 18, no. 1, p. 29-38, Jan 1988
 | 
						|
 | 
						|
.IP [RICHTER] 12
 | 
						|
Helmut Richter, `Noncorrecting syntax error recovery', ACM Trans. Prog. Lang.
 | 
						|
Sys., vol.7, no.3, p. 478-489, July 1985
 | 
						|
 | 
						|
.IP [ROEHRICH] 12
 | 
						|
Johannes R\*:ohrich, `Methods for the automatic construction of error
 | 
						|
correcting parsers', Acta Inform., vol. 13, no. 2, p. 115-139, Feb 1980
 | 
						|
 | 
						|
.IP [TOMITA] 12
 | 
						|
Masaru Tomita, Efficient parsing for natural language, Kluwer Academic
 | 
						|
Publishers, Boston, p.210, 1986
 | 
						|
.bp
 | 
						|
.SH
 | 
						|
Appendix A: Implementation Issues
 | 
						|
 | 
						|
.nr PS 10
 | 
						|
.nr VS 12
 | 
						|
.RS
 | 
						|
.LP
 | 
						|
In this appendix we will describe some implementation issues;
 | 
						|
the data structure used to store the grammar during non-correcting
 | 
						|
error recovery, postponing deletions of graph elements until after 
 | 
						|
the prediction phase, and the implementation of the %substart directive .
 | 
						|
.RE
 | 
						|
 | 
						|
.SH
 | 
						|
A.1 The grammar data structure
 | 
						|
 | 
						|
.LP
 | 
						|
The grammar data structure used by the non-correcting error recovery technique has
 | 
						|
to meet two conditions: easy access to a rule as a whole to make
 | 
						|
substituting nonterminals efficient and easy access to each symbol in the RHS 
 | 
						|
of a rule to make starting error recovery and finding continuations
 | 
						|
efficient. To fulfill these conditions we decided to construct the 
 | 
						|
storage of the grammar as follows.
 | 
						|
 | 
						|
.LP
 | 
						|
A rule in the grammar is divided in two
 | 
						|
parts: a LHS and a RHS. The LHS is represented by a struct `lhs' and
 | 
						|
for each symbol in the RHS a struct 'symbol' is constructed. 
 | 
						|
A struct `lhs' contains the number of the 
 | 
						|
nonterminal forming the LHS of the rule, a pointer to the RHS, the 
 | 
						|
first- and follow-sets of the nonterminal and a flag 'empty' which
 | 
						|
indicates whether the nonterminal produces empty or not. A struct 
 | 
						|
`symbol' contains a field indicating the type of the symbol, i.e.
 | 
						|
a terminal or a nonterminal, the number of the symbol, a `link' pointer
 | 
						|
to a struct `symbol' that represents the same symbol, a `next' pointer
 | 
						|
to the rest of the RHS and a pointer back to the LHS.
 | 
						|
 | 
						|
.LP
 | 
						|
A special struct `symbol' is added to the end of the RHS to indicate 
 | 
						|
the end of a rule. The type of this struct is LLEORULE, the number
 | 
						|
is set to -1 and the pointers 'link' and `next' are nil.
 | 
						|
 | 
						|
.LP
 | 
						|
In case that there is more than one RHS for a LHS, all the RHS's 
 | 
						|
are put after each other and separated by another special struct 
 | 
						|
`symbol'. The type of this struct is LLALT, the number is set to 
 | 
						|
-1 and the 'link' pointer is nil. After the last RHS a `LLEORULE'-struct 
 | 
						|
marker is added. 
 | 
						|
 | 
						|
.LP
 | 
						|
Finally, to make searching efficient there are two arrays: `terminals'
 | 
						|
and `nonterminals'. `terminals' is indexed by the number of a terminal
 | 
						|
and contains for each terminal a struct containing a 'link' pointer 
 | 
						|
to a symbol, representing this terminal, in the RHS of a rule. Because
 | 
						|
this symbol has again a 'link' pointer to another symbol representing
 | 
						|
the terminal, it is possible by following this chain of pointers 
 | 
						|
to find all rules containing such a terminal. In a similar way `nonterminals'
 | 
						|
is indexed by the number of a nonterminal and contains for each 
 | 
						|
nonterminal a struct. This struct not only contains a 'link' pointer
 | 
						|
linking all rules with this nonterminal, but also contains a 'rule' 
 | 
						|
pointer. This pointer points to the RHS or RHS's of the rules of which
 | 
						|
the nonterminal forms the LHS.
 | 
						|
 
 | 
						|
.LP
 | 
						|
As an example, consider the following grammar:
 | 
						|
 | 
						|
.br
 | 
						|
A:	a B
 | 
						|
.br
 | 
						|
B:	a | $epsilon$
 | 
						|
.br
 | 
						|
 | 
						|
This will result in the picture below. Note that `pointer' fields 
 | 
						|
without an arrow indicate nil pointers.
 | 
						|
 | 
						|
.PS
 | 
						|
dx = 0.05
 | 
						|
 | 
						|
down 
 | 
						|
A_a: box ht boxht/2 "link"
 | 
						|
box invis "a" ljust with .e at A_a.w
 | 
						|
 | 
						|
move to A_a.s
 | 
						|
move
 | 
						|
move
 | 
						|
 | 
						|
A: box "link" "rule"
 | 
						|
B: box "link" "rule"
 | 
						|
line dashed from A.w to A.e 
 | 
						|
line dashed from B.w to B.e
 | 
						|
box invis "A" ljust with .e at A.w
 | 
						|
box invis "B" ljust with .e at B.w
 | 
						|
 | 
						|
move to A.ne
 | 
						|
right
 | 
						|
move
 | 
						|
move
 | 
						|
down
 | 
						|
 | 
						|
LHS_A: box wid 1.2 * boxwid ht 2.5 * boxht "`A'" "rhs" "first" "follow" "empty 0"
 | 
						|
line dashed from 0.2 <LHS_A.nw, LHS_A.sw> to 0.2 <LHS_A.ne, LHS_A.se>
 | 
						|
line dashed from 0.4 <LHS_A.nw, LHS_A.sw> to 0.4 <LHS_A.ne, LHS_A.se>
 | 
						|
line dashed from 0.6 <LHS_A.nw, LHS_A.sw> to 0.6 <LHS_A.ne, LHS_A.se>
 | 
						|
line dashed from 0.8 <LHS_A.nw, LHS_A.sw> to 0.8 <LHS_A.ne, LHS_A.se>
 | 
						|
 | 
						|
move to LHS_A.ne + (1,0)
 | 
						|
 | 
						|
RHS_a1: box wid 2.0 * boxwid ht 2.5 * boxht "LLTERM" "`a'" "link" "next" "lhs"
 | 
						|
line dashed from 0.2 <RHS_a1.nw, RHS_a1.sw> to 0.2 <RHS_a1.ne, RHS_a1.se>
 | 
						|
line dashed from 0.4 <RHS_a1.nw, RHS_a1.sw> to 0.4 <RHS_a1.ne, RHS_a1.se>
 | 
						|
line dashed from 0.6 <RHS_a1.nw, RHS_a1.sw> to 0.6 <RHS_a1.ne, RHS_a1.se>
 | 
						|
line dashed from 0.8 <RHS_a1.nw, RHS_a1.sw> to 0.8 <RHS_a1.ne, RHS_a1.se>
 | 
						|
 | 
						|
move to RHS_a1.ne + (1,0)
 | 
						|
 | 
						|
RHS_B: box wid 2.0 * boxwid ht 2.5 * boxht "LLNONTERM" "`B'" "link" "next" "lhs"
 | 
						|
line dashed from 0.2 <RHS_B.nw, RHS_B.sw> to 0.2 <RHS_B.ne, RHS_B.se>
 | 
						|
line dashed from 0.4 <RHS_B.nw, RHS_B.sw> to 0.4 <RHS_B.ne, RHS_B.se>
 | 
						|
line dashed from 0.6 <RHS_B.nw, RHS_B.sw> to 0.6 <RHS_B.ne, RHS_B.se>
 | 
						|
line dashed from 0.8 <RHS_B.nw, RHS_B.sw> to 0.8 <RHS_B.ne, RHS_B.se>
 | 
						|
 | 
						|
move to RHS_B.ne + (1,0)
 | 
						|
 | 
						|
RHS_END1: box wid 2.0 * boxwid ht 2.5 *boxht "LLEORULE" "-1" "link" "next" "lhs"
 | 
						|
line dashed from 0.2 <RHS_END1.nw, RHS_END1.sw> to 0.2 <RHS_END1.ne,RHS_END1.se>
 | 
						|
line dashed from 0.4 <RHS_END1.nw, RHS_END1.sw> to 0.4 <RHS_END1.ne,RHS_END1.se>
 | 
						|
line dashed from 0.6 <RHS_END1.nw, RHS_END1.sw> to 0.6 <RHS_END1.ne,RHS_END1.se>
 | 
						|
line dashed from 0.8 <RHS_END1.nw, RHS_END1.sw> to 0.8 <RHS_END1.ne,RHS_END1.se>
 | 
						|
 | 
						|
 | 
						|
move to LHS_A.s - (0,1)
 | 
						|
 | 
						|
LHS_B: box wid 1.2 * boxwid ht 2.5 * boxht "`B'" "rhs" "first" "follow" "empty 1"
 | 
						|
line dashed from 0.2 <LHS_B.nw, LHS_B.sw> to 0.2 <LHS_B.ne, LHS_B.se>
 | 
						|
line dashed from 0.4 <LHS_B.nw, LHS_B.sw> to 0.4 <LHS_B.ne, LHS_B.se>
 | 
						|
line dashed from 0.6 <LHS_B.nw, LHS_B.sw> to 0.6 <LHS_B.ne, LHS_B.se>
 | 
						|
line dashed from 0.8 <LHS_B.nw, LHS_B.sw> to 0.8 <LHS_B.ne, LHS_B.se>
 | 
						|
 | 
						|
move to LHS_B.ne + (1,0)
 | 
						|
 | 
						|
RHS_a2: box wid 2.0 * boxwid ht 2.5 * boxht "LLTERM" "`a'" "link" "next" "lhs"
 | 
						|
line dashed from 0.2 <RHS_a2.nw, RHS_a2.sw> to 0.2 <RHS_a2.ne, RHS_a2.se>
 | 
						|
line dashed from 0.4 <RHS_a2.nw, RHS_a2.sw> to 0.4 <RHS_a2.ne, RHS_a2.se>
 | 
						|
line dashed from 0.6 <RHS_a2.nw, RHS_a2.sw> to 0.6 <RHS_a2.ne, RHS_a2.se>
 | 
						|
line dashed from 0.8 <RHS_a2.nw, RHS_a2.sw> to 0.8 <RHS_a2.ne, RHS_a2.se>
 | 
						|
 | 
						|
move to RHS_a2.ne + (1,0)
 | 
						|
 | 
						|
RHS_ALT: box wid 2.0 * boxwid ht 2.5 * boxht "LLALT" "-1" "link" "next" "lhs"
 | 
						|
line dashed from 0.2 <RHS_ALT.nw, RHS_ALT.sw> to 0.2 <RHS_ALT.ne, RHS_ALT.se>
 | 
						|
line dashed from 0.4 <RHS_ALT.nw, RHS_ALT.sw> to 0.4 <RHS_ALT.ne, RHS_ALT.se>
 | 
						|
line dashed from 0.6 <RHS_ALT.nw, RHS_ALT.sw> to 0.6 <RHS_ALT.ne, RHS_ALT.se>
 | 
						|
line dashed from 0.8 <RHS_ALT.nw, RHS_ALT.sw> to 0.8 <RHS_ALT.ne, RHS_ALT.se>
 | 
						|
 | 
						|
move to RHS_ALT.ne + (1,0)
 | 
						|
 | 
						|
RHS_END2: box wid 2.0 * boxwid ht 2.5 *boxht "LLEORULE" "-1" "link" "next" "lhs"
 | 
						|
line dashed from 0.2 <RHS_END2.nw, RHS_END2.sw> to 0.2 <RHS_END2.ne,RHS_END2.se>
 | 
						|
line dashed from 0.4 <RHS_END2.nw, RHS_END2.sw> to 0.4 <RHS_END2.ne,RHS_END2.se>
 | 
						|
line dashed from 0.6 <RHS_END2.nw, RHS_END2.sw> to 0.6 <RHS_END2.ne,RHS_END2.se>
 | 
						|
line dashed from 0.8 <RHS_END2.nw, RHS_END2.sw> to 0.8 <RHS_END2.ne,RHS_END2.se>
 | 
						|
 | 
						|
# Next pointers upper row
 | 
						|
.ps 30
 | 
						|
circle radius .01 at 0.75 <A.ne, A.se> - (dx, 0) 
 | 
						|
circle radius .01 at 0.3 <LHS_A.ne, LHS_A.se> - (dx, 0) 
 | 
						|
circle radius .01 at 0.7 <RHS_a1.ne, RHS_a1.se> - (dx, 0) 
 | 
						|
circle radius .01 at 0.7 <RHS_B.ne, RHS_B.se> - (dx, 0) 
 | 
						|
.ps 10 
 | 
						|
 | 
						|
arrow from 0.75 <A.ne, A.se> - (dx, 0) to 0.3 <LHS_A.nw, LHS_A.sw>
 | 
						|
arrow from 0.3 <LHS_A.ne, LHS_A.se> - (dx, 0) to 0.3 <RHS_a1.nw,RHS_a1.sw>
 | 
						|
arrow from 0.7 <RHS_a1.ne, RHS_a1.se> - (dx, 0) to 0.7 <RHS_B.nw,RHS_B.sw>
 | 
						|
arrow from 0.7 <RHS_B.ne, RHS_B.se> - (dx, 0) to 0.7 <RHS_END1.nw, RHS_END1.sw>
 | 
						|
 | 
						|
 | 
						|
# Next pointers lower row
 | 
						|
.ps 30
 | 
						|
circle radius .01 at 0.75 <B.ne, B.se> - (dx, 0) 
 | 
						|
circle radius .01 at 0.3 <LHS_B.ne, LHS_B.se> - (dx, 0) 
 | 
						|
circle radius .01 at 0.7 <RHS_a2.ne, RHS_a2.se> - (dx, 0) 
 | 
						|
circle radius .01 at 0.7 <RHS_ALT.ne, RHS_ALT.se> - (dx, 0) 
 | 
						|
.ps 10
 | 
						|
 | 
						|
arrow from 0.75 <B.ne, B.se> - (dx, 0) to 0.3 <LHS_B.nw, LHS_B.sw>
 | 
						|
arrow from 0.3 <LHS_B.ne, LHS_B.se> - (dx, 0) to 0.3 <RHS_a2.nw,RHS_a2.sw>
 | 
						|
arrow from 0.7 <RHS_a2.ne, RHS_a2.se> - (dx, 0) to 0.7 <RHS_ALT.nw,RHS_ALT.sw>
 | 
						|
arrow from 0.7 <RHS_ALT.ne, RHS_ALT.se> - (dx, 0) to 0.7 <RHS_END2.nw, RHS_END2.sw>
 | 
						|
 | 
						|
 | 
						|
# Link pointers
 | 
						|
.ps 30
 | 
						|
circle radius .01 at 0.5 <RHS_a1.ne, RHS_a1.se> - (2*dx, 0) 
 | 
						|
circle radius .01 at 0.5 <A_a.ne, A_a.se> - (dx, 0) 
 | 
						|
circle radius .01 at 0.25 <B.ne, B.se> - (dx, 0) 
 | 
						|
.ps 10
 | 
						|
 | 
						|
arrow dashed from 0.5 <RHS_a1.ne, RHS_a1.se> - (2*dx, 0) to RHS_a2.ne - (2*dx,0)
 | 
						|
line dashed from 0.5 <A_a.ne, A_a.se> - (dx, 0) right 4.0 * boxwid then to RHS_a1.ne - (2*dx, 0) ->
 | 
						|
line dashed from 0.25 <B.ne, B.se> - (dx, 0) right then up .75 then right 7.0 * boxwid then to RHS_B.ne - (2*dx, 0) ->
 | 
						|
 | 
						|
 | 
						|
# LHS pointers upper row
 | 
						|
.ps 30
 | 
						|
circle radius .01 at 0.9 <RHS_a1.ne, RHS_a1.se> - (3*dx, 0) 
 | 
						|
circle radius .01 at 0.9 <RHS_B.ne, RHS_B.se> - (3*dx, 0) 
 | 
						|
circle radius .01 at 0.9 <RHS_END1.ne, RHS_END1.se> - (3*dx, 0) 
 | 
						|
.ps 10
 | 
						|
 | 
						|
line from 0.9 <RHS_a1.ne, RHS_a1.se> - (3*dx, 0) down ->
 | 
						|
line from 0.9 <RHS_B.ne, RHS_B.se> - (3*dx, 0) down ->
 | 
						|
line from 0.9 <RHS_END1.ne, RHS_END1.se> - (3*dx, 0) down then left 8.0 * boxwid then to LHS_A.se -> 
 | 
						|
 | 
						|
 | 
						|
# LHS pointers lower row
 | 
						|
.ps 30
 | 
						|
circle radius .01 at 0.9 <RHS_a2.ne, RHS_a2.se> - (3*dx, 0)
 | 
						|
circle radius .01 at 0.9 <RHS_ALT.ne, RHS_ALT.se> - (3*dx, 0)
 | 
						|
circle radius .01 at 0.9 <RHS_END2.ne, RHS_END2.se> - (3*dx, 0)
 | 
						|
.ps 10
 | 
						|
 | 
						|
line from 0.9 <RHS_a2.ne, RHS_a2.se> - (3*dx, 0) down ->
 | 
						|
line from 0.9 <RHS_ALT.ne, RHS_ALT.se> - (3*dx, 0) down ->
 | 
						|
line from 0.9 <RHS_END2.ne, RHS_END2.se> - (3*dx, 0) down then left 8.0 * boxwid then to LHS_B.se ->
 | 
						|
 | 
						|
 | 
						|
# Text above structs
 | 
						|
box invis ht boxht/2 "terminals" with .s at A_a.n
 | 
						|
box invis ht boxht/2 "nonterminals" with .s at A.n
 | 
						|
box invis ht boxht/2 "lhs" with .s at LHS_A.n
 | 
						|
box invis ht boxht/2 "lhs" with .s at LHS_B.n
 | 
						|
box invis ht boxht/2 "symbol" with .s at RHS_a1.n
 | 
						|
box invis ht boxht/2 "symbol" with .s at RHS_B.n
 | 
						|
box invis ht boxht/2 "symbol" with .s at RHS_END1.n
 | 
						|
box invis ht boxht/2 "symbol" with .s at RHS_a2.n
 | 
						|
box invis ht boxht/2 "symbol" with .s at RHS_ALT.n
 | 
						|
box invis ht boxht/2 "symbol" with .s at RHS_END2.n
 | 
						|
.PE
 | 
						|
 | 
						|
.LP
 | 
						|
Note that the empty alternative for `B' is represented in the 
 | 
						|
data structure by the `LLEORULE-struct' immediately following
 | 
						|
the `LLALT'-struct. When there are still other alternatives 
 | 
						|
the `LLEORULE'-struct is replaced by a `LLALT'-struct followed
 | 
						|
by the other alternatives and a `LLEORULE'-struct. 
 | 
						|
Finally, when the empty rule is the only rule for a 
 | 
						|
nonterminal the RHS will consist only of a `LLEORULE'-struct.
 | 
						|
 | 
						|
.SH
 | 
						|
A.2 Delayed deletes
 | 
						|
 | 
						|
.LP
 | 
						|
We encountered a problem with deleting elements during the 
 | 
						|
prediction phase. Imagine that we have a nonterminal `B' on top of 
 | 
						|
the graph, and `B' has two alternatives. Now suppose that we  
 | 
						|
apply the first alternative and we find out that this alternative leads 
 | 
						|
to a `dead end', i.e. a head that does not match the input symbol, so we want
 | 
						|
to get rid of it. When we delete it immediately the deletion algorithm
 | 
						|
will also deallocate `[B]' and possibly some elements below `[B]'.
 | 
						|
However, there was another alternative for `[B]' which was not yet 
 | 
						|
developed and maybe this alternative leads to a head which is legal.
 | 
						|
But `[B]' has already been deleted and thus cannot be used anymore. A similar 
 | 
						|
situation can occur when we want to delete a joined element; 
 | 
						|
the substitution of a nonterminal
 | 
						|
that only produces empty and thus has no element above it in the graph 
 | 
						|
can also lead to such a situation. We therefore decided to put `dead ends' 
 | 
						|
on a list, `cleanup_arr[]', and after the prediction phase has 
 | 
						|
finished we delete all elements on this list, and all their descendants
 | 
						|
that become unreachable of course.
 | 
						|
 | 
						|
.SH
 | 
						|
A.3 Clearing flags
 | 
						|
 | 
						|
.LP
 | 
						|
We implemented two different ways to clear the flags set by the prediction 
 | 
						|
phase of the algorithm; the first recursively tracks down the whole graph 
 | 
						|
following the flags, the second puts all elements visited by 
 | 
						|
the prediction phase
 | 
						|
on a list; after the prediction phase has finished the algorithm walks 
 | 
						|
through this list clearing the flags of all elements on it. We took measurements
 | 
						|
on both algorithms and found out that with small programs the times
 | 
						|
did not differ much but large programs were processed faster by the
 | 
						|
second algorithm. Therefore we decided to use the second algorithm. 
 | 
						|
 | 
						|
.LP
 | 
						|
To speed up the algorithm even more, we do not deallocate the list
 | 
						|
after a prediction phase has finished. We just set the number of 
 | 
						|
elements on the list to 0. This saves considerably on the number
 | 
						|
of `Malloc'-calls.
 | 
						|
 | 
						|
.SH
 | 
						|
A.4 Implementation of %erroneous directive
 | 
						|
 | 
						|
.LP
 | 
						|
As explained in chapter 3, the user can put a %erroneous directive
 | 
						|
in front of a terminal, making the non-correcting error recovery
 | 
						|
mechanism ignore that terminal. However, implementing this directive
 | 
						|
was not entirely straightforward; consider, for example, the rule
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	A:	'a' | %erroneous 'b' | 'c';
 | 
						|
 | 
						|
.fi
 | 
						|
.LP
 | 
						|
Just leaving out terminal 'b' will not do, because then nonterminal
 | 
						|
A produces empty all of a sudden, which it did not before. 
 | 
						|
The rule should become
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	A:	'a' | 'c';
 | 
						|
 | 
						|
.fi
 | 
						|
but this is hard to implement in LLgen. We took a different approach:
 | 
						|
we introduce a new terminal 'ERRONEOUS', and substitute it for all 
 | 
						|
terminals with an %erroneous directive in front of them. Thus, the
 | 
						|
example rule becomes
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	A:	'a' | ERRONEOUS | 'c';
 | 
						|
 | 
						|
.fi
 | 
						|
.LP
 | 
						|
Since the terminal ERRONEOUS will never be in the input to the parser,
 | 
						|
this has exactly the desired effect; when a predicting phase produces
 | 
						|
ERRONEOUS as head of a prediction graph this head will never match the
 | 
						|
input. In particular, it will not match the terminal that was
 | 
						|
originally there (in this case 'b') so that terminal is no longer
 | 
						|
regarded as part of the input language at that point.
 | 
						|
.bp
 | 
						|
.SH
 | 
						|
Appendix B: Using the non-correcting error recovery
 | 
						|
 | 
						|
.LP
 | 
						|
To use the new non-correcting error recovery mechanism, LLgen has to
 | 
						|
be called with the new flag -n. LLgen will then create an extra file
 | 
						|
called `Lncor.c' which contains the code for the non-correcting recovery 
 | 
						|
mechanism. This file has to be compiled and linked with the rest
 | 
						|
of the program, just like the file `Lpars.c'. 
 | 
						|
 | 
						|
.LP
 | 
						|
The user-supplied error reporting routine `LLmessage' will have to be
 | 
						|
modified slightly; when it is called with a positive parameter, it
 | 
						|
should only set the attributes of the inserted token, but not report an 
 | 
						|
error. Note that the lexical analyzer still must return the same token
 | 
						|
as it did the last time it was called. When LLmessage is called with 
 | 
						|
parameter 0, it should report that the token in global variable LLsymb 
 | 
						|
is illegal; if the value of LLsymb is `EOFILE', the routine should
 | 
						|
report an unexpected End-of-file. When LLmessage is called with parameter
 | 
						|
-1, it should report that end-of-file was expected. To facilitate
 | 
						|
switching between correcting and non-correcting error recovery,
 | 
						|
the file Lpars.h contains a statement `#define LLNONCORR' 
 | 
						|
which indicates that the non-correcting
 | 
						|
mechanism is enabled.
 | 
						|
Here is a
 | 
						|
skeleton for the modified LLmessage routine:
 | 
						|
.nr PS 8
 | 
						|
.nr VS 10
 | 
						|
.LP
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
	#include "Lpars.h"
 | 
						|
	extern int LLsymb;
 | 
						|
 | 
						|
	LLmessage(flag) 
 | 
						|
	int flag;
 | 
						|
	{
 | 
						|
		if (flag < 0)     
 | 
						|
		{
 | 
						|
			/* Error message "end-of-file expected" */;
 | 
						|
		}
 | 
						|
		else if (flag)    
 | 
						|
		{
 | 
						|
			/* flag equals the number of the inserted token */
 | 
						|
#ifndef LLNONCORR 
 | 
						|
 | 
						|
			/* Error message "token inserted" */;	
 | 
						|
#endif
 | 
						|
 | 
						|
			/* Code to set attributes for inserted token */
 | 
						|
			/* Code to make lexical analyzer return same token as before */
 | 
						|
 | 
						|
		else    
 | 
						|
		{
 | 
						|
			/* The number of the illegal or deleted token is in LLsymb */
 | 
						|
#ifndef LLNONCORR
 | 
						|
 | 
						|
			/* Error message "token deleted" */;
 | 
						|
#else
 | 
						|
 | 
						|
			if (LLsymb == EOFILE)
 | 
						|
			{
 | 
						|
				/* Error message "unexpected end of file" */
 | 
						|
			}
 | 
						|
			else
 | 
						|
			{
 | 
						|
				/* Error message "token illegal" */;
 | 
						|
			}
 | 
						|
#endif
 | 
						|
			
 | 
						|
		}
 | 
						|
 | 
						|
	}
 | 
						|
 | 
						|
.fi
 | 
						|
.nr PS 10
 | 
						|
.nr VS 12
 | 
						|
 | 
						|
.LP
 | 
						|
For best results, one should check if the parser calls other parsers
 | 
						|
in semantic actions; if this is the case, and the called parser
 | 
						|
processes the same input file as the calling parser, then a %substart
 | 
						|
should be put in front of the semantic action that starts a parser.
 | 
						|
If a semantic action calls parsers defined by startsymbols say
 | 
						|
A and B, then `%substart A, B;' should be put in front of the action.
 | 
						|
As an alternative, one can use the -s flag of LLgen; this has the
 | 
						|
same effect as putting `%substart X, Y, ....;' in front of all
 | 
						|
semantic actions, where X, Y, .... are the startsymbols of the grammar.
 | 
						|
Clearly, it is preferable to analyze the grammar and put %substart
 | 
						|
directives only where appropriate.
 | 
						|
 | 
						|
Finally, beware of syntactic errors being handled in semantic
 | 
						|
actions; eg, one could have a rule like
 | 
						|
.nr PS 8
 | 
						|
.nr VS 10
 | 
						|
.LP
 | 
						|
.br
 | 
						|
.nf
 | 
						|
 | 
						|
        Assignment_statement:   lvalue
 | 
						|
                                [
 | 
						|
                                        '='
 | 
						|
                                        {
 | 
						|
                                         error(":= expected");
 | 
						|
                                        }
 | 
						|
 | 
						|
                                        |
 | 
						|
 | 
						|
                                        ':='
 | 
						|
                                ]
 | 
						|
                                expression
 | 
						|
                                ;
 | 
						|
.fi
 | 
						|
 | 
						|
.nr PS 10
 | 
						|
.nr VS 12
 | 
						|
.LP
 | 
						|
To ensure that the non-correcting mechanism will recognize the
 | 
						|
`=' as a syntactic error, a `%erroneous' directive should be
 | 
						|
put in front of it.  
 |