summaryrefslogtreecommitdiff
path: root/ipl/packs/ibpag2/README
diff options
context:
space:
mode:
Diffstat (limited to 'ipl/packs/ibpag2/README')
-rw-r--r--ipl/packs/ibpag2/README1093
1 files changed, 1093 insertions, 0 deletions
diff --git a/ipl/packs/ibpag2/README b/ipl/packs/ibpag2/README
new file mode 100644
index 0000000..c2f5d82
--- /dev/null
+++ b/ipl/packs/ibpag2/README
@@ -0,0 +1,1093 @@
+
+
+
+
+
+
+ A User's Manual for Ibpag2
+ (Icon-Based Parser Generation System 2)
+ Version 1.2
+
+ - or -
+
+ How to Use an LR-based Parser Generator
+
+
+ Richard L. Goerwitz, III
+ University of Chicago
+
+
+
+
+
+
+1.__What_is_Ibpag2?
+
+ Ibpag2 is a so-called "parser generator," i.e. a tool for
+automating the process of generating a recognizer and/or parser from
+abstract structural descriptions of an input language. Put in more
+practical terms, Ibpag2 is a piece of software that a) reads a source
+file containing a grammar that defines an input language, and then b)
+outputs an automaton that recognizes that language. The user may, at
+his or her option, specify actions this automaton should take when it
+sees various substructures within its input language. By default,
+however, the parser simply recognizes a given sequence as belonging,
+or not, to that language.
+
+ Ibpag2 utilizes so-called "LR" table generation and parsing
+algorithms. These algorithms facilitate construction of reasonably
+fast deterministic pushdown automata that are powerful enough to
+handle most commonly used programming language constructs. LR-based
+systems come in three main flavors: SLR(1), LALR(1), and LR(1). The
+LR(1) flavor is fairly easy to implement, but uses too many resources
+to be practical. LALR(1) algorithms are harder to implement, but much
+faster, and the parse tables they construct use considerably less
+memory than do those of their LR(1) counterparts. SLR(1) algorithms
+are the easiest to implement, compile the fastest, and use about as
+much memory as LALR(1)s. SLR(1) is the least powerful of the three,
+though, so there is a tradeoff. Ibpag2 is an "enhanced" SLR(1) parser
+generator. It is enhanced in the sense that it can operate both in
+its native SLR(1) mode, and in a more powerful "quasi-GLR" mode (on
+which, see section 5 below).
+
+ As its full title ("Icon-Based Parser Generator 2") implies,
+Ibpag2 is written in Icon [2,3], as are the automata it creates.
+Ibpag2 has been tested with Icon version 8.10. So far I have only run
+it on an i386 box running Xenix 2.3.3, and on a Sun 4 running some
+version of SunOS. I have many reports, though, of it running under
+other UNIX variants. It will probably also run under other operating
+systems, though modifications will in some instances be required.
+Using Ibpag2 under MS-DOS may not be possible, on account of the way
+it manages memory.
+
+ The Ibpag2 distribution adheres to de facto UNIX installation
+standards: Just set the appropriate variables in the makefile, and
+then "make install." For those who are using a non-UNIX system, or
+who have not installed such a package before, there is a section at
+the end entitled "Installing Ibpag2" that details the installation
+procedure (section 6).
+
+ Aside from the above-mentioned installation section (6), the
+remainder of this document aims to provide the reader a) with a
+simple, practical explanation of what LR-family parser generators are
+and how they work (section 2), and b) with a set of directions
+specifically on how to use Ibpag2 (section 3). There is also an
+advanced section on debugging (4), and one on using Ibpag2 with non-LR
+and/or ambiguous languages (5). The discussion is geared for those
+that have little or no experience in parsing or automaton theory. For
+very advanced reading, consult the bibliography. For a brief summary
+of Ibpag's command-line options, see the main Ibpag2 source file,
+ibpag2.icn, or invoke ibpag2 with the -h (help) option.
+
+ In general, be warned that Ibpag2 works best with small or
+medium-sized grammars. Its parse tables have to be reconstructed at
+run-time, and the code for doing this can become a bit cumbersome for
+grammars with more than 100 rules and fifty or so terminal symbols. I
+myself have processed grammars with as many as 300 terminals and 400
+rules. Although the resulting automata run well enough, the output
+files are over 300k, and Ibpag2 takes a long time to create them. If
+you must use Ibpag2 with a very large grammar symbols, try the -c
+command-line option (which produces compressed parse tables). This
+option is discussed below, in section 4. Compiling (rather than
+interpreting) Ibpag2 may result in much faster processing, as will
+resetting your BLOCKSIZE and STRSIZE environment variables. See the
+installation section (6) below on using the Icon compiler to create
+the Ibpag2 executable. Good starting values for BLOCKSIZE and STRSIZE
+are triple their default values (i.e. 3 x 65000). These variables are
+discussed in the Icon manual page.
+
+ My ultimate aim in writing this document has been to make
+accessible to the non-CS portion of the Icon community what for them
+might seem an inaccessible branch of applied parsing and automaton
+theory. I am a philologist myself, and feel that there is a great
+deal that can and ought to be done to make advanced tools accessible
+to people with other interests than twiddling bits or pondering the
+true meaning of epsilon closures :-).
+
+ Any comments on the Ibpag2 system itself or its documentation
+will be gratefully received. Write to me at the address appended to
+the final section (6).
+
+
+2.__What_is_an_LR_Parser_Generator?
+
+ Back in the late 50s and 60s, linguists, mathematicians, and
+software engineers all became intensely interested in the formal
+properties of languages: Can they be described as a series of logical
+structures and relations? Can computers recognize and manipulate
+these structures efficiently? Linguists, in particular, quickly
+realized that the amount of structural complexity, ambiguity, and pure
+noise in natural language would render it computationally intractable,
+especially given the limited memory/throughput of then available CPUs.
+Mathematicians and engineers, however, found that many of the
+formalized notations they dealt with could, in fact, be (re)designed
+in such a way that efficient computer processing could - at least in
+principle - be achieved.
+
+ Principle, in this case, did not squarely meet reality until
+viable parser generation tools came into being. Parser generation
+tools map an abstract structural description of a formal notation or
+"language" to working computer code. Ideally, the designer simply
+makes assertions like:
+
+ an expression is composed of either
+ 1) a term (e.g. 10), or
+ 2) an expression, a "+" or "-", and another expression
+
+Parser generator systems translate these assertions (the "grammar")
+into a machine, i.e. automaton, that can recognize and/or manipulate
+input streams that conform to the "language" so described.
+
+ Let me dwell, for a moment, on the toy expression grammar
+offered above. Note that it describes a set of simple mathematical
+constructs like:
+
+ 9
+ 9 + 3
+ 9 + 3 - 8
+
+According to the specifications given above, the nine, three, and
+eight alone constitute terms - which are also expressions (via rule
+1). Because these terms are also expressions, "9 + 3" can be reduced
+to a larger expression by rule 2. The same is true for "9 + 3 - 8,"
+except that there rule 2 must apply twice - once for "9 + 3," and then
+again for that and the remainder of the line - in effect grouping the
+expressions as ( ( (9) + (3) ) - (8) ). It is also possible to group
+the expression ( (9) + ( (3) - (8) ) ), although for the discussion
+that immediately follows this second grouping will be ignored (see
+below on the terms "precedence" and "associativity").
+
+ If we add actions to the above grammar specification, we can
+create a calculator-like automaton. Traditionally, LR-family automata
+(like the ones Ibpag2 creates) contain a parser, one or more stacks,
+and a set of action tables. The parser reads from an input stream
+segmented into "tokens" (e.g. TERM, '+', '-'), and then manipulates
+its stacks according to directives contained in so-called "action" and
+"goto" tables. As it reads the input stream, the parser matches rules
+with action code specified by the programmer, e.g. rule 2 above might
+be matched with code that added/subtracted the expressions on either
+side of the '+'/'-' operator, and produced (in calculator style) the
+result. Alternatively, it might be matched with code that generated
+an equivalent construct in another language.
+
+ In the case of our toy expression grammar above, the
+corresponding LR automaton operates as follows. Omitting and/or
+simplifying some of the inner details, it first looks at the input
+stream to see what the next token is. If the next token is an
+operator or end-of-input, it checks the top of its stack. If the top
+of the stack has a term on it, that term is popped off, and pushed
+back on, this time renamed as an expression (rule 1 above). The input
+token is then shifted from the input stream onto the stack, unless it
+is the end-of-input token, in which case the parser returns with a
+result. If the top of the stack has an expression on it (rather than
+a term), the parser pops the top three elements off of the stack, and
+then either subtracts the third element from the first or adds the two
+together, depending on whether the second element down was the
+addition or subtraction operator, and the result is pushed onto the
+stack as yet another expression.
+
+ Even in this much-simplified form, the automaton's structure
+is complex. Let us look briefly, therefore, at a practical example of
+its actual workings. If we were to feed it "9 + 3 + 8," our
+calculator would take the following actions:
+
+ 1) read the 9, and push it onto the stack as a term
+ 2) see a plus sign on the input stream
+ 3) pop the term (9) off of the stack and push it back on again
+ (this time calling it an expression)
+ 4) push the plus sign onto the stack
+ 5) read the 3, and push it onto the stack as a term
+ 6) see a minus sign on the input stream
+ 7) pop the 3 off of the stack and push it back on again (this
+ time calling it an expression)
+ 8) see a minus sign still waiting on the input stream
+ 9) pop 9, +, and 3 off of the stack, apply the plus operator
+ to 9 and 3, then push the result onto the stack again a
+ single expression (the stack now has 12 on top)
+ 10) read the minus sign, and push it onto the stack
+ 11) read the 8, and push it onto the stack as a term
+ 12) see the end of input coming up on the input stream
+ 13) pop the 8 off of the stack and push it back on again as an
+ expression
+ 14) see the end-of-input token still sitting on the input
+ stream
+ 15) pop 12, -, and 8 off of the stack, apply the minus operator
+ to 12 and 8, then push the result onto the stack again (the
+ stack now has 4 on top)
+ 16) return the "answer" (i.e. 4)
+
+ This series of actions is hard to describe, and even more so
+to model as part of a hand-written computer program. And, even if
+such a program were written by hand, this program would have to be
+modified, at times radically, every time the grammar it assumes was
+augmented or changed. What I am leading up to is that, with a parser
+generator, the hand compilation stage can be eliminated by allowing
+the programmer simply to declare his/her tokens and language specs,
+then have the appropriate automaton constructed with little, or no,
+human intervention. This is why parser generation tools were critical
+to the development of not just theoretically feasible, but truly
+*practical*, LR-based computer language design systems.
+
+
+3.__Using_Ibpag2
+
+ To recode the above toy expression grammar in
+Ibpag2-compatible format is relatively simple, especially if we omit
+the actions initially, and concentrate on simple recognition. We need
+only a set of token declarations and three rules. Certain
+modifications will have to be made to the token declarations later on.
+For general illustration's sake, however, the following will suffice:
+
+ %token TERM, '+', '-'
+ %%
+ expression : TERM
+ expression : expression, '+', expression
+ expression : expression, '-', expression
+
+TERM, and the addition and subtraction operators, are the tokens (i.e.
+the terminals symbols out of which the grammar is constructed - the
+things that the input stream is segmented into). Note the %token
+keyword used to declare them. The colon means "is composed of." The
+double percent sign separates token declarations from the grammar
+proper.
+
+ Adding in our actions - which above were keyed to a complex
+set of decisions based on input tokens and stack conditions - requires
+just a few extra lines of Ibpag2 action code, set off in curly braces:
+
+ %token TERM, '+', '-'
+ %%
+ expression : TERM { return arg1 }
+ expression : expression, '+', expression { return arg1 + arg3 }
+ expression : expression, '-', expression { return arg1 - arg3 }
+
+Using a "|" shorthand for repeated left-hand sides of rules, we may
+reformat this as:
+
+ %token TERM, '+', '-'
+ %%
+ expression : TERM { return arg1 }
+ | expression, '+', expression { return arg1 + arg3 }
+ | expression, '-', expression { return arg1 - arg3 }
+
+ ArgX above refers to the Xth element of the right-hand side of
+the preceding rule. So, for example, arg1 in "{ return arg1 }" above
+refers to TERM - the only right-hand side element of the first rule.
+The action "{ return arg1 }" means, "once you find a TERM and have
+renamed it as an expression, use the value of TERM as the value for
+that expression." By way of contrast, the action "{ return arg1 +
+arg3 }" means, in conjunction with the rule it follows: "When you find
+an expression consisting of a sub-expression, a plus operator, and
+another sub-expression, use the value of sub-expression 1 + the value
+of sub-expression 2 as the value for the expression as a whole."
+Technically, the action "{ return arg1 }" for expression : TERM is not
+necessary, since the Ibpag2 parser, by default, pushes the value of
+the last RHS arg onto the stack. For epsilon productions (to be
+discussed below), it pushes &null.
+
+ One serious problem with this set of specifications is that
+the operators '-' and '+' are left associative. We humans take this
+for granted, because correct algebraic grouping is something our
+high-school math teachers burned into us. The computer, though, has
+to be told, pedantically, how to group addition and subtraction
+expressions. It has to be explicitly instructed, in other words, to
+group expressions like "9 + 32 - 4" as (9 + 32) - 4. Without
+instructions of this kind, the parser does not know, after it has read
+"9 + 32" and is looking at a minus sign, whether to shift the minus
+sign onto the stack, and eventually try to group as 9 + (32 - 4), or
+to reduce "9 + 32" to an expression and group as (9 + 32) - 4.
+Although in this case the grouping may not seem to matter, it
+sometimes does. Some operators group right to left. The unary minus
+sign, for example, is one such operator (--4 groups as (- (- 4))). To
+include the unary minus sign in our grammar, we might append yet
+another rule:
+
+ %token TERM
+ %left '+', '-'
+ %right '-'
+ %%
+ expression : TERM { return arg1 }
+ | expression, '+', expression { return arg1 + arg3 }
+ | expression, '-', expression { return arg1 - arg3 }
+ | '-', expression { return - arg2 }
+
+The trouble with this arrangement is that the minus sign was already
+declared as left associative. To get around the conflict we use a
+"dummy" token declaration, and a %prec declaration in the applicable
+rule:
+
+ %token TERM
+ %left '+', '-'
+ %right UMINUS
+ %%
+ expression : TERM { return arg1 }
+ | expression, '+', expression { return arg1 + arg3 }
+ | expression, '-', expression { return arg1 - arg3 }
+ | '-', expression %prec UMINUS { return - arg2 }
+
+The %prec declaration simply tells the parser that, even though the
+rule contains a '-' operator, the rule should be handled as if the
+operator were UMINUS. UMINUS is not actually used as a symbol in the
+right-hand side of any rule (hence the designation "dummy"). It is
+there simply to make the last rule behave as if the minus sign in the
+last rule were different than in the second-to-last rule.
+
+ Let us now add in multiplication and division operators to our
+calculator specifications, and see what happens. Let me reiterate
+here that the action "{ return arg1 }" for rule 1 (expression : TERM)
+is not strictly necessary, since the default is to push the last RHS
+arg onto the value stack:
+
+ %token TERM
+ %left '+', '-'
+ %left '*', '/'
+ %right UMINUS
+ %%
+ expression : TERM { return arg1 }
+ | expression, '+', expression { return arg1 + arg3 }
+ | expression, '-', expression { return arg1 - arg3 }
+ | expression, '*', expression { return arg1 * arg3 }
+ | expression, '/', expression { return arg1 / arg3 }
+ | '-', expression %prec UMINUS { return - arg2 }
+
+Note that the multiplication and division operators were defined
+*after* the addition and subtraction operators. The reason for this
+is that, technically speaking, the grammar itself is ambiguous. If we
+treat all operators identically, the parser will not be able to tell
+whether "9 + 1 * 3" should be parsed as (9 + 1) * 3 or as 9 + (1 * 3).
+As we all know from our high-school algebra, multiplication has a
+higher precedence than addition. You do the multiplications before
+the additions, in other words, no matter where they occur. To tell
+the parser to behave in this same manner, we declare '*' after '+'.
+Note that, despite their higher priority, the '*' and '/' operators
+are still left associative. Hence, given "3 / 4 * 7," the parser will
+group its input as (3 / 4) * 7. As a brain teaser, try to figure out
+how the parser might group the input "9 + 3 / 4 * 7." Remember that
+higher-precedence rules get done first, but that same-precedence rules
+get done according to associativity.
+
+ The only fundamental problem remaining with the above grammar
+is that it assumes that the end of the input coincides with the end of
+the line. Is it possible to redefine the language described as
+consisting of arbitrary many lines? The answer to this question is
+"yes." One can simply add another set of productions to the grammar
+that state, essentially, that the input language consists of lines
+made up of an expression and a carriage return or of nothing. Nothing
+is indicated by the keyword epsilon. Note that only the first rule
+has an action field:
+
+ lines : lines, expression, '\n' { write(arg2) }
+ | lines, '\n'
+ | epsilon
+
+This rule-series may seem rather abstruse, but it becomes a bit
+clearer when you think about what happens on actual input. If there
+is no input (epsilon), nothing gets printed, because lines : epsilon
+has no action field. If the parser sees an expression and a newline,
+the parser takes this as an instance of epsilon, plus an expression,
+plus a newline. This, then, becomes the first component of rule 1 if
+another expression + newline follows, or of rule two if just a newline
+occurs. Every time an instance of rule 1 occurs, the action "{
+write(arg2) }" is executed, i.e. the value of the expression gets
+printed. If this still seems hard to fathom, try walking through
+step-by-step. Even experienced hands may find these sorts of rules
+difficult to construct and debug.
+
+ Note that "lines" is now the so-called "start symbol" of our
+grammar. It is, in other words, the goal of every parse. By default
+the left-hand side symbol of the first rule is the start symbol. This
+may be overridden with a %start declaration in the tokens section (on
+which, see the sample Ibpag2 input file below).
+
+ With our new, multi-line start symbol in place, the only piece
+that needs to be added, in order to make our calculator specification
+a full working input to Ibpag2, is a tokenizer. A tokenizer is a
+routine that reads input from a file or from some other stream (e.g.
+the user's console), and then segments this input into tokens that its
+parser can understand. In some cases, the tokens must be accompanied
+by a literal value. For example, if we encounter a TERM, we return
+TERM, just as it is listed in the %token declaration. But what is the
+literal value of a TERM token? It could be, for example, 9, or 5, or
+700. The tokenizer returns the symbol TERM, in this case, but then
+records that TERM's actual value by setting some global variable. In
+Ibpag2's parser, this variable is assumed to be "iilval." In the
+tokenizer, therefore, one might write
+
+ iilval := (literal value)
+ suspend TERM
+
+For literal operators like '+' and '*', there is no need to set
+iilval, since their literal value is irrelevant. One simply returns
+these as integers (usually via "suspend ord(c)").
+
+ The tokenizer routine is normally appended to the grammar
+after another double percent sign. Everything after this second
+double percent sign is copied literally to the output file.
+Alternatively, the tokenizer can be $included via Icon's preprocessor.
+Ibpag2 demands that the tokenizer be called iilex, and that it take a
+single file argument, that it be a generator, and that it fail when it
+reaches end-of-input. Combined with our "lines" productions, the
+addition of an iilex routine to our calculator grammar yields the
+following Ibpag2 input file:
+
+ %token TERM
+ %left '+', '-'
+ %left '*', '/'
+ %right UMINUS
+
+ %start lines
+
+ %%
+
+ expression : TERM { return arg1 }
+ | expression, '+', expression { return arg1 + arg3 }
+ | expression, '-', expression { return arg1 - arg3 }
+ | expression, '*', expression { return arg1 * arg3 }
+ | expression, '/', expression { return arg1 / arg3 }
+ | '-', expression %prec UMINUS { return - arg2 }
+
+ lines : lines, expression, '\n' { write(arg2) }
+ | lines, '\n'
+ | epsilon
+
+ %%
+
+ procedure iilex(infile)
+
+ local nextchar, c, num
+
+ nextchar := create !(!infile || "\n" || "\n")
+ c := @nextchar | fail
+
+ repeat {
+ if any(&digits, c) then {
+ if not (\num ||:= c) then
+ num := c
+ } else {
+ if iilval := \num then {
+ suspend TERM
+ num := &null
+ }
+ if any('+-*/()\n', c) then {
+ iilval := c
+ suspend ord(c)
+ } else {
+ if not any(' \t', c) then {
+ # deliberate error - will be handled later
+ suspend &null
+ }
+ }
+ }
+ c := @nextchar | break
+ }
+ if iilval := \num then {
+ return TERM
+ num := &null
+ }
+
+ end
+
+ procedure main()
+ return iiparse(&input, 1)
+ end
+
+As noted above, the tokenizer (iilex) must be a generator. It must
+suspend integers either directly (e.g. ord(c)), or else via symbolic
+defines like TERM, created by Ibpag2 on the basis of %token, %right,
+%left, and %nonassoc declarations. The tokenizer must fail on end of
+input.
+
+ If you like, cut the above code out, place it in a temporary
+file, tmp.ibp, and then feed this file to Ibpag2 by typing "ibpag2 -f
+tmp.ibp -o tmp.icn." If your system supports input and output
+redirection, type: "ibpag2 < tmp.ibp > tmp.icn." Ibpag2 will turn
+your grammar specifications and actions into a routine called iiparse.
+If you look above, you will see that I appended a main procedure that,
+in fact, calls iiparse(). Iiparse() takes two arguments: 1) an input
+stream, and 2) a switch that, if nonnull, tells the parser to fail
+rather than abort on unrecoverable errors. When Ibpag2 is finished
+creating its output file (tmp.icn above), compile that file the way
+you would compile any other Icon program (e.g. "icont tmp"). Finally,
+run the executable. You should be able to type in various simple
+arithmetic expressions and have the program spit back answers each
+time you hit a return. The only problem you might encounter is that
+the parser aborts on erroneous input.
+
+ The issue of erroneous input brings up yet another point of
+general Ibpag2 usage. Normally, if one is processing input, one does
+not want to abort on errors, but rather just emit an error message,
+and to continue processing - if this is at all possible. To do this,
+Ibpag2 provides a simple but fairly effective mechanism: A reserved
+"error" token.
+
+ When Ibpag2 encounters an error, it will remove symbols from
+its stack until it has backtracked to a point where the error token is
+legal. It then shifts the error token onto the stack, and tries to
+re-start the token stream at the point where it left off, discarding
+tokens if necessary in order to get itself resynchronized. The parser
+considers itself resynchronized when it has successfully read and
+shifted three tokens after shifting the error token. Until then it
+remains in an error state, and will not output additional error
+messages as it discards tokens.
+
+ This explanation may sound a bit abstruse, but in practice it
+is turns out to be quite simple. To implement error handling for our
+calculator, we really have to add only one production to the end of
+the "lines" section:
+
+ lines : lines, expression, '\n' { write(arg2) }
+ | lines, '\n'
+ | epsilon
+ | error, '\n' {
+ write("syntax error; try again:")
+ iierrok
+ }
+
+Given the above grammar, the parser will handle errors as follows: If
+an error occurs (say it has an expression then an operator on its
+stack and sees a newline on the input stream) the parser will throw
+out the operator, then check if the error token would be OK in this
+state (which it would not). Then it would throw out the expression.
+At this point, the stack is in the ready-to-read-a-lines state - the
+state it was in before it read the last expression. Since "lines" may
+consist of error and '\n,' the error token is legal here, and so the
+parser pushes error onto the stack, then looks back at the input
+stream (where a newline is still waiting). Since the newline now
+completes the rule lines : error, '\n', the parser pushes the newline
+onto its stack, then executes the action associated with this
+production, i.e. it writes "syntax error; try again:" to the console,
+prompting the user for additional input.
+
+ The keyword "iierrok" in the above error production's action
+field is there for a subtle, but important, reason: It tells the
+parser to consider itself resynchronized, even if three tokens have
+not yet been shifted. If iierrok were not in the action code for this
+rule, and the user were to supply more bad input after the prompt,
+then the parser would simply discard those tokens, without emitting
+another error message. Why? Because, as you will recall, the parser
+discards tokens after an error, in efforts to resynchronize itself.
+Until it reads and shifts three tokens successfully, it considers
+itself in an error state, and will not emit additional error messages.
+The three-token resync rule is there to prevent a cascade of
+irrelevant error messages touched off by a single error. In our
+calculator's case above, though, we are smarter than the parser. We
+know that it is resynchronized as soon as it reduces error, '\n' to
+lines. So if a syntax error occurs on the next token, it should be
+reported. Adding "iierrok" to the action insures that the parser will
+do just this.
+
+ In addition to iierrok, there are several other directives
+Ibpag2 accepts as part of the action code segments. These are as
+follows:
+
+ iiclearin clear the current input token
+ IIERROR perform error recovery
+ IIACCEPT simulate an accept action
+
+There are several other directives (all implemented as macros) that
+Ibpag2 accepts in GLR mode. For a discussion of GLR mode, see below,
+section 5. IIERROR in particular, and error recovery in general, work
+a bit differently in that mode than they do in Ibpag2's normal (i.e.
+LR) mode.
+
+ There are admittedly many other topics that might be covered
+here. This treatment, however, is intended as a general nontechnical
+introduction, and not as a complete textbook on parser generation use.
+If you want to learn more about this topic, consult the bibliography.
+Also, check the UNIX manual pages on the YACC utility (Yet Another
+Compiler Compiler). Ibpag's input format is fairly close (too close,
+perhaps) to YACC's. In fact, most of what is said about YACC in UNIX
+documentation can be carried directly over to Ibpag2. Several salient
+differences, though, should be kept in mind:
+
+ 1) YACC's "$$ = x" constructs are replaced by "return x" (e.g.
+ "$$ = $1 + $3" -> "return $1 + $3" [$1 is a synonym for
+ "arg1", $3 for "arg3", etc.])
+
+ 2) all variables within a given action are, by default, local
+ to that action; i.e. they cannot be accessed by other
+ actions unless you declare them global elsewhere (e.g. in
+ the pass-through part of the declarations section %{ ...
+ %})
+
+ 3) the %union and %type declarations/tags are not needed by
+ Ibpag2 (both for better and for worse)
+
+ 4) tokens and symbols are separated from each other by a comma
+ in Ibpag2 files (e.g. %token '+', '-' and S : NP, VP)
+
+ 5) epsilon is indicated by the keyword "epsilon" (e.g. REL :
+ epsilon), and not by an empty RHS
+
+ 6) both epsilon and error *may* be declared as %tokens for
+ reasons of precedence, although they retain hard-coded
+ internal values (-2 and -1, respectively)
+
+ 7) all actions must follow the last RHS symbol of the rule
+ they apply to (preceded by an optional %prec directive); to
+ achieve S : NP { action1 }, VP { action2 }, insert a dummy
+ rule: S : NP, dummy, VP { action2 }; dummy : epsilon {
+ action1 } ;
+
+ 8) YYERROR, YYACCEPT, yyclearin, and yyerrok are the same,
+ except they are written IIERROR, IIACCEPT, iiclearin, and
+ iierrok (i.e. "ii" replaces "yy")
+
+ 9) Ibpag2's input files are tokenized as modified Icon files,
+ and, as a consequence, Icon's reserved words must not be
+ used as symbols (e.g. "if : if, then" is no go)
+
+I myself find YACC to be ugly. As a result, Ibpag2 is not an exact
+YACC clone. I would like to underscore the fact that I have no
+intention to move in this direction, either. It's as YACC-like as
+it's going to get!
+
+ Both YACC and non-YACC users should note number 9 in the above
+list. Don't use things like "while," "every," "do," etc. as symbols
+in your grammar! Just use the same rules for Ibpag2 nonterminals as
+for Icon variables, and you'll be OK.
+
+ For those that just can't bear using anything but a strictly
+YACC-conformant system, I've included a preprocessor with the Ibpag2
+distribution called (at one user's recommendation) "iacc." Iacc reads
+&input - assumed to be a YACCish grammar - and sends to &output an
+Ibpag2-conformant file. I have not tested this file extensively, and
+there are likely to be bugs in the way I've handled the necessary 2
+token lookaheads and value stack references. Give it a whirl, though,
+if you are feeling adventurous. The only reason I personally use Iacc
+is that some YACCs (e.g. BSD YACC) have particularly nice debugging
+messages and help. If my grammar is particularly complex, I just run
+it through YACC without action code first, then use Iacc to convert it
+to Ibpag2 format. Iacc's output, as I noted, is not meant to be
+pretty, so I invariably end up doing a little editing - usually just
+respacing a few rules, and re-inserting any comments that I might have
+put in the original YACC file.
+
+ In general, Ibpag2 (like YACC) handles epsilon moves and
+indirect cycles. LR-mode shift-reduce conflicts are also handled in
+the normal way (i.e. pick the rule with the highest priority, and, in
+cases where the priority is the same, check the associativities). In
+contrast to YACC, Ibpag2 flags reduce/reduce conflicts as errors
+(since these often conceal deeper precedence problems and just plain
+kludges). Reduce/reduce conflict errors are easily enough remedied,
+if need be, via (dummy) precedences. One can convert these errors to
+warnings by specifying -y on the command line. With the -y option,
+reduce/reduce conflicts are resolved in favor of the rule that occurs
+first in the grammar. The -y switch also prevents Ibpag2 from
+aborting on shift/reduce conflicts, telling it instead to resolve in
+favor of shift. Basically, -y is a partial YACC compatibility switch.
+Normally (i.e. in SLR mode) Ibpag2 is much more finicky than YACC
+about conflicts in its grammars.
+
+ Also in contrast to YACC, Ibpag2 supports multiple
+simultaneous parsers. Ibpag2 normally names its main parser routine
+iiparse(). By using the -m command-line option, however, you can
+override this default behavior, and force Ibpag2 to augment this name
+in some uniquely identifiable fashion. For example, "ibpag2 -m _1 <
+tmp.ibp > tmp.icn" will force Ibpag2 to write a parser called
+"iiparse_1" to tmp.icn. Note that, instead of calling iilex, this
+iiparse_1() routine will now call iilex_1, and all necessary global
+variables will have _1 appended to them (e.g. errors will become
+errors_1). I don't expect that many people will have occasion to use
+this feature. It is there, though, for those that want it.
+
+
+4.__Debugging
+
+ Constructing and debugging LR(1) family parsers can sometimes
+be hair raising, even with a parser generator. Several precautions
+can be taken, however, to minimize the agony. The first is to declare
+all tokens initially as part of a single %token declaration, i.e. with
+no precedences, and with the same associativities. Also, leave out
+action code until the grammar seems to be working. In this stage, you
+can even run the grammar through (BSD)YACC or GNU Bison. All you
+would need to do is remove the commas between tokens and symbols, and
+place a semicolon at the end of every rule. During this and all
+debugging stages, supply Ibpag2 with a -v command-line switch. This
+will cause Ibpag2 to write a summary of rules, tokens, and its two
+state tables to "ibpag2.output" (a bit like GNU Bison, but with a
+hard-coded name). If you get messages about conflicts in your parse
+tables (e.g. "unresolvable reduce/reduce conflict, state 5, token
+257, rules 4,5"). This file will tell you what rules these are, and
+what token number 257 is. Use precedences and associativities to
+clear these problems up as they arise. If you are comfortable having
+reduce/reduce errors resolved by the order in which the conflicting
+rules occur, then use the -y command-line switch. With -y on the
+command line, Ibpag2 will always resolve in favor of the earlier rule.
+This option will also cause it to resolve all shift/reduce conflicts
+in favor of shift.
+
+ There are certain languages that are not ambiguous that SLR(1)
+parsers like Ibpag2 will fail to produce an unambiguous parse table
+for. The classic example is
+
+ expr : lval, '=', rval | rval
+ lval : '*', rval | ID
+ rval : lval
+
+C programmers will recognize this as a toy expression grammar with
+code for identifiers, assignments, and pointers. The problem is that
+if we feed this grammar to Ibpag2, it will claim that there is a
+conflict on lookahead '='. In truth, there is no ambiguity. The SLR
+parser simply doesn't remember the pathway the parser used to get to
+the state it is in when it sees '=' on the input stream. Whether the
+parser gets into this state by seeing '*' plus and ID, or by seeing
+just an ID, it knows to turn the ID into an lval. Then it knows to
+turn lval into rval. At this point, though, it doesn't know whether
+to shift the = sign via rule 1, or to turn rval and the preceding '*'
+into an lval. The parser has "forgotten" that the '*' is there
+waiting on level down on the stack!
+
+ The solution to this problem is actually quite simple (at
+least in concept). Just provide a unique pathway in the grammar for
+the conflicting rules. In this case, they are rules 1 and 5 (the
+first and last):
+
+ expr : lval, '=', rval | rval
+ lval : '*', pval | ID
+ pval : lval
+ rval : lval
+
+Now when the parser sees '*,' it can only have a pval after it. Never
+mind that pval is composed of precisely the same things as rval. The
+point is that the parser generator follows a different route after
+seeing '*' than if it starts with ID and no preceding '*'. Hence it
+"remembers" that that the '*' is back on the stack, waiting for the
+"lval : '*', pval" rule to apply. There is no more conflict.
+
+ Go ahead and run these grammars through Ibpag2 if you aren't
+sure what is going on. Remember to declare ID as a token, and to
+place "%%" in the appropriate spot!
+
+ If you get your parser up and running, but find that it is not
+functioning quite the way you expect, add the following line somewhere
+near the start of Ibpag2's output file:
+
+ $define IIDEBUG
+
+If you like, you can add it to the beginning of your Ibpag2 input
+file. Place it in the declarations section (before the first double
+percent sign), and surround it by %{ and %}, e.g.:
+
+ %{
+ $define IIDEBUG
+ %}
+
+This tells Ibpag2 to send $define IIDEBUG straight through to the
+output file.
+
+ What defining IIDEBUG does is tell iiparse, once compiled, to
+emit profuse debugging messages about the parser's actions, and about
+the state of its stacks. This display will not make a whole lot of
+sense to anyone who doesn't understand LR-family parsers, so those who
+want to access this feature should perhaps go through a standard
+reference like Aho, Sethi, and Ullman [1].
+
+ If, after you are finished debugging your grammar, you find
+that Ibpag2's output files are rather large, you may try saving space
+by compressing the action and goto tables. This is accomplished by
+invoking Ibpag2 with the -c (compress) option. Using this option
+makes debugging difficult, and makes the parser run a bit more slowly.
+It also only works for rather large grammars with long nonterminal
+symbol names. Don't even consider it until the grammar is thoroughly
+debugged and you have determined that the output file's size is just
+too great for practical use. Even then, compression may or may not
+help, depending on how long your nonterminal names are. In general,
+Ibpag2 is best as a teaching tool, or as a production system for
+medium or small grammars.
+
+
+5.__Using_Ibpag2_with_Non-LR_Grammars
+
+ There may be times when you *want* to parse languages that no
+LR-based algorithm can handle. There may be times, that is, when the
+grammar you want to use contains conflicts or ambiguities that are
+there by design, and not by oversight. For example, you may want to
+parse a natural language. Full-blown natural languages involve many
+highly ambiguous constructs, and are not LR-parsable. By invoking it
+with the -a option, Ibpag2 can parse or recognize certain natural
+languages, or, more practically speaking, certain NL subsets. The
+letter "a" in -a is supposed to stand for "ambiguous," although what
+this option really does is put Ibpag2 into a quasi-GLR mode - i.e.
+into a kind of "generalized" LR mode in which it can accept non-LR
+grammars [4,5].
+
+ User-visible changes to Ibpag2's operation in quasi-GLR mode
+(i.e. with the -a option) are as follows:
+
+ 1) iiparse() is now a generator
+ 2) action code can use suspend as well as return
+ 3) IIERROR places the current thread in an error state (i.e.
+ it doesn't *necessarily* trigger error recovery; see below)
+ 4) there are two new action-code directives (iiprune and
+ iiisolate) and a general define (AUTO_PRUNE)
+ 5) conflicts due to ambiguities in the grammar no longer
+ result in aborted processing (so, e.g., if you do not
+ specify the -y option on a grammar with reduce/reduce
+ conflicts, Ibpag2 will simply generate a parser capable of
+ producing multiple parses for the same input)
+
+ In quasi-GLR mode, iiparse() should be invoked in a way that
+will render multiple results usable, if they are available (e.g.
+"every result := iiparse(&input) do...". Action code is also allowed
+to produce more than one value (i.e. to use suspend). When it does
+so, iiparse() creates separate parse threads for each value. So, for
+instance, if your action code for some production suspends both of the
+following lists,
+
+ ["noun", "will", "gloss: desire"]
+ ["noun", "will", "gloss: legal document mandating how _
+ one's possessions are to be disposed _
+ of after one's death"],
+
+iiparse() would create two separate parse threads - one for each
+result. Note that in this case, the syntactic structure of each
+thread is the same. It is their semantics (i.e. the stuff on the
+value stack) that differs.
+
+ If you use the iierrok and iiclearin macros in your action
+code before suspending any result, their affect persists through all
+subseqent suspensions and resulting parse threads. If you use these
+macros after suspending one or more times, however, they are valid
+only for the parse thread generated by the next suspension. By way of
+contrast, the IIERROR macro *always* flags only the next parse thread
+as erroneous. Likewise, IIACCEPT always simulates an accept action on
+the next suspension only. IIERROR and IIACCEPT, in other words, never
+have any effect on subsequent suspensions and parse threads other than
+the one that immediately follows them. This is true of iierrok and
+iiclearin only when used after the first suspension.
+
+ In quasi-GLR mode, IIERROR (number three in the difference
+list above) becomes a mechanism for placing the current parse thread
+in error mode. This is similar to, but not quite identical to, how
+IIERROR functions in straight LR mode. In quasi-GLR mode, if other
+threads can carry on the parse without error the erroneous parse
+thread is quietly clobbered. Full-blown error recovery only occurs if
+all of the other parsers halt as well. This makes sense if you think
+about it. Why keep erroneous threads around when there are threads
+still continuing a valid parse? For some large interactive systems,
+it might be necessary to keep bogus threads around longer, and weed
+them out only after a lengthy grading process. If you are
+constructing a system such as this, you'll have to modify Ibpag2's
+iiglrpar.lib file. In particular, you'll need to change the segment
+in iiparse() that takes out the trash, so to speak, in such a way that
+it does so only if the error count in a given parser either rises
+above a specific threshhold or else exceeds the number of errors in
+the "most correct" parser by a certain amount. This is not that hard
+to do. I just don't expect that most parsers people generate with
+Ibpag2 will use IIERROR or error recovery in general in so involved a
+fashion.
+
+ Iiprune and iiisolate (number 4 above) are used to control the
+growth of the parallel parser array. In order to give straightforward
+(read "implementationally trivial") support for action code, Ibpag2
+cannot create a parse "forest" in the sense that a standard GLR parser
+does. Instead, it simply duplicates the current parser environment
+whenever it encounters a conflict in its action table. Even if the
+conflict turns out to reflect only a local ambiguity, the parsers, by
+default, remain separate. Put differently, Ibpag2's quasi-GLR parser,
+by default, makes no direct effort to reduce the size of its parser
+arrays or to alter the essentially linear structure of their value and
+state stacks. Size reduction, where necessary and/or desirable, is up
+to the programmer. What the iiprune macro is there to do is to give
+the programmer a way of pruning a given thread out of the active
+parser list. Iiisolate allows him or her to prune out every thread
+*but* the current one. AUTO_PRUNE makes the parser behave more like a
+standard GLR parser, instructing it to prune parse threads that are
+essentially duplicating another parse thread's efforts. The parser,
+though, does not build a parse tree per se, the way most GLR parsers
+typically do, but rather manipulates its value stack like a
+traditional LR-family parser.
+
+ Iiprune is useful when, for example, the semantics (i.e. your
+"action" code segments) determine that a given parse thread is no
+longer viable, and you want to signal the syntactic analyzer not to
+continue pursuing it. The difference between iiprune and IIERROR is
+that iiprune clobbers the current parser immediately. IIERROR only
+puts it into an error state. If all active parsers end up in an error
+state, and none can shift additional input symbols, then the IIERROR
+macro induces error recovery. Iiprune does not. NB: iiprune, if used
+in action code that suspends multiple results, cancels the current and
+remaining results (i.e. it does not clobber parsers already spun off
+by previous suspensions by invocation of that same code; it merely
+cuts the result sequence). Iiprune essentially stands in for "fail"
+in this situation. Fail itself can be used in the code, but be warned
+that iiparse() will still push *at least one* value onto its value
+stack, even if a given action code segment fails. This keeps the
+value stack in sync with the syntax. To avoid confusion, I recommend
+not using "fail" in any action code.
+
+ Iiisolate is useful if, during error recovery, you prompt the
+user interactively, or do something else that cannot be elegantly done
+in parallel for two or more distinct parse threads. Iiisolate allows
+you to preserve only the the current parse thread, and to clobber the
+rest. Iiisolate can also be useful as a way of making sure that only
+one thread carries on the parse in non-error situations. Suppose that
+we have a series of productions:
+
+ sentences : sentences sentence '\n'
+ { print_parse(arg2) }
+ | sentences '\n'
+ | error '\n'
+ | epsilon
+
+If we get a sentence with more than one parse, all of the underlying
+threads that produced these parses will be active for the next
+sentence as well. In many situations this will not be what we want.
+If our desire it to have only one active parse thread at the start of
+each sentence, we simply tell our lexical analyzer to suspend two
+newlines every time it sees a newline on the input stream. This
+insures that the second rule will always apply right after the first.
+We then insert iiisolate directives for both it and the one error
+production:
+
+ sentences : sentences sentence '\n'
+ { print_parse(arg2) }
+ | sentences '\n'
+ { iiisolate }
+ | error '\n'
+ { iiisolate; iierrok }
+ | epsilon
+
+The effect here is to allow multiple parsers to be generated only
+while parsing "sentence". The iiisolate directive, in other words,
+sees to it that no sentence parse will ever begin with multiple active
+parsers. As with LR mode, iierrok clears the error flag for the
+(current) parser.
+
+ Note that if you use iiisolate in action code that suspends
+multiple results, iiisolate will clobber all parsers but the one
+generated by the next suspension.
+
+ If there is no need for close control over the details of the
+parser array, and you wish only to clobber parsers that end up doing
+the same thing as some other parser (and hence returning identical
+values), then just make sure you add "$define AUTO_PRUNE" to the
+pass-through code section at the top of the file. Put differently,
+defining AUTO_PRUNE instructs the quasi-GLR parser to weed out parsers
+that are in the same state, and which have identical value stacks.
+AUTO_PRUNE can often be used in place of iiisolate in situations like
+the one discussed just above. Its only drawback is that it slows
+the parser a bit.
+
+ Other than these deviations (action code and iiparse becoming
+generators, IIERROR's altered behavior, and the addition of iiprune,
+iiisolate, and AUTO_PRUNE), Ibpag2's quasi-GLR mode - at least on the
+surface - works pretty much like its straight LR mode. In fact, if
+you take one of your SLR(1) grammars, and run it through Ibpag2 using
+the -a option, you probably won't notice any difference in the
+resulting automaton unless you do some debugging or perform some
+timing tests (the GLR parser is slower, though for straight SLR(1)
+grammars not by much). Even with non-SLR(1) grammars, the quasi-GLR
+parser will clip along merrily, using all the same sorts of rules,
+action code, and macros that you would typically use in LR mode!
+
+
+6.__Installing_Ibpag
+
+ If you are a UNIX user, or have a generic "make" utility, you
+are in luck. Just edit Makefile.dist according to the directions
+given in that file, rename it as "makefile," then execute "make."
+Ibpag2 should be created automatically. If everything goes smoothly,
+then "make install" (su-ing root, if both possible and necessary for
+correct installation of the iiparse.icn file). Check with your system
+administrator if you are on a public system, and aren't sure what to
+do.
+
+ Please be sure to read the directions in the makefile
+carefully, and set DESTDIR and LIBDIR to the directory where you want
+the executable and parser file to reside. Also, make sure the paths
+you specify are correct for your Icon executables. Although Ibpag2
+will apparently compile using iconc, I would recommend using the
+interpreter, icont, first, unless you are planning on working with a
+large grammar.
+
+ If you are using some other system - one that lacks "make" -
+then shame on your manufacturer :-). You'll be a bit inconvenienced.
+Try typing:
+
+ icont -o ibpag2 follow.icn ibpag2.icn ibreader.icn \
+ ibtokens.icn ibutil.icn ibwriter.icn iohno.icn \
+ outbits.icn slritems.icn slrtbls.icn shrnktbl.icn \
+ version.icn slshupto.icn
+
+The backslashes merely indicate that the next line is a continuation.
+The whole thing should, in other words, be on a single line. As noted
+above, you may compile rather than interpret - if your OS supports the
+Icon compiler. Just replace "icont" above with "iconc." The
+resulting executable will run considerably faster than with "icont,"
+although the time required to compile it may be large, and the (still
+somewhat experimental) compiler may not work smoothly in all
+environments.
+
+ If your operating system support environment variables, and
+you have set up your LPATH according to the specifications in the Icon
+distribution (see below), then you may copy iiparse.lib and
+iiglrpar.lib to some file in your LPATH. If you do not do this, or if
+your OS does not support environment variables, then you must be in
+the directory where you keep your Ibpag2 files when you use it, or
+else invoke Ibpag2 with the -p dirname option (where dirname is the
+directory that holds the iiparse.lib and iiglrpar.lib files that come
+with the Ibpag2 distribution). The .lib files contain template
+parsers that are critical to Ibpag2's operation. Ibpag2 will abort if
+it cannot find them.
+
+ If your operating system permits the creation of macros or
+batch files, it might be useful to create one that changes
+automatically to the Ibpag2 source directory, and runs the executable.
+This has the side-benefit of making it easier for Ibapg2 to find the
+parser library files, iiparse.lib and iiglrpar.lib. Under DOS, for
+instance, one might create a batch file that says:
+
+ c:
+ cd c:\ibpag2
+ iconx ibpag2 %1 %2 %3 %4 %5 %6 %7 %8 %9
+
+DOS, it turns out, has to execute Icon files indirectly through iconx,
+so this technique has yet another advantage in that it hides the
+second level of indirection - although it prevents you from using
+input and output redirection. Naturally, the above example assumes
+that Ibpag2 is in c:\ibpag2.
+
+ Ibpag2 assumes the existence on your system, not only of an
+Icon interpreter or compiler, but also of an up-to-date Icon Program
+Library. There are several routines included in the IPL that Bibleref
+uses. Make sure you (or the local system administrators) have put the
+IPL online, and have translated the appropriate object modules. Set
+your IPATH environment variable to point to the place where the object
+modules reside. Set LPATH to point to the modules' source files.
+Both IPATH and LPATH are documented in doc directory of the Icon
+source tree (ipd224.doc). If your system does not support environment
+variables, copy ximage.icn, options.icn, ebcdic.icn, and escape.icn
+from the IPL into the Ibpag2 source directory, and compile them in
+with the rest of the Ibpag2 source files, either by adding them to the
+SRC variable in the makefile, or by adding them manually to the "icont
+-o ..." command line given above.
+
+ If you have any problems installing or using Ibpag2, please
+feel free to drop me, Richard Goerwitz, an e-mail message at
+goer@midway.uchicago.edu, or (via the post) at:
+
+ 5410 S. Ridgewood Ct., 2E
+ Chicago, IL 60615
+
+
+6.__Bibliography
+
+1. Aho, Alfred V., Sethi, Ravi, and Ullman, Jeffrey D. Compilers.
+ Addison-Wesley: Reading, Massachusetts, second printing, 1988.
+
+2. Griswold, Ralph E. and Griswold, Madge T. The Icon Programming
+ Language. Prentice-Hall, Inc.: Englewood Cliffs, New Jersey, USA,
+ second edition, 1990.
+
+3. Griswold, Ralph E., Jeffery, Clinton L., and Townsend, Gregg M.
+ Version 8.10 of the Icon Programming Language. Univ. of Arizona
+ Icon Project Document 212, 1993. (obtain via anonymous FTP from
+ cs.arizona.edu ~ftp/icon/docs/ipd212.doc)
+
+4. Tomita, Masaru. Efficient Parsing for Natural Language. Boston:
+ Kluwer Academic Publishers, c. 1985.
+
+5. Tomita, Masaru editor. Generalized LR Parsing. Boston: Kluwer
+ Academic Publishers, 1991.