.so /n/pipe/usr/vol2/ADM/mac .XX yacc 347 "Yacc: A Parser Generator" .ds Y \f2Yacc\fP .ds y \f2yacc\fP .nr PI 4n .rn SH sH .de SH .SP 1.5 .sH \\$1. \\$2 .PP .nr H5 0 .nr H4 0 .nr H3 0 .nr H2 0 .nr H1 \\$1 .. .de SS .NH 2 \\$1 .PP .. .de Q{ .in +20p .X{ \\$1 .. .de P{ .Q{ \\$1 .tr _\(ru .ft CW .lg 0 .. .de Q} .X} \\$1 .in -20p .. .de P} .lg .Q} \\$1 .. .de X{ .KS .lg 0 .ie \\n(.$ .SP \\$1 .el .SP .5 .nf .ft 1 .nr t 20p .ta \\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu .. .de X} .ft 1 .fi .lg .ie \\n(.$ .SP \\$1 .el .SP .5 .KE .. .de FI .LP .DS B .fi .nr PS 9 .ps 9 \\f3Figure \\$1\\$2.\\f1 .. .\" EF - end figure .de EF .DE .fi .nr PS 10 .ps 10 .if \\n(.$ .SP \\$1 .if !\\n(.$ .SP .LP .. .PS arrowht = .08; arrowwid = .04 # default values are .1 and .05 vs = 1/6 # 1/6 = 12p; for vertical distances pagewid = 6; fillval = .98 # ln -- line to $1, $2 chop elliptically # chopw and choph are the chop ellipse width and height define ln % { [ x = $1*hu; y = $2*vu xsq = x*x rsq = xsq + y*y; r = sqrt(rsq); asq = chopw*chopw/4; bsq = choph*choph/4 esq = (asq - bsq)/asq p = sqrt(bsq/(1-esq*(xsq/rsq)))/r; O: "" B: O + p*x, p*y E: O + (1-p)*x, (1-p)*y P: O + x,y line $3 from B to E ] with .O at Here } move to Here + $1*hu, $2*vu % # usage: adirection([e|n|w|s], [right|up|...], [1|0|1|0], Object) define adirection % axdir = $3 move to $4 $2 % define aright % adirection(e, right, 1, $1) % define aleft % adirection(w, left, 1, $1) % define anup % adirection(n, up, 0, $1) % define adown % adirection(s, down, 0, $1) % define anell % ax = $1.x - Here.x; ay = $1.y - Here.y Anell: Here + ax,ay if axdir then| line $3 from Here to Here + ax,0 chop 0 chop arcrad if ax*ay > 0 then@ arc @else@ arc cw @ axdir = 0 |else| line $3 from Here to Here + 0,ay chop 0 chop arcrad if ax*ay < 0 then@ arc @else@ arc cw @ axdir = 1 | line $2 to Anell % axdir = 1 .PE .fp 8 SS S .ds BU \f8\(bu\fP .EQ define Small % size 7 "$1" % define f2 % "\&" font 2 "$1" % define cdot % "\s7\f8\N'183'\fP\s0" % define => % "\v'-.15m'\f8\N'222'\fP\v'.15m'" % define C++ % "\f1C\h'-.14m'+\h'-.18'+\fP" % define <! % "\f8\N'225'\fP" % define >! % "\f8\N'241'\fP" % delim $$ .EN .ND "July 1, 1989" .TL Yacc: A Parser Generator\(dg .AU Stephen C. Johnson Ravi Sethi .AI .MH .AB .PP Since the early 1970s, \*y has been used to implement hundreds of languages, big and small. Its applications range from small desk calculators, to medium-sized preprocessors for typesetting, to large compiler front ends for complete programming languages. .PP A \*y specification is based on a collection of grammar rules that describe the syntax of a language; \*y turns the specification into a syntax analyzer. A pure syntax analyzer merely checks whether or not an input string conforms to the syntax of the language. .PP We can go beyond pure syntax analysis by attaching code in C or $C++$ to a grammar rule; such code is called an action, and is executed whenever the rule is applied during syntax analysis. Thus, a desk calculator might use actions to evaluate an expression, and a compiler front end might use actions to emit intermediate code. .PP \*Y allows us to build parsers from LALR(1) grammars without necessarily learning the underlying theory. .AE .@tag SH _SH1_ .FS \(dg Prepared by R. Sethi from\&|reference(v7yacc)\&. S. C. Johnson is presently with Ardent Computer, 880 West Maude Ave., Sunnyvale, California 94086. .FE .SH _SH1_ "Introduction" .LP \*Y is a tool for building syntax analyzers, also known as .I parsers . This section introduces the basic features of \*y. We review grammars, build a pure parser for real numbers, and then augment the parser to evaluate numbers during parsing. The language of real numbers is a toy; realistic examples appear in Section _SH2_. .PP Uppercase letters are distinct from lowercase letters in \*y specifications. Thus, .CW digit and .CW DIGIT are distinct names. The font of a name is chosen purely for readability, so .CW fraction , and $fraction$ (within diagrams) refer to the same name. .SS "Further Reading" \*Y is designed to handle a single but significant part of the total job of building a translator or interpreter for a language. The remaining parts of the job must be implemented in a host programming language, presumed to be C|reference(Kernighan Ritchie 1988) or $C++$|reference(Stroustrup book 1986). .PP \f2Lex\fP|reference(latest lex), a tool for building lexical analyzers, works in harmony with \*y. It can be easily used to produce quite complicated lexical analyzers, but there remain some languages (Fortran, for example) whose lexical analyzers must be crafted by hand. .PP Kernighan and Pike|reference(Kernighan Pike) illustrate program development using \*y and \f2lex\fP by gradually extending an expression evaluator into an interpreter for a language comparable to Basic. Schreiner and Friedman|reference(Schreiner Friedman) conduct a book-length case study of how to create a compiler using \*y and \f2lex\fP. .PP Textbooks on compilers such as \&|reference(Aho Sethi Ullman 1986)\& provide more information on the behavior and construction of parsers than will be covered here. The algorithms underlying \*y are also discussed in a survey of LR parsing \&|reference(Aho Johnson surveys)\&. A feature that sets \*y apart \(em its ability to build fast compact LR parsers from ambiguous grammars \(em is based on the theory developed in \&|reference(Aho Johnson Ullman ambiguous)\&. .PP Among the earliest applications of \*y are .I eqn |reference(Kernighan Cherry 1975), a language for typesetting mathematics, and .I pcc , the Portable C Compiler|reference(Johnson portable acm). ....... .SS "Grammars, Reviewed" The .I syntax of a language imposes a hierarchical structure, called a .I "parse tree" , on strings in the language. The following is a parse tree for the string .CW 3.14 in a language of real numbers: .KS .ps 9 .ft CW .PS [ hu = .5; vu = 1.5*vs; choph = vs; chopw = .7 boxht = vs; boxwid = .8*vs "$realNumber$" { ln(-1,-1); "$integerPart$" ln( 0,-1); "$digit$" ln( 0,-2); "3" }{ ln( 0,-4); "." }{ ln( 1.5,-1); "$fraction$" { ln(-0.5,-1); "$digit$" ln( 0,-2); "1" }{ ln( 0.5,-1); "$fraction$" ln( 0,-1); "$digit$" ln( 0,-1); "4" } } ] .PE .KE .PP The leaves at the bottom of a parse tree are labeled with .I terminals or .I tokens ; tokens represent themselves. By contrast, the other nodes of a parse tree are labeled with .I nonterminals . Each node in the parse tree is based on a rule, called a .I production , that defines a nonterminal in terms of a sequence of terminals and nonterminals. The root of the parse tree for .CW 3.14 is based on the following informally stated production: .Q{ A real number consists of an integer part, a point, and a fraction. .Q} This production is written as follows in the notation accepted by \*y: .P{ realNumber : integerPart '.' fraction ; .P} .PP Together, the tokens, the nonterminals, the productions, and a distinguished nonterminal, called the .I "start symbol" , constitute a .I grammar for a language. Both tokens and nonterminals are referred to as .I "grammar symbols" , or simply .I symbols . .SS "Grammars in Yacc Specifications" The three sections of a \*y specification are for optional declarations, productions, and optional user-supplied routines. The productions are the heart of a specification; they comprise all but the first two and last two lines of Figure _FI1_. The sections are separated by double percent .CW %% marks \(em the percent symbol is generally used by \*y as an escape character. If the user-routines section is omitted, the second .CW %% mark can be omitted as well. .KF .ps 9 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .P{ %start lines %% lines : /* empty */ | lines realNumber '\en' ; realNumber : integerPart '.' fraction ; integerPart : digit | integerPart digit ; fraction : digit | digit fraction ; digit : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ; %% int yylex() { return getchar(); } .P} .@tag FI _FI1_ .FI _FI1_ A complete \*y specification for sequences of real numbers, one per line. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP Blanks, tabs, and newlines are ignored. Comments can appear wherever a name can; they are enclosed between .CW /* and .CW */ , as in C. .PP It is possible, and desirable, for the start symbol of a grammar to be declared explicitly, using the .CW %start keyword, as in .P{ %start lines .P} Otherwise, the start symbol is taken from the first production in the specification. .PP A name in the production section is presumed to represent a nonterminal unless it is explicitly declared to represent a token. There are no token declarations in Figure _FI1_. The following declaration of token .CW DIGIT is from an example later in this section. .P{ %token DIGIT .P} .PP Single-character tokens need not be declared; a .I literal is a character enclosed in single quotes. As in C, the backslash .CW \e is an escape character within literals, and the following C escapes are recognized: .ft CW .TS lw4 0 l l. '\en' \f1newline\fP '\er' \f1return\fP '\e'' \f1single quote \fP'\f1\fP '\et' \f1tab\fP '\eb' \f1backspace\fP '\ef' \f1form feed\fP '\e$xxx$' $xxx$\f1 in octal\fP .TE .ft 1 For technical reasons, the .CW NUL character, .CW '\e0' or .CW 0 , should never be used in productions. .PP A production .P{ realNumber : integerPart '.' fraction ; .P} defines a nonterminal, called its .I "left side" , in terms of a sequence of grammar symbols, called its .I "right side" . A colon separates the two sides, and a semicolon marks the end of the production. The left side of this production is .CW realNumber . Its right side consists of .CW integerPart , the literal .CW '.' , and .CW fraction . .PP The right side of a production can be empty, as in the following production with no symbols between the colon and the terminating semicolon: .P{ lines : ; .P} .PP Productions with the same left side can be combined, and written with a vertical bar separating the right sides. The productions .P{ integerPart : digit ; integerPart : integerPart digit ; .P} can be rewritten equivalently as .P{ integerPart : digit | integerPart digit ; .P} In words, an integer part is either a single digit or a (smaller) integer part followed by a digit. Thus, an integer part consists of a string of one or more digits. .PP A fraction also consists of a string of one or more digits, described by the productions .P{ fraction : digit | digit fraction ; .P} The nonterminals .CW integerPart and .CW fraction impose different hierarchical structures on strings of digits. Note how a tree for .CW integerPart grows down to the left, whereas a tree for .CW fraction grows down to the right: .KS .ps 9 .ft CW .PS [ hu = .5; vu = 1.5*vs; choph = vs; chopw = .7 boxht = vs IntegerPart:\ [ "$integerPart$" { ln( 1,-1); "$digit$" ln( 0,-1); "3" } ln(-1,-1); "$integerPart$" { ln( 1,-1); "$digit$" ln( 0,-1); "2" } ln(-1,-1); "$integerPart$" { ln( 0,-1); "$digit$" ln( 0,-1); "1" }{ box invis wid .8 at Here } ] Fraction:\ [ "$fraction$" { ln(-1,-1); "$digit$" ln( 0,-1); "7" } ln( 1,-1); "$fraction$" { ln(-1,-1); "$digit$" ln( 0,-1); "8" } ln( 1,-1); "$fraction$" { ln( 0,-1); "$digit$" ln( 0,-1); "9" }{ box invis wid .6 at Here } ] with .e at IntegerPart.w + pagewid-2*20/72,0 ] .PE .KE .LP Semantic considerations influence the choice of hierarchical structure, and hence the choice of productions for a nonterminal, as we shall see in Section _SH2_. .SS "Using Yacc" When \*y is applied to a specification, the output is a file of C code, called .CW y.tab.c (the name might differ due to local file-system conventions). Suppose that the specification in Figure _FI1_ appears in a file .CW real.y . A program for reading a sequence of real numbers can then be compiled into a file .CW a.out by the following .UX system commands: .ft CW .TS lw4 0 l8 l. yacc real.y \f2generates C code into file \&\fPy.tab.c cc y.tab.c -ly \f2compiles executable program into file \&\fPa.out .TE .LP The flag .CW -ly , which must appear after .CW y.tab.c , refers to a tiny library, described below. .PP Figure _FI2_ illustrates the use of \*y. \*Y and the C compiler are represented by dashed boxes since they are used once, at ``compiler-construction time.'' The constructed compiler, consists of the lexical analyzer, the syntax analyzer, and any user routines. .KF .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .nr PS 9 .ps 9 .vs 11 .PS [ fillval = 1 boxht = 2.5*vs; boxwid = .6 lineht = vs; linewid = .25 right box invis "character" "stream" wid .7 line -> box "lexical" "analyzer" fill line -> box invis "token" "stream" line -> box "syntax" "analyzer" fill { move to last box.n up line <- box "C" "compiler" dashed line <- box "\*y" dashed line <- box invis ht 1.5*vs "grammar" } line -> box invis "optional" "output" ] .PE .@tag FI _FI2_ .FI _FI2_ \*Y handles the syntax analysis part of an application. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP \*Y confines itself to building fast parsers. All other aspects of an application, such as initialization, lexical analysis, and error reporting, must be programmed separately. Some relevant function names in the C code are as follows: .SP .5 .IP \*(BU .CW yyparse is the parser generated by \*y. It returns 0 if the entire input is parsed successfully; otherwise it returns 1. .SP .5 .IP \*(BU .CW yylex is called repeatedly by .CW yyparse ; it reads input characters and returns tokens. A routine that groups characters into tokens is called a .I "lexical analyzer" . The lexical analyzer at the bottom of Figure _FI1_ .P{ .25 int yylex() { return getchar(); } .P} .25 simply returns each individual character as a token. .SP .IP \*(BU .CW main is the start-up routine. Execution of a C program begins in a function called .CW main . In general, .CW main might read command-line arguments and options and perform initialization before calling .CW yyparse . .SP .5 .IP \*(BU .CW yyerror is called by .CW yyparse if an error occurs during parsing, usually with the terse message .CW "syntax error" .'' `` Parsing terminates when an error is detected unless ``error productions'' are used as described in Section _SH5_. .LP The library .CW -ly contains default versions of .CW main and .CW yyerror : .P{ int main() { return yyparse(); } #include <stdio.h> void yyerror(s) char *s; { fprintf(stderr, "%s\en", s); } .P} This version of .CW main simply returns the result obtained from .CW yyparse , and this version of .CW yyerror simply prints the message it is called with. .PP In case the .CW -ly library is not available, the specification in Figure _FI1_ can be completed by adding these three lines at the bottom. .EQ delim off .EN .SS "Actions and Attributes" Actions attached to a production are executed each time the production is applied during parsing. An .I action consists of one or more C statements, enclosed in curly braces .CW { and .CW } . Within an action, pseudo-variables starting with .CW $ signs refer to values associated with the symbols in the production. Such values are called .I attributes . The pseudo-variable for the left side is .CW $$ . In the local version of \*y, the pseudo-variable for a symbol on the right side is formed by prefixing a .CW $ sign to either its name or its position. .PP The complete specification in Figure _FI3_ includes the production and action .P{ realNumber : integerPart fraction { $$ = $integerPart + $fraction; } ; .P} The action is the single statement .P{ $$ = $integerPart + $fraction; .P} .EQ delim @@ .EN It defines the attribute value associated with the left side to be the sum of the attribute values associated with the two symbols on the right side.@"" sup _FS1_@ .FS .@tag FS _FS1_ @"" sup _FS1_@ When the same symbol @s@ appears several times on the right side, its pseudo-variable can be written as .CW $@s@#1 , .CW $@s@#2 , .CW $@s@#3 . .FE .KF .ps 9 .nf \s5\l'\n(LLu\&\(ul'\s0 .P{ %token DIGIT %start lines %{ #def\&ine YYSTYPE double %} %% lines : /* empty */ | lines realNumber '\en' { printf("%g\en", $realNumber); } ; realNumber : integerPart '.' fraction { $$ = $integerPart + $fraction; } ; integerPart : DIGIT | integerPart DIGIT { $$ = $integerPart*10 + $DIGIT; } ; fraction : DIGIT { $$ = $DIGIT*0.1; } | DIGIT fraction { $$ = ($DIGIT + $fraction)*0.1; } ; %% #include <ctype.h> int yylex() { int c; c = getchar(); if( ! isdigit(c) ) return c; yylval = c - '0'; return DIGIT; } .P} .@tag FI _FI3_ .FI _FI3_ The actions in this \*y specification evaluate real numbers during parsing. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP All versions of \*y support pseudo-variables like .CW $1 and .CW $2 formed from positions on the right side. Using positions, this action can be rewritten equivalently as .EQ delim off .EN .P{ realNumber : integerPart fraction { $$ = $1 + $2; } ; .P} .PP The .CW $ -prefix notation permits one attribute per grammar symbol. By default, all attributes have the same type, .CW int . This default can be changed by defining .CW YYSTYPE , as on line 4 of Figure _FI3_, where .CW YYSTYPE is defined to be .CW double because the specification deals with real numbers. See Section _SH6_ for how to customize the types of attributes; that is, to allow different attributes to have different types. (As in Figure _FI3_, any C code in the declarations section must be enclosed between .CW %{ and .CW %} .) .PP Attributes for tokens are computed by the lexical analyzer. Conceptually, a lexical analyzer returns a pair, consisting of a token and an associated attribute value. Consider, for example, the token .CW DIGIT in Figure _FI3_. When the lexical analyzer reads the character .CW 1 it returns .CW DIGIT with attribute value 1, when it reads .CW 2 it returns .CW DIGIT with attribute value 2, and so on. .PP Specifically, the parser .CW yyparse expects the lexical analyzer .CW yylex to leave the attribute value in a global variable .CW yylval , which is automatically declared by \*y to have type .CW YYSTYPE . The function .CW yylex in Figure _FI3_ assigns a value to .CW yylval just before it returns the token .CW DIGIT . .PP The tree in Figure _FI4_ illustrates the evaluation of the real number .CW 321.789 . Starting in the bottom-left corner of the figure, the lexical analyzer sets the attribute value 3 at the leftmost leaf for the token .CW DIGIT . The parent of this leaf is based on the production and action .KF .EQ delim $$ .EN .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .ps 9 .PS [ define intp % { "$f2($integerPart)^=^$1$" } % define frac % { "$f2($fraction)^=^$1$" } % define DIGIT % { "$Small($DIGIT)^=^$1$" } % hu = .6; vu = 2.5*vs; choph = vs; chopw = .7 boxht = vs "$f2($realNumber)^=^321.789$" { ln(-2,-1); intp(321) { ln( 1,-1); DIGIT(1) } ln(-1,-1); intp(32) { ln( 1,-1); DIGIT(2) } ln(-1,-1); intp(3) { ln( 0,-1); DIGIT(3) }{ box invis wid .85 at Here } }{ ln( 2,-1); frac(0.789) { ln(-1,-1); DIGIT(7) } ln( 1,-1); frac(0.89) { ln(-1,-1); DIGIT(8) } ln( 1,-1); frac(0.9) { ln( 0,-1); DIGIT(9) }{ box invis wid .8 at Here } } ] .PE .@tag FI _FI4_ .FI _FI4_ Attribute values during the evaluation of .CW 321.789 . .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .EQ delim off .EN .KE .P{ integerPart : DIGIT { $$ = $1; } ; .P} This action is omitted from the specification in Figure _FI3_ because, by default, the parser sets .CW $$ , the attribute of the left side, to .CW $1 , the attribute of the first symbol on the right side. .PP Working up the tree, the next node is based on .P{ integerPart : integerPart DIGIT { $$ = $integerPart*10 + $DIGIT; } ; .P} .PP The effect of the actions is perhaps easier to see at the nodes for .CW fraction . At the only node based on .P{ fraction : DIGIT { $$ = $DIGIT*0.1; } ; .P} the value of .CW $DIGIT is 9 and the value of .CW $fraction is 0.9. .PP The attributes at the nodes in Figure _FI4_ are said to be synthesized because they are defined solely in terms of the attributes at the children of the node. Since \*y generates bottom-up parsers, bottom-up evaluation of synthesized attributes fits naturally with parsing. Actions can also be used to simulate some ``inherited attributes,'' which are context dependent. .PP Actions attached to productions are examined further in Section _SH2_; their execution order becomes significant when they do input/output, assign values to variables, call functions with side effects, or otherwise affect the state of a computation. .SS "A Style for Specifications" As in any language, choose a style that makes the code easy to read, preferably a consistent (and accepted) style that can be read by others. The main concern in a \*y specification is to make the productions visible through the morass of action code. .PP The following style hints owe much to Brian Kernighan: .SP .5 .IP \*(BU 4 Use all capital letters for token names, all lower case letters for nonterminal names. This hint comes under the heading of ``knowing who to blame when things go wrong.'' .SP .25 .IP \*(BU Put productions and actions on separate lines. Either can then be changed independently. .SP .25 .IP \*(BU Put all productions with the same left side together. Put the left side in only once, and let all following productions begin with a vertical bar. .SP .25 .IP \*(BU Put a semicolon only after the last production with a given left side, and put the semicolon on a separate line. New productions can then be easily added. .SP .25 .IP \*(BU Indent production bodies by one tab stop, and action bodies by two tab stops. .PP The examples in this paper sometimes deviate from this style to conserve space. .EQ delim $$ .EN .@tag SH _SH2_ .SH _SH2_ "Evaluation And Translation Of Expressions" Both actions and productions must be considered when a \*y specification is designed. Without actions, a parser would silently analyze input strings, complaining only if it detects an error. With suitable actions, a parser can become an evaluator or a translator. Typically, the desired actions influence the choice of productions. .PP For example, the actions for evaluating the integer and fractional parts of a real number motivate different productions for the sequences of digits represented by .CW integerPart and .CW fraction in Section _SH1_. In the integer part, the contribution of a digit depends on the number of digits to its right; the contribution of .CW 3 in .CW 321.789 is 300, where the number of zeros depends on the number of digits to its right. In the fractional part, however, the contribution of a digit depends on the number of digits to its left; the contribution of .CW 9 in .CW 321.789 is 0.009, where the number of zeros depends on the number of digits to its left. .PP Arithmetic expressions are a fertile source of examples for \*y because the specifications of expression evaluators and translators are quite short, and the ideas carry over to richer languages. This section begins with a specification for an expression evaluator. The evaluator benefits from \*y's facilities for specifying the associativity and precedence of operators within expressions. The next example, a translator from infix into postfix notation, illustrates parsing order. .SS "Grammars for Expressions" Expressions are characterized by the operators within them; the choice of productions for expressions depends on the associativity and precedence of operators. .PP An operator .CW OP is .I "left associative" if an expression .P{ expr$"" sub 1$ OP expr$"" sub 2$ OP expr$"" sub 3$ .P} is evaluated as if it were parenthesized as .P{ (expr$"" sub 1$ OP expr$"" sub 2$) OP expr$"" sub 3$ .P} Similarly, the operator is .I "right associative" if the expression is evaluated as if it were parenthesized as .P{ expr$"" sub 1$ OP (expr$"" sub 2$ OP expr$"" sub 3$) .P} .PP An operator .CW OPA has .I "lower precedence" than an operator .CW OPB if the following equivalences hold (that is, the expressions to the left and right of the $==$ signs have the same values): .ft CW .TS lw4 0 r2 c2 l. expr$"" sub 1$ OPA expr$"" sub 2$ OPB expr$"" sub 3$ $==$ expr$"" sub 1$ OPA (expr$"" sub 2$ OPB expr$"" sub 3$) expr$"" sub 1$ OPB expr$"" sub 2$ OPA expr$"" sub 3$ $==$ (expr$"" sub 1$ OPB expr$"" sub 2$) OPA expr$"" sub 3$ .TE .PP The traditional grammar for arithmetic expressions uses three nonterminals .CW expr , .CW term , and .CW factor , where .CW expr represents an expression and .CW term and .CW factor represent subexpressions. Tokens .CW NUMBER and .CW VAR represent numbers and variables. The grammar is .P{ %token NUMBER VAR %% expr : expr '+' term | expr '-' term | term ; term : term '*' factor | term '/' factor | factor ; factor : NUMBER | VAR | '(' expr ')' ; .P} In words, an expression is a sequence of terms separated by .CW + or .CW - signs. A term is a sequence of factors separated by .CW * or .CW / signs. Thus, .P{ b*b - 4*a*c .P} is an expression containing two terms .CW b*b and .CW 4*a*c . The term .CW 4*a*c has three factors .CW 4 , .CW a , and .CW c . A factor is either a number, a variable, or a parenthesized expression. .PP The grammar for expressions dates back to Backus's introduction of BNF\&|reference(Backus 1960)\&, a notation for writing grammars. .PP A desk calculator can be based on this grammar by adding actions, as in .EQ delim off .EN .P{ expr : expr '-' term { $$ = $expr - $term; } ; .P} .EQ delim $$ .EN This production and action respect the left associativity of the minus operator, and correctly evaluate .CW 7-1-2 to 4 \(em check it by drawing a parse tree. .PP The traditional grammar generalizes to operators at $n^>=^1$ precedence levels; all operators at the same level have the same associativity and precedence. Set up a nonterminal .CW expr$"" sub i$ for precedence level $i$, with level $1$ being the lowest. In the traditional grammar, the nonterminals .CW expr , .CW term , and .CW factor correspond to .CW expr$"" sub 1$ , .CW expr$"" sub 2$ , and .CW expr$"" sub 3$ , respectively. If the operators at level $i^<^n$ are left associative, then the productions for .CW expr$"" sub i$ have the form .P{ expr$"" sub i$ : expr$"" sub i$ OP expr$"" sub i+1$ ; .P} Here, .CW OP represents an operator at level $i$. Otherwise, if the operators at level $i$ are right associative, then the productions have the form .P{ expr$"" sub i$ : expr$"" sub i+1$ OP expr$"" sub i$ ; .P} .PP At each precedence level $i^<^n$, there is an additional production of the form .P{ expr$"" sub i$ : expr$"" sub i+1$ ; .P} .SS "Associativity and Precedence Declarations" \*Y has special facilities for declaring the associativity and precedence of operators, which will be introduced by considering the specification in Figure _FI5_. .KF .EQ delim off .EN .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .ps 9 .P{ %{ #def\&ine YYSTYPE double %} %token NUMBER %left '+' '-' %left '*' '/' %right '^' %left UMINUS %% lines : lines expr '\en' { printf("%g\en", $expr); } | lines '\en' | /* empty */ ; expr : expr '+' expr { $$ = $1 + $3; } | expr '-' expr { $$ = $1 - $3; } | expr '*' expr { $$ = $1 * $3; } | expr '/' expr { $$ = $1 / $3; } | expr '^' expr { $$ = pow($1, $3); } | '-' expr %prec UMINUS { $$ = - $expr; } | '(' expr ')' { $$ = $expr; } | NUMBER ; .P} .@tag FI _FI5_ .FI _FI5_ An expression evaluator based on precedence declarations for tokens. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .EQ delim $$ .EN .KE .PP The tokens of the evaluator in Figure _FI5_ are parentheses, .CW NUMBER , and the operators .P{ \&'+' '-' '*' '/' '^' UMINUS .P} (Ignore .CW UMINUS for the moment.) .PP The associativity and precedence of tokens are declared on lines beginning with one of the keywords .CW %left , .CW %right , or .CW %nonassoc . All of the tokens on a line have the same associativity and precedence; they have lower precedence than the tokens on successive lines. These declarations will be referred to as .I precedence declarations. .PP (The keyword .CW %nonassoc describes operators that do not associate with themselves; for example, .P{ A .LT. B .LT. C .P} is illegal in Fortran because .CW .LT. is nonassociative.) .PP The precedence declarations .P{ %left '+' '-' %left '*' '/' %right '^' .P} specify that .CW + and .CW - are left associative and have lower precedence than the left-associative operators .CW * and .CW / , which in turn have lower precedence than the right-associative operator .CW ^ . .PP The operator .CW ^ in the grammar represents exponentiation, as in .CW 2^3 , which evaluates to 8. This operator is right associative; thus, .CW 2^2^3 is equivalent to .CW 2^(2^3) , .CW 2^8 , and .CW 256 . .PP The nonterminals of the grammar are .CW lines and .CW expr . The grammar expects a sequence of expressions, on separate lines. A typical production for .CW expr has the form .EQ delim off .EN .P{ expr : expr '+' expr { $$ = $1 + $3; } .P} Alternatively, we can write the action as .P{ expr : expr '+' expr { $$ = $expr#1 + $expr#2; } .P} .EQ delim $$ .EN .PP With these precedence declarations, the expression .P{ 2 ^ 2 ^ 3 * 4 - 5 * 6 - 7 * 8 .P} is evaluated as if it were parenthesized as .P{ ( (2^(2^3))*4 - 5*6 ) - 7*8 .P} .PP The user routines for the evaluator are in Figure _FI6_. The lexical analyzer .CW yylex returns a token each time it is called. It skips blanks. If it sees a digit or a decimal point, it returns the token .CW NUMBER after using the C library function .CW scanf to read a number into the global variable .CW yylval (as mentioned in Section 1, the attribute value, if any, associated with a token must be left in .CW yylval ). Otherwise, the lexical analyzer returns a single character as a token. .KF .EQ delim off .EN .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .ps 9 .P{ %% #include <stdio.h> #include <ctype.h> #include <math.h> int yylex() { int c; while ( ( c = getchar() ) == ' ' ); if ( (c == '.') || (isdigit(c)) ) { ungetc(c, stdin); scanf("%lf", &yylval); return NUMBER; } return c; } .P} .@tag FI _FI6_ .FI _FI6_ User routines for the evaluator in Figure _FI5_. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .EQ delim $$ .EN .KE .PP The parser uses precedence declarations to decide when to apply a production. Suppose that the input has the form .P{ expr * expr \&\f1$...$\fP .P} and the parser has to decide whether or not to apply the multiplication production .P{ expr : expr '*' expr .P} If the next symbol in the input is .CW + , as in .P{ expr * expr + \&\f1$...$\fP .P} the parser applies the multiplication production because the token .CW + has lower precedence than .CW * . However, if the next symbol in the input is .CW ^ , the parser defers the multiplication production because .CW ^ has higher precedence than .CW * . .PP The treatment of the unary minus operator deserves special mention. The evaluator in Figure _FI5_ accepts the expression .CW 10^-1 in lieu of $10 sup -1~==~0.1$. In other words, .CW 10^-1 is treated like .CW 10^(-1) , with unary minus having higher precedence than .CW ^ . The precedence of the minus operator therefore depends on whether it is used as a binary or as a unary operator. .PP .PP \*Y provides the keyword .CW %prec for overriding the declared precedence of a token. A .CW %left declaration in Figure _FI5_ gives .CW - the lowest precedence, along with .CW + . The keyword .CW %prec in .P{ expr : '-' expr %prec UMINUS .P} overrides the declared precedence of .CW - . When this production is applied, the high precedence of token .CW UMINUS is used instead. The expression .CW 3-10^-1 is therefore equivalent to .CW 3-(10^(-1)) . .PP When several tokens appear in a production, the parser normally uses the precedence of the last token on the right side to decide whether to apply a production. The .CW %prec keyword overrides this normal behavior. For example, the .CW %prec in .P{ expr : '(' TYPENAME ')' expr %prec TYPENAME .P} is necessary for the production to take the precedence of token .CW TYPENAME ; otherwise the production takes its precedence from the closing parenthesis. .PP It is recommended that precedence declarations be used in a ``cookbook'' fashion, until some experience is gained. How \*y uses precedence declarations is examined further in Section _SH4_. .SS "Execution Order for Actions" The execution order of actions is significant because actions can have side effects. One way to visualize the order is to imagine a traversal of a parse tree in which the children of each node are visited depth-first from left to right, starting at the root. Suppose a node has two children $c$ and $d$, with $c$ to the left of $d$. .I Depth-first implies that all the nodes in the subtree for $c$ are visited before any nodes are visited in the subtree for $d$. The tree in Figure _FI7_ includes actions as pseudo-symbols, attached by dashed lines. Actions are executed in the order they would be visited in a depth-first left-to-right traversal. .KF .EQ define Small % size 7 "$1" % .EN .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .nr PS 9 .ps 9 .PS [ define NUM % { "$Small(NUM)$" "$Small($NUM)^=^$1$" at Here + 0,-vs } % hu = .3; vu = vs; choph = vs; chopw = .7 boxht = vs "$expr$" { ln(-5,-3); "$expr$" G1:"" { ln(-2,-2); NUM(2) G2:"" }{ ln( 1,-2, dashed .04); "{print $Small($NUM)$}" G3:"" } }{ ln(-1,-3); "+" }{ ln( 2,-3); "$expr$" { ln(-3,-4); "$expr$" { ln(-2,-2); NUM(3) G4: "" }{ ln( 1,-2, dashed .04); "{print $Small($NUM)$}" } }{ ln(-.5,-4); "\(**" }{ ln( 2.5,-4); "$expr$" { ln(-2,-4); NUM(5) }{ ln( 1,-4, dashed .04); "{print $Small($NUM)$}" } }{ ln( 5,-4, dashed .04); "{print \(**}" box invis wid .6 ht vs at Here } }{ ln( 6,-3, dashed .04); "{print +}" } H1: G1 + -3*hu,0 H2: G2 + -5*hu,-3*vu H3: G2.x, H2.y .ps 36 spline -> from H1 to H2 to H3 .ps\n(PS ] .PE .@tag FI _FI7_ .FI _FI7_ Actions are executed in a depth-first left-to-right order. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP The parse tree in Figure _FI7_ is based on the following specification of an infix-to-postfix translator: .EQ delim off .EN .P{ %token NUM %left '+' %left '*' %% expr : expr '+' expr { printf(" +"); } | expr '*' expr { printf(" *"); } | '(' expr ')' | NUM { printf(" %d", $NUM); } ; .P} .EQ delim $$ .EN The translation is emitted incrementally during parsing, so the execution order of the print statements is critical. The translation of .CW 2+3*5 is .P{ 2 3 5 * + .P} .SS "Actions Embedded Within Rules" \*Y permits an action to be written in the middle of a production as well as at the end; actions in the middle are called .I embedded actions. .PP Embedded actions are useful for keeping track of context information. For example, consider the typeset text ``$E sub 1$'', specified by the .I eqn input .P{ E sub 1 .P} The smaller point size of 1, relative to that of $E$, is dictated by the context; specifically, by the preceding keyword .CW sub . The .I eqn grammar uses embedded actions to maintain the current point size in variable .CW ps . The embedded action .CW ps$^$-=$^$del in .P{ box : box { ps -= del; } SUB box { ps += del; } .P} reduces the point size before a subscript is processed; the other action .CW ps$^$+=$^$del restores the point size. Nonterminal .CW box represents an .I eqn construct, and token .CW SUB represents the input characters .CW sub . .PP Each embedded action is implemented by manufacturing a fresh nonterminal, called a .I marker nonterminal. \*Y actually treats this .I eqn example as if it had been written: .EQ delim @@ .EN .P{ box : box _ACT SUB box { ps += del; } ; $ACT : /* empty */ { ps -= del; } ; .P} The fresh marker nonterminal .CW _ACT marks the position of the embedded action .CW ps@^@-=@^@del . .PP Within an embedded action, .CW $$ refers to the attribute value of its marker nonterminal. Thus, the two occurrences of .CW $$ in .P{ a : b { $$ = 1; } c { x = $2; $$ = $c; } ; .P} refer to different nonterminals, shown explicitly in .P{ a : b _ACT c { x = $2; $$ = $c; } ; _ACT : /* empty */ { $$ = 1; } ; .P} In words, the effect of .P{ a : b { $$ = 1; } c { x = $2; $$ = $c; } ; .P} is to make 1 the attribute value of the implicit marker nonterminal in position 2, assign 1 to variable .CW x , and make .CW $c the attribute value of the left side .CW a . .PP Note that .CW c is at position 3, so the above production can be rewritten using .CW $3 instead of .CW $c : .P{ a : b { $$ = 1; } c { x = $2; $$ = $3; } .P} .EQ delim $$ .EN .@tag SH _SH3_ .SH _SH3_ "How The Parser Works" The algorithm used to go from a grammar to a parser is complex and will not be discussed here, but the parser itself is relatively simple. Its two main actions are .SP .5 .IP \*(BU 4 shift to the next input symbol and .IP \*(BU 4 reduce by applying a production. .SP .5 .LP Some familiarity with such actions is helpful in deciphering messages about ``shift/reduce'' and ``reduce/reduce'' conflicts, which warn of potential ambiguities in the grammar that could lead the parser astray. .PP \*Y places a human-readable description of the generated parser into a file .CW y.output , when it is invoked with the .CW -v (for verbose) option. This section deals with the parsing background behind .CW y.output files; parsing conflicts themselves are considered in Section _SH4_. .PP The running example in this section is the following grammar for real numbers: .P{ %token D P %% real : intp P frac ; intp : D | intp D ; frac : D | D frac ; .P} The abbreviated names .CW intp for integer part, .CW frac for fraction, and .CW D for digit, conserve space in diagrams. The use of token .CW P for a decimal point .CW '.' avoids confusion with other uses of dots within .CW y.output files. .SS "Shift-Reduce Parsing" A .I "bottom-up" parser works from the leaves (bottom) of a parse tree towards the root. The following sequence of tree snapshots illustrates a bottom-up parse of the token stream .CW DDPDD , corresponding to the real number .CW 21.89 . For the moment, the digit represented by a token .CW D appears below the token. The trees are .KS .ps 9 .PS [ define D % { "\f2\s8D\s0\fP" "\s6$1\s0" at Here + 0,-.5*vs } % define Pt % { "\f2\s8P\s0\fP" "." at Here + 0,-.5*vs } % hu = .125; vu = 1.5*vs; choph = vs; chopw = .4 boxht = vs; boxwid = .8*vs; movewid = .5 S1: [ # "$real$" { ln(-1,-1, invis); # "$intp$" { ln(-1,-1, invis); # "$intp$" ln( 0,-1, invis); D(2) }{ ln( 0,-2, invis); D(1) } }{ ln( 0,-3, invis); Pt }{ ln( 1,-1, invis); # "$frac$" { ln( 0,-2, invis); D(8) }{ ln( 1,-1, invis); # "$frac$" ln( 0,-1, invis); D(9) } } ] move S2: [ # "$real$" { ln(-1,-1, invis); # "$intp$" { ln(-1,-1, invis); "$intp$" ln( 0,-1); D(2) }{ ln( 0,-2, invis); D(1) } }{ ln( 0,-3, invis); Pt }{ ln( 1,-1, invis); # "$~~frac$" { ln( 0,-2, invis); D(8) }{ ln( 1,-1, invis); # "$frac$" ln( 0,-1, invis); D(9) } } ] move S3: [ # "$real$" { ln(-1,-1, invis); "$intp~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(2) }{ ln( 0,-2); D(1) } }{ ln( 0,-3, invis); Pt }{ ln( 1,-1, invis); # "$~~frac$" { ln( 0,-2, invis); D(8) }{ ln( 1,-1, invis); # "$frac$" ln( 0,-1, invis); D(9) } } ] move S4: [ # "$real$" { ln(-1,-1, invis); "$intp~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(2) }{ ln( 0,-2); D(1) } }{ ln( 0,-3, invis); Pt }{ ln( 1,-1, invis); # "$frac$" { ln( 0,-2, invis); D(8) }{ ln( 1,-1, invis); "$frac$" ln( 0,-1); D(9) } } ] move S5: [ # "$real$" { ln(-1,-1, invis); "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(2) }{ ln( 0,-2); D(1) } }{ ln( 0,-3, invis); Pt }{ ln( 1,-1, invis); "$~~frac$" { ln( 0,-2); D(8) }{ ln( 1,-1); "$frac$" ln( 0,-1); D(9) } } ] move S6: [ "$real$" { ln(-1,-1); "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(2) }{ ln( 0,-2); D(1) } }{ ln( 0,-3); Pt }{ ln( 1,-1); "$~~frac$" { ln( 0,-2); D(8) }{ ln( 1,-1); "$frac$" ln( 0,-1); D(9) } } ] "$=>$" at .5<S1.e, S2.w> "$=>$" at .5<S2.e, S3.w> "$=>$" at .5<S3.e, S4.w> "$=>$" at .5<S4.e, S5.w> "$=>$" at .5<S5.e, S6.w> ] .PE .KE .LP Let us redraw these trees to line up their uncovered portions; that is, the roots of the completed subtrees. The redrawn sequence is .KS .ps 9 .PS [ define D % { "\f2\s8D\s0\fP" } % define Pt % { "\f2\s8P\s0\fP" # "." at Here + 0,-.5*vs } % hu = .125; vu = 1.5*vs; choph = vs; chopw = .4 boxht = vs; boxwid = .8*vs; movewid = hu; sep = .5 S1: [ D(1); move; D(2); move; Pt; move; D(8); move; D(9) ] S2: [ { "$intp~~$"; ln( 0,-1); D(1) } move D(2); move; Pt; move; D(8); move; D(9) ] with .nw at S1.ne + sep,0 S3: [ { "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(1) }{ ln( 0,-2); D(2) } } move Pt; move; D(8); move; D(9) ] with .nw at S2.ne + sep,0 S4: [ { "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(1) }{ ln( 0,-2); D(2) } } move Pt; move; D(8) move { "$~~frac$"; ln( 0,-1); D(9) } ] with .nw at S3.ne + sep,0 S5: [ { "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(1) }{ ln( 0,-2); D(2) } } move Pt move { "$~~frac$" { ln( 0,-2); D(8) }{ ln( 1,-1); "$frac$" ln( 0,-1); D(9) } } ] with .nw at S4.ne + sep,0 S6: [ "$real$" { ln(-1,-1); "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(1) }{ ln( 0,-2); D(2) } }{ ln( 0,-3); Pt }{ ln( 1,-1); "$~~frac$" { ln( 0,-2); D(8) }{ ln( 1,-1); "$frac$" ln( 0,-1); D(9) } } ] with .nw at S5.ne + sep,0 "$=>~$" at .5<S1.ne, S2.nw> "$=>$" at .5<S2.ne, S3.nw> "$=>$" at .5<S3.ne, S4.nw> "$~~=>$" at .5<S4.ne, S5.nw> "$=>$" at .5<S5.ne, S6.nw> ] .PE .KE .PP The uncovered portions suffice, as long as the grammar is unambiguous. The real-number grammar is indeed unambiguous, so the preceding sequence of partial trees is characterized by the snapshots .KS .ps 9 .PS [ define N % box invis $1 wid .25 % define D % box invis "\f2\s8D\s0\fP" wid .125 % define P % box invis "\f2\s8P\s0\fP" wid .125 % define derives % box invis wid .4 "$=>$" % boxht = vs; boxwid = .8*vs [ D; D; P; D; D ] derives [ N("$intp$"); D; P; D; D ] derives [ N("$intp$"); P; D; D ] derives [ N("$intp$"); P; D; N("$frac$") ] derives [ N("$intp$"); P; N("$frac$") ] derives [ N("$real$") ] ] .PE .KE .PP These snapshots correspond to a sequence of reduce actions; a .I reduce action replaces the right side of a production by its left side. A .I shift action advances the parser to the next unexamined input token; such tokens are called .I lookahead symbols. .EQ delim off .EN .PP The key problem of shift-reduce parsing is that of deciding when to shift and when to reduce; \*y generates tables for this purpose that are explained later in this section. Meanwhile, an example of a shift-reduce parse appears in Figure _FI8_. The input token stream is again .CW DDPDD , and a special token .CW $end marks its end. In the figure, a pointer appears before the current lookahead symbol. The first action, a shift, advances the pointer past the leftmost .CW D : .EQ delim $$ .EN .KF .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .ps 9 .PS [ define N % box invis $1 wid .25 % define D % box invis "\f2\s8D\s0\fP" wid .125 % define P % box invis "\f2\s8P\s0\fP" wid .125 % define E % box invis "$f2($end)$" wid .3 % define action % Act: [ right B: box invis wid actionwid PTR: box invis wid .1 $1 ljust at B.w ] with .PTR.n at S.PTR.s % define shift % action("shift") % define reduce % action("reduce") % define ptr % PTR: box invis wid .1; { move left .025; up; line <- } % define handle % { line from last box.sw to last box.se } % boxht = vs arrowht = .05; arrowwid = .025; lineht = .75*vs; dy = vs down actionwid = .9 T1: S:[ right ptr; D; D; P; D; D; E ] shift S:[ right D; handle; ptr; D; P; D; D; E ] with .PTR.n at Act.PTR.s reduce S:[ right N("$intp$"); ptr; D; P; D; D; E ] with .PTR.n at Act.PTR.s shift S:[ right N("$intp$"); handle; D; handle; ptr; P; D; D; E ] with .PTR.n at Act.PTR.s reduce S:[ right N("$intp$"); ptr; P; D; D; E ] with .PTR.n at Act.PTR.s shift B1: S:[ right N("$intp$"); P; ptr; D; D; E ] with .PTR.n at Act.PTR.s actionwid = 1.2 T2: S:[ right box invis wid actionwid PTR: box invis wid .05 ] with .sw at T1.se + 1.2,0 shift S:[ right N("$intp$"); P; D; ptr; D; E ] with .PTR.n at Act.PTR.s shift S:[ right N("$intp$"); P; D; D; handle; ptr; E ] with .PTR.n at Act.PTR.s reduce S:[ right N("$intp$"); P; D; handle; N("$frac$"); handle; ptr; E ] with .PTR.n at Act.PTR.s reduce S:[ right N("$intp$"); handle; P; handle; N("$frac$"); handle; ptr; E ] with .PTR.n at Act.PTR.s reduce B2: S:[ right N("$real$"); ptr; E ] with .PTR.n at Act.PTR.s G: .5<T1.se, T2.sw> + 0,vs line from G to G.x, B2.s.y ] .PE .@tag FI _FI8_ .FI _FI8_ Shift-reduce parsing of .CW DDPDD . .EF .nf \s5\l'\n(LLu\&\(ul'\s0 .EF 0 .fi .SP .ps 9 .PS define N % box invis $1 wid .25 % define D % box invis "\f2\s8D\s0\fP" wid .125 % define P % box invis "\f2\s8P\s0\fP" wid .125 % define E % box invis wid .05; box invis "$f2($end)$" wid .25 % define shift % box invis wid .8 "shift" % define reduce % box invis wid .8 "reduce" % define lookahead % box invis wid .1; { move left .05; up; line <- } % define handle % { line from last box.sw to last box.se } % [ boxht = vs arrowht = .05; arrowwid = .025; lineht = .75*vs lookahead; D; D; P; D; D; E shift D; lookahead; D; P; D; D; E ] .PE .LP The second action reduces the token .CW D immediately to the left of the pointer (right sides to be reduced are underlined for clarity). The reduction replaces the right side .CW D by its left side .CW intp : .ps 9 .PS [ boxht = vs arrowht = .05; arrowwid = .025; lineht = .75*vs D; handle; lookahead; D; P; D; D; E reduce N("$intp$"); lookahead; D; P; D; D; E ] .PE .LP After the next .CW D is shifted, the right side .CW intp$~$D is reduced to the left side .CW intp : .ps 9 .PS [ boxht = vs arrowht = .05; arrowwid = .025; lineht = .75*vs N("$intp$"); handle; D; handle; lookahead; P; D; D; E reduce N("$intp$"); lookahead; P; D; D; E ] .PE .LP Successive shift actions now advance the lookahead pointer all the way to the endmarker. Finally, a sequence of reduce actions completes the parse. .PP It is no accident that a right side to be reduced always appears immediately to the left of the pointer in Figure _FI8_. This observation is the basis for a stack-implementation of shift-reduce parsing. Informally, the symbols to the left of the pointer are held on a stack, so a right side to be reduced appears at the top of the stack. .SS "Parser States" Instead of grammar symbols, a \*y-generated parser works with states, which encode some parsing context together with a grammar symbol. The context summarizes prior parsing actions. For example, states tell a parser for real numbers that a digit to the left of a decimal point reduces to .CW intp , but that a digit to the right reduces to .CW frac . .PP A .I state consists of a collection of items, where an .I item is a production with a dot inserted in the right side \(em some versions of \*y use an underscore in place of the dot. One of the states of the real-number parser is .P{ real : intp.P frac intp : intp.D .P} The dot tells us that an .CW intp has just been seen, and that the parser expects to see a .CW P or a .CW D . .PP In this state, the parser shifts on lookahead .CW P . The shift is recorded by (conceptually) moving the dot past .CW P to obtain the item .P{ real : intp P.frac .P} .PP Since, .CW frac now appears to the right of the dot, the parser expects the incoming symbols to match a .CW frac . Although they are not shown, the productions for .CW frac are implicitly carried with the item. The full version, or .I closure , of this state is obtained by adding the productions for .CW frac (with a dot at the beginning of the right side): .P{ real : intp P.frac frac : .D frac : .D frac .P} Since both productions for .CW frac begin with the token .CW D , no more productions are added; otherwise, we would continue adding productions until all nonterminals to the right of the dot were considered. .PP With this closure, on lookahead .CW D , the parser shifts to a state containing the items .P{ frac : D. frac : D.frac .P} .PP The parser states and their transitions constitute an automaton. The automaton for the real-number grammar appears in Figure _FI9_. The solid arrows are for shift transitions, due to tokens. The dashed arrows, for transitions due to nonterminals, are used during reductions. .KF .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .nr PS 9 .ps 9 .PS [ define state % if "$4"!="" then 'nitems=$4' else 'nitems=1' S$1: box fill at O + $2*hu, $3*vu ht nitems*vs "$1" at S$1.n + 0,vs/2 % arcrad = vs/3; boxht = vs; boxwid = 1.2; fillval = 1 dashwid = .03 hu = .7; vu = 4*vs O: "" state(0, 0, 0) box invis "$f2($accept)^:~cdot~real~f2($end)$" at S0 state(1, 3, 0) box invis "$f2($accept)^:~real~cdot~f2($end)$" at S1 state(2, 1,-1, 2) [ down B1: box invis B2: box invis "$real^:~intp~cdot~P~frac$" ljust at B1.w + .075,0 "$intp^:~intp~cdot~D$" ljust at B2.w + .075,0 ] at S2 state(5, 4,-1) box invis "$intp^:~intp~D~cdot$" at S5 state(4, 2,-2) box invis "$real^:~intp~P~cdot~frac$" at S4 state(6, 5,-2) box invis "$real^:~intp~P~frac~cdot$" at S6 state(7, 3,-3, 2) [ down B1: box invis B2: box invis box invis "$frac^:~D~cdot$" ljust at B1.w + .1,0 box invis "$frac^:~D~cdot~frac$" ljust at B2.w + .1,0 ] at S7 state(8, 6,-3) box invis "$frac^:~D~frac~cdot$" at S8 state(3, 0,-3) box invis "$intp^:~D~cdot$" at S3 S9: box invis at O + 6*hu, 0 "\0\0\f3accept\fP" ljust at S9.w G02: .66<S0.sw,S0.s> G03: .33<S0.sw,S0.s> G2: .5<S2.sw,S2.s> G4: .5<S4.sw,S4.s> G7a: .5<S7.sw,S7.s> G7b: .5<S7.s,S7.se> G7c: S7.s + 0,-1.5*vs line -> from S0.e to S1.w dashed line -> from S2.e to S5.w line -> from S4.e to S6.w dashed line -> from S7.e to S8.w dashed line -> from S1.e to S9.w line -> from G03.s to G03.x, S3.n.y adown(G02); anell(S2.w, -> dashed, dashed) adown(G2); anell(S4.w, ->) adown(G4); anell(S7.w, ->) adown(G7a); anell(G7c); anell(G7b, ->) "$real$" at .5<S0.e, S1.w> + 0, vs/2 "$intp$" at (G02.x+S2.w.x)/2, S2.y + vs/2 "$~D$" ljust at G03.x,S4.y "$f2($end)$" at .5<S1.e, S9.w> + 0, vs/2 "$D$" at .5<S2.e, S5.w> + 0, vs/2 "$P$" at (G2.x+S4.w.x)/2, S4.y + vs/2 "$frac$" at .5<S4.e, S6.w> + 0, vs/2 "$D$" at (G4.x+S7.w.x)/2, S7.y + vs/2 "$frac$" at .5<S7.e, S8.w> + 0, vs/2 "$D$" at G7c + 0, vs/2 ] .PE .@tag FI _FI9_ .FI _FI9_ States and transitions for the real-number grammar. The solid arrows are for shifts, and the dashed arrows are for transitions during reductions. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .EQ delim @@ .EN .PP For technical reasons, \*y augments a grammar by adding a new starting nonterminal .CW $accept , which derives the old starting nonterminal and an endmarker .CW $end . The starting state 0 has an item with a dot to the left of the old starting symbol, as in .P{ $accept : .real $end .P} The closure of state 0 contains the items .P{ $accept : .real $end real : .intp P frac intp : .D intp : .intp D .P} Since the only token to the right of a dot is .CW D , the very first token must be a .CW D . .PP Some of the states of the real-number parser in Figure _FI9_ are (informally) .SP .5 .IP 0. .I "The starting state" . The item .CW $accept:@^@.real@^@$end tells us that the entire input, upto the endmarker, must match .CW real . .SP .5 .IP 2. .I "Within the integer part" . An .CW intp has been seen. The lookahead token must either be a .CW D (another digit in the integer part) or a .CW P (the decimal point). .SP .5 .IP 7. .I "Within the fraction part" . Shift as long as the lookahead token is a .CW D . Otherwise, reduce the last .CW D to .CW frac . .SP .5 .PP State 2 is displayed as follows in the .CW y.output file for the real-number grammar: .P{ state 2 real : intp.P frac intp : intp.D .sp .5 D shift 5 P shift 4 . error .P} After the two items is a summary of the actions in this state. With lookahead .CW D , the parser shifts to state 5, and with lookahead .CW P it shifts to state 4. The default action (represented by .CW . '') `` is to report an error. .SS "Parsing Actions" .PP The parser holds states on a stack, with the current state on top. The starting state of the automaton in Figure _FI9_ is state 0. With lookahead .CW D , the automaton shifts to state 3 (in the bottom-left corner of the figure) by pushing 3 onto the stack and removing the lookahead .CW D from the input. For ease of comparison with Figure _FI8_, the symbol .CW D appears below the state in the following diagram: .EQ delim $$ .EN .KS .ps 9 .PS [ define Disp % box invis $2 { move to last box + 0,-del; $1 } % define N % Disp($1, wid .25) % define D % Disp("\f2\s8D\s0\fP", wid .1) % define P % Disp("\f2\s8P\s0\fP", wid .1) % define E % Disp("$f2($end)$", wid .3) % define action % Act: [ right B: box invis wid 1 PTR: box invis wid .1 $1 ljust at B.w ] with .PTR.n at S.PTR.s % define shift % action("shift") % define reduce % action("$1") % define ptr % PTR: box invis wid .05 { move left .025; move up boxht/2; line <- } box invis wid .3 { move left .15; $1 } box invis wid .05 { move left .025; move up boxht/2; line <- } % define handle % { line from last box.sw to last box.se } % define state % [ down ELEM: box "$1" $2 $3 ] with .ELEM.w at Here; move to last [].ELEM.e; right % boxht = vs; boxwid = .2 arrowht = .05; arrowwid = .025; lineht = .5*vs; dy = vs movewid = .1; del = .015 L1: S:[ right state(0) ptr() D; D; P; D; D; E ] shift L2: S:[ right state(0); state(3, D) ptr() D; P; D; D; E ] with .PTR.n at Act.PTR.s + 0,-boxht/2 ] .PE .KE .LP The top of the state stack has its own pointer, separate from the lookahead pointer to the next input symbol. .PP The only item in state 3 is .P{ intp : D. .P} Whenever the dot in an item is at the end of the production, one of the possible actions is a reduction by that production. Since this state has no other actions, the parser chooses to reduce. .PP A reduce action has two phases: (a) pop the states corresponding to the right side, and (b) push a state corresponding to the left side. The following diagram illustrates the reduction of the leading .CW D in .CW DDPDD to .CW intp : .KS .ps 9 .PS [ define Disp % box invis $2 { move to last box + 0,-del; $1 } % define N % Disp($1, wid .25) % define D % Disp("\f2\s8D\s0\fP", wid .1) % define P % Disp("\f2\s8P\s0\fP", wid .1) % define E % Disp("$f2($end)$", wid .3) % define action % Act: [ right B: box invis wid 1 PTR: box invis wid .1 $1 ljust at B.w ] with .PTR.n at S.PTR.s + 0,-boxht % define shift % action("shift") % define reduce % action("$1") % define ptr % PTR: box invis wid .05 { move left .025; move up boxht/2; line <- } box invis wid .3 { move left .15; $1 } box invis wid .05 { move left .025; move up boxht/2; line <- } % define handle % { line from last box.sw to last box.se } % define state % [ down ELEM: box "$1" $2 $3 ] with .ELEM.w at Here; move to last [].ELEM.e; right % boxht = vs; boxwid = .2 arrowht = .05; arrowwid = .025; lineht = .5*vs; dy = vs movewid = .1; del = .015 L1: S:[ right state(0); state(3, D, handle) ptr() D; P; D; D; E ] reduce(pop) L2: S:[ right state(0) ptr("$intp$") D; P; D; D; E ] with .PTR.n at Act.PTR.s + 0,-boxht reduce(goto) L3: S:[ right state(0); state(2, N("$intp$")) ptr() D; P; D; D; E ] with .PTR.n at Act.PTR.s + 0,-boxht dx = .75; dy = -lineht-vs/2-del "Pop the states corresponding to the right side \f2\s8D\s0\fP."\ ljust at L1.ne + dx,dy "Prepare to push a state for the left side $intp$."\ ljust at L2.ne + dx,dy "State 0 goes to 2 under $intp$."\ ljust at L3.ne + dx,dy box invis wid 2.75 with .nw at L1.ne + dx,dy ] .PE .KE .PP The parser has only four actions available to it, called shift, reduce, accept, and error. A move of the parser is as follows: .SP .5 .IP 1. 4 Based on its current state, the parser decides whether it needs a lookahead token to choose the next action. If it needs one, and does not already have one, it calls .CW yylex to obtain the next token. .SP .5 .IP 2. Using the current state $p$, and the lookahead token $t$ if needed, the parser chooses an action. .RS .SP .25 .IP a) 4 A .I shift action to state $q$ is done by pushing state $q$ onto the state stack and clearing the lookahead token. .SP .25 .IP b) A .I reduce action by a production is done in two phases. In the first phase, the parser pops from the stack a number of states equal to the number of symbols on the right side. Let $q$ be the state uncovered after the states are popped, and let the nonterminal on the left side of the production take $q$ to $r$ (see the dashed arrows in Figure _FI9_). In the second phase, the parser pushes state $r$ onto the stack. .SP .25 .EQ delim off .EN .IP c) The .I accept action occurs after the entire input has been seen and matched. This action occurs only when the lookahead token is the endmarker .CW $end . .EQ delim $$ .EN .SP .25 .IP d) An .I error action occurs when the parser can no longer continue parsing according to the productions. That is, the input tokens seen so far and the current lookahead cannot possibly be followed by anything that would result in a legal input. See Section _SH5_ for error recovery. .RE .@tag SH _SH4_ .SH _SH4_ "Ambiguity and Conflicts" A .I "shift/reduce conflict" occurs if the parser cannot decide between a shift action and a reduce action. Similarly, a .I "reduce/reduce conflict" occurs if the parser cannot decide between two legal reductions. .PP Conflicts definitely occur when a grammar is .I ambiguous ; that is, if some input string has more than one parse tree. Conflicts can also occur when a grammar, although consistent, requires a more complex parser than \*y is capable of constructing. Finally, conflicts are sometimes introduced when an action is embedded into the middle of a production. .PP \*Y produces a parser even when conflicts occur. Rules used to choose between two competing actions are called .I "disambiguating rules" . The default disambiguating rules are: .SP .5 .IP 1. 4 In a shift/reduce conflict, the default is to shift. .SP .25 .IP 2. In a reduce/reduce conflict, the default is to reduce by the earlier production in the specification. .SP .5 .PP Although the effect of disambiguating rules can often be achieved by rewriting the grammar to avoid conflicts, experience suggests that this rewriting is somewhat unnatural and produces slower parsers. .PP The rest of this section considers two situations, one in which the default disambiguating rules do not have the intended effect, and one in which they do. It is recommended that shift/reduce conflicts be investigated using the .CW y.output file created by running \*y with the .CW -v option. ....... .SS "To Shift or To Reduce" \*Y reports 2 shift/reduce conflicts when it is applied to .EQ delim off .EN .P{ %token NUM %% expr : expr '\en' { printf("\en"); } | expr '-' expr { printf(" -"); } | NUM { printf(" %g", $NUM); } ; .P} .EQ delim $$ .EN .PP The hope here is to translate an infix expression, terminated by a newline, into postfix notation. Thus, we want .P{ 2 - 1 - 1 \en .P} to be translated into .P{ 2 1 - 1 - \en .P} The answer would be 0 if the expression were evaluated by subtracting 1 from 2, and then subtracting 1 from the result. Unfortunately, the output of the parser (with a suitable user-routines section) is .P{ 2 1 1 \en - - .P} What happened? The problem can be traced to the preference for shift in a shift/reduce conflict, which makes .CW - right associative, and gives .CW \en higher precedence than .CW - . .PP A slightly edited version of the .CW y.output file for this grammar appears in Figure _FI10_. Unfortunately, both productions and states are numbered, leaving room for confusion. The action .P{ . reduce 2 .P} refers to \f4production\fP 2, whereas the action .P{ \en shift 3 .P} refers to \f4state\fP 3. .KF .EQ delim off .EN .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .ps 9 .ft CW .PS [ define text % [ boxht = vs B: box $2 invis $1 ljust at B.w ] % boxht = vs C1: [ boxwid = 2; down text("state 0") text("", ht vs/4) text(" $accept : .expr $end ") text("", ht vs/4) text(" NUM shift 2") text(" . error") text(" expr goto 1") text("", ht vs/2) text("state 1") text("", ht vs/4) text(" $accept : expr.$end ") text(" expr : expr.\en ") text(" expr : expr.- expr ") text(" $end accept") text("", ht vs/4) text(" \en shift 3") text(" - shift 4") text(" . error") text("", ht vs/2) text("state 2") text("", ht vs/4) text(" expr : NUM. (3)") text("", ht vs/4) text(" . reduce 3") ] C2: [ boxwid = 2.5; down text("state 3") text("", ht vs/4) text(" expr : expr \en. (1)") text("", ht vs/4) text(" . reduce 1") text("", ht vs/2) text("state 4") text("", ht vs/4) text(" expr : expr -.expr ") text("", ht vs/4) text(" NUM shift 2") text(" . error") text(" expr goto 5") text("", ht vs/2) .ft CB text("5: conflict (shift \en, reduce 2)") text("5: conflict (shift -, reduce 2)") .ft CW text("state 5") text("", ht vs/4) text(" expr : expr.\en ") text(" expr : expr.- expr ") text(" expr : expr - expr. (2)") text("", ht vs/4) text(" \en shift 3") text(" - shift 4") text(" . reduce 2") ] with .ne at C1.nw + pagewid-.75,0 ] .PE .EQ delim $$ .EN .@tag FI _FI10_ .FI _FI10_ A slightly edited .CW y.output file. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP The two shift/reduce conflicts are in state 5. In this state, by default, the parser shifts the lookahead symbols .CW \en and .CW - instead of using the item .P{ expr : expr - expr. (2) .P} to reduce by production (2). This reduction can occur only when the right side is on top of the stack, so, when the conflict occurs, the stack must contain states corresponding to .ps 9 .ft CW .PS Indent: box wid pagewid ht 0 invis [ boxwid = .45; boxht = 1/6 Bot: box "\f1$...$\fP" invis box "expr$\"\" sub a$" box "-" box "expr$\"\" sub b$" line from Bot.nw to Bot.ne to Bot.se to Bot.sw ] with .w at Indent.w + 20/72,0 .PE .LP (The subscripts merely distinguish between the two occurrences of .CW expr .) .PP The other items in state 5 tell us more about the choices faced by the parser. With lookahead .CW \en the parser shifts, by the default disambiguating rule. Thus, it treats .CW expr$"" sub b$ on top of the stack as if it were in the item .P{ expr : expr$"" sub b$.\en .P} instead of the item .P{ expr : expr$"" sub a$ - expr$"" sub b$. (2) .P} Similarly, with lookahead .CW - , the parser treats .CW expr$"" sub b$ as if it were the first subexpression in .P{ expr : expr$"" sub b$.- expr .P} .PP Putting these observations together, the input .CW 2-1-1\en results in the stack eventually containing .P{ expr - expr - expr \en .P} .PP A sequence of reductions now occurs; the corresponding actions print a newline and two minus signs.$"" sup _FS2_$ .FS .SP .@tag FS _FS2_ $"" sup _FS2_$ This explanation assumes that the newline is the last character of the input. If it is not, the parser will wait for more input after reducing the right side .CW expr\en . At this point, the stack contains .P{ .25 expr - expr - expr .P} .25 The parser needs more input before it can choose whether to shift on newline, shift on minus, or reduce on any other token. .FE .PP The addition of the precedence declarations .P{ %nonassoc '\en' %left '-' .P} eliminates the conflicts in state 5, resulting in the new state .P{ state 5 expr : expr.\en expr : expr.- expr expr : expr - expr. (2) .sp .5 . reduce 2 .P} Here, the reduction takes precedence over the potential shifts, thereby implementing the left associativity of .CW - and its higher precedence over .CW \en . .SS "Precedence Declarations" Precedence declarations give rise to disambiguating rules for resolving parsing conflicts. .PP As mentioned in Section _SH2_, a precedence declaration starts with .CW %left , .CW %right , or .CW %nonassoc . These keywords specify the associativity of the tokens in a declaration. All the tokens in a declaration have the same precedence; tokens in successive declarations have higher precedence. A token in a precedence declaration need not be, but can be, declared by .CW %token as well. .PP The disambiguating rules are as follows. .SP .5 .IP 1. 4 Although declared only for tokens and literals, precedence information is attached to each production as well. The precedence and associativity of a production is that of the last token or literal on its right side. The .CW %prec construction overrides this default. .SP .5 .IP 2. If there is a shift/reduce conflict, and both the lookahead symbol (to be shifted) and the production (to be reduced) have precedence declarations, then the conflict is resolved in favor of the action (shift or reduce) with the higher precedence. If the precedences are the same, then the associativity is used; left associative implies reduce, right associative implies shift, and nonassociative implies error. .SP .5 .IP 3. In the absence of precedence information, the default disambiguating rules given earlier in this section are used. More precisely, suppose that there is no precedence information for either the lookahead symbol or a production to be reduced by. A shift/reduce conflict is resolved by shifting. A reduce/reduce conflict is resolved by reducing by the production that appears earlier in the specification. Conflicts resolved by these default rules are reported. .SP .5 .LP Conflicts resolved by precedence are not counted in the shift/reduce and reduce/reduce conflicts reported by \*y. Thus, mistakes in precedence declarations can mask errors in the design of the productions. .SS "Shift a Dangling Else" As an example of the power of the default disambiguating rules consider a fragment from a programming language involving an if-then-else construction: .P{ stmt : IF '(' expr ')' stmt | IF '(' expr ')' stmt ELSE stmt ; .P} Here, .CW IF and .CW ELSE are tokens, .CW stmt is a nonterminal for statements, and .CW expr is a nonterminal for expressions. Call the first production the .I "simple-if" production and the second the .I "else-if" production. .PP These two productions are ambiguous, since .P{ IF ( expr$"" sub 1$ ) IF ( expr$"" sub 2$ ) stmt$"" sub a$ ELSE stmt$"" sub b$ .P} can be structured in two ways .ft CB .TS lw4 0 l7 | l. IF ( expr$"" sub 1$ ) { IF ( expr$"" sub 1$ ) { \f(CWIF ( expr$"" sub 2$ ) stmt$"" sub a$\f(CB \f(CWIF ( expr$"" sub 2$ ) stmt$"" sub a$\f(CB \f(CW}\f(CB \f(CWELSE stmt$"" sub b$\f(CB ELSE stmt$"" sub b$ } .TE .LP The second interpretation, with an .CW ELSE matching the nearest preceding unmatched .CW IF , is the one taken by most programming languages. It is the interpretation obtained when shift/reduce conflicts are resolved in favor of shift actions. .PP The .CW y.output file for a grammar containing the simple-if and if-else productions contains a shift/reduce conflict, illustrated by .P{ 8: shift/reduce conflict (shift 9, red'n 1) on ELSE state 8 stmt : IF ( expr ) stmt. (1) stmt : IF ( expr ) stmt.ELSE stmt .sp .5 ELSE shift 9 . reduce 1 .P} This conflict occurs when the lookahead symbol is .CW ELSE and the parser stack contains the right side of the simple-if production .P{ \f1$...$\fP IF ( expr ) stmt .P} The parser chooses to shift the .CW ELSE rather than reduce by the simple-if production. This choice allows .P{ \f1$...$\fP IF ( expr ) stmt ELSE stmt .P} to be successfully reduced by the if-else production. Note that a premature application of the simple-if rule (with lookahead .CW ELSE ) would lead to the stack contents .P{ \f1$...$\fP stmt ELSE stmt .P} which cannot be parsed further. .PP A shift on lookahead .CW ELSE also matches an .CW ELSE with the nearest preceding unmatched .CW IF . With stack contents .P{ \f1$...$\fP \f(CWIF ( expr )\fP IF ( expr ) stmt .P} and lookahead .CW ELSE , the parser shifts, so the stack eventually holds .P{ \f1$...$\fP \f(CWIF ( expr )\fP IF ( expr ) stmt ELSE stmt .P} A reduction by the if-else production now yields .P{ \f1$...$\fP \f(CWIF ( expr )\fP stmt .P} which can be reduced by the simple-if production. .@tag SH _SH5_ .SH _SH5_ "Error Handling" Error handling is an extremely difficult area, and many of the problems are semantic ones. When an error is found, it may be necessary to undo the effect of actions \(em for example, to reclaim parse-tree storage, to delete or alter symbol-table entries \(em and, typically, set flags to avoid generation of further output. .PP It is seldom acceptable to stop all processing when an error is found; further syntax errors might be found if parsing continues. But, how do we get the parser ``restarted'' after an error is detected. One approach is to discard tokens from the input until parsing can be continued. .PP \*Y provides a simple, but reasonably general, feature for discarding tokens. The token name .CW error is reserved for error handling. On the right side of a production, .CW error suggests a place where error recovery is planned. When an error occurs, the parser pops its stack until it enters a state where the token .CW error is legal. It then behaves as if .CW error were the current lookahead, and performs the action encountered. The lookahead is then reset to the token that caused the error. If no error productions are specified, parsing halts when an error is detected. .PP In order to prevent a cascade of error messages, the parser, after detecting an error, remains in the error state until three tokens have been successfully read and shifted. If an error is detected when the parser is already in error state, no message is given, and the input token is quietly deleted. .PP For example, a production .P{ stmt : error .P} would, in effect, mean that on a syntax error the parser would attempt to skip over the statement in which the error was seen. More precisely, the parser will scan ahead, looking for three tokens that might legally follow a statement, and start processing at the first of these; if the beginnings of statements are not sufficiently distinctive, it may make a false start in the middle of a statement, and end up reporting a second error where there is in fact no error. .PP Actions can be used with these special error rules. These actions might attempt to reinitialize tables, reclaim symbol table space, etc. .PP Productions with just .CW error on the right side are very general, but difficult to control. Somewhat easier are productions such as .P{ stmt : error ';' .P} Here, upon error, the parser attempts to skip over the statement, but will do so by skipping to the next semicolon. All tokens after the error and before the next semicolon cannot be shifted, and are discarded. When the semicolon is seen, this right side will be reduced, and any ``cleanup'' action associated with it performed. .PP Another form of error production arises in interactive applications, where it may be desirable to permit a line to be reentered after an error. A possible error production might be .EQ delim off .EN .P{ input : error '\en' { printf("Reenter last line: "); } input { $$ = $input; } ; .P} One potential difficulty with this approach is that the parser must correctly process three input tokens before it admits that it has correctly resynchronized after the error. If the reentered line contains an error in the first two tokens, the parser deletes the offending tokens, and gives no message; this is clearly unacceptable. For this reason, there is a mechanism that can be used to force the parser to believe that an error has been fully recovered from. The statement .P{ yyerrok; .P} in an action resets the parser to its normal mode. The last example is better written .P{ input : error '\en' { yyerrok; printf("Reenter last line: "); } input { $$ = $input; } ; .P} .EQ delim $$ .EN .PP As mentioned above, the token seen immediately after the .CW error symbol is the input token at which the error was discovered. Sometimes, this is inappropriate; for example, an error recovery action might take upon itself the job of finding the correct place to resume input. In this case, the previous lookahead token must be cleared. The statement .P{ yyclearin; .P} in an action has this effect. For example, suppose the action after error is to call some sophisticated resynchronization routine, supplied by the user, that attempts to advance the input to the beginning of the next valid statement. After this routine is called, the next token returned by .CW yylex is presumably the first token in a legal statement; the old, illegal token must be discarded, and the error state reset. This can be done by a rule like .P{ stmt : error { resynch(); yyerrok; yyclearin; } ; .P} .PP These mechanisms are admittedly crude, but do allow for a simple, fairly effective recovery of the parser from many errors; moreover, the user can get control to deal with the error actions required by other portions of the program. .@tag SH _SH6_ .SH _SH6_ "The Yacc Environment" From a specification, \*y creates a file of C programs, called .CW y.tab.c on most systems. .SS "Program Organization" Consider a specification file of the form .P{ %{ $<!$\f2user supplied code within declarations\fP$>!$ #define YYSTYPE $<!$\f1desired type\fP$>!$ %} $<!$\f2declarations section\fP$>!$ %% $<!$\f2productions\fP$>!$ %% $<!$\f2user-routines section\fP$>!$ .P} From this specification, \*y creates a file .CW y.tab.c , organized as follows: .P{ $<!$\&\f2user supplied code within declarations\fP$>!$ #define YYSTYPE $<!$\f1desired type\fP$>!$ $<!$\f2token and other declarations\fP$>!$ $<!$\f2user-routines section\fP$>!$ $<!$\f2parser tables\fP$>!$ yyparse() { \&\f1$...$\fP } .P} Actions are incorporated into .CW yyparse . .PP As mentioned in Section _SH1_, .CW yyparse , the code for the parser, expects the following functions to be supplied: .P{ int yylex() { \f2a lexical analyzer; returns a token\fP } int main(\f1$...$\fP) { \f1$...$\fP yyparse(); \f1$...$\fP } void yyerror(s) char *s; { \f2print an error message pointed to by\fP s } .P} .PP The function .CW main decides when it wants to call .CW yyparse , which returns 0 if the parser accepts, and 1 if an error is detected and no error recovery is possible. .PP A user-supplied function .CW yyerror is called with a string containing an error message. Applications typically do more than simply print the message; for example, the message might be accompanied by the input line number on which the error was detected. The external integer variable .CW yychar contains the lookahead token number at the time an error is detected; this may be of some interest in giving better diagnostics. .PP The external variable .CW yydebug is normally set to 0. It it is set to a nonzero value, the parser will output a verbose description of its actions, including the tokens read and the parser actions. Depending on the operating environment, it may be possible to set this variable by using a debugging system. .SS "Lexical Tie-Ins" The lexical analyzer .CW yylex must return an integer, the token number, representing the lookahead token. An attribute value associated with the token must be placed in the global variable .CW yylval . The lexical-analyzer generator \f2lex\fP|reference(latest lex) can be used together with \*y. .PP The parser and the lexical analyzer must agree on the token numbers in order for communication between them to take place. Token numbers may be chosen by \*y, or chosen by the user. In either case, the .CW #def\&ine mechanism of C is used to allow the lexical analyzer to refer to these numbers symbolically. For example, the declarations section .P{ %token IF ELSE .P} leads to the following definitions in the file .CW y.tab.c : .P{ # define IF 257 # define ELSE 258 .P} .PP If .CW yylex is included in the user-routines section, it is within the scope of these definitions, so .CW IF and .CW ELSE can be used as the names of token numbers in .CW yylex . .PP A file .CW y.tab.h containing the definition of token numbers can be created by running \*y with the .CW -d option. .PP The approach of treating token names as defined constants leads to clear, easily modified lexical analyzers; the only pitfall is the need to avoid using any token names that are reserved or significant in C or the parser. For example, the use of the token names .CW if and .CW while will almost certainly cause severe difficulties when the lexical analyzer is compiled. The token name .CW error is reserved for error handling; see Section _SH5_. .PP \*Y chooses token numbers if the user does not. The default token number for a literal character is the numerical value of the character in the local character set. Other names are assigned token numbers starting at 257. .PP To assign a number to a token (including a literal), the first appearance of the token name or literal in the declarations section can be immediately followed by a nonnegative integer. This integer is taken to be the token number of the name or literal. Names and literals not defined by this mechanism retain their default definition. It is important that all token numbers be distinct. .PP For historical reasons, the endmarker must have token number 0 or negative. This token number cannot be redefined by the user; thus, all lexical analyzers must be prepared to return 0 or negative as a token number upon reaching the end of their input. .SS "Communicating Context to the Lexical Analyzer" Some lexical decisions depend on context. For example, the lexical analyzer might want to delete blanks normally, but not within quoted strings. Or names might be entered into a symbol table in declarations, but not in expressions. .PP One way of handling this situation is to create a global flag that is examined by the lexical analyzer, and set by actions. For example, suppose a program consists of zero or more declarations, followed by zero or more statements. Consider: .P{ %{ int dflag; %} \&\f2$...$ other declarations $...$\fP %% prog : decls stmts ; decls : /* empty */ { dflag = 1; } | decls declaration ; stmts : /* empty */ { dflag = 0; } | stmts statement ; \&\f2$...$ other productions $...$\fP .P} .PP The flag .CW dflag is now 1 when reading declarations and 0 when reading statements, .ul except for the first token in the first statement. This token must be seen by the parser before it can tell that the declarations have ended and the productions have begun. In many cases, this single token exception does not affect the lexical scan. .PP This kind of ``backdoor'' approach can be elaborated to a noxious degree. Nevertheless, it represents a way of doing some things that are difficult, if not impossible, to do otherwise. .PP Some programming languages permit the user to use words like .CW if , which are normally reserved, as label or variable names, provided that such use does not conflict with the legal use of these names in the programming language. This is extremely hard to do in the framework of \*y; it is difficult to pass information to the lexical analyzer telling it ``this instance of .CW if is a keyword, and that instance is a variable''. The user can make a stab at it, using flags like .CW dflag , above, but it is difficult. It is better that the keywords be .I reserved ; that is, be forbidden for use as variable names. There are powerful stylistic reasons for preferring this, anyway. .SS "Support for Arbitrary Attribute Types" By default, the values returned by actions and the lexical analyzer are integers. Other types can be supported by defining .CW YYSTYPE in the declarations section, as in .P{ %{ #define YYSTYPE double %} .P} .CW YYSTYPE is not a normal variable, because the parser contains the following lines to define it to be .CW int if it has not already been defined by the user: .P{ #ifndef YYSTYPE #define YYSTYPE int #endif .P} .PP Clearly, .CW YYSTYPE can be defined to be any type, including a union type. .PP The .CW %union mechanism of \*y attempts to make the underlying union transparent, in all but a few places where \*y needs help in determining which field of the union is intended. .PP Unions are declared in the declarations section, an example being .P{ %union { int ival; double dval; char * sval; } .P} A union type with these members is created for the \*y value stack, and for the global variables .CW yylval and .CW yyval . With the .CW -d option, \*y copies the union type into the .CW y.tab.h file. The type of the union can be referred to as .CW YYSTYPE . .PP The type of each attribute must now correspond to one of the union members. The construction .P{ <name> .P} indicates a union member name. If the construction follows .CW %token , .CW %left , .CW %right , or .CW %nonassoc , then the union member name is associated with the tokens in that declaration. Another keyword .CW %type is used similarly to associate union member names with nonterminals. Thus, we might use .P{ %type <dval> expr term .P} .EQ delim off .EN .PP There remain a couple of cases where these mechanisms are insufficient. If there is an action within a rule, the value returned by this action has no a priori type. Similarly, reference to left context values leaves \*y with no easy way of knowing the type. In this case, a type can be imposed on the reference by inserting a union member name, between .CW < and .CW > , immediately after the first .CW $ and immediately before the symbol name or number. An example of this usage is .P{ rule : aaa { $<intval>$ = 3; } bbb { fun( $<intval>2, $<other>0 ); } ; .P} where the union member names .CW intval and .CW other are inserted within references. This syntax has little to recommend it, but the situation arises rarely. .PP The facilities in this subsection are not triggered until they are used: in particular, the use of .CW %type will turn on these mechanisms. When they are used, there is a fairly strict level of checking. For example, use of .CW $\f2n\fP or .CW $$ to refer to something with no defined type is diagnosed. If these facilities are not triggered, the \*y value stack is used to hold .CW int s, as was true historically. .EQ delim $$ .EN .@tag SH _SH8_ .SH _SH8_ "Acknowledgements" The original acknowledgements, from |reference(v7yacc), are as follows. ``\*Y owes much to a most stimulating collection of users, who have goaded me beyond my inclination, and frequently beyond my ability, in their endless search for `one more feature'. Their irritating unwillingness to learn how to do things my way has usually led to my doing things their way; most of the time, they have been right. B. W. Kernighan, P. J. Plauger, S. I. Feldman, C. Imagna, M. E. Lesk, and A. Snyder will recognize some of their ideas in the current version of \*y. C. B. Haley contributed to the error recovery algorithm. D. M. Ritchie, B. W. Kernighan, and M. O. Harris helped translate this document into English. Al Aho also deserves special credit for bringing the mountain to Mohammed, and other favors.'' .PP This version of \*y has benefited from thoughtful comments by B. W. Kernighan, M. F. Fernandez, and M. Tasman. .NH 1 References .LP |reference_placement .SH "Appendix A" "Yacc Input Syntax" This appendix has a description of the \*y input syntax, as a \*y specification. Context dependencies, etc., are not considered. Ironically, the \*y input specification language is most naturally specified as an LR(2) grammar; the sticky part comes when an identifier is seen in a rule, immediately following an action. If this identifier is followed by a colon, it is the start of the next rule; otherwise it is a continuation of the current rule, which just happens to have an action embedded in it. As implemented, the lexical analyzer looks ahead after seeing an identifier, and decide whether the next token (skipping blanks, newlines, comments, etc.) is a colon. If so, it returns the token .CW C_IDENTIFIER . Otherwise, it returns .CW IDENTIFIER . Literals (quoted strings) are also returned as .CW IDENTIFIER s, but never as part of .CW C_IDENTIFIER s. .P{ /* grammar for the input to \*y */ /* basic entities */ %token IDENTIFIER /* includes identifiers and literals */ %token C_IDENTIFIER /* identifier (but not literal) followed by colon */ %token NUMBER /* [0-9]+ */ /* reserved words: %type => TYPE, %left => LEFT, etc. */ .sp .5 %token LEFT RIGHT NONASSOC TOKEN PREC TYPE START UNION .sp .5 %token MARK /* the %% mark */ %token LCURL /* the %{ mark */ %token RCURL /* the %} mark */ .sp .5 /* ascii character literals stand for themselves */ %start spec .P} 0 .P{ %% .sp .5 spec : defs MARK rules tail ; .sp .5 tail : MARK { \f2In this action, eat up the rest of the file\fP } | /* empty: the second MARK is optional */ ; .sp .5 defs : /* empty */ | defs def ; .sp .5 def : START IDENTIFIER | UNION { \f2Copy union definition to output\fP } | LCURL { \f2Copy C code to output file\fP } RCURL | ndefs rword tag nlist ; .sp .5 rword : TOKEN | LEFT | RIGHT | NONASSOC | TYPE ; tag : /* empty: union tag is optional */ | \'<\' IDENTIFIER \'>\' ; .sp .5 nlist : nmno | nlist nmno | nlist \',\' nmno ; .sp .5 nmno : IDENTIFIER /* NOTE: literal illegal with %type */ | IDENTIFIER NUMBER /* NOTE: illegal with %type */ ; .P} .P{ /* rules section */ .sp .5 rules : C_IDENTIFIER rbody prec | rules rule ; .sp .5 rule : C_IDENTIFIER rbody prec | '|' rbody prec ; .sp .5 rbody : /* empty */ | rbody IDENTIFIER | rbody act ; .sp .5 act : \'{\' { \f2Copy action, translate $$, etc.\fP } \'}\' ; .sp .5 prec : /* empty */ | PREC IDENTIFIER | PREC IDENTIFIER act | prec \';\' ; .P}