.so ../ADM/mac .XX 22 359 "Yacc: A Parser Generator" .\".nr PD 1u \" bwk: little bit of paddability .\".nr PI 2n .rn SH sH .fp 9 CB CW .de SH .SP 1.5 .sH \\$1. \\$2 .PP .nr H5 0 .nr H4 0 .nr H3 0 .nr H2 0 .nr H1 \\$1 .. .de SS .NH 2 \\$1 .PP .. .de CB .nr Ft \\n(.f \&\\$3\&\\f(CB\\$1\&\\f\\n(Ft\\$2 .. .de Q{ .in +20p .X{ \\$1 .. .de P{ .Q{ \\$1 .tr _\(ru .ft CB .lg 0 .. .de Q} .X} \\$1 .in -20p .. .de P} .lg .Q} \\$1 .. .de X{ .KS .lg 0 .ie \\n(.$ .SP \\$1 .el .SP .5 .nf .ft 1 .nr t 20p .ta \\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu +\\ntu .. .de X} .ft 1 .fi .lg .ie \\n(.$ .SP \\$1 .el .SP .5 .KE .. .de FI .LP .DS B .fi .nr PS 9 .ps 9 \\f3Figure \\$1\\$2.\\f1 .. .\" EF - end figure .de EF .DE .fi .nr PS 10 .ps 10 .if \\n(.$ .SP \\$1 .if !\\n(.$ .SP .LP .. .PS arrowht = .08; arrowwid = .04 # default values are .1 and .05 vs = 1/6 # 1/6 = 12p; for vertical distances pagewid = 6; fillval = .98 # ln -- line to $1, $2 chop elliptically # chopw and choph are the chop ellipse width and height define ln % { [ x = $1*hu; y = $2*vu xsq = x*x rsq = xsq + y*y; r = sqrt(rsq); asq = chopw*chopw/4; bsq = choph*choph/4 esq = (asq - bsq)/asq p = sqrt(bsq/(1-esq*(xsq/rsq)))/r; O: "" B: O + p*x, p*y E: O + (1-p)*x, (1-p)*y P: O + x,y line $3 from B to E ] with .O at Here } move to Here + $1*hu, $2*vu % # usage: adirection([e|n|w|s], [right|up|...], [1|0|1|0], Object) define adirection % axdir = $3 move to $4 $2 % define aright % adirection(e, right, 1, $1) % define aleft % adirection(w, left, 1, $1) % define anup % adirection(n, up, 0, $1) % define adown % adirection(s, down, 0, $1) % define anell % ax = $1.x - Here.x; ay = $1.y - Here.y Anell: Here + ax,ay if axdir then| line $3 from Here to Here + ax,0 chop 0 chop arcrad if ax*ay > 0 then@ arc @else@ arc cw @ axdir = 0 |else| line $3 from Here to Here + 0,ay chop 0 chop arcrad if ax*ay < 0 then@ arc @else@ arc cw @ axdir = 1 | line $2 to Anell % axdir = 1 .PE .fp 8 SS S .EQ define Small % size 7 "$1" % define f2 % "\&" font 2 "$1" % define cdot % "\s7\f8\N'183'\fP\s0" % define => % "\v'-.15m'\f8\N'222'\fP\v'.15m'" % define C++ % "\f1C\h'-.14m'+\h'-.18'+\fP" % define <! % "\f8\N'225'\fP" % define >! % "\f8\N'241'\fP" % delim $$ .EN .ND "July 1, 1989" .TL Yacc: A Parser Generator\(dg .AU Stephen C. Johnson Ravi Sethi .AI .MH .AB Since the early 1970s, Yacc has been used to implement hundreds of languages, big and small. Its applications range from small desk calculators to medium-sized preprocessors for typesetting to large compiler front ends for complete programming languages. .PP A Yacc specification is based on a collection of grammar rules describing the syntax of a language; Yacc turns the specification into a syntax analyzer. A pure syntax analyzer merely checks whether or not an input string conforms to the syntax of the language. .PP We can go beyond pure syntax analysis by attaching code in C or $C++$ to a grammar rule; such code is called an action, and is executed whenever the rule is applied during syntax analysis. Thus, a desk calculator might use actions to evaluate an expression, and a compiler front end might use actions to emit intermediate code. .PP Yacc allows us to build parsers from LALR(1) grammars without necessarily learning the underlying theory. .AE .@tag SH _SH1_ .FS \(dg Prepared by R. Sethi from\&|reference(v7yacc)\&. S. C. Johnson is presently with Ardent Computer, 880 West Maude Ave., Sunnyvale, California 94086. .FE .SH _SH1_ "Introduction" .LP Yacc is a tool for building syntax analyzers, also known as .I parsers . This section introduces the basic features of Yacc. We review grammars, build a pure parser for real numbers, and then augment the parser to evaluate numbers during parsing. The language of real numbers is a toy; realistic examples appear in Section _SH2_. .PP Uppercase letters are distinct from lowercase letters in Yacc specifications. Thus, .CW digit and .CW DIGIT are distinct names. The font of a name is chosen purely for readability, so .CW fraction , and $fraction$ (within diagrams) refer to the same name. .SS "Further Reading" Yacc is designed to handle a single but significant part of the total job of building a translator or interpreter for a language. The remaining parts of the job must be implemented in a host programming language, presumed to be C|reference(Kernighan Ritchie 1988) or $C++$|reference(Stroustrup book 1986). .PP Another progam generator, designed to work in harmony with Yacc, is Lex|reference(latest lex), a tool developed for building lexical analyzers. Lex can be easily used to produce quite complicated lexical analyzers, but there remain some languages (such as Fortran) that do not fit any theoretical framework, and whose lexical analyzers must be crafted by hand. .PP Kernighan and Pike illustrate program development using Yacc and Lex by gradually extending an expression evaluator into an interpreter for a language comparable to Basic|reference(Kernighan Pike). Schreiner and Friedman conduct a book-length case study of how to create a compiler using Yacc and Lex|reference(Schreiner Friedman). .PP Textbooks on compilers often provide more information on the behavior and construction of parsers than will be covered here; see for example \&|reference(Aho Sethi Ullman 1986)\&. The algorithms underlying Yacc are also discussed in the survey of LR parsing by \&|reference(Aho Johnson surveys)\&. A feature that sets Yacc apart \(em its ability to build fast compact LR parsers from ambiguous grammars \(em is based on the theory developed by \&|reference(Aho Johnson Ullman ambiguous)\&. .PP Among the earliest applications of Yacc are EQN|reference(Kernighan Cherry 1975), a language for typesetting mathematics, and PCC, the Portable C Compiler|reference(Johnson portable acm). .SS "Grammars, Reviewed" The .I syntax of a language imposes a hierarchical structure, called a .I "parse tree" , on strings in the language. The following is a parse tree for the string .CW 3.14 in a language of real numbers: .KS .ps 9 .ft CB .PS [ hu = .5; vu = 1.5*vs; choph = vs; chopw = .7 boxht = vs; boxwid = .8*vs "$realNumber$" { ln(-1,-1); "$integerPart$" ln( 0,-1); "$digit$" ln( 0,-2); "3" }{ ln( 0,-4); "." }{ ln( 1.5,-1); "$fraction$" { ln(-0.5,-1); "$digit$" ln( 0,-2); "1" }{ ln( 0.5,-1); "$fraction$" ln( 0,-1); "$digit$" ln( 0,-1); "4" } } ] .PE .KE .PP The leaves at the bottom of a parse tree are labeled with .I terminals or .I tokens ; tokens represent themselves. By contrast, the other nodes of a parse tree are labeled with .I nonterminals . Each node in the parse tree is based on a rule that defines a nonterminal in terms of a sequence of terminals and nonterminals. Such rules are called .I productions . The root of the preceding parse tree is based on the following informally stated production: .Q{ A real number consists of an integer part, a point, and a fraction. .Q} This production is written as follows in the notation accepted by Yacc: .P{ realNumber : integerPart '.' fraction ; .P} .PP Together, the tokens, the nonterminals, the productions, and a distinguished nonterminal, called the .I "start symbol" , constitute a .I grammar for a language. Both tokens and nonterminals can be referred to as .I "grammar symbols" , or simply .I symbols . .SS "Grammars in Yacc Specifications" A Yacc specification has three sections for optional declarations, productions, and optional user-supplied programs. The productions are the heart of a specification; they are darker for emphasis in Figure _FI2_. The sections are separated by double percent .CW %% marks \(em the percent symbol is generally used by Yacc as an escape character. If the programs section is omitted, the second .CW %% mark can be omitted as well. .KF bottom .ps 9 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .PS [ define text % [ B: box ht vs wid pagewid invis $1 ljust at B.w + 20/72,0 ] % down .ft CW text("%start lines") text("%%") .ft CB text("lines : /* empty */") text(" | lines realNumber '\en'") text(" ;") text("realNumber : integerPart '.' fraction") text(" ;") text("integerPart : digit") text(" | integerPart digit") text(" ;") text("fraction : digit") text(" | digit fraction") text(" ;") text("digit : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'") text(" ;") .ft CW text("%%") text("#include <stdio.h>") text("int yylex() { return getchar(); }") text("int main() { return yyparse(); }") text("void yyerror(s) char *s; { fprintf(stderr, \"%s\en\", s); }") ] .PE .@tag FI _FI2_ .FI _FI2_ A complete Yacc specification for sequences of real numbers, one per line. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP Blanks, tabs, and newlines are ignored. Comments can appear wherever a name can; they are enclosed between .CW /* and .CW */ , as in C. .PP It is possible, and desirable, for the start symbol of a grammar to be declared explicitly, using the .CW %start keyword, as in .P{ %start lines .P} Otherwise, the start symbol is taken from the first production in the specification. .PP A name in the production section is presumed to represent a nonterminal unless it is explicitly declared to represent a token. There are no token declarations in Figure _FI2_; later in this section, .CW DIGIT will be declared to be a token by writing .P{ %token DIGIT .P} .PP Single-character tokens need not be declared; a .I literal is a character enclosed in single quotes. As in C, the backslash .CW \e is an escape character within literals, and the following C escapes are recognized: .ft CB .TS lw4 0 l l. '\en' \f1newline\fP '\er' \f1return\fP '\e'' \f1single quote \fP'\f1\fP '\et' \f1tab\fP '\eb' \f1backspace\fP '\ef' \f1form feed\fP '\exxx' \f1``\fPxxx\f1'' in octal .TE For technical reasons, the .CW NUL character, .CW '\e0' or .CW 0 , should never be used in productions. .PP A production .P{ realNumber : integerPart '.' fraction ; .P} defines a nonterminal, called its .I "left side" , in terms of a sequence of grammar symbols, called its .I "right side" . A colon separates the two sides, and a semicolon marks the end of the production. The left side of this production is .CW realNumber . Its right side consists of .CW integerPart , the literal .CW '.' , and .CW fraction . .PP The right side of a production can be empty, as in the following production with no symbols between the colon and the terminating semicolon: .P{ lines : ; .P} .PP Productions with the same left side can be combined, and written with a vertical bar separating the right sides. The productions .P{ integerPart : digit ; integerPart : integerPart digit ; .P} can be rewritten equivalently as .P{ integerPart : digit | integerPart digit ; .P} In words, an integer part is either a single digit or a (smaller) integer part followed by a digit. Thus, an integer part consists of a string of one or more digits. .PP A fraction also consists of a string of one or more digits; however, the productions for .CW fraction impose a different hierarchical structure on strings of digits. The productions are .P{ fraction : digit | digit fraction ; .P} Note how a tree for .CW integerPart grows down to the left, whereas a tree for .CW fraction grows down to the right: .KS .ps 9 .ft CB .PS [ hu = .5; vu = 1.5*vs; choph = vs; chopw = .7 boxht = vs IntegerPart:\ [ "$integerPart$" { ln( 1,-1); "$digit$" ln( 0,-1); "3" } ln(-1,-1); "$integerPart$" { ln( 1,-1); "$digit$" ln( 0,-1); "2" } ln(-1,-1); "$integerPart$" { ln( 0,-1); "$digit$" ln( 0,-1); "1" }{ box invis wid .8 at Here } ] Fraction:\ [ "$fraction$" { ln(-1,-1); "$digit$" ln( 0,-1); "7" } ln( 1,-1); "$fraction$" { ln(-1,-1); "$digit$" ln( 0,-1); "8" } ln( 1,-1); "$fraction$" { ln( 0,-1); "$digit$" ln( 0,-1); "9" }{ box invis wid .6 at Here } ] with .e at IntegerPart.w + pagewid-2*20/72,0 ] .PE .KE .LP The different hierarchical structures imposed by .CW integerPart and .CW fraction on strings of digits facilitate the evaluation of real numbers, as we shall see later in this section. .SS "Using Yacc" Parsers generated by Yacc need to be supplemented before a self-contained application is built. Only four lines are devoted to such supplementary code at the end of Figure _FI2_. In other words, Yacc confines itself to generating fast parsers, leaving us with both the control and responsibility for all other aspects of an application. .PP The use of Yacc is illustrated in Figure _FI3_. The code of a Yacc-generated parser is a function, .CW yyparse , which returns 0 if the entire input is parsed successfully; otherwise, it returns 1. The function .CW yyparse gets tokens by repeatedly calling .CW yylex , a lexical analyzer. .KF bottom .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .nr PS 9 .ps 9 .vs 11 .PS [ fillval = 1 boxht = 2.5*vs; boxwid = .6 lineht = vs; linewid = .25 right box invis "character" "stream" wid .7 line -> box "lexical" "analyzer" fill line -> box invis "token" "stream" line -> box "syntax" "analyzer" fill { move to last box.n up line <- box "C" "compiler" line <- box "Yacc" fill line <- box invis ht 1.5*vs "grammar" } line -> box invis "optional" "output" ] .PE .@tag FI _FI3_ .FI _FI3_ Yacc handles the syntax analysis part of an application. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP A .I "lexical analyzer" reads input characters and returns tokens. The simplest lexical analyzer returns each individual character as a token; it is defined by .P{ int yylex() { return getchar(); } .P} .PP Execution of a C program begins in a function called .CW main , so .CW main must call .CW yyparse. The code for .CW main in Figure _FI2_ .P{ int main() { return yyparse(); } .P} simply returns the result obtained from .CW yyparse ; more generally, .CW main might read command-line arguments and options before calling .CW yyparse . .PP If a syntax error occurs during parsing, .CW yyparse calls a user-supplied function .CW yyerror with a terse message, usually .CW "syntax error" .'' `` The following implementation of .CW yyerror prints the message : .P{ void yyerror(s) char *s; { fprintf(stderr, "%s\en", s); } .P} Parsing terminates when an error is detected unless ``error productions'' are used as described in Section _SH5_. .PP We have now examined all the pieces of the specification in Figure _FI2_ except .P{ #include <stdio.h> .P} which provides access to input/output facilities from a standard library. .PP When Yacc is applied to a specification, the output is a file of C programs, called .CW y.tab.c on most systems (the name might differ due to local file-system conventions). Suppose that the specification in Figure _FI2_ appears in a file .CW real.y . A program for reading a sequence of real numbers can then be constructed by the following .UX system commands: .ft CB .TS lw4 0 l8 l. yacc real.y \f2generates C code into file \&\fPy.tab.c cc y.tab.c \f2compiles executable program into file \&\fPa.out .TE .LP The compiled program .CW a.out silently reads real numbers, one per line, until the end of the input is reached, or until a syntax error occurs. The specification will now be extended to evaluate real numbers as they are read. .EQ delim off .EN .SS "Actions and Attributes" An action attached to a production is executed each time the production is applied during parsing. An .I action consists of one or more C statements, enclosed in curly braces .CW { and .CW } . Within an action, pseudo-variables starting with .CW $ signs refer to values associated with the symbols in the production. The pseudo-variable for the left side is .CW $$ ; the pseudo-variable for a symbol on the right side is formed by prefixing a .CW $ sign to its name or to its position. .PP The action in .P{ realNumber : integerPart fraction { $$ = $integerPart + $fraction; } ; .P} is the single statement .P{ $$ = $integerPart + $fraction; .P} .EQ delim @@ .EN It defines the value associated with the left side to be the sum of the values associated with the two symbols on the right side.@"" sup _FS#_@ .FS .@tag FS _FS#_ @"" sup _FS#_@ When the same symbol @s@ appears several times on the right side, its pseudo-variable can be written as .CW $@s@#1 , .CW $@s@#2 , .CW $@s@#3 . .FE Using pseudo-variables formed from positions on the right side, instead of names, the action can be rewritten equivalently as .EQ delim off .EN .P{ realNumber : integerPart fraction { $$ = $1 + $2; } ; .P} This production and action are from the complete specification in Figure _FI2A_. .PP Earlier versions of Yacc support only positional pseudo-variables like .CW $1 and .CW $2 . .PP The type of the values associated with grammar symbols is given by .CW YYSTYPE , defined to be .CW double in Figure _FI2A_. The default type for .CW YYSTYPE is .CW int . Any C code in the declarations section must be enclosed between .CW %{ and .CW %} . See Section _SH6_ for how to associate different types of values with different grammar symbols. .KF .ps 9 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .PS [ define text % [ B: box ht vs wid pagewid invis $1 ljust at B.w + 20/72,0 ] % down .ft CW text("%token DIGIT") text("%start lines") text("%{") text("#def\&ine YYSTYPE double") text("%}") text("%%") .ft CB text("lines : lines realNumber '\en'") .ft CW text(" { printf(\"%g\en\", $realNumber); }") .ft CB text(" | /* empty */") text(" ;") text("realNumber : integerPart '.' fraction") .ft CW text(" { $$ = $integerPart + $fraction; }") .ft CB text(" ;") text("integerPart : DIGIT") text(" | integerPart DIGIT") .ft CW text(" { $$ = $integerPart*10 + $DIGIT; }") .ft CB text(" ;") text("fraction : DIGIT") .ft CW text(" { $$ = $DIGIT*0.1; }") .ft CB text(" | DIGIT fraction") .ft CW text(" { $$ = ($DIGIT + $fraction)*0.1; }") .ft CB text(" ;") .ft CW text("%%") text("#include <stdio.h>") text("#include <ctype.h>") text("int yylex() {") text(" int c;") text(" c = getchar();") text(" if( ! isdigit(c) ) return c;") text(" yylval = c - '0';") text(" return DIGIT;") text("}") text("int main() { return yyparse(); }") text("void yyerror(s) char *s; { fprintf(stderr, \"%s\en\", s); }") ] .PE .@tag FI _FI2A_ .FI _FI2A_ The actions in this Yacc specification evaluate real numbers during parsing. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP The tree in Figure _FI2C_ illustrates the evaluation of the real number .CW 321.789 . (The tree is technically not a parse tree because its root is not for the start symbol .CW lines , and the nodes are labeled with values rather than grammar symbols.) .KF .EQ delim $$ .EN .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .ps 9 .ft CB .PS [ define intp % { "$f2($integerPart)^=^$1$" } % define frac % { "$f2($fraction)^=^$1$" } % define DIGIT % { "$Small($DIGIT)^=^$1$" } % hu = .6; vu = 2.5*vs; choph = vs; chopw = .7 boxht = vs "$f2($realNumber)^=^321.789$" { ln(-2,-1); intp(321) { ln( 1,-1); DIGIT(1) } ln(-1,-1); intp(32) { ln( 1,-1); DIGIT(2) } ln(-1,-1); intp(3) { ln( 0,-1); DIGIT(3) }{ box invis wid .85 at Here } }{ ln( 2,-1); frac(0.789) { ln(-1,-1); DIGIT(7) } ln( 1,-1); frac(0.89) { ln(-1,-1); DIGIT(8) } ln( 1,-1); frac(0.9) { ln( 0,-1); DIGIT(9) }{ box invis wid .8 at Here } } ] .PE .@tag FI _FI2C_ .FI _FI2C_ Attribute values during the evaluation of .CB 321.789 . .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .EQ delim off .EN .KE .PP A value associated with a node in a parse tree is called an .I attribute . For the moment, the attribute at a node will be defined solely in terms of the attributes at the children of the node. .PP Attributes for tokens are defined by the lexical analyzer. Conceptually, a lexical analyzer returns a pair, consisting of a token and an associated attribute value. For example, in Figure _FI2A_, .CW DIGIT is a token. When the lexical analyzer reads the character .CW 1 it returns .CW DIGIT with attribute value 1, when it reads .CW 2 it returns .CW DIGIT with attribute value 2, and so on. .PP Specifically, the parser .CW yyparse expects the lexical analyzer .CW yylex to leave the attribute value in a global variable .CW yylval . The function .CW yylex in Figure _FI2A_ assigns a value to .CW yylval just before it returns the token .CW DIGIT . .PP Starting in the bottom-left corner of Figure _FI2C_, the lexical analyzer sets the attribute value 3 at the leftmost leaf for the token .CW DIGIT . The parent of this leaf is based on the production and action .P{ integerPart : DIGIT { $$ = $1; } ; .P} This action is omitted in Figure _FI2A_ because, by default, the parser sets the attribute of the left side to that of the first symbol on the right side. .PP Working up the tree, the next node is based on .P{ integerPart : integerPart DIGIT { $$ = $integerPart + $DIGIT; } ; .P} .PP The effect of the actions is perhaps easier to see at the nodes for .CW fraction . At the only node based on .P{ fraction : DIGIT { $$ = $DIGIT*0.1; } ; .P} the value of .CW $DIGIT is 9 and the value of .CW $fraction is 0.9. .PP Since Yacc generates bottom-up parsers, bottom-up evaluation of attribute values fits naturally with parsing. Actions attached to productions are examined further in Section _SH2_; their execution order becomes significant when they do input/output, assign values to variables, call functions with side effects, or otherwise affect the state of a computation. .SS "A Style for Specifications" Choose a style that makes the productions visible through the morass of action code. .SP .5 .IP a. 4 Use all capital letters for token names, all lower case letters for nonterminal names. This hint comes under the heading of ``knowing who to blame when things go wrong.'' .SP .25 .IP b. Put productions and actions on separate lines. Either can then be changed independently. .SP .25 .IP c. Put all productions with the same left side together. Put the left side in only once, and let all following productions begin with a vertical bar. .SP .25 .IP d. Put a semicolon only after the last production with a given left side, and put the semicolon on a separate line. New productions can then be easily added. .SP .25 .IP e. Indent production bodies by one tab stop, and action bodies by two tab stops. .PP These style hints owe much to Brian Kernighan. The examples in this paper sometimes deviate from this style to conserve space. .EQ delim $$ .EN .@tag SH _SH2_ .SH _SH2_ "Evaluation And Translation Of Expressions" Both actions and productions must be considered when a Yacc specification is designed. Without actions, a parser would silently analyze input strings, complaining only if it detects an error. With suitable actions, a parser can become an evaluator or a translator. Typically, the desired actions influence the choice of productions. .PP For example, the actions for evaluating the integer and fractional parts of a real number motivate different productions for the sequences of digits represented by .CW integerPart and .CW fraction in Section _SH1_. In the integer part, the contribution of a digit depends on the number of digits to its right; the contribution of .CW 3 in .CW 321.789 is 300, where the number of zeros depends on the number of digits to its right. In the fractional part, however, the contribution of a digit depends on the number of digits to its left; the contribution of .CW 9 in .CW 321.789 is 0.009, where the number of zeros depends on the number of digits to its left. .PP Arithmetic expressions are a fertile source of examples for Yacc because the specifications of expression evaluators and translators are quite short and the ideas carry over to richer languages. This section begins with a specification for an expression evaluator. The evaluator benefits from Yacc's facilities for specifying the associativity and precedence of operators within expressions. The next example, a translator from infix into postfix notation, illustrates parsing order. .SS "Grammars for Expressions" Expressions are characterized by the operators within them; the choice of productions for expressions depends on the associativity and precedence of operators. .PP An operator .CW OP is .I "left associative" if an expression .P{ expr$"" sub 1$ OP expr$"" sub 2$ OP expr$"" sub 3$ .P} is evaluated as if it were parenthesized as .P{ (expr$"" sub 1$ OP expr$"" sub 2$) OP expr$"" sub 3$ .P} Similarly, the operator is .I "right associative" if the expression is evaluated as if it were parenthesized as .P{ expr$"" sub 1$ OP (expr$"" sub 2$ OP expr$"" sub 3$) .P} .PP An operator .CW OPA has .I "lower precedence" than an operator .CW OPB if the following equivalences hold (that is, the expressions to the left and right of the $==$ signs have the same values): .ft CB .TS lw4 0 r2 c2 l. expr$"" sub 1$ OPA expr$"" sub 2$ OPB expr$"" sub 3$ $==$ expr$"" sub 1$ OPA (expr$"" sub 2$ OPB expr$"" sub 3$) expr$"" sub 1$ OPB expr$"" sub 2$ OPA expr$"" sub 3$ $==$ (expr$"" sub 1$ OPB expr$"" sub 2$) OPA expr$"" sub 3$ .TE .PP The traditional grammar for arithmetic expressions uses three nonterminals .CW expr , .CW term , and .CW factor , where .CW expr represents an expression and .CW term and .CW factor represent subexpressions: .P{ %token NUMBER VAR %% expr : expr '+' term | expr '-' term | term ; term : term '*' factor | term '/' factor | factor ; factor : NUMBER | VAR | '(' expr ')' ; .P} In words, an expression is a sequence of terms separated by .CW + or .CW - signs. A term is a sequence of factors separated by .CW * or .CW / signs. Thus, .P{ b*b - 4*a*c .P} is an expression containing two terms .CW b*b and .CW 4*a*c . The term .CW 4*a*c has three factors .CW 4 , .CW a , and .CW c . A factor is either a number, a variable, or a parenthesized expression. .PP Ths grammar for expressions dates back to 1960 when Backus\&|reference(Backus 1960 %no_cite)\& introduced BNF, a notation for writing grammars. .PP A desk calculator can be based on this grammar by adding actions, as in .EQ delim off .EN .P{ expr : expr '-' term { $$ = $expr - $term; } .P} .EQ delim $$ .EN This production and action respect the left associativity of the minus operator, and correctly evaluate .CW 7-1-2 to 4 \(em check it by drawing a parse tree. .PP The traditional grammar generalizes to operators at $n^>=^1$ precedence levels; all operators at the same level have the same associativity and precedence. Set up a nonterminal .CW expr$"" sub i$ for precedence level $i$, with level $1$ being the lowest. In the traditional grammar, the nonterminals .CW expr , .CW term , and .CW factor correspond to .CW expr$"" sub 1$ , .CW expr$"" sub 2$ , and .CW expr$"" sub 3$ , respectively. If the operators at level $i^<^n$ are left associative, then the productions for .CW expr$"" sub i$ have the form .P{ expr$"" sub i$ : expr$"" sub i$ OP expr$"" sub i+1$ ; .P} Here, .CW OP represents an operator at level $i$. Otherwise, if the operators at level $i$ are right associative, then the productions have the form .P{ expr$"" sub i$ : expr$"" sub i+1$ OP expr$"" sub i$ ; .P} .PP At each precedence level $i^<^n$, there is an additional production of the form .P{ expr$"" sub i$ : expr$"" sub i+1$ ; .P} .SS "Associativity and Precedence Declarations" Yacc has special facilities for declaring the associativity and precedence of operators. They will be introduced by considering the specification in Figure _PREC_. .KF .EQ delim off .EN .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .ps 9 .ft CB .PS [ define text @ [ B: box ht vs wid pagewid invis $1 ljust at B.w + 20/72,0 ] @ down text("%{") text("#def\&ine YYSTYPE double") text("%}") text("%token NUMBER") text("%left '+' '-'") text("%left '*' '/'") text("%right '^'") text("%left '~' UMINUS") text("%%") text("lines : lines expr '\en' { printf(\"%g\en\", $expr); }") text(" | lines '\en'") text(" | /* empty */") text(" ;") text("expr : expr '+' expr { $$ = $1 + $3; }") text(" | expr '-' expr { $$ = $1 - $3; }") text(" | expr '*' expr { $$ = $1 * $3; }") text(" | expr '/' expr { $$ = $1 / $3; }") text(" | '-' expr %prec UMINUS { $$ = - $expr; }") text(" | '~' expr { $$ = - $expr; }") text(" | expr '^' expr { $$ = pow($1, $3); }") text(" | '(' expr ')' { $$ = $expr; }") text(" | NUMBER") text(" ;") ] .PE .@tag FI _PREC_ .FI _PREC_ An expression evaluator based on precedence declarations for tokens. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .EQ delim $$ .EN .KE .PP The tokens of the evaluator in Figure _PREC_ are parentheses, .CW NUMBER , and the operators .P{ \&'+' '-' '*' '/' '^' '~' UMINUS .P} (Ignore .CW ~ and .CW UMINUS for the moment.) .PP The associativity and precedence of tokens are declared on lines beginning with one of the keywords .CW %left , .CW %right , or .CW %nonassoc . All of the tokens on a line have the same associativity and precedence; they have lower precedence than operators on successive lines. These declarations will be referred to as .I precedence declarations. .PP (The keyword .CW %nonassoc is used to describe operators that must not associate with themselves; for example, .P{ A .LT. B .LT. C .P} is illegal in Fortran because .CW .LT. is nonassociative.) .PP The precedence declarations .P{ %left '+' '-' %left '*' '/' %right '^' .P} specify that .CW + and .CW - are left associative and have lower precedence than the left-associative operators .CW * and .CW / , which in turn have lower precedence than the right-associative operator .CW ^ . .PP The operator .CW ^ in the grammar represents exponentiation, as in .CW 2^3 , which evaluates to 8. This operator is right associative; thus, .CW 2^2^3 is equivalent to .CW 2^(2^3) . .PP The nonterminals of the grammar are .CW lines and .CW expr . The grammar expects a sequence of expressions, on separate lines. A typical production for .CW expr has the form .EQ delim off .EN .P{ expr : expr '+' expr { $$ = $1 + $3; } .P} Alternatively, we can write the action as .P{ expr : expr '+' expr { $$ = $expr#1 + $expr#2; } .P} .EQ delim $$ .EN .PP With these precedence declarations, the expression .P{ 2 ^ 2 ^ 3 * 4 - 5 * 6 - 7 * 8 .P} is evaluated as if it were parenthesized as .P{ ( (2^(2^3))*4 - 5*6 ) - 7*8 .P} .PP The programs section for the evaluator is in Figure _PRECB_. The lexical analyzer .CW yylex returns a token each time it is called. It skips blanks. If it sees a digit or a decimal point, it returns the token .CW NUMBER after using the C library function .CW scanf to read a number into the global variable .CW yylval (as mentioned in Section 1, the attribute value, if any, associated with a token must be left in .CW yylval ). Otherwise, the lexical analyzer returns a character as a token. .KF .EQ delim off .EN .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .ps 9 .ft CB .PS [ define text @ [ B: box ht vs wid pagewid invis $1 ljust at B.w + 20/72,0 ] @ down text("%%") text("#include <stdio.h>") text("#include <ctype.h>") text("#include <math.h>") text("int yylex() {") text(" int c;") text(" while ( ( c = getchar() ) == ' ' );") text(" if ( (c == '.') || (isdigit(c)) ) {") text(" ungetc(c, stdin);") text(" scanf(\"%le\", &yylval);") text(" return NUMBER;") text(" }") text(" return c;") text("}") text("int main() { return yyparse(); }") text("void yyerror(s) char * s; { fprintf(stderr, \"%s\en\", s); }") ] .PE .@tag FI _PRECB_ .FI _PRECB_ Programs section for the evaluator in Figure _PREC_. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .EQ delim $$ .EN .KE .PP The parser uses precedence declarations to decide when to apply a production. Suppose that the input has the form .P{ expr * expr \&\f1$...$\fP .P} and the parser has to decide whether or not to apply the multiplication production .P{ expr : expr '*' expr .P} If the next symbol in the input is .CW + , as in .P{ expr * expr + \&\f1$...$\fP .P} the parser applies the multiplication production because the token .CW + has lower precedence than .CW * . However, if the next symbol in the input is .CW ^ , the parser defers the multiplication production because .CW ^ has higher precedence than .CW * . .PP The treatment of the unary minus operator deserves special mention. The evaluator in Figure _PREC_ accepts the expression .CW 10^-1 in lieu of $10 sup -1~==~0.1$. In other words, .CW 10^-1 is treated like .CW 10^(-1) , with unary minus having higher precedence than .CW ^ . To help distinguish between the two uses of the minus operator, the evaluator recognizes .CW ~ as a synonym for unary minus. Unary .CW ~ has the highest precedence and binary .CW - has the lowest precedence in Figure _PREC_. The expression .CW 3-10^~1 is therefore equivalent to .CW 3-(10^(~1)) . .PP The keyword .CW %prec in the production .P{ expr : '-' expr %prec UMINUS .P} specifies that the precedence of the unary minus operator is taken from that of the token .CW UMINUS . Since .CW ~ has this same precedence, .CW 3-10^-1 is equivalent to .CW 3-(10^(-1)) . .PP The .CW %prec keyword is used to override the precedence of the tokens on the right side of a production. Without .CW %prec the parser uses the precedence of the last token on the right side to decide whether to apply a production. Thus, the .CW %prec in .P{ expr : '(' TYPENAME ')' expr %prec TYPENAME .P} is necessary for the production to take the precedence of token .CW TYPENAME ; otherwise the production takes its precedence from the closing parenthesis. .PP It is recommended that precedence declarations be used in a ``cookbook'' fashion, until some experience is gained. How Yacc uses precedence declarations is examined further in Section _SH4_. .SS "Execution Order for Actions" The execution order of actions is significant because actions can have side effects. One way to visualize the order is to imagine a traversal of a parse tree in which the children of each node are visited depth-first from left to right, starting at the root. Suppose a node has two children $c$ and $d$, with $c$ to the left of $d$. .I Depth-first implies that all the nodes in the subtree for $c$ are visited before any nodes are visited in the subtree for $d$. The tree in Figure _FI.ACTIONS_ includes actions as pseudo-symbols, attached by dashed lines. Actions are executed in the order they would be visited in a depth-first left-to-right traversal. .KF .EQ define Small % size 7 "$1" % .EN .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .nr PS 9 .ps 9 .PS [ define NUM % { "$Small(NUM)$" "$Small($NUM)^=^$1$" at Here + 0,-vs } % hu = .3; vu = vs; choph = vs; chopw = .7 boxht = vs "$expr$" { ln(-5,-3); "$expr$" G1:"" { ln(-2,-2); NUM(2) G2:"" }{ ln( 1,-2, dashed .04); "{print $Small($NUM)$}" G3:"" } }{ ln(-1,-3); "+" }{ ln( 2,-3); "$expr$" { ln(-3,-4); "$expr$" { ln(-2,-2); NUM(3) }{ ln( 1,-2, dashed .04); "{print $Small($NUM)$}" } }{ ln(-.5,-4); "\(**" }{ ln( 2.5,-4); "$expr$" { ln(-2,-4); NUM(5) }{ ln( 1,-4, dashed .04); "{print $Small($NUM)$}" } }{ ln( 5,-4, dashed .04); "{print \(**}" box invis wid .6 ht vs at Here } }{ ln( 6,-3, dashed .04); "{print +}" } H1: G1 + -3*hu,0 H2: G2 + -5*hu,-3*vu H3: G2.x, H2.y .ps 36 spline -> from H1 to H2 to H3 .ps\n(PS ] .PE .@tag FI _FI.ACTIONS_ .FI _FI.ACTIONS_ Actions are executed in a depth-first left-to-right order. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP The parse tree in Figure _FI.ACTIONS_ is based on the following specification of an infix-to-postfix translator: .EQ delim off .EN .P{ %token NUM %left '+' %left '*' %% expr : expr '+' expr { printf(" +"); } | expr '*' expr { printf(" *"); } | '(' expr ')' | NUM { printf(" %d", $NUM); } ; .P} .EQ delim $$ .EN The translation is emitted incrementally during parsing, so the execution order of the print statements is critical. The translation of .CW 2+3*5 is .P{ 2 3 5 * + .P} .SS "Actions Embedded Within Rules" Yacc permits an action to be written in the middle of a production as well as at the end; such actions are called .I embedded actions. .PP Embedded actions are useful for keeping track of context information. In the EQN grammar, the context includes the current point size. The EQN input .P{ E sub 1 .P} is typeset as $E sub 1$. Note that the subscript $1$ has a smaller point size than $E$. Nonterminal .CW box in the following production represents an EQN construct. The current point size is maintained in variable .CW ps . The embedded action .CW ps$^$-=$^$del reduces the point size before the subscript box is processed; the other action .CW ps$^$+=$^$del restores the point size. Token .CW SUB represents the input characters .CW sub . .P{ box : box { ps -= del; } SUB box { ps += del; } .P} .PP Embedded actions are implemented by manufacturing a fresh nonterminal with one production deriving the empty string. Yacc actually treats this EQN example as if it had been written: .EQ delim off .EN .P{ box : box $ACT SUB box { ps +=del; } ; $ACT : /* empty */ { ps -= del; } .P} Here .CW $ACT is a fresh nonterminal. .PP Section _SH6_ considers the use of embedded actions to pass left-to-right context in a Yacc specification. .PP Embedded actions can have associated attribute values; within an embedded action, .CW $$ refers to its attribute value, not to the attribute of the left side of the production. Thus, 1 is the attribute value for the embedded action in .P{ a : b { $$ = 1; } c { x = $2; y = $c; } .P} The embedded action is at the second position in the right side, so its value is referred to as .CW $2 in the assignment .CW x=$2 . Note that .CW c is at position 3, so the above production can be rewritten using .CW $3 instead of .CW $c : .P{ a : b { $$ = 1; } c { x = $2; y = $3; } .P} .EQ delim $$ .EN .@tag SH _SH3_ .SH _SH3_ "How The Parser Works" Some familiarity with parsing is helpful in deciphering messages from Yacc about ``shift/reduce'' and ``reduce/reduce'' conflicts. Such messages merit investigation; they are a warning that the generated parser will make choices that may or may not be consistent with the grammar designer's intent. .PP When invoked with the .CW -v (for verbose) option, Yacc places a human-readable description of the generated parser into the file .CW y.output . This section deals with the parsing background behind .CW y.output files; parsing conflicts themselves are considered in Section _SH4_. .PP The running example in this section is the following grammar for real numbers: .P{ %token D P %% real : intp P frac ; intp : D | intp D ; frac : D | D frac ; .P} The abbreviated names .CW intp for integer part, .CW frac for fraction, and .CW D for digit, conserve space in diagrams. The use of token .CW P for a decimal point .CW '.' avoids confusion with other uses of dots within .CW y.output files. .SS "Shift-Reduce Parsing" A .I "bottom-up" parser works from the leaves (bottom) of a parse tree towards the root. The following sequence of tree snapshots illustrates a bottom-up parse of the token stream .CW DDPDD , corresponding to the real number .CW 21.89 . For the moment, the digit represented by a token .CW D appears below the token. The trees are .KS .ps 9 .ft CB .PS [ define D % { "\f2\s8D\s0\fP" "\s6$1\s0" at Here + 0,-.5*vs } % define Pt % { "\f2\s8P\s0\fP" "." at Here + 0,-.5*vs } % hu = .125; vu = 1.5*vs; choph = vs; chopw = .4 boxht = vs; boxwid = .8*vs; movewid = .5 S1: [ # "$real$" { ln(-1,-1, invis); # "$intp$" { ln(-1,-1, invis); # "$intp$" ln( 0,-1, invis); D(2) }{ ln( 0,-2, invis); D(1) } }{ ln( 0,-3, invis); Pt }{ ln( 1,-1, invis); # "$frac$" { ln( 0,-2, invis); D(8) }{ ln( 1,-1, invis); # "$frac$" ln( 0,-1, invis); D(9) } } ] move S2: [ # "$real$" { ln(-1,-1, invis); # "$intp$" { ln(-1,-1, invis); "$intp$" ln( 0,-1); D(2) }{ ln( 0,-2, invis); D(1) } }{ ln( 0,-3, invis); Pt }{ ln( 1,-1, invis); # "$~~frac$" { ln( 0,-2, invis); D(8) }{ ln( 1,-1, invis); # "$frac$" ln( 0,-1, invis); D(9) } } ] move S3: [ # "$real$" { ln(-1,-1, invis); "$intp~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(2) }{ ln( 0,-2); D(1) } }{ ln( 0,-3, invis); Pt }{ ln( 1,-1, invis); # "$~~frac$" { ln( 0,-2, invis); D(8) }{ ln( 1,-1, invis); # "$frac$" ln( 0,-1, invis); D(9) } } ] move S4: [ # "$real$" { ln(-1,-1, invis); "$intp~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(2) }{ ln( 0,-2); D(1) } }{ ln( 0,-3, invis); Pt }{ ln( 1,-1, invis); # "$frac$" { ln( 0,-2, invis); D(8) }{ ln( 1,-1, invis); "$frac$" ln( 0,-1); D(9) } } ] move S5: [ # "$real$" { ln(-1,-1, invis); "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(2) }{ ln( 0,-2); D(1) } }{ ln( 0,-3, invis); Pt }{ ln( 1,-1, invis); "$~~frac$" { ln( 0,-2); D(8) }{ ln( 1,-1); "$frac$" ln( 0,-1); D(9) } } ] move S6: [ "$real$" { ln(-1,-1); "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(2) }{ ln( 0,-2); D(1) } }{ ln( 0,-3); Pt }{ ln( 1,-1); "$~~frac$" { ln( 0,-2); D(8) }{ ln( 1,-1); "$frac$" ln( 0,-1); D(9) } } ] "$=>$" at .5<S1.e, S2.w> "$=>$" at .5<S2.e, S3.w> "$=>$" at .5<S3.e, S4.w> "$=>$" at .5<S4.e, S5.w> "$=>$" at .5<S5.e, S6.w> ] .PE .KE .LP Let us redraw these trees to line up their uncovered portions; that is, the roots of the completed subtrees. The redrawn trees are .KS .ps 9 .PS [ define D % { "\f2\s8D\s0\fP" } % define Pt % { "\f2\s8P\s0\fP" # "." at Here + 0,-.5*vs } % hu = .125; vu = 1.5*vs; choph = vs; chopw = .4 boxht = vs; boxwid = .8*vs; movewid = hu; sep = .5 S1: [ D(1); move; D(2); move; Pt; move; D(8); move; D(9) ] S2: [ { "$intp~~$"; ln( 0,-1); D(1) } move D(2); move; Pt; move; D(8); move; D(9) ] with .nw at S1.ne + sep,0 S3: [ { "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(1) }{ ln( 0,-2); D(2) } } move Pt; move; D(8); move; D(9) ] with .nw at S2.ne + sep,0 S4: [ { "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(1) }{ ln( 0,-2); D(2) } } move Pt; move; D(8) move { "$~~frac$"; ln( 0,-1); D(9) } ] with .nw at S3.ne + sep,0 S5: [ { "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(1) }{ ln( 0,-2); D(2) } } move Pt move { "$~~frac$" { ln( 0,-2); D(8) }{ ln( 1,-1); "$frac$" ln( 0,-1); D(9) } } ] with .nw at S4.ne + sep,0 S6: [ "$real$" { ln(-1,-1); "$intp~~$" { ln(-1,-1); "$intp$" ln( 0,-1); D(1) }{ ln( 0,-2); D(2) } }{ ln( 0,-3); Pt }{ ln( 1,-1); "$~~frac$" { ln( 0,-2); D(8) }{ ln( 1,-1); "$frac$" ln( 0,-1); D(9) } } ] with .nw at S5.ne + sep,0 "$=>~$" at .5<S1.ne, S2.nw> "$=>$" at .5<S2.ne, S3.nw> "$=>$" at .5<S3.ne, S4.nw> "$~~=>$" at .5<S4.ne, S5.nw> "$=>$" at .5<S5.ne, S6.nw> ] .PE .KE .PP Shift-reduce parsers store only the uncovered portions of a parse tree, as in the following snapshots, obtained from the preceding sequence of trees: .KS .ps 9 .PS [ define N % box invis $1 wid .25 % define D % box invis "\f2\s8D\s0\fP" wid .125 % define P % box invis "\f2\s8P\s0\fP" wid .125 % define derives % box invis wid .4 "$=>$" % boxht = vs; boxwid = .8*vs [ D; D; P; D; D ] derives [ N("$intp$"); D; P; D; D ] derives [ N("$intp$"); P; D; D ] derives [ N("$intp$"); P; D; N("$frac$") ] derives [ N("$intp$"); P; N("$frac$") ] derives [ N("$real$") ] ] .PE .KE .PP Such derivations can be broken down into sequences of shift and reduce actions. A .I shift action advances the parser to the next unexamined input token; such tokens are called .I lookahead symbols. A .I reduce action replaces the right side of a production by its left side. The key problem of shift-reduce parsing is that of deciding when to shift and when to reduce; Yacc generates tables for this purpose that are explained later in this section. .EQ delim off .EN .PP A shift-reduce parse of the input token stream .CW DDPDD appears in Figure _FI.SHIFTREDUCE_. A special token .CW $end marks the end of the input. Within each snapshot, a pointer appears before the current lookahead symbol. The first action, a shift, advances the pointer past the leftmost .CW D : .EQ delim $$ .EN .KF .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .ps 9 .PS [ define N % box invis $1 wid .25 % define D % box invis "\f2\s8D\s0\fP" wid .125 % define P % box invis "\f2\s8P\s0\fP" wid .125 % define E % box invis "$f2($end)$" wid .3 % define action % Act: [ right B: box invis wid actionwid PTR: box invis wid .1 $1 ljust at B.w ] with .PTR.n at S.PTR.s % define shift % action("shift") % define reduce % action("reduce") % define ptr % PTR: box invis wid .1; { move left .025; up; line <- } % define handle % { line from last box.sw to last box.se } % boxht = vs arrowht = .05; arrowwid = .025; lineht = .75*vs; dy = vs down actionwid = .9 T1: S:[ right ptr; D; D; P; D; D; E ] shift S:[ right D; handle; ptr; D; P; D; D; E ] with .PTR.n at Act.PTR.s reduce S:[ right N("$intp$"); ptr; D; P; D; D; E ] with .PTR.n at Act.PTR.s shift S:[ right N("$intp$"); handle; D; handle; ptr; P; D; D; E ] with .PTR.n at Act.PTR.s reduce S:[ right N("$intp$"); ptr; P; D; D; E ] with .PTR.n at Act.PTR.s shift B1: S:[ right N("$intp$"); P; ptr; D; D; E ] with .PTR.n at Act.PTR.s actionwid = 1.2 T2: S:[ right box invis wid actionwid PTR: box invis wid .05 ] with .sw at T1.se + 1.2,0 shift S:[ right N("$intp$"); P; D; ptr; D; E ] with .PTR.n at Act.PTR.s shift S:[ right N("$intp$"); P; D; D; handle; ptr; E ] with .PTR.n at Act.PTR.s reduce S:[ right N("$intp$"); P; D; handle; N("$frac$"); handle; ptr; E ] with .PTR.n at Act.PTR.s reduce S:[ right N("$intp$"); handle; P; handle; N("$frac$"); handle; ptr; E ] with .PTR.n at Act.PTR.s reduce B2: S:[ right N("$real$"); ptr; E ] with .PTR.n at Act.PTR.s G: .5<T1.se, T2.sw> + 0,vs line from G to G.x, B2.s.y ] .PE .@tag FI _FI.SHIFTREDUCE_ .FI _FI.SHIFTREDUCE_ Shift-reduce parsing of .CB DDPDD . .EF .nf \s5\l'\n(LLu\&\(ul'\s0 .EF 0 .fi .SP .ps 9 .PS define N % box invis $1 wid .25 % define D % box invis "\f2\s8D\s0\fP" wid .125 % define P % box invis "\f2\s8P\s0\fP" wid .125 % define E % box invis wid .05; box invis "$f2($end)$" wid .25 % define shift % box invis wid .8 "shift" % define reduce % box invis wid .8 "reduce" % define lookahead % box invis wid .1; { move left .05; up; line <- } % define handle % { line from last box.sw to last box.se } % [ boxht = vs arrowht = .05; arrowwid = .025; lineht = .75*vs lookahead; D; D; P; D; D; E shift D; lookahead; D; P; D; D; E ] .PE .LP The second action reduces the token .CW D immediately to the left of the pointer (right sides to be reduced are underlined for clarity). The reduction replaces the right side .CW D by the left side .CW intp : .ps 9 .PS [ boxht = vs arrowht = .05; arrowwid = .025; lineht = .75*vs D; handle; lookahead; D; P; D; D; E reduce N("$intp$"); lookahead; D; P; D; D; E ] .PE .LP After the next .CW D is shifted, the right side .CW intp$~$D is reduced to the left side .CW intp : .ps 9 .PS [ boxht = vs arrowht = .05; arrowwid = .025; lineht = .75*vs N("$intp$"); handle; D; handle; lookahead; P; D; D; E reduce N("$intp$"); lookahead; P; D; D; E ] .PE .LP Successive shift actions now advance the lookahead pointer all the way to the endmarker. Finally, a sequence of reduce actions completes the parse. .PP It is no accident that a right side to be reduced always appears immediately to the left of the pointer in Figure _FI.SHIFTREDUCE_. This observation is the basis for a stack-implementation of shift-reduce parsing. Informally, the symbols to the left of the pointer are held on a stack, so a right side to be reduced appears at the top of the stack. .PP Parsers generated by Yacc use a stack; however, instead of symbols the stack holds states. It is these states that are displayed in a .CW y.output file. .SS "Parser States" How does a parser know that a digit to the left of a decimal point reduces to .CW intp , but that a digit to the right reduces to .CW frac ? The parser uses states to summarize prior parsing actions. Some of the states of the real-number parser are (informally) .SP .5 .IP 0. 4 .I "The starting state" . The very first token must be a .CW D ; it reduces to .CW intp . .SP .5 .IP 2. .I "Within the integer part" . An .CW intp has been seen. The lookahead token must either be a .CW D (another digit in the integer part) or a .CW P (the decimal point). .SP .5 .IP 7. .I "Within the fraction part" . Shift as long as the lookahead token is a .CW D . Otherwise, reduce the last .CW D to .CW frac . .SP .5 .PP State 2 is displayed as follows in the .CW y.output file for the real-number grammar: .P{ state 2 real : intp.P frac intp : intp.D .sp .5 .ft CW D shift 5 P shift 4 . error .P} .LP A .I state consists of a collection of items, where an .I item is a production with a dot inserted in the right side\(emsome versions of Yacc use an underscore to mark the position of the dot. The items of state 2 are .P{ real : intp.P frac intp : intp.D .P} In this state, the parser shifts on lookahead .CW P . The shift is recorded by (conceptually) moving the dot past .CW P to obtain the item .P{ real : intp P.frac .P} This item is the sole item in state 4. In words, state 4 records that a .CW P has been seen and that the incoming tokens are now expected to match .CW frac . .PP The states and their transitions constitute an automaton. The automaton for the real-number grammar appears in Figure _FI.GOTO_. The solid arrows are for shift transitions, due to tokens. The dashed arrows, for transitions due to nonterminals, are used during reductions. .KF .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .nr PS 9 .ps 9 .PS [ define state % if "$4"!="" then 'nitems=$4' else 'nitems=1' S$1: box fill at O + $2*hu, $3*vu ht nitems*vs "$1" at S$1.n + 0,vs/2 % arcrad = vs/3; boxht = vs; boxwid = 1.2; fillval = 1 dashwid = .03 hu = .7; vu = 4*vs O: "" state(0, 0, 0) box invis "$f2($accept)^:~cdot~real~f2($end)$" at S0 state(1, 3, 0) box invis "$f2($accept)^:~real~cdot~f2($end)$" at S1 state(2, 1,-1, 2) [ down B1: box invis B2: box invis "$real^:~intp~cdot~P~frac$" ljust at B1.w + .075,0 "$intp^:~intp~cdot~D$" ljust at B2.w + .075,0 ] at S2 state(5, 4,-1) box invis "$intp^:~intp~D~cdot$" at S5 state(4, 2,-2) box invis "$real^:~intp~P~cdot~frac$" at S4 state(6, 5,-2) box invis "$real^:~intp~P~frac~cdot$" at S6 state(7, 3,-3, 2) [ down B1: box invis B2: box invis box invis "$frac^:~D~cdot$" ljust at B1.w + .1,0 box invis "$frac^:~D~cdot~frac$" ljust at B2.w + .1,0 ] at S7 state(8, 6,-3) box invis "$frac^:~D~frac~cdot$" at S8 state(3, 0,-3) box invis "$intp^:~D~cdot$" at S3 S9: box invis at O + 6*hu, 0 "\0\0\f3accept\fP" ljust at S9.w G02: .66<S0.sw,S0.s> G03: .33<S0.sw,S0.s> G2: .5<S2.sw,S2.s> G4: .5<S4.sw,S4.s> G7a: .5<S7.sw,S7.s> G7b: .5<S7.s,S7.se> G7c: S7.s + 0,-1.5*vs line -> from S0.e to S1.w dashed line -> from S2.e to S5.w line -> from S4.e to S6.w dashed line -> from S7.e to S8.w dashed line -> from S1.e to S9.w line -> from G03.s to G03.x, S3.n.y adown(G02); anell(S2.w, -> dashed, dashed) adown(G2); anell(S4.w, ->) adown(G4); anell(S7.w, ->) adown(G7a); anell(G7c); anell(G7b, ->) "$real$" at .5<S0.e, S1.w> + 0, vs/2 "$intp$" at (G02.x+S2.w.x)/2, S2.y + vs/2 "$~D$" ljust at G03.x,S4.y "$f2($end)$" at .5<S1.e, S9.w> + 0, vs/2 "$D$" at .5<S2.e, S5.w> + 0, vs/2 "$P$" at (G2.x+S4.w.x)/2, S4.y + vs/2 "$frac$" at .5<S4.e, S6.w> + 0, vs/2 "$D$" at (G4.x+S7.w.x)/2, S7.y + vs/2 "$frac$" at .5<S7.e, S8.w> + 0, vs/2 "$D$" at G7c + 0, vs/2 ] .PE .@tag FI _FI.GOTO_ .FI _FI.GOTO_ States and transitions for the real-number grammar. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP The parser holds states on a stack, with the current state on top. The starting state of the automaton in Figure _FI.GOTO_ is state 0. With lookahead .CW D , the automaton shifts to state 3 (in the bottom-left corner of the figure) by pushing 3 onto the stack and removing the lookahead .CW D from the input. For ease of comparison with Figure _FI.SHIFTREDUCE_, the symbol .CW D appears below the state in the following diagram: .KS .ps 9 .PS [ define Disp % box invis $2 { move to last box + 0,-del; $1 } % define N % Disp($1, wid .25) % define D % Disp("\f2\s8D\s0\fP", wid .1) % define P % Disp("\f2\s8P\s0\fP", wid .1) % define E % Disp("$f2($end)$", wid .3) % define action % Act: [ right B: box invis wid 1 PTR: box invis wid .1 $1 ljust at B.w ] with .PTR.n at S.PTR.s % define shift % action("shift") % define reduce % action("$1") % define ptr % PTR: box invis wid .05 { move left .025; move up boxht/2; line <- } box invis wid .3 { move left .15; $1 } box invis wid .05 { move left .025; move up boxht/2; line <- } % define handle % { line from last box.sw to last box.se } % define state % [ down ELEM: box "$1" $2 $3 ] with .ELEM.w at Here; move to last [].ELEM.e; right % boxht = vs; boxwid = .2 arrowht = .05; arrowwid = .025; lineht = .5*vs; dy = vs movewid = .1; del = .015 L1: S:[ right state(0) ptr() D; D; P; D; D; E ] shift L2: S:[ right state(0); state(3, D) ptr() D; P; D; D; E ] with .PTR.n at Act.PTR.s + 0,-boxht/2 ] .PE .KE .LP The top of the state stack has its own pointer, separate from the lookahead pointer to the next input symbol. .PP The only item in state 3 is .P{ intp : D. .P} Whenever the dot in an item is at the end of the production, one of the possible actions is a reduction by that production. Since the state has no other actions, the parser chooses to reduce. .PP A reduce action has two phases: (a) pop the states corresponding to the right side, and (b) push a state corresponding to the left side. The following diagram illustrates the reduction of the leading .CW D in .CW DDPDD to .CW intp : .KS .ps 9 .PS [ define Disp % box invis $2 { move to last box + 0,-del; $1 } % define N % Disp($1, wid .25) % define D % Disp("\f2\s8D\s0\fP", wid .1) % define P % Disp("\f2\s8P\s0\fP", wid .1) % define E % Disp("$f2($end)$", wid .3) % define action % Act: [ right B: box invis wid 1 PTR: box invis wid .1 $1 ljust at B.w ] with .PTR.n at S.PTR.s + 0,-boxht % define shift % action("shift") % define reduce % action("$1") % define ptr % PTR: box invis wid .05 { move left .025; move up boxht/2; line <- } box invis wid .3 { move left .15; $1 } box invis wid .05 { move left .025; move up boxht/2; line <- } % define handle % { line from last box.sw to last box.se } % define state % [ down ELEM: box "$1" $2 $3 ] with .ELEM.w at Here; move to last [].ELEM.e; right % boxht = vs; boxwid = .2 arrowht = .05; arrowwid = .025; lineht = .5*vs; dy = vs movewid = .1; del = .015 L1: S:[ right state(0); state(3, D, handle) ptr() D; P; D; D; E ] reduce(pop) L2: S:[ right state(0) ptr("$intp$") D; P; D; D; E ] with .PTR.n at Act.PTR.s + 0,-boxht reduce(goto) L3: S:[ right state(0); state(2, N("$intp$")) ptr() D; P; D; D; E ] with .PTR.n at Act.PTR.s + 0,-boxht dx = .75; dy = -lineht-vs/2-del "Pop the states corresponding to the right side \f2\s8D\s0\fP."\ ljust at L1.ne + dx,dy "Prepare to push a state for the left side $intp$."\ ljust at L2.ne + dx,dy "State 0 goes to 2 under $intp$."\ ljust at L3.ne + dx,dy box invis wid 2.75 with .nw at L1.ne + dx,dy ] .PE .KE .SS "Parsing Actions" The parser has only four actions available to it, called shift, reduce, accept, and error. A move of the parser is as follows: .SP .5 .IP 1. 4 Based on its current state, the parser decides whether it needs a lookahead token to choose the next action. If it needs one, and does not already have one, it calls .CW yylex to obtain the next token. .SP .5 .IP 2. Using the current state $p$, and the lookahead token $t$ if needed, the parser chooses an action. .RS .SP .25 .IP a) 4 A .I shift action to state $q$ is done by pushing state $q$ onto the state stack and clearing the lookahead token. .SP .25 .IP b) A .I reduce action by a production is done in two phases. In the first phase, the parser pops from the stack a number of states equal to the number of symbols on the right side. Let $q$ be the state uncovered after the states are popped, and let the nonterminal on the left side of the production take $q$ to $r$. In the second phase, the parser pushes state $r$ onto the stack. .SP .25 .EQ delim off .EN .IP c) The .I accept action occurs after the entire input has been seen and matched. This action occurs only when the lookahead token is the endmarker .CW $end . .EQ delim $$ .EN .SP .25 .IP d) An .I error action occurs when the parser can no longer continue parsing according to the productions. That is, the input tokens seen so far and the current lookahead cannot possibly be followed by anything that would result in a legal input. Error recovery is covered in Section _SH5_. .RE .@tag SH _SH4_ .SH _SH4_ "Ambiguity and Conflicts" A .I "shift/reduce conflict" occurs if the parser cannot decide between a shift action and a reduce action. Similarly, a .I "reduce/reduce conflict" occurs if the parser cannot decide between two legal reductions. .PP Conflicts definitely occur when a grammar is .I ambiguous ; that is, if some input string has more than one parse tree. Conflicts can also occur when a grammar, although consistent, requires a more complex parser than Yacc is capable of constructing. Finally, conflicts are sometimes introduced when an action is embedded into the middle of a production. .PP Yacc produces a parser even when conflicts occur. Rules used to choose between two competing actions is called .I "disambiguating rules" . The default disambiguating rules are: .SP .5 .IP 1. 4 In a shift/reduce conflict, the default is to shift. .SP .25 .IP 2. In a reduce/reduce conflict, the default is to reduce by the earlier production in the specification. .SP .5 .PP Although the effect of disambiguating rules can often be achieved by rewriting the grammar to avoid conflicts, experience suggests that this rewriting is somewhat unnatural and produces slower parsers. .PP The rest of this section considers two situations; one in which the default disambiguating rules do not have the intended effect, and one in which they do. It is recommended that shift/reduce conflicts be investigated using the .CW y.output file created by running Yacc with the .CW -v option. .SS "To Shift or To Reduce" Yacc reports 2 shift/reduce conflicts when it is applied to .EQ delim off .EN .P{ %token NUM %% expr : expr '\en' { printf("\en"); return(0); } | expr '-' expr { printf(" -"); } | NUM { printf(" %g", $NUM); } ; .P} .EQ delim $$ .EN This specification marks the end of an expression by a newline character; the .CW return in the associated action terminates parsing immediately after echoing the newline. .PP The hope here is to translate an infix expression into postfix notation. Thus we want .P{ 2 - 1 - 1 .P} to be translated into .P{ 2 1 - 1 - .P} The answer would be 0 if the expression were evaluated by subtracting 1 from 2, and then subtracting 1 from the result. Unfortunately, the output of the parser (with a suitable program section) is .P{ 2 1 1 .P} Whatever happened to the minus operators? The problem can be traced to the preference for shift in a shift/reduce conflict, which makes .CW - right associative, and gives .CW \en higher precedence than .CW - . .KF .EQ delim off .EN .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .ps 9 .ft CW .PS [ define text % [ boxht = vs B: box $2 invis $1 ljust at B.w ] % boxht = vs C1: [ boxwid = 2; down text("state 0") text(" $accept : .expr $end ") text("", ht vs/3) text(" NUM shift 2") text(" . error") text(" expr goto 1") text("", ht vs/2) text("state 1") text("", ht vs/3) text(" $accept : expr.$end ") text(" expr : expr.\en ") text(" expr : expr.- expr ") text(" $end accept") text(" \en shift 3") text(" - shift 4") text(" . error") text("", ht vs/2) text("state 2") text("", ht vs/3) text(" expr : NUM. (3)") text(" . reduce 3") ] C2: [ boxwid = 2.5; down text("state 3") text(" expr : expr \en. (1)") text("", ht vs/3) text(" . reduce 1") text("", ht vs/2) text("state 4") text(" expr : expr -.expr ") text("", ht vs/3) text(" NUM shift 2") text(" . error") text(" expr goto 5") text("", ht vs/2) .ft CB text("5: conflict (shift \en, reduce 2)") text("5: conflict (shift -, reduce 2)") text("state 5") text(" expr : expr.\en ") text(" expr : expr.- expr ") text(" expr : expr - expr. (2)") text("", ht vs/3) text(" \en shift 3") text(" - shift 4") text(" . reduce 2") .ft CW ] with .ne at C1.nw + pagewid-.75,0 ] .PE .EQ delim $$ .EN .@tag FI _FI.Y.OUTPUT_ .FI _FI.Y.OUTPUT_ A slightly edited .CW y.output file. .EF 0 .nf \s5\l'\n(LLu\&\(ul'\s0 .fi .SP .KE .PP A slightly edited version of the .CW y.output file for this grammar appears in Figure _FI.Y.OUTPUT_. The 2 shift/reduce conflicts are in state 5. In this state, by default, the parser shifts the lookahead symbols .CW \en and .CW - instead of using the item .P{ expr : expr - expr. (2) .P} to reduce by production (2). A reduction occurs only when a right side is at the top of the parser stack, so the stack must contain states corresponding to .P{ \f1$...$\fP expr$"" sub a$ - expr$"" sub b$ .P} (The subscripts merely distinguish between the two occurrences of .CW expr .) .PP The other items in state 5 tell us more about the choices faced by the parser. With lookahead .CW \en the parser shifts, by default. Thus, it treats .CW expr$"" sub b$ on top of the stack as if it were in the item .P{ expr : expr$"" sub b$.\en .P} instead of the item .P{ expr : expr$"" sub a$ - expr$"" sub b$. (2) .P} Similarly, with lookahead .CW - , the parser treats .CW expr$"" sub b$ as if it were the first subexpression in .P{ expr : expr$"" sub b$.- expr .P} .PP Putting these observations together, the input .CW 2-1-1\en will result in the stack eventually containing .P{ expr - expr - expr \en .P} .PP Now, when the right side .CW expr$^$\en is reduced, its action will print a newline and return, abandoning the parse in midstream. .PP The addition of the precedence declarations .P{ %nonassoc '\en' %left '-' .P} eliminates the conflicts in state 5, resulting in the new state .P{ state 5 expr : expr.\en expr : expr.- expr expr : expr - expr. (2) .sp .5 . reduce 2 .P} Here, the reduction takes precedence over the potential shifts, thereby implementing the left associativity of .CW - and its higher precedence over .CW \en . .SS "Precedence Declarations" Precedence declarations give rise to disambiguating rules for resolving parsing conflicts. As mentioned in Section _SH2_, a precedence declaration starts with .CW %left , .CW %right , or .CW %nonassoc . These keywords specify the associativity of the tokens in a declaration. All the tokens in a declaration have the same precedence; tokens in successive declarations have higher precedence. A token in a precedence declaration need not be, but can be, declared by .CW %token as well. .PP The disambiguating rules are as follows. .SP .5 .IP 1. 4 Although declared only for tokens and literals, precedence information is attached to each production as well. The precedence and associativity of a production is that of the last token or literal on its right side. The .CW %prec construction overrides this default. .SP .5 .IP 2. If there is a shift/reduce conflict, and both the lookahead symbol (to be shifted) and the production (to be reduced) have precedence declarations, then the conflict is resolved in favor of the action (shift or reduce) with the higher precedence. If the precedences are the same, then the associativity is used; left associative implies reduce, right associative implies shift, and nonassociative implies error. .SP .5 .IP 3. In the absence of precedence information, the default disambiguating rules given earlier in this section are used. More precisely, suppose that there is no precedence information for either the lookahead symbol or a production to be reduced by. A shift/reduce conflict is resolved by shifting. A reduce/reduce conflict is resolved by reducing by the production that appears earlier in the specification. Conflicts resolved by these default rules are reported. .SP .5 .LP Conflicts resolved by precedence are not counted in the shift/reduce and reduce/reduce conflicts reported by Yacc. Thus, mistakes in precedence declarations can mask errors in the design of the productions. .SS "Shift a Dangling Else" As an example of the power of the default disambiguating rules consider a fragment from a programming language involving an if-then-else construction: .P{ stmt : IF '(' expr ')' stmt | IF '(' expr ')' stmt ELSE stmt ; .P} Here, .CW IF and .CW ELSE are tokens, .CW stmt is a nonterminal for statements, and .CW expr is a nonterminal for expressions. Call the first production the .I "simple-if" production and the second the .I "else-if" production. .PP These two productions are ambiguous, since .P{ IF ( expr$"" sub 1$ ) IF ( expr$"" sub 2$ ) stmt$"" sub a$ ELSE stmt$"" sub b$ .P} can be structured in two ways .ft CW .TS lw4 0 l7 | l. IF ( expr$"" sub 1$ ) { IF ( expr$"" sub 1$ ) { .ft CB IF ( expr$"" sub 2$ ) stmt$"" sub a$ IF ( expr$"" sub 2$ ) stmt$"" sub a$ \f(CW}\fP ELSE stmt$"" sub b$ .ft CW ELSE stmt$"" sub b$ } .TE .LP The second interpretation, with an .CW ELSE matching the nearest preceding unmatched .CW IF , is the one taken by most programming languages. It is the interpretation obtained when shift/reduce conflicts are resolved in favor of shift actions. .PP The .CW y.output file for a grammar containing the simple-if and if-else productions contains a shift/reduce conflict, illustrated by .P{ .ft CW 8: shift/reduce conflict (shift 9, red'n 1) on ELSE .ft CB state 8 stmt : IF ( expr ) stmt. (1) stmt : IF ( expr ) stmt.ELSE stmt .ft CW .sp .5 ELSE shift 9 . reduce 1 .P} This conflict occurs when the lookahead symbol is .CW ELSE and the parser stack contains the right side of the simple-if production .P{ \f1$...$\fP IF ( expr ) stmt .P} The parser chooses to shift the .CW ELSE rather than reduce by the simple-if production. This choice allows .P{ \f1$...$\fP IF ( expr ) stmt ELSE stmt .P} to be successfully reduced by the if-else production. Note that a premature application of the simple-if rule (with lookahead .CW ELSE ) would lead to the stack contents .P{ \f1$...$\fP stmt ELSE stmt .P} which cannot be parsed further. .PP A shift on lookahead .CW ELSE also matches an .CW ELSE with the nearest preceding unmatched .CW IF . With stack contents .P{ \f1$...$\fP \f(CWIF ( expr )\fP IF ( expr ) stmt .P} and lookahead .CW ELSE , the parser shifts, so the stack eventually holds .P{ \f1$...$\fP \f(CWIF ( expr )\fP IF ( expr ) stmt ELSE stmt .P} A reduction by the if-else production now yields .P{ \f1$...$\fP \f(CWIF ( expr )\fP stmt .P} which can be reduced by the simple-if production. .@tag SH _SH5_ .SH _SH5_ "Error Handling" Error handling is an extremely difficult area, and many of the problems are semantic ones. When an error is found, it may be necessary to undo the effect of actions\(emfor example, to reclaim parse-tree storage, to delete or alter symbol-table entries\(emand, typically, set flags to avoid generation of further output. .PP It is seldom acceptable to stop all processing when an error is found; further syntax errors might be found if parsing continues. But, how do we get the parser ``restarted'' after an error is detected. One approach is to discard tokens from the input until parsing can be continued. .PP Yacc provides a simple, but reasonably general, feature for discarding tokens. The token name .CW error is reserved for error handling. On the right side of a production, .CW error suggests a place where error recovery is planned. When an error occurs, the parser pops its stack until it enters a state where the token .CW error is legal. It then behaves as if .CW error were the current lookahead, and performs the action encountered. The lookahead is then reset to the token that caused the error. If no error productions are specified, parsing halts when an error is detected. .PP In order to prevent a cascade of error messages, the parser, after detecting an error, remains in the error state until three tokens have been successfully read and shifted. If an error is detected when the parser is already in error state, no message is given, and the input token is quietly deleted. .PP For example, a production .P{ stmt : error .P} would, in effect, mean that on a syntax error the parser would attempt to skip over the statement in which the error was seen. More precisely, the parser will scan ahead, looking for three tokens that might legally follow a statement, and start processing at the first of these; if the beginnings of statements are not sufficiently distinctive, it may make a false start in the middle of a statement, and end up reporting a second error where there is in fact no error. .PP Actions can be used with these special error rules. These actions might attempt to reinitialize tables, reclaim symbol table space, etc. .PP Productions with just .CW error on the right side are very general, but difficult to control. Somewhat easier are productions such as .P{ stmt : error ';' .P} Here, upon error, the parser attempts to skip over the statement, but will do so by skipping to the next semicolon All tokens after the error and before the next semicolon cannot be shifted, and are discarded. When the semicolon is seen, this right side will be reduced, and any ``cleanup'' action associated with it performed. .PP Another form of error production arises in interactive applications, where it may be desirable to permit a line to be reentered after an error. A possible error production might be .EQ delim off .EN .P{ input : error '\en' { printf("Reenter last line: "); } input { $$ = $input; } ; .P} One potential difficulty with this approach is that the parser must correctly process three input tokens before it admits that it has correctly resynchronized after the error. If the reentered line contains an error in the first two tokens, the parser deletes the offending tokens, and gives no message; this is clearly unacceptable. For this reason, there is a mechanism that can be used to force the parser to believe that an error has been fully recovered from. The statement .P{ yyerrok; .P} in an action resets the parser to its normal mode. The last example is better written .P{ input : error '\en' { yyerrok; printf("Reenter last line: "); } input { $$ = $input; } ; .P} .EQ delim $$ .EN .PP As mentioned above, the token seen immediately after the .CW error symbol is the input token at which the error was discovered. Sometimes, this is inappropriate; for example, an error recovery action might take upon itself the job of finding the correct place to resume input. In this case, the previous lookahead token must be cleared. The statement .P{ yyclearin; .P} in an action has this effect. For example, suppose the action after error is to call some sophisticated resynchronization routine, supplied by the user, that attempts to advance the input to the beginning of the next valid statement. After this routine is called, the next token returned by .CW yylex is presumably the first token in a legal statement; the old, illegal token must be discarded, and the error state reset. This can be done by a rule like .P{ stmt : error { resynch(); yyerrok; yyclearin; } ; .P} .PP These mechanisms are admittedly crude, but do allow for a simple, fairly effective recovery of the parser from many errors; moreover, the user can get control to deal with the error actions required by other portions of the program. .@tag SH _SH6_ .SH _SH6_ "The Yacc Environment" From a specification, Yacc creates a file of C programs, called .CW y.tab.c on most systems. .SS "Program Organization" User-code in a specification file .P{ %{ $<!$\f2user supplied code within declarations\fP$>!$ #define YYSTYPE $<!$\f1desired type\fP$>!$ %} $<!$\f2declarations section\fP$>!$ %% $<!$\f2productions\fP$>!$ %% $<!$ \f1program section\fP$>!$ .P} leads to a .CW y.tab.c file organized as follows: .P{ $<!$\&\f2user supplied code within declarations\fP$>!$ #define YYSTYPE $<!$\f1desired type\fP$>!$ $<!$\f2token and other declarations\fP$>!$ $<!$\f2program section\fP$>!$ $<!$\f2parser tables\fP$>!$ yyparse() { \&\f1$...$\fP } .P} Actions are incorporated into .CW yyparse . .PP As mentioned in Section _SH1_, .CW yyparse , the code for the parser, expects the following functions to be supplied: .P{ int yylex() { \f2a lexical analyzer; returns a token\fP } int main(\f1$...$\fP) { \f1$...$\fP yyparse(); \f1$...$\fP } void yyerror(s) char *s; { \f2print an error message pointed to by\fP s } .P} .PP The function .CW main decides when it wants to call .CW yyparse , which returns 0 if the parser accepts, and 1 if an error is detected and no error recovery is possible. .PP A user-supplied function .CW yyerror is called with a string containing an error message. The average application will want to do more than simply print the message; for example, the message might be accompanied by the input line number on which the error was detected. The external integer variable .CW yychar contains the lookahead token number at the time an error is detected; this may be of some interest in giving better diagnostics. .PP The external variable .CW yydebug is normally set to 0. It it is set to a nonzero value, the parser will output a verbose description of its actions, including the tokens read and the parser actions. Depending on the operating environment, it may be possible to set this variable by using a debugging system. .SS "Lexical Tie-Ins" The lexical analyzer .CW yylex must return an integer, the token number, representing the lookahead token. An attribute value associated with the token must be placed in the global variable .CW yylval . .PP The parser and the lexical analyzer must agree on the token numbers in order for communication between them to take place. Token numbers may be chosen by Yacc, or chosen by the user. In either case, the .CW #def\&ine mechanism of C is used to allow the lexical analyzer to refer to these numbers symbolically. For example, the declarations section .P{ %token IF ELSE .P} leads to the following definitions in the file .CW y.tab.c : .P{ # define IF 257 # define ELSE 258 .P} .PP If .CW yylex is included in the programs section, it is within the scope of these definitions, so .CW IF and .CW ELSE can be used as the names of token numbers in .CW yylex . .PP A file .CW y.tab.h containing the definition of token numbers can be created by running Yacc with the .CW -d option. .PP The approach of treating token names as defined constants leads to clear, easily modified lexical analyzers; the only pitfall is the need to avoid using any token names that are reserved or significant in C or the parser. For example, the use of the token names .CW if and .CW while will almost certainly cause severe difficulties when the lexical analyzer is compiled. The token name .CW error is reserved for error handling; see Section _SH5_. .PP Yacc chooses token numbers if the user does not. The default token number for a literal character is the numerical value of the character in the local character set. Other names are assigned token numbers starting at 257. .PP To assign a number to a token (including a literal), the first appearance of the token name or literal in the declarations section can be immediately followed by a nonnegative integer. This integer is taken to be the token number of the name or literal. Names and literals not defined by this mechanism retain their default definition. It is important that all token numbers be distinct. .PP For historical reasons, the endmarker must have token number 0 or negative. This token number cannot be redefined by the user; thus, all lexical analyzers must be prepared to return 0 or negative as a token number upon reaching the end of their input. .SS "Communicating Context to the Lexical Analyzer" Some lexical decisions depend on context. For example, the lexical analyzer might want to delete blanks normally, but not within quoted strings. Or names might be entered into a symbol table in declarations, but not in expressions. .PP One way of handling this situation is to create a global flag that is examined by the lexical analyzer, and set by actions. For example, suppose a program consists of zero or more declarations, followed by zero or more statements. Consider: .P{ %{ int dflag; %} \&\f2$...$ other declarations $...$\f(CB %% prog : decls stmts ; decls : /* empty */ { dflag = 1; } | decls declaration ; stmts : /* empty */ { dflag = 0; } | stmts statement ; \&\f2$...$ other productions $...$\f(CB .P} .PP The flag .CW dflag is now 1 when reading declarations and 0 when reading statements, .ul except for the first token in the first statement. This token must be seen by the parser before it can tell that the declarations have ended and the productions have begun. In many cases, this single token exception does not affect the lexical scan. .PP This kind of ``backdoor'' approach can be elaborated to a noxious degree. Nevertheless, it represents a way of doing some things that are difficult, if not impossible, to do otherwise. .PP Some programming languages permit the user to use words like .CW if , which are normally reserved, as label or variable names, provided that such use does not conflict with the legal use of these names in the programming language. This is extremely hard to do in the framework of Yacc; it is difficult to pass information to the lexical analyzer telling it ``this instance of .CW if is a keyword, and that instance is a variable''. The user can make a stab at it, using flags like .CW dflag , above, but it is difficult. It is better that the keywords be .I reserved ; that is, be forbidden for use as variable names. There are powerful stylistic reasons for preferring this, anyway. .SS "Support for Arbitrary Attribute Types" By default, the values returned by actions and the lexical analyzer are integers. Other types can be supported by defining .CW YYSTYPE in the declarations section, as in .P{ %{ #define YYSTYPE double %} .P} .CW YYSTYPE is not a normal variable, because the parser contains the following lines to define it to be .CW int if it has not already been defined by the user: .P{ #ifndef YYSTYPE #define YYSTYPE int #endif .P} .PP Clearly, .CW YYSTYPE can be defined to be any type, including a union type. .PP The .CW %union mechanism of Yacc attempts to make the underlying union transparent, in all but a few places where Yacc needs help in determining which field of the union is intended. .PP Unions are declared in the declarations section, an example being .P{ %union { int ival; double dval; char * sval; } .P} A union type with these members is created for the Yacc value stack, and for the global variables .CW yylval and .CW yyval . With the .CW -d option, Yacc copies the union type into the .CW y.tab.h file. The type of the union can be referred to as .CW YYSTYPE . .PP The type of each attribute must now correspond to one of the union members. The construction .P{ <name> .P} indicates a union member name. If the construction follows .CW %token , .CW %left , .CW %right , or .CW %nonassoc , then the union member name is associated with the tokens in that declaration. Another keyword .CW %type is used similarly to associate union member names with nonterminals. Thus, we might use .P{ %type <dval> expr term .P} .EQ delim off .EN .PP There remain a couple of cases where these mechanisms are insufficient. If there is an action within a rule, the value returned by this action has no a priori type. Similarly, reference to left context values leaves Yacc with no easy way of knowing the type. In this case, a type can be imposed on the reference by inserting a union member name, between .CW < and .CW > , immediately after the first .CW $ . An example of this usage is .P{ rule : aaa { $<intval>$ = 3; } bbb { fun( $<intval>2, $<other>0 ); } ; .P} This syntax has little to recommend it, but the situation arises rarely. .PP The facilities in this subsection are not triggered until they are used: in particular, the use of .CW %type will turn on these mechanisms. When they are used, there is a fairly strict level of checking. For example, use of .CW $\f2n\fP or .CW $$ to refer to something with no defined type is diagnosed. If these facilities are not triggered, the Yacc value stack is used to hold .CW int s, as was true historically. .EQ delim $$ .EN .NH 1 References .LP |reference_placement