AUSAM/source/lex/lexmemo

.po 7
.hc ~
.de TS
.br
.nf
.sp
.ul 0
..
.de TE
.sp
.fi
..
.ND July 21, 1975
.RP
.TM 75-1274-15 39199 39199-11
.TL
Lex - A Lexical Analyzer ~Generator~
.AU "MH 2C-569" 6377
M. E. Lesk
.AI
.MH
.AB
Lex helps write programs whose control flow
is directed by instances of regular
expressions in the input stream.
It is well suited for editor-script type transformations and
for segmenting input in preparation for
a parsing routine.
.PP
Lex source is a table of regular expressions and corresponding program fragments.
The table is translated to a program
which reads an input stream, copying it to an output stream
and partitioning the input
into strings which match the given expressions.
As each such string is recognized the corresponding
program fragment is executed.
The recognition of the expressions
is performed by a deterministic finite automaton
generated by Lex.
The program fragments written by the user are executed in the order in which the
corresponding regular expressions occur in the input stream.
.if n .if \n(tm .ig
.PP
The lexical analysis
programs written with Lex accept ambiguous specifications
and choose the longest
match possible at each input point.
If necessary, substantial look~ahead
is performed on the input, but the
input stream will be backed up to the
end of the current partition, so that the user
has general freedom to manipulate it.
.PP
Lex can be used to generate analyzers in either
C or Ratfor, a language
which can be translated automatically to portable Fortran.
It is available on the PDP-11 UNIX, Honeywell GCOS,
and IBM OS systems.
Lex is designed to simplify
interfacing with Yacc, for those
with access to this compiler-compiler system.
..
.AE
.OK
Programming Languages
Compilers
.CS 35 0 35 2 4 4
.SH
.ce 1
Table of Contents
.LP
.ce 100
.TS
r 1l 2r .
1.	Introduction.	1
2.	Lex Source.	5
3.	Lex Regular Expressions.	6
4.	Lex Actions.	9
5.	Ambiguous Source Rules.	13
6.	Lex Source Definitions.	16
7.	Usage.	17
8.	Lex and Yacc.	20
9.	Examples.	20
10.	Left Context Sensitivity.	23
11.	Character Set.	27
12.	Program Sizes and Timings.	28
13.	Summary of Source Format.	33
14.	Caveats and Bugs.	35
15.	Acknowledgments.	35
16.	References.	35
.TE
.ce 0
.NH
Introduction.
.PP
Lex is a program generator designed for
lexical processing of character input streams.
It accepts a high-level, problem oriented specification
for character string matching,
and
produces a program in a general purpose language which recognizes
regular expressions.
The regular expressions are specified by the user in the
source specifications given to Lex.
The Lex written code recognizes these expressions
in an input stream and partitions the input stream into
strings matching the expressions.  At the bound~aries
between strings
program sections
provided by the user are executed.
The Lex source file associates the regular expressions and the
program fragments.
As each expression appears in the input to the program written by Lex,
the corresponding fragment is executed.
.PP
The user supplies the additional code
beyond expression matching
needed to complete his tasks, possibly
including code written by other generators.
The program that recognizes the expressions is generated in the
general purpose programming language employed for the
user's program fragments.
Thus, a high level expression
language is provided to write the string expressions to be
matched while the user's freedom to write actions
is unimpaired.
This avoids forcing the user who wishes to use a string manipulation
language for input analysis to write processing programs in the same
and often inappropriate string handling language.
.PP
Lex is not a complete language, but rather a generator representing
a new language feature which can be added to
different programming languages, called "host languages".
Just as general purpose languages
can produce code to run on different computer hardware,
Lex can write code in different host languages.
The host language is used for the output code generated by Lex
and also for the program fragments added by the user.
Compatible run-time libraries for the different host languages
are also provided.
This makes Lex adaptable to different environments and
different users.
Each application
may be directed to the combination of hardware and host language appropriate
to the task, the user's background, and the properties of local
implementations.
At
present there are only
two host languages,
C[1] and Fortran (in the form of the Ratfor language[2]).
Lex itself exists on UNIX, GCOS, and OS/370; but the
code generated by Lex may be taken anywhere the appropriate
compilers exist.
.PP
Lex first turns the user's expressions and actions
(called
.ul
source
in this memo) into the host general-purpose language;
the generated program is named
.ul
yylex.
The
.ul
yylex
program
will recognize expressions
in a stream
(called
.ul
input
in this memo)
and perform the specified actions for each expression as it is detected.
.DS
           _____
          |     |
Source -> | Lex | -> yylex
          |_____|
.sp
          _______
         |       |
Input -> | yylex | -> Output
         |_______|
.sp
      An overview of Lex
.sp
         Figure 1
.DE
.PP
For a trivial example, consider a program to delete
from the input
all blanks or tabs at the ends of lines.
.DS
%%
[ \et]+$	;
.DE
is all that is required.
The program
contains a %% delimiter to mark the beginning of the rules, and
one rule.
This rule contains a regular expression
which matches one or more
instances of the characters blank or tab
(written \et for visibility, in accordance with the C language convention)
just prior to the end of a line.
The brackets indicate the character
class made of blank and tab; the + indicates "one or more ...";
and the $ indicates "end of line", as in QED.
No action is specified,
so the program generated by Lex (yylex) will ignore these characters.
Everything else will be copied.
To change any remaining
string of blanks or tabs to a single blank,
add another rule:
.ta 10 20
.DS
%%
[ \et]+$	;
[ \et]+	printf(" ");
.DE
The finite automaton generated for this
source will scan for both rules at once,
observing at
the termination of the string of blanks or tabs
whether or not there is a newline character, and executing
the desired rule action.
The first rule matches all strings of blanks or tabs
at the end of lines, and the second
rule all remaining strings of blanks or tabs.
.PP
Lex can be used alone for simple transformations, or
for analysis and statistics gathering on a lexical level.
Lex can also be used with a parser generator
to perform the lexical analysis phase; it is particularly
easy to interface Lex and Yacc [3].
Lex programs recognize only regular expressions;
Yacc writes parsers that accept a large class of context free grammars,
but require a lower level analyzer to recognize input tokens.
Thus, a combination of Lex and Yacc is often appropriate.
When used as a preprocessor for a later parser generator,
Lex is used to partition the input stream,
and the parser generator assigns structure to
the resulting pieces.
The flow of control
in such a case (which might be the first half of a compiler,
for example) is shown in the next figure.
Additional programs,
written by other generators
or by hand, can
be added easily to programs written by Lex.
.DS
          lexical       grammar
           rules         rules
             |             |
             |v             |v
           _____        ______
          |     |      |      |
          | Lex |      | Yacc |
          |_____|      |______|
             |             |
             |v             |v
          _______      _________
         |       |    |         |
Input -> | yylex | -> | yyparse | -> Parsed input
         |_______|    |_________|
.sp
              Lex with Yacc
.sp
                Figure 2
.DE
Yacc users
will realize that the name
.ul
yylex
is what Yacc expects its lexical analyzer to be named,
so that the use of this name by Lex simplifies
interfacing.
.PP
Lex generates a deterministic finite automaton from the regular expressions
in the source [4].
The automaton is interpreted, rather than compiled, in order
to save space.
The result is still a fast analyzer.
In particular, the time taken by a Lex program
to recognize and partition an input stream is
proportional to the length of the input.
The number of Lex rules or
the complexity of the rules is
not important in determining speed,
unless rules which include
forward context require a significant amount of re~scanning.
What does increase with the number and complexity of rules
is the size of the finite
automaton, and therefore the size of the program
generated by Lex.
.PP
In the program written by Lex, the user's fragments
(representing the
.ul
actions
to be performed as each regular expression
is found)
are gathered
as cases of a switch (in C) or branches of a computed GOTO
(in Ratfor).
The automaton interpreter directs the control flow.
Opportunity is provided for the user to insert either
declarations or additional statements in the routine containing
the actions, or to
add subroutines outside this action routine.
.PP
Lex is not limited to source which can
be interpreted on the basis of one character
look~ahead.
For example,
if there are two rules, one looking for "ab" and another for "abcdefg",
and the input stream is "abcdefh", Lex will recognize "ab" and leave
the input pointer just before "cd...".
Such backup is more costly
than the processing of simpler languages.
.NH
Lex Source.
.PP
The general format of Lex source is:
.DS
{definitions}
%%
{rules}
%%
{user subroutines}
.DE
where the definitions and the user subroutines
are often omitted.
The second %%__ is optional, but the first is required
to mark the beginning of the rules.
The absolute minimum Lex program is thus
.DS
%%
.DE
(no definitions, no rules) which translates into a program
which copies the input to the output unchanged.
.PP
In the outline of Lex programs shown above, the
.I
rules
.R
represent the user's control
decisions; they are a table, in which the left column
contains
.I
regular expressions
.R
(see section 3)
and the right column contains
.I
actions,
.R
program fragments to be executed when the expressions
are recognized.
Thus an individual rule might appear
.DS
integer		printf("found keyword INT");
.DE
to look for the string "integer" in the input stream and
print the message "found keyword INT" whenever it appears.
In this example the host procedural language is C and
the C library function
.I
printf
.R
is used to print the string.
The end
of the expression is indicated by the first blank or tab character.
If the action is merely a single C expression,
it can just be given on the right side of the line; if it is
compound, or takes more than a line, it should be enclosed in
braces.
As a slightly more useful example, suppose it is desired to
change a number of words from British to American spelling.
Lex rules such as
.DS
colour		printf("color");
mechanise		printf("mechanize");
petrol		printf("gas");
.DE
would be a start.  These rules are not quite enough,
since
the word "petroleum" would become "gaseum"; a way of dealing
with this will be described later.
.NH
Lex Regular Expressions.
.PP
The definitions of regular expressions are very similar to those
in QED [5].
A regular
expression specifies a set of strings to be matched.
It contains text characters (which match the corresponding
characters in the strings being compared)
and operator characters (which specify
repetitions, choices, and other features).
The letters of the alphabet and the digits are
always text characters; thus the regular expression
.DS
integer
.DE
matches the string
.ul
integer
wherever it appears
and the expression
.DS
a57D
.DE
looks for the string
.ul
a57D.
.PP
.I
Operators.
.R
The operator characters are
.DS
" \e [ ] ^ - ? . * + | ( ) $ / { } % < >
.DE
and if they are to be used as text characters, an escape
should be used.
The quotation mark operator (")
indicates that whatever is contained between a pair of quotes
is to be taken as text characters.
Thus
.DS
xyz"++"
.DE
matches the string xyz++_____
when it appears.  Note that a part of a string may be quoted.
It is harmless but unnecessary to quote an ordinary
text character; the expression
.DS
"xyz++"
.DE
is the same as the one above.
Thus by quoting every non-alphanumeric character
being used as a text character, the user can avoid remembering
the list above of current
operator characters, and is safe should further extensions to Lex
lengthen the list.
.PP
An operator character may also be turned into a text character
by preceding it with \e as in
.DS
xyz\e+\e+
.DE
which
is another, less readable, equivalent of the above expressions.
Another use of the quoting mechanism is to get a blank into
an expression; normally, as explained above, blanks or tabs end
a rule.
Any blank character not contained within [] (see below) must
be quoted.
Several normal C escapes with \e
are recognized: \en is newline, \et is tab, and \eb is backspace.
To enter \e itself, use \e\e.
Since newline is illegal in an expression, \en must be used;
it is not
required to escape tab and backspace.
Every character but blank, tab, newline and the list above is always
a text character.
.PP
.I
Character classes.
.R
Classes of characters can be specified using the operator pair [].
The construction [abc] matches a
single character, which may be a, b, or c.
Within square brackets,
most operator meanings are ignored.
Only three characters are special:
these are \e,  - and ^.  The - character
indicates ranges.  For example,
.DS
[a-z0-9<>_]
.DE
indicates the character class containing all the lower case letters,
the digits,
the angle brackets, and underline.
Ranges may be given in either order.
Using - between any pair of characters which are
not both upper case letters, both lower case letters, or both digits
is implementation dependent and will get a warning message.
(E.g., [0-z] in ASCII is many more characters
than it is in EBCDIC).
If it is desired to include the
character - in a character class, it should be first or
last; thus
.DS
[-+0-9]
.DE
matches all the digits and the two signs.
.PP
In character classes,
the ^ operator must appear as the first character
after the left bracket; it indicates that the resulting string
is to be complemented with respect to the computer character set.
Thus
.DS
[^abc]
.DE
matches all characters except a, b, or c, including
all special or control characters; or
.DS
[^a-zA-Z]
.DE
is any character which is not a letter.
The \e character provides the usual escapes within
character class brackets.
.PP
.I
Arbitrary character.
.R
To match almost any character, the operator character
.DS
\&.
.DE
is the class of all characters except newline.
Escaping into octal is possible:
.DS
[\e40-\e176]
.DE
matches all printable characters in the ASCII character set, from octal
40 (blank) to octal 176 (tilde).
.PP
.I
Optional expressions.
.R
The operator ?_ indicates
an optional element of an expression.
Thus
.DS
ab?c
.DE
matches either ac__ or abc___.
.PP
.I
Repeated expressions.
.R
Repetitions of classes are indicated by the operators *_ and +_.
.DS
a*
.DE
is any number of consecutive a_ characters, including zero; while
.DS
a+
.DE
is one or more instances of a_.
For example,
.DS
[a-z]+
.DE
is all strings of lower case letters.
And
.DS
[A-Za-z][A-Za-z0-9]*
.DE
indicates all alphanumeric strings with a leading
alphabetic character.
This is a typical expression for recognizing identifiers in
computer languages.
.PP
.I
Alternation and Grouping.
.R
The operator |
indicates alternation:
.DS
(ab|cd)
.DE
matches either
.ul
ab
or
.ul
cd.
Note that parentheses are used for grouping, although
they are
not necessary on the outside level;
.DS
ab|cd
.DE
would have sufficed.
Parentheses
can be used for more complex expressions:
.DS
(ab|cd+)?(ef)*
.DE
matches such strings as "abefef", "efefef",
"cdef", or "cddd"; but not "abc", "abcd", or "abcdef".
.PP
.I
Context sensitivity.
.R
Lex will recognize a small amount of surrounding
context.  The two simplest operators for this are
^_ and $_.
If the first character of an expression is ^_,
the expression will only be matched at the beginning
of a line (after a newline character, or at the beginning of
the input stream).
This can never conflict with the other meaning of _^,
complementation
of character classes, since that only applies within
the [] operators.
If the very last character is $_,
the expression will only be matched at the end of a line (when
immediately followed by newline).
The latter operator is a special case of the /_ operator character,
which indicates trailing context.
The expression
.DS
ab/cd
.DE
matches the string
ab__,
but only if followed by
.ul
cd.
Thus
.DS
ab$
.DE
is the same as
.DS
ab/\en
.DE
Left context is handled in Lex by
.I
start conditions
.R
as explained in section 10.  If a rule is only to be executed
when the Lex automaton interpreter is in start condition
.I
x,
.R
the rule should be prefixed by
.DS
<x>
.DE
using the angle bracket operator characters.
If we considered "being at the beginning of a line" to be
start condition ONE,
then the ^ operator
would be equivalent to
.DS
<ONE>
.DE
Start conditions are explained more fully later.
.PP
.I
Definitions.
.R
The operators {} specify
definition expansion.  For example
.DS
{digit}
.DE
looks for a predefined string named "digit" and inserts it
at that point in the expression.
The definitions are given in the first part of the Lex
input, before the rules.
.PP
Finally, initial %_ is special, being the separator
for Lex source segments.
.NH
Lex Actions.
.PP
When an expression written as above is matched, Lex
executes the corresponding action.  This section describes
some features of Lex which aid in writing actions.  Note
that there is a default action, which
consists of copying the input to the output.  This
is performed on all strings not otherwise matched.  Thus
the Lex user who wishes to absorb the entire input, without
producing any output, must provide rules to match everything.
When Lex is being used with Yacc, this is the normal
situation.
One may consider that actions are what is done instead of
copying the input to the output; thus, in general,
a rule which merely copies can be omitted.
Also, a character combination
which is omitted from the rules
and which appears as input
is likely to be printed on the output, thus calling
attention to the gap in the rules.
.PP
One of the simplest things that can be done is to ignore
the input.   Specifying a C null statement, ";", as an action
causes this result.  A frequent rule is
.DS
[ \et\en]		;
.DE
which causes the three spacing characters (blank, tab, and newline)
to be ignored.
.PP
Another easy way to avoid writing actions is the action character
"|", which indicates that the action for this rule is the action
for the next rule.
The previous example could also have been written
.DS
" "		|
"\et"		|
"\en"		;
.DE
with the same result, although in different style.
The quotes around \en and \et are not required.
.PP
In more complex actions, the user
will
often want to know the actual text that matched some expression
like
.tr `-
[a`z]+______.
Lex leaves this text in an external character
array named
.I
yytext.
.R
Thus, to print the name found,
a rule like
.DS
[a-z]+		printf("%s", yytext);
.DE
will print
the string in
.I
yytext.
.R
The C function
.I
printf
.R
accepts a format argument and data to be printed;
in this case, the format is "print string" (% indicating
data conversion, and _s indicating string type),
and the data are the characters
in
.I
yytext.
.R
So this just places
the matched string
on the output.
This action
is so common that
it may be written as ECHO:
.DS
[a-z]+		ECHO;
.DE
is the same as the above.
Since the default action is just to
print the characters found, one might ask why
give a rule, like this one, which merely specifies
the default action?
Such rules are often required
to avoid matching some other rule
which is not desired.  For example, if there is a rule
which matches
"read"
it will normally match the instances of "read" contained in
"bread" or "readjust"; to avoid
this,
a rule
of the form
[a`z]+______
is needed.
This is explained further below.
.PP
Sometimes it is more convenient to know the end of what
has been found; hence Lex also provides a count
.I
yyleng
.R
of the number of characters matched.
To count both the number
of words and the number of characters in words in the input, the user might write
.DS
[a-zA-Z]+	{words++; chars =+ yyleng;}
.DE
which accumulates in
.ul
chars
the number
of characters in the words recognized.
The last character in the string matched can
be accessed by
.DS
yytext[yyleng-1]
.DE
in C or
.DS
yytext(yyleng)
.DE
in Ratfor.
.PP
Occasionally, a Lex
action may decide that a rule has not recognized the correct
span of characters.
Two routines are provided to aid with this situation.
First,
.I
yymore()
.R
can be called to indicate that the next input expression recognized is to be
tacked on to the end of this input.  Normally,
the next input string would overwrite the current
entry in
.I
yytext.
.R
Second,
.I
yyless (n)
.R
may be called to indicate that not all the characters matched
by the currently successful expression are wanted right now.
The argument
.I
n
.R
indicates the number of characters
in
.I
yytext
.R
to be retained.
Further characters previously matched
are
returned to the input.  This provides the same sort of
look~ahead offered by the "/" operator,
but in a different form.
.PP
.I
Example:
.R
Consider a language which defines
a string as a set of characters between quotation (") marks, and provides that
to include a " in a string it must be preceded by a \e.  The
regular expression which matches that is somewhat confusing,
so that it might be preferable to write
.DS
\e"[^"]*	{
	if (yytext[yyleng-1] == '\e\e')
		yymore();
	else
		... normal user processing
	}
.DE
which will, when faced with a string such as
.I
"abc\e"def"
.R
first match
the five characters
_"_a_b_c_\e
;
then
the call to yymore() will
cause the next part of the string,
_"d_e_f_
.R
, to be tacked on the end.
Note that the final quote terminating the string should be picked
up in the code labeled "normal processing".
.PP
The function
.I
yyless()
.R
might be used to reprocess
text in various circumstances.  Consider the C problem of distinguishing
the ambiguity of "=-a".
Suppose it is desired to treat this as "=- a"
but print a message.  A rule might be
.DS
=-[a-zA-Z]	{
		printf("Operator (=-) ambiguous\en");
		yyless(yyleng-1);
		... action for =- ...
		}
.DE
which prints a message, returns the letter after the
operator to the input stream, and treats the operator as "=-".
Alternatively it might be desired to treat this as "= -a".
To do this, just return the minus
sign as well as the letter to the input:
.DS
=-[a-zA-Z]	{
		printf("Operator (=-) ambiguous\en");
		yyless(yyleng-2);
		... action for = ...
		}
.DE
will perform the other interpretation.
Note that the expressions for the two cases might more easily
be written
.DS
=-/[A-Za-z]
.DE
in the first case and
.DS
=/-[A-Za-z]
.DE
in the second;
no backup would be required in the rule action.
It is not necessary to recognize the whole identifier
to observe the ambiguity.
The
possibility of "=-3", however, makes
.DS
=-/[^ \et\en]
.DE
a still better rule.
.PP
In addition to these routines, Lex also permits
access to the I/O routines
it uses.
They are:
.IP 1)
.I
input()
.R
which returns the next input character;
.IP 2)
.I
output(c)
.R
which writes the character
.I
c
.R
on the output; and
.IP 3)
.I
unput(c)
.R
pushes the character
.I
c
.R
back onto the input stream to be read later by
.I
input().
.R
.LP
Suitable versions of these routines are in the library.
There is another important routine in Ratfor, named
.I
lexshf,
.R
which is described below under "Character Set".
These routines
define the relationship between external files and
internal characters, and must all be retained
or modified consistently.
They may be redefined, to
cause input or output to be transmitted to or from strange
places, including other programs or internal memory;
but the character set used must be consistent in all routines;
a value of zero returned by
.I
input
.R
must mean end of file; and
the relationship between
.I
unput
.R
and
.I
input
.R
must be retained
or the Lex look~ahead will not work.
Lex does not look ahead at all if it does not have to,
but every rule ending in "+", "*", "?", or "$", or containing "/",
implies look~ahead.
Look~ahead is also necessary to match an expression that is a prefix
of another expression.
See below for a discussion of the character set used by Lex.
The standard Lex library imposes
a 100 character limit on backup.
.PP
Another Lex library routine that the user will sometimes want
to redefine is
.I
yywrap()
.R
which is called whenever Lex reaches an end-of-file.
If
.I
yywrap
.R
returns a 1, Lex continues with the normal wrapup on end of input.
Sometimes, however, it is convenient to arrange for more
input to arrive
from a new source.
In this case, the user should provide
a
.I
yywrap
.R
which
arranges for new input and
returns 0.  This instructs Lex to continue processing.
The default
.I
yywrap
.R
always returns 1.
.PP
This routine is also a convenient place
to print tables, summaries, etc. at the end
of a program.  Note that it is not
possible to write a normal rule which recognizes
end-of-file; the only access to this condition is
through
.I
yywrap.
.R
In fact, unless a private version of
.I
input()
.R
is supplied
a file containing nulls
cannot be handled,
since a value of 0 returned by
.I
input
.R
is taken to be end-of-file.
.PP
In Ratfor all of the standard I/O library
routines,
.I
input, output, unput, yywrap,
.R
and
.I
lexshf,
.R
are defined as integer functions.
This requires
.I
input
.R
and
.I
yywrap
.R
to be called with arguments.  One dummy
argument is supplied and ignored.
.NH
Ambiguous Source Rules.
.PP
Lex can handle ambiguous specifications.
When more than one expression can match the
current input, Lex chooses as follows:
.IP 1)
The longest match is preferred.
.IP 2)
Among rules which matched the same number of characters,
the rule given first is preferred.
.LP
Thus, suppose the rules
.DS
integer		keyword action ...;
[a-z]+		identifier action ...;
.DE
to be given in that order.  If the input is "integers",
it is taken as an identifier, because "[a-z]+" matches
8 characters while "integer" matches only 7.  If the input
is "integer", both rules match 7 characters, and
the keyword rule is selected because it was given first.
Anything shorter (e.g. "int") will not match the expression "integer"
and so the identifier interpretation is used.
.PP
The principle of preferring the longest
match makes rules containing
expressions like
\&.*__
dangerous.
For example,
.DS
\&'.*'
.DE
might seem a good way of recognizing
a string in single quotes.
But it is an invitation for the program to read far
ahead, looking for a distant
single quote.
Presented with the input
.DS
\&'first' quoted string here, 'second' here
.DE
the above expression will match
.DS
\&'first' quoted string here, 'second'
.DE
which is probably not what was wanted.
A better rule is of the form
.DS
\&'[^'\en]*'
.DE
which, on the above input, will stop
after 'first'.
The consequences
of errors like this are mitigated by the fact
that the _. operator will not match newline.
Thus expressions like .*__ stop on the
current line.
Don't try to defeat this with expressions like
[.\en]+______
or
equivalents;
the Lex generated program will try to read
the entire input file, causing
internal buffer overflows.
.PP
Note that Lex is normally partitioning
the input stream, not searching for all possible matches
of each expression.
This means that each character is accounted for
once and only once.
For example, suppose it is desired to
count occurrences of both "she" and "he" in an input text.
Some Lex rules to do this might be
.DS
she		s++;
he		h++;
\en		|
\&.		;
.DE
where the last two rules ignore everything besides "he" and "she".
Remember that . does not include newline.
Since "she" includes "he", Lex will normally
.I
not
.R
recognize
the instances of "he" included in "she",
since once it has passed a "she" those characters are gone.
.PP
Sometimes the user would like to override this choice.  The action
REJECT
means "go do the next alternative."
It causes whatever rule was second choice after the current
rule to be executed.
The position of the input pointer is adjusted accordingly.
Suppose the user really wants to count the included instances of "he":
in the previous example;
one possible alternative is
the rule set
.DS
she		{s++; REJECT;}
he		{h++; REJECT;}
\en		|
\&.		;
.DE
After counting each expression, it is rejected; whenever appropriate,
the other expression will then be counted.  In this example, of course,
the user could note that "she" includes "he" but not
vice versa, and omit the REJECT action on "he";
in other cases, however, it
would not be possible a priori to tell
which input characters
were in both classes.
.PP
Consider the two rules
.DS
a[bc]+	{ ... ; REJECT;}
a[cd]+	{ ... ; REJECT;}
.DE
If the input is "ab", only the first rule matches,
and on "ad" only the second matches.
The input string "accb"
matches the first rule for four characters
and then the second rule for three characters.
In contrast, the input "accd" agrees with
the second rule for four characters and then the first
rule for three.
.PP
In general, REJECT is useful whenever
the purpose of Lex is not to partition the input
stream but to detect all examples of some items
in the input, and the instances of these items
may overlap or include each other.
Suppose a digram table of the input is desired;
normally the digrams overlap, that is the word "the"
is considered to contain
both "th" and
"he".
Assuming a two-dimensional array named
.ul
digram
to be incremented, the appropriate
source is
.DS
%%
[a-z][a-z]	{digram[yytext[0]][yytext[1]]++; REJECT;}
.		;
\en		;
.DE
where the REJECT is necessary to pick up
a letter pair beginning at every character, rather than at every
other character.
.NH
Lex Source Definitions.
.PP
Remember the format of the Lex
source:
.DS
{definitions}
%%
{rules}
%%
{user routines}
.DE
So far only the rules have been described.  The user needs
additional options,
though, to define variables for use in his program and for use
by Lex.
These can go either in the definitions section
or in the rules section.
.PP
Remember that Lex is turning the rules into a program.
Any source not intercepted by Lex is copied
into the generated program.  There are three classes
of such things.
.IP 1)
Any line which is not part of a Lex rule or action
which begins with a blank or tab is copied into
the Lex generated program.
Such source input prior to the first %% delimiter will be external
to any function in the code; if it appears immediately after the first
%%,
it appears in an appropriate place for declarations
in the function written by Lex which contains the actions.
This material must look like program fragments,
and should precede the first Lex rule.
.IP
As a side effect of the above, lines which begin with a blank
or tab, and which contain a comment,
are passed through to the generated program.
This can be used to include comments in either the Lex source or
the generated code.  The comments should follow the host
language convention.
.IP 2)
Anything included between lines containing
only
%{__ and %}__ is
copied out as above.  The delimiters are discarded.
This format permits entering text like preprocessor statements that
must begin in column 1,
or copying lines that do not look like programs.
.IP 3)
Anything after the third %% delimiter, regardless of formats, etc.,
is copied out after the Lex output.
.PP
Definitions intended for Lex are given
before the first %% delimiter.  Any line in this section
not contained between %{ and %}, and begining
in column 1, is assumed to define Lex substitution strings.
The format of such lines is
.DS
name translation
.DE
and it
causes the string given as a translation to
be associated with the name.
The name and translation
must be separated by at least one blank or tab.
The translation can then be called out
by the {name} syntax in a rule.
For example,
.DS
D		[0-9]
E		[DEde][-+]?{D}+
%%
{D}+		printf("integer");
{D}+"."{D}*({E})?	|
{D}*"."{D}+({E})?	|
{D}+{E}			printf("real");
.DE
uses the strings {D} (representing the digits)
and {E} (representing an exponent field)
to abbreviate the rules.
Note the first two rules for real numbers;
both require a decimal point and contain
an optional exponent field,
but the first requires at least one digit before the
decimal point and the second requires at least one
digit after the decimal point.
To correctly handle the problem
posed by a Fortran expression such as "35.EQ.I",
which does not contain a real number, a context-sensitive
rule such as
.DS
[0-9]+/"."EQ	printf("integer");
.DE
could be used in addition to the normal rule for integers.
.PP
The definitions
section may also contain other commands, including the
selection of a host language, a character set table,
a list of start conditions, or a lexical analyzer name change.
All of these are discussed later.
.NH
Usage.
.PP
There are two steps in
compiling a Lex source program.
First, the Lex source must be turned into a generated program
in the host general purpose language.
Then this program must be compiled and loaded, usually with
a library of Lex subroutines.
The generated program
is on a file named "lex.yy.c" for a C host
language source and "lex.yy.r" for a Ratfor
host environment.
There are two I/O libraries, one for C defined in terms
of the C
portable library [6], and
the other defined in terms of Ratfor.
To indicate that a Lex source file
is intended to be used with the Ratfor host language,
make the first line of the file "%R".
.PP
The C programs generated by Lex are slightly different
on OS/370, because the
OS compiler is less powerful than the UNIX or GCOS compilers,
and does less at compile time.
C programs generated on GCOS and UNIX are the same.
The C host language is default, but may be explicitly
requested by making the first line of the source file
"%C".
.PP
The Ratfor generated by Lex is the same
on all systems, but can not be compiled directly on TSO.
See below for instructions.  The Ratfor I/O
library, however, varies slightly
because
the different Fortrans
disagree on the method of indicating end-of-input
and the name of the library routine for logical AND.
The Ratfor I/O library, dependent on
Fortran character I/O, is quite slow.
In particular it reads all input lines as 80A1 format; this
will truncate any longer
line, discarding your data,
and pads any shorter line with blanks.
The library version of
.I
input
.R
removes the padding (including any trailing blanks from
the original input) before processing.
Each source
file using a Ratfor host should begin with the "%R" command.
.PP
.I
UNIX.
.R
The libraries are accessed by the loader flags
-llc for C and -llr for Ratfor; the C name may
be abbreviated to -ll.
So an appropriate
set of commands is
.KS
.in 5
.TS
c 5c 
l l .
C Host	Ratfor Host
.sp
lex source	lex source
cc lex.yy.c -ll -lp	rc -2 lex.yy.r -llr
.TE
.in 0
.KE
The resulting program is placed on the usual file
.I
a.out
.R
for later execution.
To use Lex with Yacc see below.
Although the current Lex library uses the C portable library,
the Lex output itself does not do so;
if private versions of
.I
input,
output
.R
and
.I
unput
.R
are given, the portable library can be avoided.
In this case, the "-lp" option to the "cc" command
should be omitted.
Note the "-2" option in the
Ratfor compile command;
this requests the larger version of the compiler,
a useful precaution.
.PP
.I
GCOS.
.R
The Lex commands on GCOS are stored in the "." library.
The appropriate command sequences are:
.KS
.TS
c 5c 
l l .
C Host	Ratfor Host
.sp
\&./lex source	./lex source
\&./cc lex.yy.c ./lexclib h=	./rc a= lex.yy.r ./lexrlib h=
.TE
.KE
The resulting program is placed on the usual
file
.I
\&.program
.R
for later execution (as
indicated by the "h=" option);
it may be copied
to a permanent file if desired.
Note the "a=" option in the Ratfor
compile command; this indicates that the Fortran
compiler is to run in ASCII mode.
.PP
.I
TSO.
.R
Lex is just barely available on TSO.  Restrictions imposed
by the compilers which must be used with its output make it rather
inconvenient.
To use the C version, type
.DS
exec 'dot.lex.clist(lex)' 'sourcename'
exec 'dot.lex.clist(cload)' 'libraryname membername'
.DE
The first command analyzes the source file and
writes a C program
on file
.ul
lex.yy.text.
The second command runs this file through the C compiler
and links it with the Lex C library (stored on 'hr289.lcl.load')
placing the final object program in your file
.ul
libraryname.LOAD(membername)
as a completely linked load module.
The compiling command uses a special version of the C compiler
command on TSO which provides an unusually large intermediate
assembler file to compensate for the unusual bulk of C-compiled
Lex programs on the OS system.
Even so, almost any Lex source program is too big to compile, and
must be split.
.PP
The same Lex command will compile Ratfor Lex programs,
leaving a file
.ul
lex.yy.rat
instead of
.ul
lex.yy.text
in your directory.
The Ratfor program must be edited, however, to compensate
for peculiarities of IBM Ratfor.
A command sequence to do this, and then compile and load,
is available.  The full commands are:
.DS
exec 'dot.lex.clist(lex)' 'sourcename'
exec 'dot.lex.clist(rload)' 'libraryname membername'
.DE
with the same overall effect as the C language commands.
However, the Ratfor commands will run in a 150K byte
partition, while the C commands require 250K bytes to operate.
.PP
The steps involved in
processing
the generated Ratfor program are:
.IP a.
Edit the Ratfor program.
.RS
.IP 1.
Remove all tabs.
.IP 2.
Change all lower case letters to upper case letters.
.IP 3.
Convert the file to an 80-column card image file.
.RE
.IP b.
Process the Ratfor through the Ratfor preprocessor to get Fortran code.
.IP c.
Compile the Fortran.
.IP d.
Load with the
libraries 'hr289.lrl.load' and 'sys1.fortlib'.
.LP
The final load module
will only read input in 80-character fixed length records.
.NH
Lex and Yacc.
.PP
If you want to use Lex with Yacc, note that what Lex writes is a program
named
.I
yylex(),
.R
the name required by Yacc for its analyzer.
Normally, the default main program on the Lex library
calls this routine, but if Yacc is loaded, and its main
program is used, Yacc will call
.I
yylex().
.R
In this case each Lex rule should end with
.DS
return(token);
.DE
where the appropriate token value is returned.
An easy way to get access
to Yacc's names for tokens is to
compile the Lex output file as part of
the Yacc output file by placing the line
.DS
# include "lex.yy.c"
.DE
in the last section of Yacc input.
Supposing the grammar to be
named "good" and the lexical rules to be named "better"
the UNIX command sequence can just be:
.DS
yacc good
lex better
cc y.tab.c -ly -ll -lp
.DE
The Yacc library (-ly) should be loaded before the Lex library,
to obtain a main program which invokes the Yacc parser.
The generations of Lex and Yacc programs can be done in
either order.
.NH
Examples.
.PP
As a trivial problem, consider copying an input file while
adding 3 to every positive number divisible by 7.
Here is a suitable Lex source program
.DS
%%
	int k;
[0-9]+	{
	scanf(-1, yytext, "%d", &k);
	if (k%7 == 0)
		printf("%d", k+3);
	else
		printf("%d",k);
	}
.DE
to do just that.
The rule [0-9]+ recognizes strings of digits;
.I
scanf
.R
converts the digits to binary
and stores the result in
.ul
k.
The operator % (remainder) is used to check whether
.ul
k
is divisible by 7; if it is,
it is incremented by 3 as it is written out.
It may be objected that this program will alter such
input items as
49.63
or
X7.
Furthermore, it increments the absolute value
of all negative numbers divisible by 7.
To avoid this, just add a few more rules after the active one,
as here:
.DS
%%
	int k;
-?[0-9]+	{
	scanf(-1, yytext, "%d", &k);
	printf("%d", k%7 == 0 ? k+3 : k);
	}
-?[0-9.]+	ECHO;
[A-Za-z][A-Za-z0-9]+	ECHO;
.DE
Numerical strings containing
a "." or preceded by a letter will be picked up by
one of the last two rules, and not changed.
The "if-else" has been replaced by
a C conditional expression to save space;
the form
.ul
a?b:c
means "if a_ then b_ else c_".
.PP
For an example of statistics gathering, here
is a program which histograms the lengths
of words, where a word is defined as a string of letters.
.DS
		int lengs[100];
%%
[a-z]+	lengs[yyleng]++;
\&.		|
\en		;
%%
yywrap()
{
int i;
printf("Length  No. words\en");
for(i=0; i<100; i++)
	if (lengs[i] > 0)
		printf("%5d%10d\en",i,lengs[i]);
return(1);
}
.DE
This program
accumulates the histogram, while producing no output.  At the end
of the input it prints the table.
The final statement
.I
return(1);
.R
indicates that Lex is to perform wrapup.  If
.I
yywrap
.R
returns zero (false)
it implies that further input is available
and the program is
to continue reading and processing.
To provide a
.I
yywrap
.R
that never
returns true causes an infinite loop.
.PP
As a larger example,
here are some parts of a program written by N. L. Schryer
to convert double precision Fortran to single precision Fortran.
Because Fortran does not distinguish upper and lower case letters,
this routine begins by defining a set of classes including
both cases of each letter:
.DS
a	[aA]
b	[bB]
c	[cC]
\&...
z	[zZ]
.DE
An additional class recognizes white space:
.DS
W	[ \et]*
.DE
The first rule changes
"double precision" to "real", or "DOUBLE PRECISION" to "REAL".
.DS
{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
		printf(yytext[0]=='d'? "real" : "REAL");
		}
.DE
Care is taken throughout this program to preserve the case
(upper or lower)
of the original program.
The conditional operator is used to
select the proper form of the keyword.
The next rule copies continuation card indications to
avoid confusing them with constants:
.DS
^"     "[^ 0]	ECHO;
.DE
In the regular expression, the quotes surround the
blanks.
It is interpreted as
"beginning of line, then five blanks, then
anything but blank or zero".
Note the two different meanings of ^_.
There follow some rules to change double precision
constants to ordinary floating constants.
.DS
[0-9]+{W}{d}{W}[+-]?{W}[0-9]+	|
[0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+	|
"."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+	{
			/* convert constants */
			for(p=yytext; *p != 0; p++)
				if (*p == 'd' || *p == 'D')
					*p=+ 'e'- 'd';
				ECHO;
				}
.DE
After the floating point constant is recognized, it is
scanned by the
.ul
for
loop
to find the letter 'd' or 'D'.
The program then adds 'e'-'d' to that, which converts
it to the next letter of the alphabet.
The modified constant, now single-precision,
is written out again.
There follow a series of names which must be respelled to remove
their initial "d".
By using the
array
.I
yytext
.R
the same action suffices for all the names (only a sample of
a rather long list is given here).
.DS
{d}{s}{i}{n}	|
{d}{c}{o}{s}	|
{d}{s}{q}{r}{t}	|
{d}{a}{t}{a}{n}	|
\&...
{d}{f}{l}{o}{a}{t}		printf("%s",yytext+1);
.DE
Another list of names must have initial "d" changed to initial "a":
.DS
{d}{l}{o}{g}	|
{d}{l}{o}{g}10	|
{d}{m}{i}{n}1	|
{d}{m}{a}{x}1	{
		yytext[0] =+ 'a' - 'd';
		ECHO;
		}
.DE
And one routine
must have initial "d" changed to initial "r":
.DS
{d}1{m}{a}{c}{h}	{yytext[0] =+ 'r'  - 'd';
			ECHO;
			}
.DE
To avoid such names as "dsinx" being detected as instances
of "dsin", some final rules pick up longer words as identifiers
and copy some surviving characters:
.DS
[A-Za-z][A-Za-z0-9]*	|
[0-9]+		|
\en		|
\&.		ECHO;
.DE
Note that this program is not complete; it
does not deal with the spacing problems in Fortran or
with the use of keywords as identifiers.
.NH
Left Context Sensitivity.
.PP
Sometimes
it is desirable to have several sets of lexical rules
to be applied at different times in the input.
For example, a compiler preprocessor might distinguish
preprocessor statements and analyze them differently
from ordinary statements.
This requires
sensitivity
to prior context, and there are several ways of handling
such problems.
The "^" operator, for example, is a prior context operator,
recognizing immediately preceding left context just as "$" recognizes
immediately following right context.
Adjacent left context could be extended, to produce a facility similar to
that for adjacent right context, but it is unlikely
to be as useful, since often the relevant left context
appeared some time earlier, such as at the beginning of a line.
.PP
This section describes three means of dealing
with different environments: a simple use of flags,
when only a few rules change from one environment to another,
the use of
.I
start conditions
.R
on rules,
and the possibility of making multiple lexical analyzers all run
together.
In each case, there are rules which recognize the need to change the
environment in which the
following input text is analyzed, and set some parameter
to reflect the change.  This may be a flag explicitly tested by
the user's action code; such a flag is the simplest way of dealing
with the problem, since Lex is not involved at all.
It may be more convenient,
however,
to have Lex remember the flags as initial conditions on the rules.
Any rule may be associated with a start condition.  It will only
be recognized when Lex is in
that start condition.
The current start condition may be changed at any time.
Finally, if the sets of rules for the different environments
are very dissimilar,
clarity may be best achieved by writing several distinct lexical
analyzers, and switching from one to another as desired.
.PP
Consider the following problem: copy the input to the output,
changing the word "magic" to "first" on every line which began
with the letter "a", changing "magic" to "second" on every line
which began with the letter "b", and changing
"magic" to "third" on every line which began
with the letter "c".  All other words and all other lines
are left unchanged.
.PP
These rules are so simple that the easiest way
to do this job is with a flag:
.DS
	int flag;
%%
^a	{flag = 'a'; ECHO;}
^b	{flag = 'b'; ECHO;}
^c	{flag = 'c'; ECHO;}
\en	{flag =  0 ; ECHO;}
magic	{
	switch (flag)
		{
		case 'a': printf("first"); break;
		case 'b': printf("second"); break;
		case 'c': printf("third"); break;
		default: ECHO; break;
		}
	}
.DE
should be adequate.
.PP
To handle the same problem with start conditions, each
start condition must be introduced to Lex in the definitions section
with a line reading
.DS
%Start	name1 name2 ...
.DE
where the conditions may be named in any order.
The word "Start" may be abbreviated to "s" or "S".
The conditions may be referenced at the
head of a rule with the "<>" brackets:
.DS
<name1>expression
.DE
is a rule which is only recognized when Lex is in the
start condition "name1".
To enter a start condition,
execute the action statement
.DS
BEGIN name1;
.DE
which changes the start condition to "name1".
To resume the normal state,
.DS
BEGIN 0;
.DE
resets the initial condition
of the Lex automaton interpreter.
A rule may be active in several
start conditions:
.DS
<name1,name2,name3>
.DE
is a legal prefix.  Any rule not beginning with the
"<>" prefix operator is always active.
.PP
The same example as before can be written:
.DS
%START AA BB CC
%%
^a	{ECHO; BEGIN AA;}
^b	{ECHO; BEGIN BB;}
^c	{ECHO; BEGIN CC;}
\en	{ECHO; BEGIN 0;}
<AA>magic	printf("first");
<BB>magic	printf("second");
<CC>magic	printf("third");
.DE
where the logic is exactly the same as in the previous
method of handling the problem, but Lex does the work
rather than the user's code.
.PP
To have several distinct analyzers cooperate,
it is necessary to change
all the "yy" names used internally by Lex.  A command line
of the form
.DS
lex +abc file
.DE
causes the output program to use names with "abc" instead of "yy".
For example, the output file name will be "lex.abc.c".
Why is this useful?  Well, Lex really writes code under the
name
.I
yylexl();
.R
in this case, it will be
.I
abclexl().
.R
The name
.I
yylex()
.R
merely calls a function
pointed to by the cell
.I
yyplex.
.R
Normally,
.I
yyplex
.R
points to
.I
yylexl()
.R
but it can be assigned
.I
abclexl()
.R
instead.
Thus, lexical
analyzers can be switched in midstream.
.PP
If there were many complex rules, however, several
lexical analyzers might be a better way of doing this.
The same problem as above can be written this way,
as in the following example.
Note that this version is substantially longer.
The use of several distinct
analyzers is best adapted to a
case of many conflicting rules.
.DS
.I
File lby
.R
%{
extern aalexl(), bblexl(), cclexl(), (*yyplex)();
%}
%%
^a	{yyplex=aalexl; ECHO; return(1);}
^b	{yyplex=bblexl; ECHO; return(1);}
^c	{yyplex=cclexl; ECHO; return(1);}
%%
main()
{
while (yylex());
cexit();
}
.sp
.I
File lba
.R
%{
extern yylexl(), (*yyplex)();
%}
%%
magic	printf("first");
\en	{ECHO; yyplex=yylexl; return(1);}
.sp
.I
File lbb
.R
%{
extern yylexl(), (*yyplex)();
%}
%%
magic	printf("second");
\en	{ECHO; yyplex=yylexl;return(1);}
.DE
File
.I
lbc
.R
is just like
.I
lba
.R
and
.I
lbb
.R
with "magic" printed as "third".
Note that the "main" analyzer,
file
.I
lby,
.R
switches the name pointed to by
.I
yyplex
.R
each time it sees a special beginning of line character.
The subsidiary analyzers all switch back to the main
one every time they reach the end of a line.
To run this collection, the commands
.DS
lex lby
lex +aa lba
lex +bb lbb
lex +cc lbc
cc lex.yy.c lex.aa.c lex.bb.c lex.cc.c -ll -lp
.DE
would analyze and compile
these lexical analyzers into a program to do the job.
.PP
In addition to changing the name of the analyzer
by command line option, if the first line of the
Lex source
is
.DS
%+ XX
.DE
the name will be changed to XX.
.PP
As far as Lex is concerned, changing analyzers is simple;
but Yacc users should be cautious, because Yacc may look ahead
a token, so that switching at the right time may be tricky.
.PP
The multiple lexical analyzer feature does not provide
the ability to pipe the output
stream from one Lex routine into another Lex
program as input.
Since the I/O routines
.ul
input
and
.ul
output
are not renamed,
several analyzers still share the same input stream.
A new option to provide analyzers with different I/O routine
names may be added later.
A still more interesting feature would
be the ability to combine two such machines at compile time.
.NH
Character Set.
.PP
The programs generated by Lex handle
character I/O only through the routines
.I
input,
output,
.R
and
.I
unput.
.R
Thus the character representation
provided in these routines
is accepted by Lex and employed to return
values in
.I
yytext.
.R
For internal use
a character is represented as a small integer
which, if the standard library is used,
has a value equal to the integer value of the bit
pattern representing the character on the host computer.
In C, the I/O routines are assumed to deal directly
in this representation.  In Ratfor, it is anticipated
that many users will prefer left-adjusted rather than right-adjusted
characters; thus the routine
.I
lexshf
.R
is called to change the representation delivered by
.I
input
.R
into a right-adjusted integer.
If the user changes the I/O library,
the routine
.I
lexshf
.R
should also be changed to a compatible version.
The Ratfor library
I/O system is arranged to represent the letter
.I
a
.R
as in the Fortran value
.I
1Ha
.R
while in C the letter
.I
a
.R
is represented as the character constant
.I
\&'a'.
.R
If this interpretation is changed, by providing I/O
routines which translate the characters,
Lex must be told about
it, by giving a translation table.
This table must be in the definitions section,
and must be bracketed by lines containing  only
"%T".
The table contains lines of the form
.DS
{integer} {character string}
.DE
which indicate the value associated with each character.
Thus
.DS
%T
 1	Aa
 2	Bb
\&...
26	Zz
27	\en
28	+
29	-
30	0
31	1
\&...
39	9
%T
.DE
maps the lower and upper case letters together into the integers 1 through 26,
newline into 27, + and - into 28 and 29, and the
digits into 30 through 39.
Note the escape for newline.
If a table is supplied, every character that is to appear either
in the rules or in any valid input must be included
in the table.
No character
may be assigned the number 0, and no character may be
assigned a bigger number than the size of the hardware character set.
.PP
It is not likely that C users will wish to use
the character table feature; but for Fortran portability
it may be essential.
.PP
Although the contents of the Lex Ratfor library routines
for input and output run almost unmodified on UNIX, GCOS, and OS/370,
they are not really machine independent, and would not work with CDC
or Burroughs Fortran compilers.
The user is of course
welcome to replace
.ul
input, output, unput
and
.ul
lexshf
but to replace them by completely portable Fortran routines
is likely to cause a substantial decrease in the speed of Lex Ratfor
programs.
A simple way to produce portable routines would be to
leave
.ul
input
and
.ul
output
as routines that read with 80A1 format,
but replace
.ul
lexshf
by a table lookup routine.
.NH
Program Sizes and Timings.
.PP
Consider the following sequence of Lex source programs, all of
which will read an entire C program and ignore it.
As a reference point,
program 0 refers to a routine that merely calls
.I
input
.R
to read all the characters of the input, without doing anything
with them.  Lex is not involved in program 0.
.PP
The first Lex program
is program 1, which
ignores the input character by character.
.DS
Program 1:
.sp
%%
\&.	;
\en	;
.DE
.PP
The next program recognizes keywords and some identifiers,
but nothing else.
.DS
Program 2:
.sp
.ta 12
%%
int	;
char	;
float	;
double	;
struct	;
auto	;
register	;
static	;
extern	;
return	;
goto	;
if	;
else	;
switch	;
break	;
continue	;
while	;
do	;
for	;
default	;
case	;
entry	;
[a-zA-Z]+	;
\en	;
\&.	;
.DE
The next program, program 3, recognizes operators and constants as well as
the keywords and identifiers.
It is printed in double column to save space; please don't conclude
that Lex would accept this format.
Many of the operators in the first column
that are quoted need not be, but it is easier to
quote them all than remember that (for example)
the quotes in "=+" are essential while those in "=-"
are unnecessary.
Furthermore the extra quotes offer security
against extensions to Lex which introduce new operators.
.DS
Program 3:
.sp
.TS
L L10 L L .
%%		float	;
[ \et]	;	double	;
^\e001	;	struct	;
"\en"	;	auto	;
"/*"	;	register	;
"||"	;	static	;
"&&"	;	extern	;
"++"	;	return	;
"--"	;	goto	;
"->"	;	if	;
"!="	;	else	;
"<="	;	switch	;
"<<"	;	break	;
">="	;	continue	;
">>"	;	while	;
"=="	;	do	;
"=+"	;	for	;
"=-"	;	default	;
"=*"	;	case	;
"=/"	;	entry	;
"=%"	;	[a-zA-Z_][a-zA-Z0-9_]*	;
"=<<"	;	[0-9]+	;
"=>>"	;	'	;
"=&"	;	\e"	;
"=|"	;	[0-9]+"."[0-9]*	;
"=^"	;	"."[0-9]+	;
int	;	.	;
char	;
.TE
.DE
And finally, Program 4 partitions the program in the longest
pieces, recognizing a line at a time:
.DS
Program 4:
.sp
%%
\&.*	;
\en	;
.DE
The following tables give the results of running all the above
through Lex and then compiling and timing them.
The first table gives the statistics of the Lex processing
itself, while the remaining tables
give the properties of the Lex-generated programs.
Remember that the reference program, program 0,
contains only I/O and support routines, but not any Lex output.
The performance of Lex output itself is more fairly represented
by the differences between the measured size and time and the size and
time for program 0 in the same host language and operating environment.
.PP
The Lex compilation measurements give the processor time,
in seconds, to translate the different source programs.
The size of the output program generated, in terms of both the
number of states in the finite automaton
and the number of entries in the
transition table,
is also given.  These, of course, do not change from system
to system.  All times are given for generating C output;
there would be no significant difference to generate Ratfor.
.sp
.KS
.ce 100
Table 1
.sp
Measurements of Lex Compilations
.TS
c c1 s3 c3 c1 s1 s 
c c c c c c c 
l n n n n n n .
	Automaton	Table	CPU Time 
Source	Rules	States	Size	(UNIX)	(GCOS)	(OS)
Program 0	0	0	0	0	0	0
Program 1	2	4	129	2.1	1.9	1.6
Program 2	25	101	290	27.4	15.8	7.4
Program 3	55	158	439	45.7	27.9	12.0
Program 4	2	4	256	3.4	2.8	1.8
.TE
.ce 0
.KE
The number of states increases with the length and
the intricacy of the rules in a complex way.
The time required for Lex
to process the source program varies similarly.
The time taken by Lex to analyze the source programs
is given in processor seconds.  Lex spends roughly one-third
of its time processing the rules, one-third condensing the
resulting tables,
and one-third further compressing and formatting them.
.PP
The speed and size of the object programs follow
quite a different
pattern.  The time taken to process an input file
is directly proportional to the length of the input.
Beyond that, the speed of the program is related
to the lengths of the
partitions, reflecting the overhead involved in
transferring in and out of the search routine.
The tables below give the program sizes in bytes,
including all I/O library and basic support routines
as well as Lex written code and tables.
The times given represent the number of microseconds
per character to read C input text and ignore it.
.sp
.KS
.ce 100
Table 2
.sp
UNIX Measurements
.TS
c c s c s 
c c c c c 
l n n5 n n .
	Ratfor	   C   
Source	Size	Time	Size	Time
Program 0	6176	3700	1490	111
Program 1	10740	6640	3062	301
Program 2	13706	6230	4568	261
Program 3	16108	6350	5802	270
Program 4	11756	4570	3570	145
.TE
.ce 0
.KE
Ratfor on UNIX is
much slower and bulkier.
The default
I/O routines for Lex in Ratfor are not tailored to UNIX, and could
be substantially accelerated
by using local features such
as integer*2.
Since the main reason for using Ratfor on UNIX is
to improve portability, however, this seems pointless.
.PP
In contrast, the C routines used for
basic I/O were specially written for UNIX, and are not in
the portable library.
If the portable library were used, it would add
about 1700 bytes to each program size and
about 100 microseconds per character to the timings.
If comparisons between UNIX and the other systems
are desired, timings adjusted
for the portable library would be fairer.
The time taken by program 0 on the Honeywell 6000,
for example, could
be decreased by about a factor of 2
by using machine-tailored I/O routines.
.PP
On GCOS the Fortran compiler is much better,
and it must be remembered that the Ratfor input routine is handling
about seven times as many characters as the C input routine
since every input line is padded to 80 characters.
In the table below, the sizes are rounded up by GCOS
to the nearest multiple of 320 words, or 1280 bytes.
If sizes were measured in words rather
than bytes,
the GCOS sizes would appear half as large as they do below
relative to UNIX.
On the other hand, if the timings were measured relative
to machine operation
times, rather than to real time, UNIX would appear more than twice as fast
as it now does relative to GCOS.
.sp
.KS
.ce 100
Table 3
.sp
GCOS Measurements
.TS
c c s c s 
c c c c c 
l n n5 n n .
	Ratfor	   C   
Source	Size	Time	Size	Time
Program 0	35840	251	35840	77
Program 1	40960	371	39680	153
Program 2	44800	361	42240	157
Program 3	47360	391	44800	171
Program 4	42240	332	40960	114
.TE
.ce 0
.KE
A factor of two in the incremental sizes and times
with respect to program 0 could probably be achieved
by generating analyzers in GMAP rather than in C or Fortran.
.PP
On OS/370, the hardware is faster still,
but the C and Fortran runtime I/O libraries are slower.
This makes the content of the lexical analyzer less
important in comparison to the basic character fetch
operation.
Thus, a user really interested in optimizing his IBM 370
programs should first of all provide a faster
set of
.ul
input, unput,
and
.ul
output
routines.
.sp
.KS
.ce 100
Table 4
.sp
OS/370 Measurements
.TS
c c s c s 
c c c c c 
l n n5 n n .
	Ratfor	   C   
Source	Size	Time	Size	Time
Program 0	25592	267	47280	86
Program 1	32960	313	52832	122
Program 2	36936	316	56720	153
Program 3	39416	323	58992	154
Program 4	34992	277	54888	107
.TE
.ce 0
.KE
As before, sizes are in bytes, and times are in microseconds per character.
Programs 2, 3 and 4 were too large for the IBM
C compiler, and were compiled by splitting the
Lex translation into several files.
.NH
Summary of Source Format.
.PP
The general form of a Lex source file is:
.DS
{definitions}
%%
{rules}
%%
{user subroutines}
.DE
The definitions section contains
a combination of
.IP 1)
Definitions, in the form "name space translation".
.IP 2)
Included code, in the form "space code"
.IP 3)
Included code, in the form
.DS
%{
code
%}
.DE
.ns
.IP 4)
Character set tables, in the form
.DS
%T
number space character-string
\&...
%T
.DE
.ns
.IP 5)
Start conditions, given in the form
.DS
%S name1 name2 ...
.DE
.ns
.IP 6)
A name change, which must precede any rules or included code, in the form
"%+ xx".
.IP 7)
A language specifier, which must also precede any rules
or included code, in the form
"%C" for C or "%R" for Ratfor.
.LP
Lines in the rules section have the form "expression  action"
where the action may be continued on succeeding
lines by using braces to delimit it.
.PP
Regular expressions in Lex use the following
operators:
.DS
x	the character "x"
"x"	an "x", even if x is an operator.
\ex	an "x", even if x is an operator.
[xy]	the character x or y.
[x-z]	the characters x, y or z.
[^x]	any character but x.
\&.	any character but newline.
^x	an x at the beginning of a line.
<y>x	an x when Lex is in start condition y.
x$	an x at the end of a line.
x?	an optional x.
x*	0,1,2, ... instances of x.
x+	1,2,3, ... instances of x.
x|y	an x or a y.
(x)	an x.
x/y	an x but only if followed by y.
{xx}	the translation of xx from the definitions section.
.DE
.NH
Caveats and Bugs.
.PP
A. V. Aho has pointed out that there are regular expressions whose conversion
to deterministic machines produces pathological
consequences (exponential growth of the machine size).
Anyone encountering such an expression in a realistic problem
is asked to write the author.
.PP
REJECT does not rescan the input; instead it remembers the results of the previous
scan.  This means that if a rule with trailing context is found, and
REJECT executed, the user
must not have used
.ul
unput
to change the characters forthcoming
from the input stream.
This is the only restriction on the user's ability to manipulate
the not-yet-processed input.
.PP
TSO Lex is an older version.
The multiple analyzer
mechanism doesn't work.  Neither do yymore, yyless, REJECT,
start conditions, or variable length trailing context,
And any significant Lex source is too big for the IBM C compiler
when translated.
.NH
Acknowledgments.
.PP
As should
be obvious from the above, the outside of Lex
is patterned
on Yacc and the inside on Aho's string matching routines.
Therefore, both S. C. Johnson and A. V. Aho
are really originators
of much of Lex,
as well as debuggers of it.
Many thanks are due to both.
.SG MH-1274-MEL-unix
.sp 2
.NH
References.
.IP 1.
D. M. Ritchie, B. W. Kernighan, and M. E. Lesk,
.I
The C Programming Language,
.R
Computing Science Technical Report,
to appear 1975,
.MH
.IP 2.
B. W. Kernighan,
.I
Ratfor: A Preprocessor for a Rational Fortran,
.R
to appear in
.I
Software Practice and Experience, 1975.
.IP 3.
S. C. Johnson,
.I
Yacc: Yet Another Compiler Compiler,
.R
Computing Science Technical Report No. 32,
1975,
.MH
.if \n(tm (also TM 75-1273-6)
.IP 4.
A. V. Aho and M. J. Corasick,
.I
Efficient String Matching: An Aid to Bibliographic Search,
.R
Comm. ACM
.B
18,
.R
333-340 (1975).
.IP 5.
B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
.I
QED Text Editor,
.R
Computing Science Technical Report No. 5,
1972,
.MH
.IP 6.
M. E. Lesk,
.I
The Portable C Library,
.R
contained in reference [1], above,
Computing Science Technical Report,
to appear 1975,
.MH
.if \n(tm (also TM 75-1274-11)
.LP