[TUHS] Regular Expressions

Doug McIlroy doug at cs.dartmouth.edu
Sun Aug 2 07:12:54 AEST 2020


> 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> 2. What are the 'best' implementations throughout unix (keep it pre1980s)?
> 3. What are some of the milestones along the way (major changes, forks, disagreements)?


The editor ed was in Unix from day 1. For the necessarily tiny
 implementation, Ken discarded various features
from the ancestral qed. Among the casualties was alternation
in regular expressions. It has never fully returned.

Ken's original paper described a method for simulating all paths
of a nondeterministic finite automaton in parallel, although he
didn't describe it in these exact terms. This meant he had to
keep track of up to n possible states, where n is the number of
terminal symbols in the regular expression.

"Computing Reviews" published a scathing critique of the paper:
everyone knows a deterministic automaton can recognize regular
expressions with one state transition per input character; what
a waste of time to have to keep track of multiple states! What the
review missed was that the size of the DFA can be exponential in n.
For one-shot use, as in an editor, it can take far longer to construct
the DFA than to run it.

This lesson came home with a vengeance when Al Aho wrote egrep,
which implemented full regular expressions as DFA's. I happened
to be writing calendar(1) at the same time, and used egrep to

search calendar files for dates in rather free formats for today
and all days through the next working day.  Here's an example
(egrep interprets newline as "|"):

(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*1)([^0123456789]|$)
(^|[ (,;])((\* *)0*1)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*2)([^0123456789]|$)
(^|[ (,;])((\* *)0*2)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*3)([^0123456789]|$)
(^|[ (,;])((\* *)0*3)([^0123456789]|$)

Much to Al's chagrin, this regular expression took the better
part of a minute to compile into a DFA, which would then run in
microseconds. The trouble was that the DFA was enormously
bigger than the input--only a tiny fraction of the machine's
states would be visited; the rest were useless. That led
him to the brilliant idea of constructing the machine on
the fly, creating only the states that were pertinent to
the input at hand. That innovation made the DFA again
competitive with an NFA.

Doug


More information about the TUHS mailing list