[COFF] Requesting thoughts on extended regular expressions in grep.

Fri Mar 3 23:42:15 AEST 2023

Hi Dan,

> > If you want to understand:
> >
> > - the maths of regular expressions,
> > - the syntax of regexps which these days expresses more than REs, and
> > - the regexp engines in programs, the differences in how they work
> >   and what they match, and
> > - how to efficiently steer an engine's internals
> >
> > then I recommend Jeffrey Friedl's Mastering Regular Expressions.
> > http://regex.info/book.html
>
> I'm afraid I must sound a note of caution about Friedl's book.  Russ
> Cox alludes to some of the problems in the "History and References"
> section of his page (https://swtch.com/~rsc/regexp/regexp1.html), that
> was linked earlier

Russ says:

 1 ‘Finally, any discussion of regular expressions would be incomplete
    without mentioning Jeffrey Friedl's book Mastering Regular Expressions,
    perhaps the most popular reference among today's programmers.
 2  Friedl's book teaches programmers how best to use today's regular
    expression implementations, but not how best to implement them.
 3  What little text it devotes to implementation issues perpetuates the
    widespread belief that recursive backtracking is the only way to
    simulate an NFA.
 4  Friedl makes it clear that he [neither understands nor respects] the
    underlying theory.’  http://regex.info/blog/2006-09-15/248

I think Grant is after what Russ addresses in sentence 2.  :-)

> The impression is that Friedl shows wonderfully how to _use_ regular
> expressions, but does not understand the theory behind their
> implementation.

Yes, Friedl does show that wonderfully.  From long-ago memory, Friedl
understands enough to have diagrams of NFAs and DFAs clocking through
their inputs, showing the differences in number of states, etc.

Yes, Friedl says an NFA must recursively backtrack.  As Russ says in #3,
it was a ‘widespread belief’.  Friedl didn't originate it; I ‘knew’ it
before reading his book.  Friedl was at the sharp end of regexps,
needing to process large amounts of text, at Yahoo! IIRC.  He
investigated how the programs available behaved; he didn't start at the
theory and come up with a new program best suited to his needs.

> Personally, I'd stick with Russ's stuff, especially as `egrep` is the
> target here.

Russ's stuff is great.  He refuted that widespread belief, for one
thing.  But Russ isn't trying to teach a programmer how to best use the
regexp engine in sed, grep, egrep, Perl, PCRE, ... whereas Friedl takes
the many pages needed to do this.

It depends what one wants to learn first.

As Friedl says in the post Russ linked to:

   ‘As a user, you don't care if it's regular, nonregular, unregular,
    irregular, or incontinent.  So long as you know what you can expect
    from it (something this chapter will show you), you know all you need
    to care about.

   ‘For those wishing to learn more about the theory of regular expressions,
    the classic computer-science text is chapter 3 of Aho, Sethi, and
    Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley,
    1986), commonly called “The Dragon Book” due to the cover design.
    More specifically, this is the “red dragon”.  The “green dragon”
    is its predecessor, Aho and Ullman's Principles of Compiler Design.’

In addition to the Dragon Book, Hopcroft and Ullman's ‘Automata Theory,
Languages, and Computation’ goes further into the subject.  Chapter two
has DFA, NFA, epsilon transitions, and uses searching text as an
example.  Chapter three is regular expressions, four is regular
languages.  Pushdown automata is chapter six.

Too many books, not enough time to read.  :-)

-- 
Cheers, Ralph.