[COFF] Requesting thoughts on extended regular expressions in grep.

Ralph Corderoy ralph at inputplus.co.uk
Wed Mar 8 03:34:51 AEST 2023


Hi Dan,

> I'm sure that with enough effort, it _is_ possible to make a 50K RE
> that is understandable to mere mortals.  But it begs the question: why
> bother?

It could be the quickest way to express the intent.

> The answer to that question, in my mind, shows the difference between
> a clever programmer and a pragmatic engineer.

I think those two can overlap.  :-)

> I submit that it's time to reach for another tool well before you get
> to an RE that big

Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular
grammar?  I think sticking with code aimed at regular grammars, or more
likely regexps, will do better than, say, a parser generator for a
type-two context-free grammar.

As well as the lex(1) family, there's Ragel as another example.
http://www.colm.net/open-source/ragel/

> 3.  Optimizing regular expressions.  You're still bound by the known
> theoretical properties of finite automata here.

Well, we're back to the RE v. regexp again.  /^[0-9]+\.jpeg$/ is matched
by some engines by first checking the last five bytes are ‘.jpeg’.

    $ debugperl -Dr -e \
    >     '"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/'
    ...
    Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890123546789012.jpg"
    Intuit: trying to determine minimum start position...
      doing 'check' fbm scan, [1..46] gave -1
      Did not find floating substr ".jpeg"$...
    Match rejected by optimizer
    Freeing REx: "^[0-9]+\.jpeg$"
    $

Boyer-Moore string searching can be used.  Common-subregexp-elimination
can spot repetitive fragment of regexp and factor them into a single set
of states along with pairing the route into them with the appropriate
route out.

The more regexp engines are optimised, the more benefit to the
programmer from sticking to a regexp rather than, say, ad hoc parsing.

The theory of REs is interesting and important, but regexps deviate from
it ever more.

-- 
Cheers, Ralph.


More information about the COFF mailing list