[COFF] Requesting thoughts on extended regular expressions in grep.

Wed Mar 8 02:14:49 AEST 2023

On Mon, Mar 6, 2023 at 11:01 PM Ed Bradford <egbegb2 at gmail.com> wrote:
>[snip]
> I think it is possible to make a 50K RE that is understandable. However, it requires
> a lot of 'splainin' throughout the code. I'm naive though; I will eventually discover
> a lack of truth in that belief, if such exists.

Actually, I believe you. I'm sure that with enough effort, it _is_
possible to make a 50K RE that is understandable to mere mortals. But
it begs the question: why bother? The answer to that question, in my
mind, shows the difference between a clever programmer and a pragmatic
engineer. I submit that it's time to reach for another tool well
before you get to an RE that big, and if one is still considering such
a thing, one must really ask what properties of REs and the problem at
hand one thinks lend itself to that as the solution.

>[snip]
> It sounds to me like an "optimizer" is needed. There is alreay a compiler
> that uses FA's.

I'm not sure what you're referring to here, though you were replying
to me. There are a couple of different threads floating around:

1. Writing really big regular expressions: this is probably a bad
idea. Don't do it (see below).
2. Writing a recognizer for dates. Yeah, the small REs you have for
that are fine. If you want to extend those to arbitrary date formats,
I think you'll find it starts getting ugly.
3. Optimizing regular expressions. You're still bound by the known
theoretical properties of finite automata here.

> Is someone else going to create a program
> to look for dates without using regular expressions?

Many people have already done so. :-)

> Today, I write small-sized RE's. If I write a giant RE, there is nothing preventing
> the owner of RE world to change how they are used. For instance. Compile your RE
> and a subroutine/function is produced that performs the RE search.

I'm not sure I understand what you mean.

The theory here is well-understood: we know recognizers for regular
languages can be built from DFAs, that run in time linear in the size
of their inputs, but we also know that constructing such a DFA can be
exponential in space and time, and thus impractical for many REs.

We know that NDFA simulators can be built in time and space linear in
the length of the RE, but that the resulting recognizers will be
superlinear at runtime, proportional to the product of the length of
input, number of states, and number edges between states in the state
transition graph. For a very large regular expression, that's going to
be a pretty big number, and even on modern CPUs won't be particularly
fast. Compilation to native code won't really help you.

There is no "owner of RE world" that can change that. If you can find
some way to do so, I think that would qualify as a major breakthrough
in computer science.

> RE is a language, not necessarily an implementation.
> At least that is my understanding.

Regular expressions describe regular languages, but as I mentioned
above, the theory gives the currently understood bounds for their
performance characteristics. It's kinda like the speed of light in
this regard; we can't really make it go faster.

        - Dan C.