[COFF] Requesting thoughts on extended regular expressions in grep.

Dan Cross crossd at gmail.com
Fri Mar 3 23:47:39 AEST 2023


On Thu, Mar 2, 2023 at 10:53 PM Grant Taylor via COFF <coff at tuhs.org> wrote:
>[snip
> Here's an example of the full RE:
>
> ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+
> postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from
> [._[:alnum:]-]+\[[.:[:xdigit:]]+\]$
>
> As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a
> larger RE and there is bounding & delimiting around the subpart.

Oh, for sure; to be clear, it was obvious that in the earlier
discussion the original was just part of something larger.

FWIW, this RE seems ok to me; the additional context makes it unlikely
to match something else accidentally.

> This is to match a standard message from postfix via standard SYSLOG.
>
> > Ah. I suspect this relies on domain knowledge about the format of log
> > lines to match reliably. Otherwise it could match, `___ 123 456:789`
> > which is probably not what you are expecting.
>
> Yep.
>
> Though said domain knowledge isn't anything special in and of itself.

It needn't be special.  The point is simply that there's some external
knowledge that can be brought to bear to guide the shape of the REs.
In this case, you know that log lines won't begin with `___ 123
456:789` or other similar junk.

> [snip]
> > Basically, a regular expression is a regular expression if you can build
> > a machine with no additional memory that can tell you whether or not a
> > given string matches the RE examining its input one character at a time.
>
> I /think/ that I could build a complex nested tree of switch statements
> to test each character to see if things match what they should or not.
> Though I would need at least one variable / memory to hold absolutely
> minimal state to know where I am in the switch tree.  I think a number
> to identify the switch statement in question would be sufficient.  So
> I'm guessing two bytes of variable and uncounted bytes of program code.

Kinda. The "machine" in this case is actually an abstraction, like a
Turing machine. The salient point here is that REs map to finite state
machines, and in particular, one need not keep (say) a stack of prior
states when simulating them. Note that even in an NDFA simulation,
where one keeps track of what states one may be in, one doesn't need
to keep track of how one got into those states.

Obviously in a real implementation you've got the program counter,
register contents, local variables, etc, all of which consume "memory"
in the conventional sense. But the point is that you don't need
additional memory proportional to anything other than the size of the
RE. DFA implementation could be implemented entirely with `switch` and
`goto` if one wanted, as opposed to a bunch of mutually recursive
function calls, NDFA simulation similarly except that you need some
(bounded) additional memory to hold the active set of states. Contrast
this with a pushdown automata, which can parse a context-free
language, in which a stack is maintained that can store additional
information relative to the input (for example, an already seen
character). Pushdown automata can, for example, recognize matched
parenthesis while regular languages cannot.

Anyway, sorry, this is all rather more theoretical than is perhaps
interesting or useful. Bottom line is, I think your REs are probably
fine. `egrep` will complain at you if they are not, and I wouldn't
worry too much about optimizing them: I'd "stop" whenever you're happy
that you've got something understandable that matches what you want it
to match.

        - Dan C.


More information about the COFF mailing list