[COFF] Requesting thoughts on extended regular expressions in grep.

Fri Mar 3 11:05:51 AEST 2023

On 3/2/23 2:53 PM, Dan Cross wrote:
> Well, obviously the former matches any sequence 3 of 
> alpha-numerics/underscores at the beginning of a string, while the 
> latter only matches abbreviations of months in the western calendar; 
> that is, the two REs are matching very different things (the latter 
> is a strict subset of the former).

I completely agree with you.  That's also why I'm wanting to start 
utilizing the latter, more specific RE.  But I don't know where the line 
of over complicating things is to avoid crossing it.

> But I suspect you mean in a more general sense.

Yes and no.  Does the comment above clarify at all?

> ...do you really want to match a space, a colon and a single digit 
> 11 times ...

Yes.

> ... in a single string?

What constitutes a single string?  ;-)  I sort of rhetorically ask.

The log lines start with

MMM dd hh:mm:ss

Where:
  - MMM is the month abbreviation
  - dd  is the day of the month
  - hh  is the hour of the day
  - mm  is the minute of the hour
  - ss  is the second of the minute

So, yes, there are eleven characters that fall into the class consisting 
of a space or a colon or a number.

Is that a single string?  It depends what you're looking at, the 
sequences of non white space in the log? No.  The patter that I'm 
matching ya.

> Using character classes would greatly simplify what you're trying to 
> do. It seems like this could be simplified to (untested) snippet:

Agreed.

I'm starting with the examples that came with; "^\w{3} [ 
:[:digit:]]{11}", the logcheck package that I'm working with and 
evaluating what I want to do.

I actually like the idea of dividing out the following:

  - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec
  - months that have 30 days: Apr, Jun, Sep, Nov
  - month that have 28/29 days: Feb

> ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]

Aside:  Why do you have the double square brackets in "[12][[0-9]]"?

> For this, I'd probably eschew `[:digit:]`. Named character classes 
> are for handy locale support, or in lieu of typing every character 
> in the alphabet (though we can use ranges to abbreviate that), but 
> it kind of seems like that's not coming into play here and, IMHO, 
> `[0-9]` is clearer in context.

ACK

"[[:digit:]]+" was a construct that I'm parroting.  It and 
[.:[:xdigit:]]+ are good for some things.  But they definitely aren't 
the best for all things.

Hence trying to find the line of being more accurate without going too far.

> It's not clear to me that dates, in their generality, can be 
> matched with regular expressions.  Consider leap years; you'd almost 
> necessarily have to use backtracking for that, but I admit I haven't 
> thought it through.

Given the context that these extended regular expressions are going to 
be used in, logcheck -- filtering out known okay log entries to email 
what doesn't get filtered -- I'm okay with having a few things slip 
through like leap day / leap seconds / leap frogs.

> `\w` is a GNU extension; I'd probably avoid it on portability grounds 
> (though `\b` is very handy).

I hear, understand, and acknowledge your concern.  At present, these 
filters are being used in a package; logcheck, which I believe is 
specific to Debian and ilk.  As such, GNU grep is very much a thing.

I'm also not a fan of the use of `\w` and would prefer to (...|...) things.

> The thing about regular expressions is that they describe regular 
> languages, and regular languages are those for which there exists a 
> finite automaton that can recognize the language. An important class 
> of finite automata are deterministic finite automata; by definition, 
> recognition by such automata are linear in the length of the input.
> 
> However, construction of a DFA for any given regular expression can be 
> superlinear (in fact, it can be exponential) so practically speaking, 
> we usually construct non-deterministic finite automata (NDFAs) and 
> "simulate" their execution for matching. NDFAs generalize DFAs (DFAs 
> are a subset of NDFAs, incidentally) in that, in any non-terminal 
> state, there can be multiple subsequent states that the machine can 
> transition to given an input symbol. When executed, for any state, 
> the simulator will transition to every permissible subsequent state 
> simultaneously, discarding impossible states as they become evident.
> 
> This implies that NDFA execution is superlinear, but it is bounded, 
> and is O(n*m*e), where n is the length of the input, m is the number 
> of nodes in the state transition graph corresponding to the NDFA, and 
> e is the maximum number of edges leaving any node in that graph (for 
> a fully connected graph, that would m, so this can be up to O(n*m^2)). 
> Construction of an NDFA is O(m), so while it's slower to execute, it's 
> actually possible to construct in a reasonable amount of time. Russ's 
> excellent series of articles that Clem linked to gives details and 
> algorithms.

I only vaguely understand those three paragraphs as they are deeper 
computer science than I've gone before.

I think I get the gist of them but could not explain them if my life 
depended upon it.

> In practical terms? Basically, don't worry about it too much. Egrep 
> will generate an NDFA simulation that's going to be acceptably fast 
> for all but the weirdest cases.

ACK

It sounds like I can make any reasonable extended regular expression a 
human can read and I'll probably be good.

Thank you for the detailed response Dan.  :-)

-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/58354682/attachment.p7s>