[COFF] Requesting thoughts on extended regular expressions in grep.
Grant Taylor via COFF
coff at tuhs.org
Fri Mar 3 11:05:51 AEST 2023
On 3/2/23 2:53 PM, Dan Cross wrote:
> Well, obviously the former matches any sequence 3 of
> alpha-numerics/underscores at the beginning of a string, while the
> latter only matches abbreviations of months in the western calendar;
> that is, the two REs are matching very different things (the latter
> is a strict subset of the former).
I completely agree with you. That's also why I'm wanting to start
utilizing the latter, more specific RE. But I don't know where the line
of over complicating things is to avoid crossing it.
> But I suspect you mean in a more general sense.
Yes and no. Does the comment above clarify at all?
> ...do you really want to match a space, a colon and a single digit
> 11 times ...
Yes.
> ... in a single string?
What constitutes a single string? ;-) I sort of rhetorically ask.
The log lines start with
MMM dd hh:mm:ss
Where:
- MMM is the month abbreviation
- dd is the day of the month
- hh is the hour of the day
- mm is the minute of the hour
- ss is the second of the minute
So, yes, there are eleven characters that fall into the class consisting
of a space or a colon or a number.
Is that a single string? It depends what you're looking at, the
sequences of non white space in the log? No. The patter that I'm
matching ya.
> Using character classes would greatly simplify what you're trying to
> do. It seems like this could be simplified to (untested) snippet:
Agreed.
I'm starting with the examples that came with; "^\w{3} [
:[:digit:]]{11}", the logcheck package that I'm working with and
evaluating what I want to do.
I actually like the idea of dividing out the following:
- months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec
- months that have 30 days: Apr, Jun, Sep, Nov
- month that have 28/29 days: Feb
> ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]
Aside: Why do you have the double square brackets in "[12][[0-9]]"?
> For this, I'd probably eschew `[:digit:]`. Named character classes
> are for handy locale support, or in lieu of typing every character
> in the alphabet (though we can use ranges to abbreviate that), but
> it kind of seems like that's not coming into play here and, IMHO,
> `[0-9]` is clearer in context.
ACK
"[[:digit:]]+" was a construct that I'm parroting. It and
[.:[:xdigit:]]+ are good for some things. But they definitely aren't
the best for all things.
Hence trying to find the line of being more accurate without going too far.
> It's not clear to me that dates, in their generality, can be
> matched with regular expressions. Consider leap years; you'd almost
> necessarily have to use backtracking for that, but I admit I haven't
> thought it through.
Given the context that these extended regular expressions are going to
be used in, logcheck -- filtering out known okay log entries to email
what doesn't get filtered -- I'm okay with having a few things slip
through like leap day / leap seconds / leap frogs.
> `\w` is a GNU extension; I'd probably avoid it on portability grounds
> (though `\b` is very handy).
I hear, understand, and acknowledge your concern. At present, these
filters are being used in a package; logcheck, which I believe is
specific to Debian and ilk. As such, GNU grep is very much a thing.
I'm also not a fan of the use of `\w` and would prefer to (...|...) things.
> The thing about regular expressions is that they describe regular
> languages, and regular languages are those for which there exists a
> finite automaton that can recognize the language. An important class
> of finite automata are deterministic finite automata; by definition,
> recognition by such automata are linear in the length of the input.
>
> However, construction of a DFA for any given regular expression can be
> superlinear (in fact, it can be exponential) so practically speaking,
> we usually construct non-deterministic finite automata (NDFAs) and
> "simulate" their execution for matching. NDFAs generalize DFAs (DFAs
> are a subset of NDFAs, incidentally) in that, in any non-terminal
> state, there can be multiple subsequent states that the machine can
> transition to given an input symbol. When executed, for any state,
> the simulator will transition to every permissible subsequent state
> simultaneously, discarding impossible states as they become evident.
>
> This implies that NDFA execution is superlinear, but it is bounded,
> and is O(n*m*e), where n is the length of the input, m is the number
> of nodes in the state transition graph corresponding to the NDFA, and
> e is the maximum number of edges leaving any node in that graph (for
> a fully connected graph, that would m, so this can be up to O(n*m^2)).
> Construction of an NDFA is O(m), so while it's slower to execute, it's
> actually possible to construct in a reasonable amount of time. Russ's
> excellent series of articles that Clem linked to gives details and
> algorithms.
I only vaguely understand those three paragraphs as they are deeper
computer science than I've gone before.
I think I get the gist of them but could not explain them if my life
depended upon it.
> In practical terms? Basically, don't worry about it too much. Egrep
> will generate an NDFA simulation that's going to be acceptably fast
> for all but the weirdest cases.
ACK
It sounds like I can make any reasonable extended regular expression a
human can read and I'll probably be good.
Thank you for the detailed response Dan. :-)
--
Grant. . . .
unix || die
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/58354682/attachment.p7s>
More information about the COFF
mailing list