[COFF] Requesting thoughts on extended regular expressions in grep.
Grant Taylor via COFF
coff at tuhs.org
Fri Mar 3 13:53:08 AEST 2023
On 3/2/23 8:04 PM, Dan Cross wrote:
> I guess what I'm saying is, match what you want to match and don't sweat
> the small stuff.
ACK
> Not exactly. :-)
>
> What I understand you to mean, based on this and the rest of your note,
> is that you want to find a good division point between overly specific,
> complex REs and simpler, easy to understand REs that are less specific.
> The danger with the latter is that they may match things you don't
> intend, while the former are harder to maintain and (arguably) more
> brittle. I can sympathize.
You got it.
> For the purposes of grep/egrep, that'll be a logical "line" of text,
> terminated by a newline, though the newline itself isn't considered part
> of the text for matching. I believe the `-z` option can be used to set a
> NUL byte as the "line" terminator; presumably this lets one match
> strings with embedded newlines, though I haven't tried.
Fair enough. That's also sort of what I thought might be the case.
> "string" in this context is the input you're attempting to match
> against. `egrep` will attempt to match your pattern against each "line"
> of text it reads from the files its searching. That is, each line in
> your log file(s).
*nod*
> But consider what `[ :[:digit:]]{11}` means: you've got a character
> class consisting of space, colon and a digit; {11} means "match any of
> the characters in that class exactly 11 times" (as opposed to other
> variations on the '{}' syntax that say "at least m times", "at most n
> times", or "between n and m times").
Yep, I'm well aware of the that.
> But that'll match all sorts of things that don't look like 'dd
> hh:mm:ss':
That's one of the reasons that I'm interested in coming up with a more
precise regular expression ... without being overly complex.
> (The first line is my typing; the second is output from egrep except for
> the short line of 9 '1's, for which egrep had no output. That last two
> lines are matching space characters and egrep echoing the match, but I'm
> guessing gmail will eat those.)
>
> Note that there are inputs with more than 11 characters that match; this
> is because there is some 11-character substring that matches the RE in
> those lines. In any event, I suspect this would generally not be what
> you want. But if nothing else in your input can match the RE (which you
> might know a priori because of domain knowledge about whatever is
> generating those logs) then it's no big deal, even if the RE was capable
> of matching more things generally.
Yep.
Here's an example of the full RE:
^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+
postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from
[._[:alnum:]-]+\[[.:[:xdigit:]]+\]$
As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a
larger RE and there is bounding & delimiting around the subpart.
This is to match a standard message from postfix via standard SYSLOG.
> Ah. I suspect this relies on domain knowledge about the format of log
> lines to match reliably. Otherwise it could match, `___ 123 456:789`
> which is probably not what you are expecting.
Yep.
Though said domain knowledge isn't anything special in and of itself.
> Sure. One nice thing about `egrep` et al is that you can put the REs
> into a file and include them with `-f`, as opposed to having them all
> directly on the command line.
Yep. logcheck makes extensive use of many files like this to do it's work.
> Typo. :-)
ACKK
> That seems reasonable.
Thank you for the logic CRC.
> Aside: I found the note on it's website amusing: Brought to you by the
> UK's best gambling sites! "Only gamble with what you can afford to
> lose." Yikes!
Um ... that's concerning.
> I'd proceed with caution here; it also seems to be in the FreeBSD and
> DragonFly ports collections and Homebrew on the Mac (but so is GNU grep
> for all of those).
Fair enough.
My use case is on Linux where GNU egrep is a thing.
> Yeah. IMHO `\w` is too general for what you're trying to do.
I think that `\w` is a good primer, but not where I want things to end
up long term.
> Basically, a regular expression is a regular expression if you can build
> a machine with no additional memory that can tell you whether or not a
> given string matches the RE examining its input one character at a time.
I /think/ that I could build a complex nested tree of switch statements
to test each character to see if things match what they should or not.
Though I would need at least one variable / memory to hold absolutely
minimal state to know where I am in the switch tree. I think a number
to identify the switch statement in question would be sufficient. So
I'm guessing two bytes of variable and uncounted bytes of program code.
> I think that's about right.
Thank you again Dan.
> Sure thing!
:-)
--
Grant. . . .
unix || die
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/8a59dff2/attachment.p7s>
More information about the COFF
mailing list