<div dir="ltr"><div class="gmail_default" style="font-family:monospace,monospace">  Thanks, Grant and contributors in<br>this thread,<br><br>Great thread on RE's. I bought and read<br>the book (it's on the floor over there<br>in the corner and I'm not getting up).<br><br>My task was finding dates in binary<br>and text files. It turns out RE's work just<br>fine for that. Because I was looking at<br>both text files and binary files, I<br>wrote my stuff using 8-bit python<br>"bytes" rather than python "text" which<br>is, I think, 7-bit in python. (I use<br>python because it works on both<br>Linux, Macs and Windows and reduces the<br>number of RE implementations I have<br>to deal with to 1).<br><br>I finished my first round of the<br>program late fall of 2022. Then<br>I put it down and now I am<br>revisiting it. I was creating:<br><br>  A Python program to search for<br>  media files (pictures and movies)<br>  and copy them to another<br>  directory tree, copying only the<br>  unique ones (deduplication), and<br>  renaming each with<br>  <br>    <b>YYYY-MM-DD-</b><br><br>  as a prefix.<br>  <br><br>Here is a list of observations from my<br>programming.<br><br>1. RE's are quite unreadable. I defined<br>   a lot of python variables and simply<br>   added them together in python to make<br>   a larger byte string (see below).</div><div class="gmail_default" style="font-family:monospace,monospace">   The resulting<br>   expressions were shorter on screen<br>   and more readable. Furthermore,<br>   I could construct them incrementally.<br>   I insist on readable code<br>   because I frequently put things down<br>   for a month or more. A while back<br>   it was a sad day when I restarted<br>   something and simply had to throw it<br>   away, moaning, "What was that<br>   programmer thinking?".<br><br>   Here is an example RE for<br>       YYYY-MM-DD<br><br>      # FR = front   BA = back<br>      # ymdt is text version<br>      ymdt = FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP<br>      ymdc = re.compile( ymdt )<br><br>     <br>1a. I also had a time defining<br>    delimiters. There are delimiters<br>    for the beginning, delimiters<br>    for internal separation,<br>    and delimiters for the end.<br><br>    The significant thing is I have<br>    to find the RE if it is the very<br>    first string in the file or the<br>    very last. That also complicates<br>    buffered reading immensely. Hence, I wrote<br>    the whole program by reading the<br>    file into a single python variable.<br>    However, when files become much<br>    larger than memory, python simply<br>    ground to a halt as did my Windows<br>    machine. I then rewrote it using a<br>    memory mapped file (for all files)<br>    and the problem was fixed.<br><br>2. Dates are formatted in a number of<br>   ways. I chose exactly one<br>   format to learn about RE's<br>   and how to construct them and use<br>   them. Even the book didn't elaborate<br>   everything. I could not find<br>   detailed documentation on some of<br>   the interfaces in the book.<br><br>   On a whim, I asked chatGPT<br>   to write a python module that returns<br>   a list of offsets and dates in a file.<br>   Surprisingly, it wrote one that was<br>   quite credible. It had bugs but it<br>   knew more about how to use the various<br>   functional interfaces in RE's than I<br>   did.<br><br>3. Testing an RE is maybe even more<br>   difficult than writing one. I have<br>   not given any serious effort to<br>   verification testing yet.<br><br>I would like to extend my program to<br>any date format. That would require<br>a much bigger RE. I have been led to<br>believe that a 50Kbyte or 500Kbyte<br>RE works just as well (if not<br>as fast) as a 100 byte RE. I think<br>with parentheses and<br>pipe-symbols suitably used,<br>one could match<br><br>  Monday, March 6, 2023<br>  2023-03-06 <br>  Mar 6, 2023<br>  or<br>  ...<br><br>I'm just guessing, though. This<br>thread has been very informative.<br>I have much to read.<br>Thank all of you.<br><br>Ed Bradford<br>Pflugerville, TX<br><br><br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Mar 2, 2023 at 12:55 PM Grant Taylor via COFF <<a href="mailto:coff@tuhs.org" target="_blank">coff@tuhs.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

I'd like some thoughts ~> input on extended regular expressions used <br>

with grep, specifically GNU grep -e / egrep.<br>

<br>

What are the pros / cons to creating extended regular expressions like <br>

the following:<br>

<br>

    ^\w{3}<br>

<br>

vs:<br>

<br>

    ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)<br>

<br>

Or:<br>

<br>

    [ :[:digit:]]{11}<br>

<br>

vs:<br>

<br>

    ( 1| 2| 3| 4| 5| 6| 7| 8| <br>

9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) <br>

(0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]<br>

<br>

I'm currently eliding the 61st (60) second, the 32nd day, and dealing <br>

with February having fewer days for simplicity.<br>

<br>

For matching patterns like the following in log files?<br>

<br>

    Mar  2 03:23:38<br>

<br>

I'm working on organically training logcheck to match known good log <br>

entries.  So I'm *DEEP* in the bowels of extended regular expressions <br>

(GNU egrep) that runs over all logs hourly.  As such, I'm interested in <br>

making sure that my REs are both efficient and accurate or at least not <br>

WILDLY badly structured.  The pedantic part of me wants to avoid <br>

wildcard type matches (\w), even if they are bounded (\w{3}), unless it <br>

truly is for unpredictable text.<br>

<br>

I'd appreciate any feedback and recommendations from people who have <br>

been using and / or optimizing (extended) regular expressions for longer <br>

than I have been using them.<br>

<br>

Thank you for your time and input.<br>

<br>

<br>

<br>

-- <br>

Grant. . . .<br>

unix || die<br>

<br>

</blockquote></div><br clear="all"><div><br></div><span>-- </span><br><div dir="ltr"><font face="'courier new', monospace"><span style="font-weight:900"><div>Advice is judged by results, not by intentions.</div><div>  Cicero</div></span></font><div><br></div></div></div>