<div dir="ltr"><div class="gmail_default" style="font-family:monospace,monospace"> Thanks, Grant and contributors in<br>this thread,<br><br>Great thread on RE's. I bought and read<br>the book (it's on the floor over there<br>in the corner and I'm not getting up).<br><br>My task was finding dates in binary<br>and text files. It turns out RE's work just<br>fine for that. Because I was looking at<br>both text files and binary files, I<br>wrote my stuff using 8-bit python<br>"bytes" rather than python "text" which<br>is, I think, 7-bit in python. (I use<br>python because it works on both<br>Linux, Macs and Windows and reduces the<br>number of RE implementations I have<br>to deal with to 1).<br><br>I finished my first round of the<br>program late fall of 2022. Then<br>I put it down and now I am<br>revisiting it. I was creating:<br><br> A Python program to search for<br> media files (pictures and movies)<br> and copy them to another<br> directory tree, copying only the<br> unique ones (deduplication), and<br> renaming each with<br> <br> <b>YYYY-MM-DD-</b><br><br> as a prefix.<br> <br><br>Here is a list of observations from my<br>programming.<br><br>1. RE's are quite unreadable. I defined<br> a lot of python variables and simply<br> added them together in python to make<br> a larger byte string (see below).</div><div class="gmail_default" style="font-family:monospace,monospace"> The resulting<br> expressions were shorter on screen<br> and more readable. Furthermore,<br> I could construct them incrementally.<br> I insist on readable code<br> because I frequently put things down<br> for a month or more. A while back<br> it was a sad day when I restarted<br> something and simply had to throw it<br> away, moaning, "What was that<br> programmer thinking?".<br><br> Here is an example RE for<br> YYYY-MM-DD<br><br> # FR = front BA = back<br> # ymdt is text version<br> ymdt = FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP<br> ymdc = re.compile( ymdt )<br><br> <br>1a. I also had a time defining<br> delimiters. There are delimiters<br> for the beginning, delimiters<br> for internal separation,<br> and delimiters for the end.<br><br> The significant thing is I have<br> to find the RE if it is the very<br> first string in the file or the<br> very last. That also complicates<br> buffered reading immensely. Hence, I wrote<br> the whole program by reading the<br> file into a single python variable.<br> However, when files become much<br> larger than memory, python simply<br> ground to a halt as did my Windows<br> machine. I then rewrote it using a<br> memory mapped file (for all files)<br> and the problem was fixed.<br><br>2. Dates are formatted in a number of<br> ways. I chose exactly one<br> format to learn about RE's<br> and how to construct them and use<br> them. Even the book didn't elaborate<br> everything. I could not find<br> detailed documentation on some of<br> the interfaces in the book.<br><br> On a whim, I asked chatGPT<br> to write a python module that returns<br> a list of offsets and dates in a file.<br> Surprisingly, it wrote one that was<br> quite credible. It had bugs but it<br> knew more about how to use the various<br> functional interfaces in RE's than I<br> did.<br><br>3. Testing an RE is maybe even more<br> difficult than writing one. I have<br> not given any serious effort to<br> verification testing yet.<br><br>I would like to extend my program to<br>any date format. That would require<br>a much bigger RE. I have been led to<br>believe that a 50Kbyte or 500Kbyte<br>RE works just as well (if not<br>as fast) as a 100 byte RE. I think<br>with parentheses and<br>pipe-symbols suitably used,<br>one could match<br><br> Monday, March 6, 2023<br> 2023-03-06 <br> Mar 6, 2023<br> or<br> ...<br><br>I'm just guessing, though. This<br>thread has been very informative.<br>I have much to read.<br>Thank all of you.<br><br>Ed Bradford<br>Pflugerville, TX<br><br><br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Mar 2, 2023 at 12:55 PM Grant Taylor via COFF <<a href="mailto:coff@tuhs.org" target="_blank">coff@tuhs.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
I'd like some thoughts ~> input on extended regular expressions used <br>
with grep, specifically GNU grep -e / egrep.<br>
<br>
What are the pros / cons to creating extended regular expressions like <br>
the following:<br>
<br>
^\w{3}<br>
<br>
vs:<br>
<br>
^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)<br>
<br>
Or:<br>
<br>
[ :[:digit:]]{11}<br>
<br>
vs:<br>
<br>
( 1| 2| 3| 4| 5| 6| 7| 8| <br>
9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) <br>
(0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]<br>
<br>
I'm currently eliding the 61st (60) second, the 32nd day, and dealing <br>
with February having fewer days for simplicity.<br>
<br>
For matching patterns like the following in log files?<br>
<br>
Mar 2 03:23:38<br>
<br>
I'm working on organically training logcheck to match known good log <br>
entries. So I'm *DEEP* in the bowels of extended regular expressions <br>
(GNU egrep) that runs over all logs hourly. As such, I'm interested in <br>
making sure that my REs are both efficient and accurate or at least not <br>
WILDLY badly structured. The pedantic part of me wants to avoid <br>
wildcard type matches (\w), even if they are bounded (\w{3}), unless it <br>
truly is for unpredictable text.<br>
<br>
I'd appreciate any feedback and recommendations from people who have <br>
been using and / or optimizing (extended) regular expressions for longer <br>
than I have been using them.<br>
<br>
Thank you for your time and input.<br>
<br>
<br>
<br>
-- <br>
Grant. . . .<br>
unix || die<br>
<br>
</blockquote></div><br clear="all"><div><br></div><span>-- </span><br><div dir="ltr"><font face="'courier new', monospace"><span style="font-weight:900"><div>Advice is judged by results, not by intentions.</div><div> Cicero</div></span></font><div><br></div></div></div>