From chet.ramey at case.edu Wed Mar 1 00:53:44 2023 From: chet.ramey at case.edu (Chet Ramey) Date: Tue, 28 Feb 2023 09:53:44 -0500 Subject: [COFF] [TUHS] Re: Generational development [was Re: Re: Early GUI on Linux] In-Reply-To: References: <16241ceb-fe92-7f25-bda0-0b327847728d@case.edu> <735c811e-62ce-5384-b83f-a3887baac89d@case.edu> <5a7aa991-7656-3faf-b34a-d613736716fd@case.edu> Message-ID: <708986db-d22e-3b1b-7dad-c15025697e42@case.edu> On 2/27/23 7:28 PM, Dan Cross wrote: > Huh? Rustup is the context that this came up in: I think if you look back in the thread, you'll find that the message from segaloco was a reply to a message of mine where I criticized the practice of piping from `wget' to `sh'. That's the context. >> But just because you don't run `sudo sh' when using >> `rustup' doesn't mean there aren't a disturbingly large number of >> installers -- or whatever -- for which that is the recommended workflow. >> >> Nor does the fact that `rustup' is a safe example mean that this is a safe >> practice in general. I posit that it's a bad idea in general to blindly >> run scripts you download from the Internet, and it's especially bad to >> do it as root. Depending on how you accept risk, you can choose to do >> things about it, but that's often not part of recommendations. > > I cannot help but point out that this is moving the goalposts somewhat > from the specific context that I was responding to. If we're now > talking about things in general then I agree with you. We were talking about the general practice before Matt used `rustup' as a specific example. I'm glad we agree it's a bad idea. >> In any case, if you want >> to, you can have a workflow where you rebuild configure yourself. > > This is true, but then there's the autotools source stuff that you've > got to inspect as well, and on and on. Sure, there's always a limit to where trust takes over. It's ultimately who you trust to do the packaging: is it your distro/OS vendor, your package manager (e.g., macports, homebrew), free software distributors (e.g., signed tar files from gnu.org), or the authors themselves? > Or perhaps they just cargo-cult it and don't > really think about it, which (I think) hews closer to the argument > that folks here have been making. That's pretty close to the point I was making originally. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRU chet at case.edu http://tiswww.cwru.edu/~chet/ From crossd at gmail.com Wed Mar 1 01:25:04 2023 From: crossd at gmail.com (Dan Cross) Date: Tue, 28 Feb 2023 10:25:04 -0500 Subject: [COFF] [TUHS] Re: Generational development [was Re: Re: Early GUI on Linux] In-Reply-To: <708986db-d22e-3b1b-7dad-c15025697e42@case.edu> References: <16241ceb-fe92-7f25-bda0-0b327847728d@case.edu> <735c811e-62ce-5384-b83f-a3887baac89d@case.edu> <5a7aa991-7656-3faf-b34a-d613736716fd@case.edu> <708986db-d22e-3b1b-7dad-c15025697e42@case.edu> Message-ID: On Tue, Feb 28, 2023 at 9:53 AM Chet Ramey wrote: > On 2/27/23 7:28 PM, Dan Cross wrote: > > Huh? Rustup is the context that this came up in: > > I think if you look back in the thread, you'll find that the message from > segaloco was a reply to a message of mine where I criticized the practice > of piping from `wget' to `sh'. That's the context. Yes, it is quite clear we were speaking past one another. - Dan C. From chet.ramey at case.edu Wed Mar 1 02:03:47 2023 From: chet.ramey at case.edu (Chet Ramey) Date: Tue, 28 Feb 2023 11:03:47 -0500 Subject: [COFF] [TUHS] Re: Generational development [was Re: Re: Early GUI on Linux] In-Reply-To: References: <16241ceb-fe92-7f25-bda0-0b327847728d@case.edu> <735c811e-62ce-5384-b83f-a3887baac89d@case.edu> <5a7aa991-7656-3faf-b34a-d613736716fd@case.edu> <708986db-d22e-3b1b-7dad-c15025697e42@case.edu> Message-ID: On 2/28/23 10:25 AM, Dan Cross wrote: > On Tue, Feb 28, 2023 at 9:53 AM Chet Ramey wrote: >> On 2/27/23 7:28 PM, Dan Cross wrote: >>> Huh? Rustup is the context that this came up in: >> >> I think if you look back in the thread, you'll find that the message from >> segaloco was a reply to a message of mine where I criticized the practice >> of piping from `wget' to `sh'. That's the context. > > Yes, it is quite clear we were speaking past one another. OK, let's not do that any more. :-) -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRU chet at case.edu http://tiswww.cwru.edu/~chet/ From lars at nocrew.org Thu Mar 2 16:41:31 2023 From: lars at nocrew.org (Lars Brinkhoff) Date: Thu, 02 Mar 2023 06:41:31 +0000 Subject: [COFF] [TUHS] Re: Unix v7 icheck dup problem In-Reply-To: (John Cowan's message of "Wed, 1 Mar 2023 20:56:12 -0500") References: <20230302013628.8E40618C07B@mercury.lcs.mit.edu> Message-ID: <7wsfenslic.fsf@junk.nocrew.org> John Cowan writes: >> which Rob Austein re-wrote into "Alice's PDP-10". > I didn't know that one was done at MIT. This spells out the details: https://www.hactrn.net/sra/alice/alice.glossary From coff at tuhs.org Fri Mar 3 04:54:49 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Thu, 2 Mar 2023 11:54:49 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. Message-ID: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Hi, I'd like some thoughts ~> input on extended regular expressions used with grep, specifically GNU grep -e / egrep. What are the pros / cons to creating extended regular expressions like the following: ^\w{3} vs: ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) Or: [ :[:digit:]]{11} vs: ( 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] I'm currently eliding the 61st (60) second, the 32nd day, and dealing with February having fewer days for simplicity. For matching patterns like the following in log files? Mar 2 03:23:38 I'm working on organically training logcheck to match known good log entries. So I'm *DEEP* in the bowels of extended regular expressions (GNU egrep) that runs over all logs hourly. As such, I'm interested in making sure that my REs are both efficient and accurate or at least not WILDLY badly structured. The pedantic part of me wants to avoid wildcard type matches (\w), even if they are bounded (\w{3}), unless it truly is for unpredictable text. I'd appreciate any feedback and recommendations from people who have been using and / or optimizing (extended) regular expressions for longer than I have been using them. Thank you for your time and input. -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From clemc at ccc.com Fri Mar 3 05:23:25 2023 From: clemc at ccc.com (Clem Cole) Date: Thu, 2 Mar 2023 14:23:25 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: Grant - check out Russ Cox's web page on this very subject: Implementing Regular Expressions ᐧ On Thu, Mar 2, 2023 at 1:55 PM Grant Taylor via COFF wrote: > Hi, > > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. > > What are the pros / cons to creating extended regular expressions like > the following: > > ^\w{3} > > vs: > > ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) > > Or: > > [ :[:digit:]]{11} > > vs: > > ( 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) > (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] > > I'm currently eliding the 61st (60) second, the 32nd day, and dealing > with February having fewer days for simplicity. > > For matching patterns like the following in log files? > > Mar 2 03:23:38 > > I'm working on organically training logcheck to match known good log > entries. So I'm *DEEP* in the bowels of extended regular expressions > (GNU egrep) that runs over all logs hourly. As such, I'm interested in > making sure that my REs are both efficient and accurate or at least not > WILDLY badly structured. The pedantic part of me wants to avoid > wildcard type matches (\w), even if they are bounded (\w{3}), unless it > truly is for unpredictable text. > > I'd appreciate any feedback and recommendations from people who have > been using and / or optimizing (extended) regular expressions for longer > than I have been using them. > > Thank you for your time and input. > > > > -- > Grant. . . . > unix || die > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From coff at tuhs.org Fri Mar 3 05:38:19 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Thu, 2 Mar 2023 12:38:19 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: On 3/2/23 12:23 PM, Clem Cole wrote: > Grant - check out Russ Cox's web page on this very subject: Implementing > Regular Expressions Thank you for the pointer Clem. It's at the top of my reading list. I'll dig into the articles listed thereon later today. -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From crossd at gmail.com Fri Mar 3 07:53:31 2023 From: crossd at gmail.com (Dan Cross) Date: Thu, 2 Mar 2023 16:53:31 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: On Thu, Mar 2, 2023 at 1:55 PM Grant Taylor via COFF wrote: > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. > > What are the pros / cons to creating extended regular expressions like > the following: > > ^\w{3} > > vs: > > ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) Well, obviously the former matches any sequence 3 of alpha-numerics/underscores at the beginning of a string, while the latter only matches abbreviations of months in the western calendar; that is, the two REs are matching very different things (the latter is a strict subset of the former). But I suspect you mean in a more general sense. > Or: > > [ :[:digit:]]{11} ...do you really want to match a space, a colon and a single digit 11 times in a single string? > vs: > > ( 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) > (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] Using character classes would greatly simplify what you're trying to do. It seems like this could be simplified to (untested) snippet: ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9] For this, I'd probably eschew `[:digit:]`. Named character classes are for handy locale support, or in lieu of typing every character in the alphabet (though we can use ranges to abbreviate that), but it kind of seems like that's not coming into play here and, IMHO, `[0-9]` is clearer in context. > I'm currently eliding the 61st (60) second, the 32nd day, and dealing > with February having fewer days for simplicity. It's not clear to me that dates, in their generality, can be matched with regular expressions. Consider leap years; you'd almost necessarily have to use backtracking for that, but I admit I haven't thought it through. > For matching patterns like the following in log files? > > Mar 2 03:23:38 > > I'm working on organically training logcheck to match known good log > entries. So I'm *DEEP* in the bowels of extended regular expressions > (GNU egrep) that runs over all logs hourly. As such, I'm interested in > making sure that my REs are both efficient and accurate or at least not > WILDLY badly structured. The pedantic part of me wants to avoid > wildcard type matches (\w), even if they are bounded (\w{3}), unless it > truly is for unpredictable text. `\w` is a GNU extension; I'd probably avoid it on portability grounds (though `\b` is very handy). The thing about regular expressions is that they describe regular languages, and regular languages are those for which there exists a finite automaton that can recognize the language. An important class of finite automata are deterministic finite automata; by definition, recognition by such automata are linear in the length of the input. However, construction of a DFA for any given regular expression can be superlinear (in fact, it can be exponential) so practically speaking, we usually construct non-deterministic finite automata (NDFAs) and "simulate" their execution for matching. NDFAs generalize DFAs (DFAs are a subset of NDFAs, incidentally) in that, in any non-terminal state, there can be multiple subsequent states that the machine can transition to given an input symbol. When executed, for any state, the simulator will transition to every permissible subsequent state simultaneously, discarding impossible states as they become evident. This implies that NDFA execution is superlinear, but it is bounded, and is O(n*m*e), where n is the length of the input, m is the number of nodes in the state transition graph corresponding to the NDFA, and e is the maximum number of edges leaving any node in that graph (for a fully connected graph, that would m, so this can be up to O(n*m^2)). Construction of an NDFA is O(m), so while it's slower to execute, it's actually possible to construct in a reasonable amount of time. Russ's excellent series of articles that Clem linked to gives details and algorithms. > I'd appreciate any feedback and recommendations from people who have > been using and / or optimizing (extended) regular expressions for longer > than I have been using them. > > Thank you for your time and input. In practical terms? Basically, don't worry about it too much. Egrep will generate an NDFA simulation that's going to be acceptably fast for all but the weirdest cases. - Dan C. From stuff at riddermarkfarm.ca Fri Mar 3 09:01:40 2023 From: stuff at riddermarkfarm.ca (Stuff Received) Date: Thu, 2 Mar 2023 18:01:40 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: On 2023-03-02 14:23, Clem Cole wrote: > Grant - check out Russ Cox's web page on this very subject: Implementing > Regular Expressions > Clem, why are you linking through streaklinks.com? N. From steffen at sdaoden.eu Fri Mar 3 09:46:12 2023 From: steffen at sdaoden.eu (Steffen Nurpmeso) Date: Fri, 03 Mar 2023 00:46:12 +0100 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: <20230302234612.qQ4rn%steffen@sdaoden.eu> Stuff Received wrote in : |On 2023-03-02 14:23, Clem Cole wrote: |> Grant - check out Russ Cox's web page on this very subject: Implementing |> Regular Expressions |> m%2F%7Ersc%2Fregexp%2F> | |Clem, why are you linking through streaklinks.com? I do not want to be unfriendly; (but) i use firefox-bin (mozilla-compiled) and my only extension is uMatrix that i have been pointed to, and i can only recommend it highly to anyone (though the "modern" web mostly requires to turn off tracking protection and numerous white flags in uMatrix to work), maybe even to those who simply put their browser into a container. Anyhow, uMatrix gives you the following, and while i have not tried to selectively click me through to get to the target, i could have done so: streak.com www.streak.com cloudflare.com cdnjs.cloudflare.com d3e54v103j8qbb.cloudfront.net facebook.net connect.facebook.net google.com www.google.com ajax.googleapis.com storage.googleapis.com intercom.io widget.intercom.io licdn.com snap.licdn.com pdst.fm cdn.pdst.fm producthunt.com api.producthunt.com sentry-cdn.com js.sentry-cdn.com website-files.com assets.website-files.com assets-global.website-files.com google-analytics.com www.google-analytics.com googletagmanager.com www.googletagmanager.com Randomized links i find just terrible. IETF started using randomized archive links, which are mesmerising; most often mailman archive links give you a bit of orientation by themselves, isn't that more appealing. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From coff at tuhs.org Fri Mar 3 11:05:51 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Thu, 2 Mar 2023 18:05:51 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> On 3/2/23 2:53 PM, Dan Cross wrote: > Well, obviously the former matches any sequence 3 of > alpha-numerics/underscores at the beginning of a string, while the > latter only matches abbreviations of months in the western calendar; > that is, the two REs are matching very different things (the latter > is a strict subset of the former). I completely agree with you. That's also why I'm wanting to start utilizing the latter, more specific RE. But I don't know where the line of over complicating things is to avoid crossing it. > But I suspect you mean in a more general sense. Yes and no. Does the comment above clarify at all? > ...do you really want to match a space, a colon and a single digit > 11 times ... Yes. > ... in a single string? What constitutes a single string? ;-) I sort of rhetorically ask. The log lines start with MMM dd hh:mm:ss Where: - MMM is the month abbreviation - dd is the day of the month - hh is the hour of the day - mm is the minute of the hour - ss is the second of the minute So, yes, there are eleven characters that fall into the class consisting of a space or a colon or a number. Is that a single string? It depends what you're looking at, the sequences of non white space in the log? No. The patter that I'm matching ya. > Using character classes would greatly simplify what you're trying to > do. It seems like this could be simplified to (untested) snippet: Agreed. I'm starting with the examples that came with; "^\w{3} [ :[:digit:]]{11}", the logcheck package that I'm working with and evaluating what I want to do. I actually like the idea of dividing out the following: - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec - months that have 30 days: Apr, Jun, Sep, Nov - month that have 28/29 days: Feb > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9] Aside: Why do you have the double square brackets in "[12][[0-9]]"? > For this, I'd probably eschew `[:digit:]`. Named character classes > are for handy locale support, or in lieu of typing every character > in the alphabet (though we can use ranges to abbreviate that), but > it kind of seems like that's not coming into play here and, IMHO, > `[0-9]` is clearer in context. ACK "[[:digit:]]+" was a construct that I'm parroting. It and [.:[:xdigit:]]+ are good for some things. But they definitely aren't the best for all things. Hence trying to find the line of being more accurate without going too far. > It's not clear to me that dates, in their generality, can be > matched with regular expressions. Consider leap years; you'd almost > necessarily have to use backtracking for that, but I admit I haven't > thought it through. Given the context that these extended regular expressions are going to be used in, logcheck -- filtering out known okay log entries to email what doesn't get filtered -- I'm okay with having a few things slip through like leap day / leap seconds / leap frogs. > `\w` is a GNU extension; I'd probably avoid it on portability grounds > (though `\b` is very handy). I hear, understand, and acknowledge your concern. At present, these filters are being used in a package; logcheck, which I believe is specific to Debian and ilk. As such, GNU grep is very much a thing. I'm also not a fan of the use of `\w` and would prefer to (...|...) things. > The thing about regular expressions is that they describe regular > languages, and regular languages are those for which there exists a > finite automaton that can recognize the language. An important class > of finite automata are deterministic finite automata; by definition, > recognition by such automata are linear in the length of the input. > > However, construction of a DFA for any given regular expression can be > superlinear (in fact, it can be exponential) so practically speaking, > we usually construct non-deterministic finite automata (NDFAs) and > "simulate" their execution for matching. NDFAs generalize DFAs (DFAs > are a subset of NDFAs, incidentally) in that, in any non-terminal > state, there can be multiple subsequent states that the machine can > transition to given an input symbol. When executed, for any state, > the simulator will transition to every permissible subsequent state > simultaneously, discarding impossible states as they become evident. > > This implies that NDFA execution is superlinear, but it is bounded, > and is O(n*m*e), where n is the length of the input, m is the number > of nodes in the state transition graph corresponding to the NDFA, and > e is the maximum number of edges leaving any node in that graph (for > a fully connected graph, that would m, so this can be up to O(n*m^2)). > Construction of an NDFA is O(m), so while it's slower to execute, it's > actually possible to construct in a reasonable amount of time. Russ's > excellent series of articles that Clem linked to gives details and > algorithms. I only vaguely understand those three paragraphs as they are deeper computer science than I've gone before. I think I get the gist of them but could not explain them if my life depended upon it. > In practical terms? Basically, don't worry about it too much. Egrep > will generate an NDFA simulation that's going to be acceptably fast > for all but the weirdest cases. ACK It sounds like I can make any reasonable extended regular expression a human can read and I'll probably be good. Thank you for the detailed response Dan. :-) -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From coff at tuhs.org Fri Mar 3 11:08:45 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Thu, 2 Mar 2023 18:08:45 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: <46d95806-8ea6-0eb4-41c9-1ae33e004faa@spamtrap.tnetconsulting.net> On 3/2/23 4:01 PM, Stuff Received wrote: > Clem, why are you linking through streaklinks.com? Here's a direct link to the page that I landed on when following Clem's link: Link - Implementing Regular Expressions - https://swtch.com/~rsc/regexp/ I didn't pay attention to Clem's link beyond the fact that I got to the desired page without needing to tilt at my various filtering plugins. Though the message I'm replying to has caused a few brain cells to find themselves in confusion ~> curiosity. -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From dave at horsfall.org Fri Mar 3 12:10:31 2023 From: dave at horsfall.org (Dave Horsfall) Date: Fri, 3 Mar 2023 13:10:31 +1100 (EST) Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <46d95806-8ea6-0eb4-41c9-1ae33e004faa@spamtrap.tnetconsulting.net> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <46d95806-8ea6-0eb4-41c9-1ae33e004faa@spamtrap.tnetconsulting.net> Message-ID: On Thu, 2 Mar 2023, Grant Taylor via COFF wrote: > Though the message I'm replying to has caused a few brain cells to find > themselves in confusion ~> curiosity. Because evil things can happen with URL redirectors; personally I like to know where I'm going beforehand... -- Dave From crossd at gmail.com Fri Mar 3 13:04:32 2023 From: crossd at gmail.com (Dan Cross) Date: Thu, 2 Mar 2023 22:04:32 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> Message-ID: On Thu, Mar 2, 2023 at 8:06 PM Grant Taylor via COFF wrote: > On 3/2/23 2:53 PM, Dan Cross wrote: > > Well, obviously the former matches any sequence 3 of > > alpha-numerics/underscores at the beginning of a string, while the > > latter only matches abbreviations of months in the western calendar; > > that is, the two REs are matching very different things (the latter > > is a strict subset of the former). > > I completely agree with you. That's also why I'm wanting to start > utilizing the latter, more specific RE. But I don't know where the line > of over complicating things is to avoid crossing it. I guess what I'm saying is, match what you want to match and don't sweat the small stuff. > > But I suspect you mean in a more general sense. > > Yes and no. Does the comment above clarify at all? Not exactly. :-) What I understand you to mean, based on this and the rest of your note, is that you want to find a good division point between overly specific, complex REs and simpler, easy to understand REs that are less specific. The danger with the latter is that they may match things you don't intend, while the former are harder to maintain and (arguably) more brittle. I can sympathize. > > ...do you really want to match a space, a colon and a single digit > > 11 times ... > > Yes. > > > ... in a single string? > > What constitutes a single string? ;-) I sort of rhetorically ask. For the purposes of grep/egrep, that'll be a logical "line" of text, terminated by a newline, though the newline itself isn't considered part of the text for matching. I believe the `-z` option can be used to set a NUL byte as the "line" terminator; presumably this lets one match strings with embedded newlines, though I haven't tried. > The log lines start with > > MMM dd hh:mm:ss > > Where: > - MMM is the month abbreviation > - dd is the day of the month > - hh is the hour of the day > - mm is the minute of the hour > - ss is the second of the minute > > So, yes, there are eleven characters that fall into the class consisting > of a space or a colon or a number. > > Is that a single string? It depends what you're looking at, the > sequences of non white space in the log? No. The patter that I'm > matching ya. "string" in this context is the input you're attempting to match against. `egrep` will attempt to match your pattern against each "line" of text it reads from the files its searching. That is, each line in your log file(s). But consider what `[ :[:digit:]]{11}` means: you've got a character class consisting of space, colon and a digit; {11} means "match any of the characters in that class exactly 11 times" (as opposed to other variations on the '{}' syntax that say "at least m times", "at most n times", or "between n and m times"). But that'll match all sorts of things that don't look like 'dd hh:mm:ss': term% egrep '[ :[:digit:]]{11}' 11111111111 11111111111 111111111 1111111111111 1111111111111 :::::::::::::::: :::::::::::::::: aaaa bbbbb aaaa bbbbb term% (The first line is my typing; the second is output from egrep except for the short line of 9 '1's, for which egrep had no output. That last two lines are matching space characters and egrep echoing the match, but I'm guessing gmail will eat those.) Note that there are inputs with more than 11 characters that match; this is because there is some 11-character substring that matches the RE in those lines. In any event, I suspect this would generally not be what you want. But if nothing else in your input can match the RE (which you might know a priori because of domain knowledge about whatever is generating those logs) then it's no big deal, even if the RE was capable of matching more things generally. > > Using character classes would greatly simplify what you're trying to > > do. It seems like this could be simplified to (untested) snippet: > > Agreed. > > I'm starting with the examples that came with; "^\w{3} [ > :[:digit:]]{11}", the logcheck package that I'm working with and > evaluating what I want to do. Ah. I suspect this relies on domain knowledge about the format of log lines to match reliably. Otherwise it could match, `___ 123 456:789` which is probably not what you are expecting. > I actually like the idea of dividing out the following: > > - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec > - months that have 30 days: Apr, Jun, Sep, Nov > - month that have 28/29 days: Feb Sure. One nice thing about `egrep` et al is that you can put the REs into a file and include them with `-f`, as opposed to having them all directly on the command line. > > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9] > > Aside: Why do you have the double square brackets in "[12][[0-9]]"? Typo. :-) > > For this, I'd probably eschew `[:digit:]`. Named character classes > > are for handy locale support, or in lieu of typing every character > > in the alphabet (though we can use ranges to abbreviate that), but > > it kind of seems like that's not coming into play here and, IMHO, > > `[0-9]` is clearer in context. > > ACK > > "[[:digit:]]+" was a construct that I'm parroting. It and > [.:[:xdigit:]]+ are good for some things. But they definitely aren't > the best for all things. > > Hence trying to find the line of being more accurate without going too far. > > > It's not clear to me that dates, in their generality, can be > > matched with regular expressions. Consider leap years; you'd almost > > necessarily have to use backtracking for that, but I admit I haven't > > thought it through. > > Given the context that these extended regular expressions are going to > be used in, logcheck -- filtering out known okay log entries to email > what doesn't get filtered -- I'm okay with having a few things slip > through like leap day / leap seconds / leap frogs. That seems reasonable. > > `\w` is a GNU extension; I'd probably avoid it on portability grounds > > (though `\b` is very handy). > > I hear, understand, and acknowledge your concern. At present, these > filters are being used in a package; logcheck, Aside: I found the note on it's website amusing: Brought to you by the UK's best gambling sites! "Only gamble with what you can afford to lose." Yikes! > which I believe is > specific to Debian and ilk. As such, GNU grep is very much a thing. I'd proceed with caution here; it also seems to be in the FreeBSD and DragonFly ports collections and Homebrew on the Mac (but so is GNU grep for all of those). > I'm also not a fan of the use of `\w` and would prefer to (...|...) things. Yeah. IMHO `\w` is too general for what you're trying to do. > > The thing about regular expressions is that they describe regular > > languages, and regular languages are those for which there exists a > > finite automaton that can recognize the language. An important class > > of finite automata are deterministic finite automata; by definition, > > recognition by such automata are linear in the length of the input. > > > > However, construction of a DFA for any given regular expression can be > > superlinear (in fact, it can be exponential) so practically speaking, > > we usually construct non-deterministic finite automata (NDFAs) and > > "simulate" their execution for matching. NDFAs generalize DFAs (DFAs > > are a subset of NDFAs, incidentally) in that, in any non-terminal > > state, there can be multiple subsequent states that the machine can > > transition to given an input symbol. When executed, for any state, > > the simulator will transition to every permissible subsequent state > > simultaneously, discarding impossible states as they become evident. > > > > This implies that NDFA execution is superlinear, but it is bounded, > > and is O(n*m*e), where n is the length of the input, m is the number > > of nodes in the state transition graph corresponding to the NDFA, and > > e is the maximum number of edges leaving any node in that graph (for > > a fully connected graph, that would m, so this can be up to O(n*m^2)). > > Construction of an NDFA is O(m), so while it's slower to execute, it's > > actually possible to construct in a reasonable amount of time. Russ's > > excellent series of articles that Clem linked to gives details and > > algorithms. > > I only vaguely understand those three paragraphs as they are deeper > computer science than I've gone before. > > I think I get the gist of them but could not explain them if my life > depended upon it. Basically, a regular expression is a regular expression if you can build a machine with no additional memory that can tell you whether or not a given string matches the RE examining its input one character at a time. > > In practical terms? Basically, don't worry about it too much. Egrep > > will generate an NDFA simulation that's going to be acceptably fast > > for all but the weirdest cases. > > ACK > > It sounds like I can make any reasonable extended regular expression a > human can read and I'll probably be good. I think that's about right. > Thank you for the detailed response Dan. :-) Sure thing! - Dan C. -------------- next part -------------- An HTML attachment was scrubbed... URL: From coff at tuhs.org Fri Mar 3 13:34:19 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Thu, 2 Mar 2023 20:34:19 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <46d95806-8ea6-0eb4-41c9-1ae33e004faa@spamtrap.tnetconsulting.net> Message-ID: <57a22cdd-2523-d8fd-4004-360da77d4ba0@spamtrap.tnetconsulting.net> On 3/2/23 7:10 PM, Dave Horsfall wrote: > Because evil things can happen with URL redirectors; personally I like to > know where I'm going beforehand... I absolutely agree. The confusion was why someone would purposefully choose to use a redirector. -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From coff at tuhs.org Fri Mar 3 13:53:08 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Thu, 2 Mar 2023 20:53:08 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> Message-ID: <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net> On 3/2/23 8:04 PM, Dan Cross wrote: > I guess what I'm saying is, match what you want to match and don't sweat > the small stuff. ACK > Not exactly. :-) > > What I understand you to mean, based on this and the rest of your note, > is that you want to find a good division point between overly specific, > complex REs and simpler, easy to understand REs that are less specific. > The danger with the latter is that they may match things you don't > intend, while the former are harder to maintain and (arguably) more > brittle. I can sympathize. You got it. > For the purposes of grep/egrep, that'll be a logical "line" of text, > terminated by a newline, though the newline itself isn't considered part > of the text for matching. I believe the `-z` option can be used to set a > NUL byte as the "line" terminator; presumably this lets one match > strings with embedded newlines, though I haven't tried. Fair enough. That's also sort of what I thought might be the case. > "string" in this context is the input you're attempting to match > against. `egrep` will attempt to match your pattern against each "line" > of text it reads from the files its searching. That is, each line in > your log file(s). *nod* > But consider what `[ :[:digit:]]{11}` means: you've got a character > class consisting of space, colon and a digit; {11} means "match any of > the characters in that class exactly 11 times" (as opposed to other > variations on the '{}' syntax that say "at least m times", "at most n > times", or "between n and m times"). Yep, I'm well aware of the that. > But that'll match all sorts of things that don't look like 'dd > hh:mm:ss': That's one of the reasons that I'm interested in coming up with a more precise regular expression ... without being overly complex. > (The first line is my typing; the second is output from egrep except for > the short line of 9 '1's, for which egrep had no output. That last two > lines are matching space characters and egrep echoing the match, but I'm > guessing gmail will eat those.) > > Note that there are inputs with more than 11 characters that match; this > is because there is some 11-character substring that matches the RE  in > those lines. In any event, I suspect this would generally not be what > you want. But if nothing else in your input can match the RE (which you > might know a priori because of domain knowledge about whatever is > generating those logs) then it's no big deal, even if the RE was capable > of matching more things generally. Yep. Here's an example of the full RE: ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from [._[:alnum:]-]+\[[.:[:xdigit:]]+\]$ As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a larger RE and there is bounding & delimiting around the subpart. This is to match a standard message from postfix via standard SYSLOG. > Ah. I suspect this relies on domain knowledge about the format of log > lines to match reliably. Otherwise it could match, `___ 123 456:789` > which is probably not what you are expecting. Yep. Though said domain knowledge isn't anything special in and of itself. > Sure.  One nice thing about `egrep` et al is that you can put the REs > into a file and include them with `-f`, as opposed to having them all > directly on the command line. Yep. logcheck makes extensive use of many files like this to do it's work. > Typo.  :-) ACKK > That seems reasonable. Thank you for the logic CRC. > Aside: I found the note on it's website amusing: Brought to you by the > UK's best gambling sites! "Only gamble with what you can afford to > lose." Yikes! Um ... that's concerning. > I'd proceed with caution here; it also seems to be in the FreeBSD and > DragonFly ports collections and Homebrew on the Mac (but so is GNU grep > for all of those). Fair enough. My use case is on Linux where GNU egrep is a thing. > Yeah. IMHO `\w` is too general for what you're trying to do. I think that `\w` is a good primer, but not where I want things to end up long term. > Basically, a regular expression is a regular expression if you can build > a machine with no additional memory that can tell you whether or not a > given string matches the RE examining its input one character at a time. I /think/ that I could build a complex nested tree of switch statements to test each character to see if things match what they should or not. Though I would need at least one variable / memory to hold absolutely minimal state to know where I am in the switch tree. I think a number to identify the switch statement in question would be sufficient. So I'm guessing two bytes of variable and uncounted bytes of program code. > I think that's about right. Thank you again Dan. > Sure thing! :-) -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From ralph at inputplus.co.uk Fri Mar 3 20:59:28 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Fri, 03 Mar 2023 10:59:28 +0000 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: <20230303105928.E88AB215AA@orac.inputplus.co.uk> Hi Grant, > What are the pros / cons to creating extended regular expressions like > the following: If you want to understand: - the maths of regular expressions, - the syntax of regexps which these days expresses more than REs, and - the regexp engines in programs, the differences in how they work and what they match, and - how to efficiently steer an engine's internals then I recommend Jeffrey Friedl's Mastering Regular Expressions. http://regex.info/book.html > For matching patterns like the following in log files? > > Mar 2 03:23:38 Do you want speed of matching with some false positives or validation by regexp rather than post-lexing logic and to what depth, e.g. does this month have a ‘31st’? /^... .. ..:..:../ You'd said egrep, which is NDFA, but in other engines, alternation order can matter, e.g. ‘J’ starts the most months and some months have more days than others. /^(J(an|u[nl])|Ma[ry]|A(ug|pr)|Oct|Dec|... -- Cheers, Ralph. From crossd at gmail.com Fri Mar 3 23:11:23 2023 From: crossd at gmail.com (Dan Cross) Date: Fri, 3 Mar 2023 08:11:23 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <20230303105928.E88AB215AA@orac.inputplus.co.uk> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> Message-ID: On Fri, Mar 3, 2023 at 5:59 AM Ralph Corderoy wrote: > [snip] > > If you want to understand: > > - the maths of regular expressions, > - the syntax of regexps which these days expresses more than REs, and > - the regexp engines in programs, the differences in how they work and > what they match, and > - how to efficiently steer an engine's internals > > then I recommend Jeffrey Friedl's Mastering Regular Expressions. > http://regex.info/book.html I'm afraid I must sound a note of caution about Friedl's book. Russ Cox alludes to some of the problems in the "History and References" section of his page (https://swtch.com/~rsc/regexp/regexp1.html), that was linked earlier, and he links to this post: http://regex.info/blog/2006-09-15/248 The impression is that Friedl shows wonderfully how to _use_ regular expressions, but does not understand the theory behind their implementation. It is certainly true that today what many people refer to as "regular expressions" are not in fact regular (and require a pushdown automata to implement, putting them somewhere between REs and the context-free languages in terms of expressiveness). Personally, I'd stick with Russ's stuff, especially as `egrep` is the target here. - Dan C. From ralph at inputplus.co.uk Fri Mar 3 23:42:15 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Fri, 03 Mar 2023 13:42:15 +0000 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> Message-ID: <20230303134215.3ED63215AA@orac.inputplus.co.uk> Hi Dan, > > If you want to understand: > > > > - the maths of regular expressions, > > - the syntax of regexps which these days expresses more than REs, and > > - the regexp engines in programs, the differences in how they work > > and what they match, and > > - how to efficiently steer an engine's internals > > > > then I recommend Jeffrey Friedl's Mastering Regular Expressions. > > http://regex.info/book.html > > I'm afraid I must sound a note of caution about Friedl's book. Russ > Cox alludes to some of the problems in the "History and References" > section of his page (https://swtch.com/~rsc/regexp/regexp1.html), that > was linked earlier Russ says: 1 ‘Finally, any discussion of regular expressions would be incomplete without mentioning Jeffrey Friedl's book Mastering Regular Expressions, perhaps the most popular reference among today's programmers. 2 Friedl's book teaches programmers how best to use today's regular expression implementations, but not how best to implement them. 3 What little text it devotes to implementation issues perpetuates the widespread belief that recursive backtracking is the only way to simulate an NFA. 4 Friedl makes it clear that he [neither understands nor respects] the underlying theory.’ http://regex.info/blog/2006-09-15/248 I think Grant is after what Russ addresses in sentence 2. :-) > The impression is that Friedl shows wonderfully how to _use_ regular > expressions, but does not understand the theory behind their > implementation. Yes, Friedl does show that wonderfully. From long-ago memory, Friedl understands enough to have diagrams of NFAs and DFAs clocking through their inputs, showing the differences in number of states, etc. Yes, Friedl says an NFA must recursively backtrack. As Russ says in #3, it was a ‘widespread belief’. Friedl didn't originate it; I ‘knew’ it before reading his book. Friedl was at the sharp end of regexps, needing to process large amounts of text, at Yahoo! IIRC. He investigated how the programs available behaved; he didn't start at the theory and come up with a new program best suited to his needs. > Personally, I'd stick with Russ's stuff, especially as `egrep` is the > target here. Russ's stuff is great. He refuted that widespread belief, for one thing. But Russ isn't trying to teach a programmer how to best use the regexp engine in sed, grep, egrep, Perl, PCRE, ... whereas Friedl takes the many pages needed to do this. It depends what one wants to learn first. As Friedl says in the post Russ linked to: ‘As a user, you don't care if it's regular, nonregular, unregular, irregular, or incontinent. So long as you know what you can expect from it (something this chapter will show you), you know all you need to care about. ‘For those wishing to learn more about the theory of regular expressions, the classic computer-science text is chapter 3 of Aho, Sethi, and Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley, 1986), commonly called “The Dragon Book” due to the cover design. More specifically, this is the “red dragon”. The “green dragon” is its predecessor, Aho and Ullman's Principles of Compiler Design.’ In addition to the Dragon Book, Hopcroft and Ullman's ‘Automata Theory, Languages, and Computation’ goes further into the subject. Chapter two has DFA, NFA, epsilon transitions, and uses searching text as an example. Chapter three is regular expressions, four is regular languages. Pushdown automata is chapter six. Too many books, not enough time to read. :-) -- Cheers, Ralph. From crossd at gmail.com Fri Mar 3 23:47:39 2023 From: crossd at gmail.com (Dan Cross) Date: Fri, 3 Mar 2023 08:47:39 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net> Message-ID: On Thu, Mar 2, 2023 at 10:53 PM Grant Taylor via COFF wrote: >[snip > Here's an example of the full RE: > > ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ > postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from > [._[:alnum:]-]+\[[.:[:xdigit:]]+\]$ > > As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a > larger RE and there is bounding & delimiting around the subpart. Oh, for sure; to be clear, it was obvious that in the earlier discussion the original was just part of something larger. FWIW, this RE seems ok to me; the additional context makes it unlikely to match something else accidentally. > This is to match a standard message from postfix via standard SYSLOG. > > > Ah. I suspect this relies on domain knowledge about the format of log > > lines to match reliably. Otherwise it could match, `___ 123 456:789` > > which is probably not what you are expecting. > > Yep. > > Though said domain knowledge isn't anything special in and of itself. It needn't be special. The point is simply that there's some external knowledge that can be brought to bear to guide the shape of the REs. In this case, you know that log lines won't begin with `___ 123 456:789` or other similar junk. > [snip] > > Basically, a regular expression is a regular expression if you can build > > a machine with no additional memory that can tell you whether or not a > > given string matches the RE examining its input one character at a time. > > I /think/ that I could build a complex nested tree of switch statements > to test each character to see if things match what they should or not. > Though I would need at least one variable / memory to hold absolutely > minimal state to know where I am in the switch tree. I think a number > to identify the switch statement in question would be sufficient. So > I'm guessing two bytes of variable and uncounted bytes of program code. Kinda. The "machine" in this case is actually an abstraction, like a Turing machine. The salient point here is that REs map to finite state machines, and in particular, one need not keep (say) a stack of prior states when simulating them. Note that even in an NDFA simulation, where one keeps track of what states one may be in, one doesn't need to keep track of how one got into those states. Obviously in a real implementation you've got the program counter, register contents, local variables, etc, all of which consume "memory" in the conventional sense. But the point is that you don't need additional memory proportional to anything other than the size of the RE. DFA implementation could be implemented entirely with `switch` and `goto` if one wanted, as opposed to a bunch of mutually recursive function calls, NDFA simulation similarly except that you need some (bounded) additional memory to hold the active set of states. Contrast this with a pushdown automata, which can parse a context-free language, in which a stack is maintained that can store additional information relative to the input (for example, an already seen character). Pushdown automata can, for example, recognize matched parenthesis while regular languages cannot. Anyway, sorry, this is all rather more theoretical than is perhaps interesting or useful. Bottom line is, I think your REs are probably fine. `egrep` will complain at you if they are not, and I wouldn't worry too much about optimizing them: I'd "stop" whenever you're happy that you've got something understandable that matches what you want it to match. - Dan C. From dave at horsfall.org Sat Mar 4 02:12:31 2023 From: dave at horsfall.org (Dave Horsfall) Date: Sat, 4 Mar 2023 03:12:31 +1100 (EST) Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <20230303105928.E88AB215AA@orac.inputplus.co.uk> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> Message-ID: On Fri, 3 Mar 2023, Ralph Corderoy wrote: > You'd said egrep, which is NDFA, but in other engines, alternation order > can matter, e.g. ‘J’ starts the most months and some months have more > days than others. > > /^(J(an|u[nl])|Ma[ry]|A(ug|pr)|Oct|Dec|... I can't help but provide an extract from my antispam log summariser (AWK): # Yes, I have a warped sense of humour here. /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \ { date = sprintf("%4d/%.2d/%.2d", year, months[substr($0, 1, 3)], substr($0, 5, 2)) Etc. The idea is not to validate so much as to grab a line of interest to me and extract the bits that I want. In this case I trust the source (the Sendmail log), but of course that is not always the case... When doing things like this, you need to ask yourself at least the following questions: 1) What exactly am I trying to do? This is fairly important :-) 2) Can I trust the data? Bobby Tables, Reflections on Trusting Trust... 3) Etc. And let's not get started on the difference betwixt "trusted" and "trustworthy" (that distinction keeps security bods awake at night). -- Dave From crossd at gmail.com Sat Mar 4 03:13:13 2023 From: crossd at gmail.com (Dan Cross) Date: Fri, 3 Mar 2023 12:13:13 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> Message-ID: On Fri, Mar 3, 2023 at 11:12 AM Dave Horsfall wrote: > [snip] > # Yes, I have a warped sense of humour here. > /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \ > { > date = sprintf("%4d/%.2d/%.2d", > year, months[substr($0, 1, 3)], substr($0, 5, 2)) If I may, I'd like to point out something fairly subtle here that, I think, bears on the original question (paraphrased as, "where does one draw the line between concision and understandability?"). Note Dave's class to match the first letter of the month: `[JFMAMJJASOND]`. One may notice that a few letters are repeated (J, M, A), and one _could_ shorten this to: `[JFMASOND]`. But I can see a serious argument where that may be regarded as a mistake; in particular, the original is easy to validate by just saying the names of the month out loud as one scans the list. For the shorter version, I'd worry that I would miss something or make a mistake. The lesson here is keep it simple and don't over-optimize! > Etc. The idea is not to validate so much as to grab a line of interest to > me and extract the bits that I want. > [snip] Too true. A few years ago, Rob Pike gave a talk about lexing in Go that bears on this that's worth a listen: https://www.youtube.com/watch?v=HxaD_trXwRE - Dan C. From ralph at inputplus.co.uk Sat Mar 4 03:38:56 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Fri, 03 Mar 2023 17:38:56 +0000 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> Message-ID: <20230303173856.B615421D37@orac.inputplus.co.uk> Hi, > Dave Horsfall wrote: > > # Yes, I have a warped sense of humour here. > > /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \ ... > in particular, the original is easy to validate by just saying the > names of the month out loud as one scans the list. Some clients pay me to read code and find fault. It's a hard habit to break. ‘coc’ smells wrong. :-) A bit of vi's :map later... Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dcc The regexp works, of course, but in this case removing the redundancy would also fix the ‘fault’. -- Cheers, Ralph. From coff at tuhs.org Sat Mar 4 05:06:17 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Fri, 3 Mar 2023 12:06:17 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: Thank you all for very interesting and engaging comments & threads to chase / pull / untangle. I'd like to expand / refine my original question a little bit. On 3/2/23 11:54 AM, Grant Taylor via COFF wrote: > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. While some reading of the references that Clem provided I came across multiple indications that back-references can be problematic from a performance stand point. So I'd like to know if all back-references are problematic, or if very specific back-references are okay. Suppose I have the following two lines: aaa aaa aaa bbb Does the following RE w/ back-reference introduce a big performance penalty? (aaa|bbb) \1 As in: % echo "aaa aaa" | egrep "(aaa|bbb) \1" aaa aaa I can easily see how a back reference to something that is not a fixed length can become a rabbit hole. But I'm wondering if a back-reference to -- what I think is called -- an alternation (with length fixed in the RE) is a performance hit or not. Now to read and reply to the many good comments that people have shared. :-) -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From crossd at gmail.com Sat Mar 4 05:09:41 2023 From: crossd at gmail.com (Dan Cross) Date: Fri, 3 Mar 2023 14:09:41 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <20230303173856.B615421D37@orac.inputplus.co.uk> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> <20230303173856.B615421D37@orac.inputplus.co.uk> Message-ID: On Fri, Mar 3, 2023 at 12:39 PM Ralph Corderoy wrote: > > Dave Horsfall wrote: > > > # Yes, I have a warped sense of humour here. > > > /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \ > ... > > in particular, the original is easy to validate by just saying the > > names of the month out loud as one scans the list. > > Some clients pay me to read code and find fault. It's a hard habit to > break. ‘coc’ smells wrong. :-) > > A bit of vi's :map later... > > Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dcc > > The regexp works, of course, but in this case removing the redundancy > would also fix the ‘fault’. Ha! Good catch. I'd probably just write it as, `(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)` which isn't much longer than the original anyway. - Dan C. From coff at tuhs.org Sat Mar 4 05:19:29 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Fri, 3 Mar 2023 12:19:29 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <20230303134215.3ED63215AA@orac.inputplus.co.uk> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> <20230303134215.3ED63215AA@orac.inputplus.co.uk> Message-ID: <21e8477c-c388-7b90-ed10-21c7f76f0892@spamtrap.tnetconsulting.net> On 3/3/23 6:42 AM, Ralph Corderoy wrote: > I think Grant is after what Russ addresses in sentence 2. :-) You are mostly correct. The motivation for this thread is very much so wanting to learn "how best to use today's regular expression implementations". However there is also the part of me that wants to have a little bit of understanding behind why the former is the case. > Yes, Friedl does show that wonderfully. From long-ago memory, Friedl > understands enough to have diagrams of NFAs and DFAs clocking through > their inputs, showing the differences in number of states, etc. It seems like I need to find another copy of Friedl's book. -- My current copy is boxed up for a move nearly 1k miles away. :-/ > Yes, Friedl says an NFA must recursively backtrack. As Russ says in #3, > it was a ‘widespread belief’. Friedl didn't originate it; I ‘knew’ it > before reading his book. Friedl was at the sharp end of regexps, > needing to process large amounts of text, at Yahoo! IIRC. He > investigated how the programs available behaved; he didn't start at the > theory and come up with a new program best suited to his needs. It sounds like I'm coming from a similar position of "what is the best* way to process this corpus" more than "what is the underlying theory behind what I'm wanting to do". > Russ's stuff is great. He refuted that widespread belief, for one > thing. But Russ isn't trying to teach a programmer how to best use the > regexp engine in sed, grep, egrep, Perl, PCRE, ... whereas Friedl takes > the many pages needed to do this. :-) > It depends what one wants to learn first. I'm learning that I'm more of a technician that wants to know how to use the existing tools to the best of his / their ability. While having some interest in theory behind things. > As Friedl says in the post Russ linked to: > > ‘As a user, you don't care if it's regular, nonregular, unregular, > irregular, or incontinent. So long as you know what you can expect > from it (something this chapter will show you), you know all you need > to care about. Yep. That's the position that I would be in if someone were paying me to write the REs that I'm writing. > ‘For those wishing to learn more about the theory of regular expressions, > the classic computer-science text is chapter 3 of Aho, Sethi, and > Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley, > 1986), commonly called “The Dragon Book” due to the cover design. > More specifically, this is the “red dragon”. The “green dragon” > is its predecessor, Aho and Ullman's Principles of Compiler Design.’ This all sounds interesting to me, and like something I might add to my collection of books. But it also sounds like something that will be an up hill read and vast learning opportunity. > In addition to the Dragon Book, Hopcroft and Ullman's ‘Automata Theory, > Languages, and Computation’ goes further into the subject. Chapter two > has DFA, NFA, epsilon transitions, and uses searching text as an > example. Chapter three is regular expressions, four is regular > languages. Pushdown automata is chapter six. > > Too many books, not enough time to read. :-) Yep. Even inventorying and keeping track of the books can be time consuming. -- Thankfully I took some time to do exactly that and have access to that information on the super computer in my pocket. -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From coff at tuhs.org Sat Mar 4 05:26:41 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Fri, 3 Mar 2023 12:26:41 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net> Message-ID: <9bb089cd-1317-6bcf-3bd3-231ce96b333c@spamtrap.tnetconsulting.net> On 3/3/23 6:47 AM, Dan Cross wrote: > Oh, for sure; to be clear, it was obvious that in the earlier > discussion the original was just part of something larger. Good. For a moment I thought that you might be thinking it was stand alone. > FWIW, this RE seems ok to me; the additional context makes it unlikely > to match something else accidentally. :-) > It needn't be special. The point is simply that there's some external > knowledge that can be brought to bear to guide the shape of the REs. ACK I've heard "domain (specific) knowledge" used to refer to both extremely specific training in a field and -- as you have -- data that is having something done to it. > In this case, you know that log lines won't begin with `___ 123 > 456:789` or other similar junk. They darned well had better not. > Kinda. The "machine" in this case is actually an abstraction, like a > Turing machine. The salient point here is that REs map to finite state > machines, and in particular, one need not keep (say) a stack of prior > states when simulating them. Note that even in an NDFA simulation, > where one keeps track of what states one may be in, one doesn't need > to keep track of how one got into those states. ACK > Obviously in a real implementation you've got the program counter, > register contents, local variables, etc, all of which consume > "memory" in the conventional sense. But the point is that you don't > need additional memory proportional to anything other than the size > of the RE. DFA implementation could be implemented entirely with > `switch` and `goto` if one wanted, as opposed to a bunch of mutually > recursive function calls, NDFA simulation similarly except that > you need some (bounded) additional memory to hold the active set > of states. Contrast this with a pushdown automata, which can parse > a context-free language, in which a stack is maintained that can > store additional information relative to the input (for example, > an already seen character). Pushdown automata can, for example, > recognize matched parenthesis while regular languages cannot. I think I understand the gist of what you're saying, but I need to re-read it and think about it a little bit. > Anyway, sorry, this is all rather more theoretical than is perhaps > interesting or useful. Apology returned to sender as unnecessary. You are providing the requested thought provoking discussion, which is exactly what I asked for. I feel like I'm going to walk away from this thread wiser based on the thread's content plus all additional reading material on top of the thread itself. > Bottom line is, I think your REs are probably fine. `egrep` will > complain at you if they are not, and I wouldn't worry too much about > optimizing them: I'd "stop" whenever you're happy that you've got > something understandable that matches what you want it to match. Thank you (again) Dan. :-) -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From crossd at gmail.com Sat Mar 4 05:31:14 2023 From: crossd at gmail.com (Dan Cross) Date: Fri, 3 Mar 2023 14:31:14 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: On Fri, Mar 3, 2023 at 2:06 PM Grant Taylor via COFF wrote: > Thank you all for very interesting and engaging comments & threads to > chase / pull / untangle. > > I'd like to expand / refine my original question a little bit. > > On 3/2/23 11:54 AM, Grant Taylor via COFF wrote: > > I'd like some thoughts ~> input on extended regular expressions used > > with grep, specifically GNU grep -e / egrep. > > While some reading of the references that Clem provided I came across > multiple indications that back-references can be problematic from a > performance stand point. > > So I'd like to know if all back-references are problematic, or if very > specific back-references are okay. The thing about backreferences is that they're not representable in the regular languages because they require additional state (the thing the backref refers to), so you cannot construct a DFA corresponding to them, nor an NDFA simulator (this is where Freidl gets things wrong!); you really need a pushdown automata and then you're in the domain of the context-free languages. Therefore, "regexps" that use back references are not actually regular expressions. Yet, popular engines support them...but how? Well, pretty much all of them use a backtracking implementation, which _can_ be exponential in both time and space. Now, that said, there are plenty of REs, even some with backrefs, that'll execute plenty fast enough on backtracking implementations; it really depends on the expressions in question and the size of strings you're trying to match against. But you lose the bounding guarantees DFAs and NDFAs provide. > Suppose I have the following two lines: > > aaa aaa > aaa bbb > > Does the following RE w/ back-reference introduce a big performance penalty? > > (aaa|bbb) \1 > > As in: > > % echo "aaa aaa" | egrep "(aaa|bbb) \1" > aaa aaa > > I can easily see how a back reference to something that is not a fixed > length can become a rabbit hole. But I'm wondering if a back-reference > to -- what I think is called -- an alternation (with length fixed in the > RE) is a performance hit or not. Well, it's more about the implementation strategy than the specific expression here. Could this become exponential? I don't think this one would, no; but others may, particularly if you use Kleene closures in the alternation. This _is_ something that appears in the wild, by the way, not just in theory; I did a change to Google's spelling service code to replace PCRE with re2 precisely because it was blowing up with exponential memory usage on some user input. The problems went away, but I had to rewrite a bunch of the REs involved. - Dan C. From coff at tuhs.org Sat Mar 4 05:36:35 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Fri, 3 Mar 2023 12:36:35 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> Message-ID: <8648a720-62a6-1ed2-b0ba-2dcc38097da6@spamtrap.tnetconsulting.net> On 3/3/23 9:12 AM, Dave Horsfall wrote: > I can't help but provide an extract from my antispam log summariser > (AWK): > > # Yes, I have a warped sense of humour here. > /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \ > { > date = sprintf("%4d/%.2d/%.2d", > year, months[substr($0, 1, 3)], substr($0, 5, 2)) Thank you for sharing that Dave. > Etc. The idea is not to validate so much as to grab a line of interest > to me and extract the bits that I want. Fair enough. Using bracket expressions for the three letters is definitely another idea that I hadn't considered. But I believe I like what I think is -- what I'm going to describe as -- the more precise alternation listing out each month. (Jan|Feb|Mar... Such an alternation is not going to match Jer like the three bracket expressions will. I also believe that the alternation will be easier to maintain in the future. Especially by someone other than me that has less experience with REs. > In this case I trust the source (the Sendmail log), but of course > that is not always the case... I trust that syslog will produce consistent line beginnings more than I trust the data that is provided to syslog. But I'd still like to be able to detect "Jer" or "Dot" if syslog ever tosses it's cookies. > When doing things like this, you need to ask yourself at least the > following questions: > > 1) What exactly am I trying to do? This is fairly important :-) Filter out known to be okay log entries. > 2) Can I trust the data? Bobby Tables, Reflections on Trusting > Trust... Given that I'm effectively negating things and filtering out log entries that I want to not see (because they are okay) I'm comfortable with trusting the data from syslog. Brown M&Ms come to mind. > 3) Etc. > > And let's not get started on the difference betwixt "trusted" and > "trustworthy" (that distinction keeps security bods awake at night). ACK -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From ralph at inputplus.co.uk Sat Mar 4 20:07:17 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Sat, 04 Mar 2023 10:07:17 +0000 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: <20230304100717.E8F882021A@orac.inputplus.co.uk> Hi Grant, > Suppose I have the following two lines: > > aaa aaa > aaa bbb > > Does the following RE w/ back-reference introduce a big performance > penalty? > > (aaa|bbb) \1 > > As in: > > % echo "aaa aaa" | egrep "(aaa|bbb) \1" > aaa aaa You could measure the number of CPU instructions and experiment. $ echo xyzaaa aaaxyz >f $ ticks() { LC_ALL=C perf stat -e instructions egrep "$@"; } $ $ ticks '(aaa|bbb) \1' References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> <20230303134215.3ED63215AA@orac.inputplus.co.uk> <21e8477c-c388-7b90-ed10-21c7f76f0892@spamtrap.tnetconsulting.net> Message-ID: <20230304101533.D9CCF2021A@orac.inputplus.co.uk> Hi, Grant wrote: > Even inventorying and keeping track of the books can be time > consuming. -- Thankfully I took some time to do exactly that and > have access to that information on the super computer in my pocket. I seek recommendations for an Android app to comfortably read PDFs on a mobile phone's screen. They were intended to be printed as a book. In particular, once I've zoomed and panned to get the interesting part of a page as large as possible, swiping between pages should persist that view. An extra point for allowing odd and even pages to use different panning. -- Cheers, Ralph. From ralph at inputplus.co.uk Sat Mar 4 20:26:51 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Sat, 04 Mar 2023 10:26:51 +0000 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <8648a720-62a6-1ed2-b0ba-2dcc38097da6@spamtrap.tnetconsulting.net> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> <8648a720-62a6-1ed2-b0ba-2dcc38097da6@spamtrap.tnetconsulting.net> Message-ID: <20230304102651.D73622021A@orac.inputplus.co.uk> Hi Grant, > the more precise alternation listing out each month. (Jan|Feb|Mar... For those regexp engines which test each alternative in turn, ordering the months most-frequent first would give a slight win. :-) It really is a rabbit hole once you start. Typically not worth entering, but it can be fun if you like that kind of thing. > I trust that syslog will produce consistent line beginnings more than > I trust the data that is provided to syslog. But I'd still like to be > able to detect "Jer" or "Dot" if syslog ever tosses it's cookies. You could develop your regexps to find lines of interest and then flip them about, e.g. egrep's -v, to see what lines are missed and consider if any are interesting. Repeat. But this happens at development time. Or at run time, you can have a ‘loose’ regexp to let all expected lines in through the door and then match with one or more ‘tight’ regexps, baulking if none do. There's no right answer in general. -- Cheers, Ralph. From ralph at inputplus.co.uk Sun Mar 5 01:15:53 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Sat, 04 Mar 2023 15:15:53 +0000 Subject: [COFF] A second Unix Patent In-Reply-To: <202303041123.324BND9W061456@ultimate.com> References: <20230304015746.DD95518C08D@mercury.lcs.mit.edu> <20230304092216.287E22020E@orac.inputplus.co.uk> <202303041123.324BND9W061456@ultimate.com> Message-ID: <20230304151553.AD3EC210F2@orac.inputplus.co.uk> Hi Phil, Copying to the COFF list, hope that's okay. I thought it might interest them. > > $ units -1v '26^3 16 bit' 64KiB > > Works only for GNU units. That's interesting, thanks. I've access to a FreeBSD 12.3-RELEASE-p6, if that version number means something to you. Its units groks ^ to mean power when applied to a unit, as the fine units(1) says, but not to a number. Whereas * works. $ units yd^3 ft^3 * 27 / 0.037037037 $ $ units 6\*7 21 * 2 / 0.5 $ $ units 2^4 64 * 0.03125 / 32 $ The last one silently treats 2^4 as 2; I'd say that's a bug. It has Ki- and byte allowing $ units -t Kibyte bit 8192 but lacks GNU's B byte Fair enough, though I think that's common enough now to be included. FreeBSD also seems to have another bug: demanding a space between the quantity and the unit for fundamental ‘!’ units. $ units m 8m conformability error 1 m 8 $ units m '8 m' * 0.125 / 8 $ I found this when attempting the obvious $ units Kibyte 8bit conformability error 8192 bit 8 $ units Kibyte '8 bit' * 1024 / 0.0009765625 $ Whilst I'm not a GNU acolyte, in this case its version of units does seem to have had a bit more TLC. :-) -- Cheers, Ralph. From egbegb2 at gmail.com Mon Mar 6 20:01:47 2023 From: egbegb2 at gmail.com (Ed Bradford) Date: Mon, 6 Mar 2023 04:01:47 -0600 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: Thanks, Grant and contributors in this thread, Great thread on RE's. I bought and read the book (it's on the floor over there in the corner and I'm not getting up). My task was finding dates in binary and text files. It turns out RE's work just fine for that. Because I was looking at both text files and binary files, I wrote my stuff using 8-bit python "bytes" rather than python "text" which is, I think, 7-bit in python. (I use python because it works on both Linux, Macs and Windows and reduces the number of RE implementations I have to deal with to 1). I finished my first round of the program late fall of 2022. Then I put it down and now I am revisiting it. I was creating: A Python program to search for media files (pictures and movies) and copy them to another directory tree, copying only the unique ones (deduplication), and renaming each with *YYYY-MM-DD-* as a prefix. Here is a list of observations from my programming. 1. RE's are quite unreadable. I defined a lot of python variables and simply added them together in python to make a larger byte string (see below). The resulting expressions were shorter on screen and more readable. Furthermore, I could construct them incrementally. I insist on readable code because I frequently put things down for a month or more. A while back it was a sad day when I restarted something and simply had to throw it away, moaning, "What was that programmer thinking?". Here is an example RE for YYYY-MM-DD # FR = front BA = back # ymdt is text version ymdt = FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP ymdc = re.compile( ymdt ) 1a. I also had a time defining delimiters. There are delimiters for the beginning, delimiters for internal separation, and delimiters for the end. The significant thing is I have to find the RE if it is the very first string in the file or the very last. That also complicates buffered reading immensely. Hence, I wrote the whole program by reading the file into a single python variable. However, when files become much larger than memory, python simply ground to a halt as did my Windows machine. I then rewrote it using a memory mapped file (for all files) and the problem was fixed. 2. Dates are formatted in a number of ways. I chose exactly one format to learn about RE's and how to construct them and use them. Even the book didn't elaborate everything. I could not find detailed documentation on some of the interfaces in the book. On a whim, I asked chatGPT to write a python module that returns a list of offsets and dates in a file. Surprisingly, it wrote one that was quite credible. It had bugs but it knew more about how to use the various functional interfaces in RE's than I did. 3. Testing an RE is maybe even more difficult than writing one. I have not given any serious effort to verification testing yet. I would like to extend my program to any date format. That would require a much bigger RE. I have been led to believe that a 50Kbyte or 500Kbyte RE works just as well (if not as fast) as a 100 byte RE. I think with parentheses and pipe-symbols suitably used, one could match Monday, March 6, 2023 2023-03-06 Mar 6, 2023 or ... I'm just guessing, though. This thread has been very informative. I have much to read. Thank all of you. Ed Bradford Pflugerville, TX On Thu, Mar 2, 2023 at 12:55 PM Grant Taylor via COFF wrote: > Hi, > > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. > > What are the pros / cons to creating extended regular expressions like > the following: > > ^\w{3} > > vs: > > ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) > > Or: > > [ :[:digit:]]{11} > > vs: > > ( 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) > (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] > > I'm currently eliding the 61st (60) second, the 32nd day, and dealing > with February having fewer days for simplicity. > > For matching patterns like the following in log files? > > Mar 2 03:23:38 > > I'm working on organically training logcheck to match known good log > entries. So I'm *DEEP* in the bowels of extended regular expressions > (GNU egrep) that runs over all logs hourly. As such, I'm interested in > making sure that my REs are both efficient and accurate or at least not > WILDLY badly structured. The pedantic part of me wants to avoid > wildcard type matches (\w), even if they are bounded (\w{3}), unless it > truly is for unpredictable text. > > I'd appreciate any feedback and recommendations from people who have > been using and / or optimizing (extended) regular expressions for longer > than I have been using them. > > Thank you for your time and input. > > > > -- > Grant. . . . > unix || die > > -- Advice is judged by results, not by intentions. Cicero -------------- next part -------------- An HTML attachment was scrubbed... URL: From crossd at gmail.com Tue Mar 7 07:01:51 2023 From: crossd at gmail.com (Dan Cross) Date: Mon, 6 Mar 2023 16:01:51 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford wrote: >[snip] > I would like to extend my program to > any date format. That would require > a much bigger RE. I have been led to > believe that a 50Kbyte or 500Kbyte > RE works just as well (if not > as fast) as a 100 byte RE. I think > with parentheses and > pipe-symbols suitably used, > one could match > > Monday, March 6, 2023 > 2023-03-06 > Mar 6, 2023 > or > ... This reminds me of something that I wanted to bring up. Perhaps one _could_ define a sufficiently rich regular expression that one could match a number of date formats. However, I submit that one _should not_. REs may be sufficiently powerful, but in all likelihood what you'll end up with is an unreadable mess; it's like people who abuse `sed` or whatever to execute complex, general purpose programs: yeah, it's a clever hack, but that doesn't mean you should do it. Pick the right tool for the job. REs are a powerful tool, but they're not the right tool for _every_ job, and I'd argue that once you hit a threshold of complexity that'll be mostly self-evident, it's time to move on to something else. As for large vs small REs.... When we start talking about differences of orders of magnitude in size, we start talking about real performance implications; in general an NDFA simulation of a regular expression will have on the order of the length of the RE in states, so when the length of the RE is half a million symbols, that's half-a-million states, which practically speaking is a pretty big number, even though it's bounded is still a pretty big number, and even on modern CPUs. I wouldn't want to poke that bear. - Dan C. From steffen at sdaoden.eu Tue Mar 7 07:49:05 2023 From: steffen at sdaoden.eu (Steffen Nurpmeso) Date: Mon, 06 Mar 2023 22:49:05 +0100 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: <20230306214905.vK5oe%steffen@sdaoden.eu> Dan Cross wrote in : |On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford wrote: |>[snip] |> I would like to extend my program to |> any date format. That would require |> a much bigger RE. I have been led to ... |> one could match |> |> Monday, March 6, 2023 |> 2023-03-06 |> Mar 6, 2023 |> or ... |This reminds me of something that I wanted to bring up. Me too. If it becomes something regular and stable maybe turn into a dedicated parser. (As a lex yacc bison byacc refuser, but these surely can too.) |Perhaps one _could_ define a sufficiently rich regular expression that |one could match a number of date formats. However, I submit that one |_should not_. REs may be sufficiently powerful, but in all likelihood ... Kurt Shoens implemented some date template parser for BSD Mail in about 1980 that was successively changed many years later by Edward Wang in 1988 ([1] commit [309eb459e35f77985851ce143ad2f9da5f0d90da], 1988-07-08 18:41:33 -0800). There is strftime(3), but it came later than both to CSRG, and the Wang thing (in usr.bin/mail/head.c) is a dedicated thing. (Ie /* Template characters for cmatch_data.tdata: * 'A' An upper case char * 'a' A lower case char * ' ' A space * '0' A digit * 'O' An optional digit or space; MUST be followed by '0space'! * ':' A colon * '+' Either a plus or a minus sign */ and then according strings like "Aaa Aaa O0 00:00:00 0000".) [1] https://github.com/robohack/ucb-csrg-bsd.git --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From lm at mcvoy.com Tue Mar 7 11:43:11 2023 From: lm at mcvoy.com (Larry McVoy) Date: Mon, 6 Mar 2023 17:43:11 -0800 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: <20230307014311.GN5398@mcvoy.com> On Mon, Mar 06, 2023 at 04:01:51PM -0500, Dan Cross wrote: > On Mon, Mar 6, 2023 at 5:02???AM Ed Bradford wrote: > >[snip] > > I would like to extend my program to > > any date format. That would require > > a much bigger RE. I have been led to > > believe that a 50Kbyte or 500Kbyte > > RE works just as well (if not > > as fast) as a 100 byte RE. I think > > with parentheses and > > pipe-symbols suitably used, > > one could match > > > > Monday, March 6, 2023 > > 2023-03-06 > > Mar 6, 2023 > > or > > ... > > This reminds me of something that I wanted to bring up. > > Perhaps one _could_ define a sufficiently rich regular expression that > one could match a number of date formats. However, I submit that one > _should not_. REs may be sufficiently powerful, but in all likelihood > what you'll end up with is an unreadable mess; it's like people who > abuse `sed` or whatever to execute complex, general purpose programs: > yeah, it's a clever hack, but that doesn't mean you should do it. Dan, I agree with you. I ran a software company for almost 20 years and the main thing I contributed was "lets be dumb". Lets write code that is easy to read, easy to bug fix. Smart engineers love to be clever, they would be the folks that wrote those long RE that worked magic. But that magic was something they understood and nobody else did. Less is more. Less is easy to support. From egbegb2 at gmail.com Tue Mar 7 14:01:14 2023 From: egbegb2 at gmail.com (Ed Bradford) Date: Mon, 6 Mar 2023 22:01:14 -0600 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <20230307014311.GN5398@mcvoy.com> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230307014311.GN5398@mcvoy.com> Message-ID: I have made an attempt to make my RE stuff readable and supportable. I think I write more description that I do RE "code". As for, *it won't be comprehendable,* Machine language was unreadable and then along came assembly language. Assembly language was unreadable, then came higher level languages. Even higher level languages are unsupportable if not well documented and mostly simple to understand ("you are not expected to understand this" notwithstanding). The jump from machine language to python today was unimagined in early times. [ As an old timer, I see inflection points between: machine language and assembly language assembly language and high level languages and high level languages and python. But that's just me. ] I think it is possible to make a 50K RE that is understandable. However, it requires a lot of 'splainin' throughout the code. I'm naive though; I will eventually discover a lack of truth in that belief, if such exists. I repeat. I put stuff down for months at a time. My metric is *coming back to it* *and understanding where I left off*. So far, I can do that for this RE program that works for small files, large files, binary files and text files for exactly one pattern: YYYY[-MM-DD] I constructed this RE with code like this: # ymdt is YYYY-MM-DD RE in text. # looking only for 1900s and 2000s years and no later than today. _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}" # months _MM = "(0[1-9]|1[012])" # days _DD = "(0[1-9]|[12]\d|3[01])" ymdt = _YYYY + '[' + _INTERNALSEP + _MM + _INTERNALSEP + ']'{0,1) For the whole file, RE I used ymdthf = _FRSEP + ymdt + _BASEP where FRSEP is front separator which includes a bunch of possible separators, excluding numbers and letters, or-ed with the up arrow "beginning of line" RE mark. BASEP is back separator is same as FRSEP with "^" replaced with "$". I then aimed ymdthf at "data" the thing that represents the entire memory mapped file (where there is only one beginning and one end). Again, I say validating an RE is as difficult or more than writing one. What does it miss? Dates are an excellent test ground for RE's. Latitude and longitude is another. Ed PS: I thought I was on the COFF mailing list. I received this email by direct mail to from Larry. I haven't seen any other comments on my submission. I might have unsubscribed, but now I regret it. Dear powers that be: Please resubscribe me. -------------- next part -------------- An HTML attachment was scrubbed... URL: From egbegb2 at gmail.com Tue Mar 7 14:19:42 2023 From: egbegb2 at gmail.com (Ed Bradford) Date: Mon, 6 Mar 2023 22:19:42 -0600 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> Message-ID: Hi Dan, It sounds to me like an "optimizer" is needed. There is alreay a compiler that uses FA's. Is someone else going to create a program to look for dates without using regular expressions? Today, I write small-sized RE's. If I write a giant RE, there is nothing preventing the owner of RE world to change how they are used. For instance. Compile your RE and a subroutine/function is produced that performs the RE search. RE is a *language*, not necessarily an implementation. At least that is my understanding. Ed On Mon, Mar 6, 2023 at 3:02 PM Dan Cross wrote: > On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford wrote: > >[snip] > > I would like to extend my program to > > any date format. That would require > > a much bigger RE. I have been led to > > believe that a 50Kbyte or 500Kbyte > > RE works just as well (if not > > as fast) as a 100 byte RE. I think > > with parentheses and > > pipe-symbols suitably used, > > one could match > > > > Monday, March 6, 2023 > > 2023-03-06 > > Mar 6, 2023 > > or > > ... > > This reminds me of something that I wanted to bring up. > > Perhaps one _could_ define a sufficiently rich regular expression that > one could match a number of date formats. However, I submit that one > _should not_. REs may be sufficiently powerful, but in all likelihood > what you'll end up with is an unreadable mess; it's like people who > abuse `sed` or whatever to execute complex, general purpose programs: > yeah, it's a clever hack, but that doesn't mean you should do it. > > Pick the right tool for the job. REs are a powerful tool, but they're > not the right tool for _every_ job, and I'd argue that once you hit a > threshold of complexity that'll be mostly self-evident, it's time to > move on to something else. > > As for large vs small REs.... When we start talking about differences > of orders of magnitude in size, we start talking about real > performance implications; in general an NDFA simulation of a regular > expression will have on the order of the length of the RE in states, > so when the length of the RE is half a million symbols, that's > half-a-million states, which practically speaking is a pretty big > number, even though it's bounded is still a pretty big number, and > even on modern CPUs. > > I wouldn't want to poke that bear. > > - Dan C. > -- Advice is judged by results, not by intentions. Cicero -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralph at inputplus.co.uk Tue Mar 7 21:39:49 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Tue, 07 Mar 2023 11:39:49 +0000 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230307014311.GN5398@mcvoy.com> Message-ID: <20230307113949.501602135B@orac.inputplus.co.uk> Hi Ed, > I have made an attempt to make my RE stuff readable and supportable. Readable to you, which is fine because you're the prime future reader. But it's less readable than the regexp to those that know and read them because of the indirection introduced by the variables. You've created your own little language of CAPITALS rather than the lingua franca of regexps. :-) > Machine language was unreadable and then along came assembly language. > Assembly language was unreadable, then came higher level languages. Each time the original language was readable because practitioners had to read and write it. When its replacement came along, the old skill was no longer learnt and the language became ‘unreadable’. > So far, I can do that for this RE program that works for small files, > large files, binary files and text files for exactly one pattern: >     YYYY[-MM-DD] > I constructed this RE with code like this: > # ymdt is YYYY-MM-DD RE in text. > # looking only for 1900s and 2000s years and no later than today. > _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}" ‘{1}’ is redundant. > # months > _MM   = "(0[1-9]|1[012])" > # days > _DD   = "(0[1-9]|[12]\d|3[01])" > ymdt = _YYYY + '[' + _INTERNALSEP + >                      _MM          + >                      _INTERNALSEP + >                ']'{0,1) I think we're missing something as the ‘'['’ is starting a character class which is odd for wrapping the month and the ‘{0,1)’ doesn't have matching brackets and is outside the string. BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’. > For the whole file, RE I used > ymdthf = _FRSEP + ymdt + _BASEP > where FRSEP is front separator which includes > a bunch of possible separators, excluding numbers and letters, or-ed > with the up arrow "beginning of line" RE mark. It sounds like you're wanting a word boundary; something provided by regexps. In Python, it's ‘\b’. >>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'), (,) Are you aware of the /x modifier to a regexp which ignores internal whitespace, including linefeeds? This allows a large regexp to be split over lines. There's a comment syntax too. See https://docs.python.org/3/library/re.html#re.X GNU grep isn't too shabby at looking through binary files. I can't use /x with grep so in a bash script, I'd do it manually. \< and \> match the start and end of a word, a bit like Python's \b. re=' .?\< (19[0-9][0-9]|20[01][0-9]|202[0-3]) ( ([-:._]) (0[1-9]|1[0-2]) \3 (0[1-9]|[12][0-9]|3[01]) )? \>.? ' re=${re//$'\n'/} re=${re// /} printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01 >big-binary-file LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l which gives 0:2001-04-01,$ 11:1999_12_31$ 22:1944.03.01,$ 33:1914!$ 39:2000-$ showing: - the byte offset within the file of each match, - along with the any before and after byte if it's not a \n and not already matched, just to show the word-boundary at work, - with any non-printables escaped into octal by sed. > I thought I was on the COFF mailing list. I'm sending this to just the list. > I received this email by direct mail to from Larry. Perhaps your account on the list is configured to not send you an email if it sees your address in the header's fields. -- Cheers, Ralph. From crossd at gmail.com Wed Mar 8 02:14:49 2023 From: crossd at gmail.com (Dan Cross) Date: Tue, 7 Mar 2023 11:14:49 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230307014311.GN5398@mcvoy.com> Message-ID: On Mon, Mar 6, 2023 at 11:01 PM Ed Bradford wrote: >[snip] > I think it is possible to make a 50K RE that is understandable. However, it requires > a lot of 'splainin' throughout the code. I'm naive though; I will eventually discover > a lack of truth in that belief, if such exists. Actually, I believe you. I'm sure that with enough effort, it _is_ possible to make a 50K RE that is understandable to mere mortals. But it begs the question: why bother? The answer to that question, in my mind, shows the difference between a clever programmer and a pragmatic engineer. I submit that it's time to reach for another tool well before you get to an RE that big, and if one is still considering such a thing, one must really ask what properties of REs and the problem at hand one thinks lend itself to that as the solution. >[snip] > It sounds to me like an "optimizer" is needed. There is alreay a compiler > that uses FA's. I'm not sure what you're referring to here, though you were replying to me. There are a couple of different threads floating around: 1. Writing really big regular expressions: this is probably a bad idea. Don't do it (see below). 2. Writing a recognizer for dates. Yeah, the small REs you have for that are fine. If you want to extend those to arbitrary date formats, I think you'll find it starts getting ugly. 3. Optimizing regular expressions. You're still bound by the known theoretical properties of finite automata here. > Is someone else going to create a program > to look for dates without using regular expressions? Many people have already done so. :-) > Today, I write small-sized RE's. If I write a giant RE, there is nothing preventing > the owner of RE world to change how they are used. For instance. Compile your RE > and a subroutine/function is produced that performs the RE search. I'm not sure I understand what you mean. The theory here is well-understood: we know recognizers for regular languages can be built from DFAs, that run in time linear in the size of their inputs, but we also know that constructing such a DFA can be exponential in space and time, and thus impractical for many REs. We know that NDFA simulators can be built in time and space linear in the length of the RE, but that the resulting recognizers will be superlinear at runtime, proportional to the product of the length of input, number of states, and number edges between states in the state transition graph. For a very large regular expression, that's going to be a pretty big number, and even on modern CPUs won't be particularly fast. Compilation to native code won't really help you. There is no "owner of RE world" that can change that. If you can find some way to do so, I think that would qualify as a major breakthrough in computer science. > RE is a language, not necessarily an implementation. > At least that is my understanding. Regular expressions describe regular languages, but as I mentioned above, the theory gives the currently understood bounds for their performance characteristics. It's kinda like the speed of light in this regard; we can't really make it go faster. - Dan C. From tytso at mit.edu Wed Mar 8 02:42:14 2023 From: tytso at mit.edu (Theodore Ts'o) Date: Tue, 7 Mar 2023 11:42:14 -0500 Subject: [COFF] [TUHS] Re: Origins of the frame buffer device In-Reply-To: <20230306232429.GL5398@mcvoy.com> References: <8BD57BAB138946830AF560E17376A63B.for-standards-violators@oclsc.org> <20230306232429.GL5398@mcvoy.com> Message-ID: <20230307164214.GC960946@mit.edu> (Moving to COFF) On Mon, Mar 06, 2023 at 03:24:29PM -0800, Larry McVoy wrote: > But even that seems suspect, I would think they could put some logic > in there that just doesn't feed power to the GPU if you aren't using > it but maybe that's harder than I think. > > If it's not about power then I don't get it, there are tons of transistors > waiting to be used, they could easily plunk down a bunch of GPUs on the > same die so why not? Maybe the dev timelines are completely different > (I suspect not, I'm just grabbing at straws). Other potential reasons: 1) Moving functionality off-CPU also allows for those devices to have their own specialized video memory that might be faster (SDRAM) or dual-ported (VRAM) without having to add that complexity to the more general system DRAM and/or the CPU's Northbridge. 2) In some cases, having an off-chip co-processor may not need any access to the system memory at well. An example of this is the "bump in the wire" in-line crypto engines (ICE) which is located between the Southbridge and the eMMC/UFS flash storage device. If you are using a Android device, it's likely to have an ICE. The big advantage is that it avoids needing to have a bounce buffer on the write path, where the file system encryption layer has to copy-and-encrypt data from the page cache to a bounce buffer, and then the encrypted block will then get DMA'ed to the storage device. 3) From an architectural perspective, not all use cases need various co-processors, whether it is to doing cryptography, or running some kind of machine-learning module, or image manipulation to simulate bokeh, or create HDR images, etc. While RISC-V does have the concept of instructure set extensions, which can be developed without getting permission from the "owners" of the core CPU ISA (e.g., ARM, Intel, etc.), it's a lot more convenient for someone who doesn't need to bend the knee to ARM, inc. (or their new corporate overloads) or Intel, to simply put that extension outside the core ISA. (More recently, there is an interesting lawsuit about whether it's "allowed" to put a 3rd party co-processor on the same SOC without paying $$$$$ to the corporate overload, which may make this point moot --- although it might cause people to simply switch to another ISA that doesn't have this kind of lawsuit-happy rent-seeking....) In any case, if you don't need to play Quake with 240 frames per second, then there's no point putting the GPU in the core CPU architecture, and it may turn out that the kind of co-processor which is optimized for running ML models is different, and it is often easier to make changes to the programming model for a GPU, compared to making changes to a CPU's ISA. - Ted From ralph at inputplus.co.uk Wed Mar 8 03:34:51 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Tue, 07 Mar 2023 17:34:51 +0000 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230307014311.GN5398@mcvoy.com> Message-ID: <20230307173451.D94B421C9B@orac.inputplus.co.uk> Hi Dan, > I'm sure that with enough effort, it _is_ possible to make a 50K RE > that is understandable to mere mortals. But it begs the question: why > bother? It could be the quickest way to express the intent. > The answer to that question, in my mind, shows the difference between > a clever programmer and a pragmatic engineer. I think those two can overlap. :-) > I submit that it's time to reach for another tool well before you get > to an RE that big Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular grammar? I think sticking with code aimed at regular grammars, or more likely regexps, will do better than, say, a parser generator for a type-two context-free grammar. As well as the lex(1) family, there's Ragel as another example. http://www.colm.net/open-source/ragel/ > 3. Optimizing regular expressions. You're still bound by the known > theoretical properties of finite automata here. Well, we're back to the RE v. regexp again. /^[0-9]+\.jpeg$/ is matched by some engines by first checking the last five bytes are ‘.jpeg’. $ debugperl -Dr -e \ > '"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/' ... Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890123546789012.jpg" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [1..46] gave -1 Did not find floating substr ".jpeg"$... Match rejected by optimizer Freeing REx: "^[0-9]+\.jpeg$" $ Boyer-Moore string searching can be used. Common-subregexp-elimination can spot repetitive fragment of regexp and factor them into a single set of states along with pairing the route into them with the appropriate route out. The more regexp engines are optimised, the more benefit to the programmer from sticking to a regexp rather than, say, ad hoc parsing. The theory of REs is interesting and important, but regexps deviate from it ever more. -- Cheers, Ralph. From coff at tuhs.org Wed Mar 8 04:31:55 2023 From: coff at tuhs.org (Grant Taylor via COFF) Date: Tue, 7 Mar 2023 11:31:55 -0700 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <20230307113949.501602135B@orac.inputplus.co.uk> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230307014311.GN5398@mcvoy.com> <20230307113949.501602135B@orac.inputplus.co.uk> Message-ID: On 3/7/23 4:39 AM, Ralph Corderoy wrote: > Readable to you, which is fine because you're the prime future > reader. But it's less readable than the regexp to those that know > and read them because of the indirection introduced by the variables. > You've created your own little language of CAPITALS rather than the > lingua franca of regexps. :-) I want to agree, but then I run into things like this: ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ postfix(/smtps)?/smtpd\[[[:digit:]]+\]: disconnect from [._[:alnum:]-]+\[[.:[:xdigit:]]+\]( helo=[[:digit:]]+(/[[:digit:]]+)?)?( ehlo=[[:digit:]]+(/[[:digit:]]+)?)?( starttls=[[:digit:]]+(/[[:digit:]]+)?)?( auth=[[:digit:]]+(/[[:digit:]]+)?)?( mail=[[:digit:]]+(/[[:digit:]]+)?)?( rcpt=[[:digit:]]+(/[[:digit:]]+)?)?( data=[[:digit:]]+(/[[:digit:]]+)?)?( bdat=[[:digit:]]+(/[[:digit:]]+)?)?( rset=[[:digit:]]+(/[[:digit:]]+)?)?( noop=[[:digit:]]+(/[[:digit:]]+)?)?( quit=[[:digit:]]+(/[[:digit:]]+)?)?( unknown=[[:digit:]]+(/[[:digit:]]+)?)?( commands=[[:digit:]]+(/[[:digit:]]+)?)?$ Which is produced by this m4: define(`DAEMONPID', `$1\[DIGITS\]:')dnl define(`DATE', `\w{3} [ :[:digit:]]{11}')dnl define(`DIGIT', `[[:digit:]]')dnl define(`DIGITS', `DIGIT+')dnl define(`HOST', `[._[:alnum:]-]+')dnl define(`HOSTIP', `HOST\[IP\]')dnl define(`IP', `[.:[:xdigit:]]+')dnl define(`VERB', `( $1=DIGITS`'(/DIGITS)?)?')dnl ^DATE HOST DAEMONPID(`postfix(/smtps)?/smtpd') disconnect from HOSTIP`'VERB(`helo')VERB(`ehlo')VERB(`starttls')VERB(`auth')VERB(`mail')VERB(`rcpt')VERB(`data')VERB(`bdat')VERB(`rset')VERB(`noop')VERB(`quit')VERB(`unknown')VERB(`commands')$ I only consider myself to be an /adequate/ m4 user. Though I've done some things that are arguably creating new languages. I personally find the generated regular expression to be onerous to read and understand, much less modify. I would be highly dependent on my editor's (vim's) parenthesis / square bracket matching (%) capability and / or would need to explode the RE into multiple components on multiple lines to have a hope of accurately understanding or modifying it. Conversely I think that the m4 is /largely/ find and replace with a little syntactic sugar around the definitions. I also think that anyone that does understand regular expressions and the concept of find & replace is likely to be able to both recognize patterns -- as in "VERB(...)" corresponds to "( $1=DIGITS`'(/DIGITS)?)?", that "DIGITS" corresponds to "DIGIT+", and that "DIGIT" corresponds to "[[:digit:]]". There seems to be a point between simple REs w/o any supporting constructor and complex REs with supporting constructor where I think it is better to have the constructors. Especially when duplication comes into play. If nothing else, the constructors are likely to reduce one-off typo errors. The typo will either be everywhere the constructor was used, or similarly be fixed everywhere at the same time. Conversely, finding an unmatched parenthesis or square bracket in the RE above will be annoying at best if not likely to be more daunting. > Each time the original language was readable because practitioners > had to read and write it. When its replacement came along, the old > skill was no longer learnt and the language became ‘unreadable’. I feel like there is an analogy between machine code and assembly language as well as assembly language and higher level languages. My understanding is that the computer industry has vastly agreed that the higher level language is easier to understand and maintain. > ‘{1}’ is redundant. That may very well be. But what will be more maintainable / easier to correct in the future; adding `{2}` when necessary or changing the value of `1` to `2`? I think this is an example of tradeoff of not strictly required to make something more maintainable down the road. Sort of like fleet vehicles vs non-fleet vehicles. > BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’. I think this is another example of the maintainability. > I'm sending this to just the list. I'm also replying to only the COFF mailing list. > Perhaps your account on the list is configured to not send you an > email if it sees your address in the header's fields. There is a reasonable chance that the COFF mailing list and / or your account therein is configured to minimize duplicates meaning the COFF mailing list won't send you a copy if it sees your subscribed address as receiving a copy directly. I personally always prefer the mailing list copy and shun the direct copies. I think that the copy from the mailing list keeps the discussion on the mailing list and avoids accidental replies bypassing the mailing list. -- Grant. . . . unix || die -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4017 bytes Desc: S/MIME Cryptographic Signature URL: From crossd at gmail.com Wed Mar 8 04:33:00 2023 From: crossd at gmail.com (Dan Cross) Date: Tue, 7 Mar 2023 13:33:00 -0500 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <20230307173451.D94B421C9B@orac.inputplus.co.uk> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230307014311.GN5398@mcvoy.com> <20230307173451.D94B421C9B@orac.inputplus.co.uk> Message-ID: On Tue, Mar 7, 2023 at 12:34 PM Ralph Corderoy wrote: > > I'm sure that with enough effort, it _is_ possible to make a 50K RE > > that is understandable to mere mortals. But it begs the question: why > > bother? > > It could be the quickest way to express the intent. Ok, I challenge you to find me anything for which the quickest way to express the intent is a 50 *thousand* symbol regular expression. :-) > > The answer to that question, in my mind, shows the difference between > > a clever programmer and a pragmatic engineer. > > I think those two can overlap. :-) Indeed they can. But I have grave doubts that this is a good example of such overlap. > > I submit that it's time to reach for another tool well before you get > > to an RE that big > > Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular > grammar? I think sticking with code aimed at regular grammars, or more > likely regexps, will do better than, say, a parser generator for a > type-two context-free grammar. Is there an extant, non-theoretical, example of such a grammar? > As well as the lex(1) family, there's Ragel as another example. > http://www.colm.net/open-source/ragel/ This is moving the goal posts more than a bit. I'm suggesting that a 50k-symbol RE is unlikely to be the best solution to any reasonable problem. A state-machine generator, even one with 50k statements, is not a 50k RE. > > 3. Optimizing regular expressions. You're still bound by the known > > theoretical properties of finite automata here. > > Well, we're back to the RE v. regexp again. /^[0-9]+\.jpeg$/ is matched > by some engines by first checking the last five bytes are ‘.jpeg’. ...in general, in order to find the end, won't _something_ have to traverse the entire input? (Note that I said, "in general". Allusions to mmap'ed files or seeking to the end of a file are not general, since they don't apply well to important categories of input sources, such as pipes or network connections.) > $ debugperl -Dr -e \ > > '"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/' > ... > Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890123546789012.jpg" > Intuit: trying to determine minimum start position... > doing 'check' fbm scan, [1..46] gave -1 > Did not find floating substr ".jpeg"$... > Match rejected by optimizer > Freeing REx: "^[0-9]+\.jpeg$" > $ > > Boyer-Moore string searching can be used. Common-subregexp-elimination > can spot repetitive fragment of regexp and factor them into a single set > of states along with pairing the route into them with the appropriate > route out. Well, that's what big-O notation accounts for. I'm afraid none of this really changes the time bounds, however, when applied in general. > The more regexp engines are optimised, the more benefit to the > programmer from sticking to a regexp rather than, say, ad hoc parsing. This is comparing apples and oranges. There may be all sorts of heuristics that we can apply to specific regular expressions to prune the search space, and that's great. But by their very nature, heuristics are not always generally applicable. As an analogy, we know that we cannot solve _the_ halting problem, but we also know that we can solve _many_ halting problem_s_. For example, a compiler can recognize that any of, `for(;;);` or `while(1);` or `loop {}` do not halt, and so on, ad nauseum, but even if some oracle can recognize arbitrarily many such halting problems, we still haven't solved the general problem. > The theory of REs is interesting and important, but regexps deviate from > it ever more. Yup. My post should not be construed as suggesting that regexps are not useful, or that they should not be a part of a programmer's toolkit. My post _should_ be construed as a suggestion that they are not always the best solution, and a part of being an engineer is finding that dividing line. - Dan C. From rtomek at ceti.pl Wed Mar 8 07:49:14 2023 From: rtomek at ceti.pl (Tomasz Rola) Date: Tue, 7 Mar 2023 22:49:14 +0100 Subject: [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) In-Reply-To: <20230304101533.D9CCF2021A@orac.inputplus.co.uk> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> <20230303134215.3ED63215AA@orac.inputplus.co.uk> <21e8477c-c388-7b90-ed10-21c7f76f0892@spamtrap.tnetconsulting.net> <20230304101533.D9CCF2021A@orac.inputplus.co.uk> Message-ID: On Sat, Mar 04, 2023 at 10:15:33AM +0000, Ralph Corderoy wrote: > Hi, > > Grant wrote: > > Even inventorying and keeping track of the books can be time > > consuming. -- Thankfully I took some time to do exactly that and > > have access to that information on the super computer in my pocket. > > I seek recommendations for an Android app to comfortably read PDFs on a > mobile phone's screen. They were intended to be printed as a book. In > particular, once I've zoomed and panned to get the interesting part of a > page as large as possible, swiping between pages should persist that > view. An extra point for allowing odd and even pages to use different > panning. My own recommendation for this is to get a dedicated ebook reader. It will feel a bit clumsy to have both a cretinphone and another thing with you, but at least the thing is doing the job. At least, mine keeps cropping across pages. Also, the e-ink/epaper display of ebook reader is not supposed to screw your eyes and/or circadian rhythms (not that I know anything specific, but I find it very strange that people shine blue light into their eyes for extended periods of time and do not even quietly protest - well, perhaps it is akin to what goes between human and a dog, they become alike to each other, now, when a human has cretinphone...). Or, if it just one pdf to read, then you should be fine reading it on bigger screen. HTH -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home ** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_rola at bigfoot.com ** From rtomek at ceti.pl Wed Mar 8 08:46:04 2023 From: rtomek at ceti.pl (Tomasz Rola) Date: Tue, 7 Mar 2023 23:46:04 +0100 Subject: [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) In-Reply-To: References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230303105928.E88AB215AA@orac.inputplus.co.uk> <20230303134215.3ED63215AA@orac.inputplus.co.uk> <21e8477c-c388-7b90-ed10-21c7f76f0892@spamtrap.tnetconsulting.net> <20230304101533.D9CCF2021A@orac.inputplus.co.uk> Message-ID: On Tue, Mar 07, 2023 at 10:49:14PM +0100, Tomasz Rola wrote: [...] > people shine blue light into their eyes for extended periods of time > and do not even quietly protest - well, perhaps it is akin to what > goes between human and a dog, they become alike to each other, now, > when a human has cretinphone...). To answer unasked question, I own a cretinphone too :-). And few dumbs. Together, they sum up to something like cretinphone on steroids. -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home ** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_rola at bigfoot.com ** From egbegb2 at gmail.com Wed Mar 8 21:22:56 2023 From: egbegb2 at gmail.com (Ed Bradford) Date: Wed, 8 Mar 2023 05:22:56 -0600 Subject: [COFF] Requesting thoughts on extended regular expressions in grep. In-Reply-To: <20230307113949.501602135B@orac.inputplus.co.uk> References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <20230307014311.GN5398@mcvoy.com> <20230307113949.501602135B@orac.inputplus.co.uk> Message-ID: Thank you for the very useful comments. However, I disagree with you about the RE language. While I agree all RE experts don't need that, when I was hiring and gave some software to a new hire (whether an experienced programmer or a recent college grad) simply handing over huge RE's to my new hire was a daunting task to that person. I wrote that stuff that way to help remind me and anyone who might use the python program. I don't claim success. It does help me. When you say '{1}' is redundant, I think I did that to avoid any possibility of conflicts with the next string that is concatentated to the *Y_* (e.g. '*' or '+' or '{4,7}'). I am embarrassed I did not communicate that in the code. I had to think about it for a couple of hours before I recalled the "why". I will fix that. (it would be difficult to discuss this RE if I had to write "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE + "]" + ") rather than just *Y_*). My initial thoughts on naming were I wanted the definition to be defined in exactly one place in the software. Python and the BTL folks told me to never use a constant in code. Always name it. Hence, I gave it a name. Each name might be used in multiple places. They might be imported. You are correct, the expression is unbalanced. I tried to remove the text2bytes(lastYearRE*)* call so the expression in this email was all text. I failed to remove the trailing *)* when I removed the call to text2bytes(). My hasty transcriptions might have produced similar errors in my email. Recall, my focus was on any file of any size. I'm on Windows 10 and an m1 MacBook. Python works on both. I don't have a Linux machine or enough desktop space to host one. I'm also mildly fed-up with virtual machines. Friedl taught me one thing. Most RE implementations are different. I'm trying to write a program that I could give to anyone and could reliably find a date (an RE) in any file. YYYY, MM, DD, HR, MI, SE, TH are words my user could use in the command line or in an options dialog. LAT and LON might also be possibilities. CST, EST, MST, PST, ... also. A 500 gigabyte archive or directory/folder of pictures and movies would be a great test target. I very much appreciate your comments. If this discussion is boring to others, I would be happy to take it to emails. I like your program. My experience with RE, grep, python, and sed suggests that anything but gnu grep and sed might not work due to the different implementations. I've been out of the Unix software business for 30 years after starting work at BTL in the 1970s and working on Version 6. I didn't know "printf" was now built into bash! That was a surprise. It's an incremental improvement, but doesn't compare with f-strings in python. *The interactive interpreter for python should have* *a "bash" mode?!* Does grep use a memory mapped file for its search, thereby avoiding all buffering boundaries? That too, would be new information to me. The additional complexity of dealing with buffering is more than annoying. Do you have any thoughts on how to verify a program that uses RE's. I've given no thought until now. My first thought for dates would be to write a separate module that simply searched through the file looking for 4 numbers in a row without using RE's, recording the offsets and 16 characters after and 1 character before in a python list of (offset,str) of tuples, ddddList, and using *dddd**List* as a proxy for the entire file. I could then aim my RE's at *ddddList*. *[A list of tuples in python* *is wonderful! !]* It seems to me '*' and '+' and {x,y} are the performance hogs in RE's. My RE's avoid them. One pass, I think, should suffice. What do you think? I haven't "archived" my 350 GB of pictures and movies, but one pass over all files therein ought to suffice, right? Two different programs that use different algorithms should be pretty good proof of correctness wouldn't you think? My RE's have no stars or pluses. If there is a mismatch before a match, give up and move on. On my Windows 10 machine, I have cygwin. Microsoft says my CPU doesn't have a TPM and the specific Intel Core I7 on my system is not supported so Windows 11 is not happening. Microsoft is DOS personified. (An unkind editorial remark about the low quality of software coming from Microsoft.) Anyway, I thank you again for your patience with me and your observations. I value your views and the other views I've seen here on coff at tuhs.org. I welcome all input to my education and will share all I have done so far with anyone who wants to collaborate, test, or is just curious. GOAL: run python program from an at-cost thumb drive that: reaps all media files from a user specified directory/folder tree and Adds files to the thumb drive. *Adds files* means Original file system is untouched Adds only unique files (hash codes are unique) Creates on the thumb drive a relative directory wherein the original file was found Prepends a "YYYY-MM-DD-" string to the filename if one can be found (EXIF is great shortcut). Copies srcroot/relative_path/oldfilename to thumbdrive/relative_path/YYYY-MM-DD-oldfilename or thumbdrive/relative_path/0000-oldfilename. Can also incrementally add new files by just scanning anywhere in any other computer file system or any other computer. Must work on Mac, Windows, and Linux What I have is a working prototype. It works on Mac and Windows. It doesn't do the date thing very well, and there are other shortcomings. I have delivered exactly one Christmas present to my favorite person in the world - a 400 GB SSD drive with all our pictures and media we have ever taken. The next things are to *add *more media and *re-unique-ify* (check) what is already present on the SSD drive and *improve the proper choice of "YYYY-MM-DD-" prefix* to filenames. I am retired and this is fun. I'm too old to want to get rich. Ed Bradford Pflugerville, TX egbegb2 at gmail.com On Tue, Mar 7, 2023 at 5:40 AM Ralph Corderoy wrote: > Hi Ed, > > > I have made an attempt to make my RE stuff readable and supportable. > > Readable to you, which is fine because you're the prime future reader. > But it's less readable than the regexp to those that know and read them > because of the indirection introduced by the variables. You've created > your own little language of CAPITALS rather than the lingua franca of > regexps. :-) > > > Machine language was unreadable and then along came assembly language. > > Assembly language was unreadable, then came higher level languages. > > Each time the original language was readable because practitioners had > to read and write it. When its replacement came along, the old skill > was no longer learnt and the language became ‘unreadable’. > > > So far, I can do that for this RE program that works for small files, > > large files, binary files and text files for exactly one pattern: > > YYYY[-MM-DD] > > I constructed this RE with code like this: > > # ymdt is YYYY-MM-DD RE in text. > > # looking only for 1900s and 2000s years and no later than today. > > _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}" > > ‘{1}’ is redundant. > > > # months > > _MM = "(0[1-9]|1[012])" > > # days > > _DD = "(0[1-9]|[12]\d|3[01])" > > ymdt = _YYYY + '[' + _INTERNALSEP + > > _MM + > > _INTERNALSEP + > > ']'{0,1) > > I think we're missing something as the ‘'['’ is starting a character > class which is odd for wrapping the month and the ‘{0,1)’ doesn't have > matching brackets and is outside the string. > > BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’. > > > For the whole file, RE I used > > ymdthf = _FRSEP + ymdt + _BASEP > > where FRSEP is front separator which includes > > a bunch of possible separators, excluding numbers and letters, or-ed > > with the up arrow "beginning of line" RE mark. > > It sounds like you're wanting a word boundary; something provided by > regexps. In Python, it's ‘\b’. > > >>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'), > (,) > > Are you aware of the /x modifier to a regexp which ignores internal > whitespace, including linefeeds? This allows a large regexp to be split > over lines. There's a comment syntax too. See > https://docs.python.org/3/library/re.html#re.X > > GNU grep isn't too shabby at looking through binary files. I can't use > /x with grep so in a bash script, I'd do it manually. \< and \> match > the start and end of a word, a bit like Python's \b. > > re=' > .?\< > (19[0-9][0-9]|20[01][0-9]|202[0-3]) > ( > ([-:._]) > (0[1-9]|1[0-2]) > \3 > (0[1-9]|[12][0-9]|3[01]) > )? > \>.? > ' > re=${re//$'\n'/} > re=${re// /} > > printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01 > >big-binary-file > LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l > > which gives > > 0:2001-04-01,$ > 11:1999_12_31$ > 22:1944.03.01,$ > 33:1914!$ > 39:2000-$ > > showing: > > - the byte offset within the file of each match, > - along with the any before and after byte if it's not a \n and not > already matched, just to show the word-boundary at work, > - with any non-printables escaped into octal by sed. > > > I thought I was on the COFF mailing list. > > I'm sending this to just the list. > > > I received this email by direct mail to from Larry. > > Perhaps your account on the list is configured to not send you an email > if it sees your address in the header's fields. > > -- > Cheers, Ralph. > -- Advice is judged by results, not by intentions. Cicero -------------- next part -------------- An HTML attachment was scrubbed... URL: From crossd at gmail.com Thu Mar 9 05:52:43 2023 From: crossd at gmail.com (Dan Cross) Date: Wed, 8 Mar 2023 14:52:43 -0500 Subject: [COFF] [TUHS] the wheel of reincarnation goes sideways In-Reply-To: References: Message-ID: [bumping to COFF] On Wed, Mar 8, 2023 at 2:05 PM ron minnich wrote: > The wheel of reincarnation discussion got me to thinking: > > What I'm seeing is reversing the rotation of the wheel of reincarnation. Instead of pulling the task (e.g. graphics) from a special purpose device back into the general purpose domain, the general purpose computing domain is pushed into the special purpose device. > > I first saw this almost 10 years ago with a WLAN modem chip that ran linux on its 4 core cpu, all of it in a tiny package. It was faster, better, and cheaper than its traditional embedded predecessor -- because the software stack was less dedicated and single-company-created. Take Linux, add some stuff, voila! WLAN modem. > > Now I'm seeing it in peripheral devices that have, not one, but several independent SoCs, all running Linux, on one card. There's even been a recent remote code exploit on, ... an LCD panel. > > Any of these little devices, with the better part of a 1G flash and a large part of 1G DRAM, dwarfs anything Unix ever ran on. And there are more and more of them, all over the little PCB in a laptop. > > The evolution of platforms like laptops to becoming full distributed systems continues. > The wheel of reincarnation spins counter clockwise -- or sideways? About a year ago, I ran across an email written a decade or more prior on some mainframe mailing list where someone wrote something like, "wow! It just occurred to me that my Athlon machine is faster than the ES/3090-600J I used in 1989!" Some guy responded angrily, rising to the wounded honor of IBM, raving about how preposterous this was because the mainframe could handle a thousand users logged in at one time and there's no way this Linux box could ever do that. I was struck by the absurdity of that; it's such a ridiculous non-comparison. The mainframe had layers of terminal concentrators, 3270 controllers, IO controllers, etc, etc, and a software ecosystem that made heavy use of all of that, all to keep user interaction _off_ of the actual CPU (I guess freeing that up to run COBOL programs in batch mode...); it's not as though every time a mainframe user typed something into a form on their terminal it interrupted the primary CPU. Of course, the first guy was right: the AMD machine probably _was_ more capable than a 3090 in terms of CPU performance, RAM and storage capacity, and raw bandwidth between the CPU and IO subsystems. But the 3090 was really more like a distributed system than the Athlon box was, with all sorts of offload capabilities. For that matter, a thousand users probably _could_ telnet into the Athlon system. With telnet in line mode, it'd probably even be decently responsive. So often it seems to me like end-user systems are just continuing to adopt "large system" techniques. Nothing new under the sun. > I'm no longer sure the whole idea of the wheel or reincarnation is even applicable. I often feel like the wheel has fallen onto its side, and we're continually picking it up from the edge and flipping it over, ad nauseum. - Dan C. From coff at tuhs.org Thu Mar 9 06:18:42 2023 From: coff at tuhs.org (Tom Ivar Helbekkmo via COFF) Date: Wed, 08 Mar 2023 21:18:42 +0100 Subject: [COFF] the wheel of reincarnation goes sideways In-Reply-To: (Dan Cross's message of "Wed, 8 Mar 2023 14:52:43 -0500") References: Message-ID: Dan Cross writes: > About a year ago, I ran across an email written a decade or more prior > on some mainframe mailing list where someone wrote something like, > "wow! It just occurred to me that my Athlon machine is faster than the > ES/3090-600J I used in 1989!" Some guy responded angrily, rising to > the wounded honor of IBM, raving about how preposterous this was > because the mainframe could handle a thousand users logged in at one > time and there's no way this Linux box could ever do that. > > I was struck by the absurdity of that; it's such a ridiculous > non-comparison. I did one of those. Back in the early nineties, I had a 286 box running MINIX 1.5 as my home workstation, and a similar one running DOS at work. My job, however, was as one of a team of sysadmins caring for a VAX-780 running VMS. I used C-TeX to format documents on the DOS PC, and spent a couple of days porting it to the VMS C compiler. Performance was utterly dismal at first, but once I realized that the stdio stuff in the standard libary was the problem, I modified C-TeX to do output to binary files of fixed size 512 byte blocks in RMS, the VMS file system. In the small hours of the night, I discovered that the big and expensive VAX-780 was able to pretty much exactly match my 286-box when formatting documents. The very next day, I found that the same machine did the TeX formatting just as fast, while a hundred or so other people were actively using it for their own work. -tih -- Most people who graduate with CS degrees don't understand the significance of Lisp. Lisp is the most important idea in computer science. --Alan Kay From cowan at ccil.org Thu Mar 9 11:22:39 2023 From: cowan at ccil.org (John Cowan) Date: Wed, 8 Mar 2023 20:22:39 -0500 Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways In-Reply-To: References: Message-ID: On Wed, Mar 8, 2023 at 2:53 PM Dan Cross wrote: > > Now I'm seeing it in peripheral devices that have, not one, but several > independent SoCs, all running Linux, on one card. There's even been a > recent remote code exploit on, ... an LCD panel. > I remember at one time I had on my desk a PC with an 80x86 CPU and an Ethernet card that had an 80(x+1)86 chip inside. I think x=0, but I'm not sure. > But the > 3090 was really more like a distributed system than the Athlon box > was, with all sorts of offload capabilities. For that matter, a > thousand users probably _could_ telnet into the Athlon system. With > telnet in line mode, it'd probably even be decently responsive. > I find that difficult to believe. It seems too high by an order of magnitude. Another thing that doesn't get mentioned much is that classic mainframes had SRAM, so their memory bandwidth was enormous. -------------- next part -------------- An HTML attachment was scrubbed... URL: From crossd at gmail.com Fri Mar 10 05:55:44 2023 From: crossd at gmail.com (Dan Cross) Date: Thu, 9 Mar 2023 14:55:44 -0500 Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways In-Reply-To: References: Message-ID: On Wed, Mar 8, 2023 at 8:22 PM John Cowan wrote: > On Wed, Mar 8, 2023 at 2:53 PM Dan Cross wrote: >> But the >> 3090 was really more like a distributed system than the Athlon box >> was, with all sorts of offload capabilities. For that matter, a >> thousand users probably _could_ telnet into the Athlon system. With >> telnet in line mode, it'd probably even be decently responsive. > > I find that difficult to believe. It seems too high by an order of magnitude. I'm not going to claim it would be zippy, but I do think it would work acceptably. Suppose that 1000 users telnet'ed into the x86 machine, but remained essentially idle; what resources would that consume? We'd have 1000 open TCP connections, a thousand shell processes, a thousand telnetd's, etc. All of that would consume some amount of RAM (though there'd be a lot of sharing of text and read-only data and so on), some VM space requiring RAM for paging structures and so on, some accounting data in the kernel, 1000 pseudo-ttys allocated, entries in the process table, etc. But, most of those shells would spend most of their time blocked waiting on input, so wouldn't consume CPU continuously, and similarly with the TCP connections mostly idle, the kernel is not generally wasting a lot of processor time on the login sessions. There'd be some bookkeeping data on disk, but that would be small. System overhead would amount to maybe a few megabytes, I'd imagine. If all of those users ran telnet in line mode, then the system isn't getting pounded with interrupts all the time, even if they're executing commands (the per-character overhead would be absorbed by the client). I don't think I have a machine of quite the Athlon vintage, but I _do_ have a machine with a Ryzen processor that's a couple of years old down in my basement. As an experiment, I wrote a little "expect" script to login to that machine a thousand times, doing so recursively: that is, the script starts off ssh'ing into the machine, and then in that session, logs in again, and so on, a thousand times, before finally going interactive. I used encryption, public-key authentication, and compression, and bounced through a "jump host" for each session, ensuring that I'm using the network for each login. The effect here is that typing into the final shell sort of simulates 1000 users typing simultaneously, complete with all the glorious interrupt and scheduler overhead that implies. Response time in that connection is not bad; certainly on par with the 3090 I used for a while in the early 90s. If I login in another window, it doesn't even register that there are a thousand "users" logged in, even if I'm running something chatty in the "thousand users" window. By contrast, the mainframe required a tremendous amount of offload support to shield the CPU from all of that bursty user activity. They made user actions look like block transfers, thus amortizing (much) of the overhead of interactivity. With the same load, the mainframe is storing some state data in memory regarding which users are logged in or connected or dialed or whatever, but the situation isn't that much different than mostly-idle telnet connections in line-mode: save that it's even more favorable to the mainframe in that much of the interaction is per-screen of data, as opposed to per-line. The difference in interactivity and offload is why I think the comparison is poor. If the mainframe handled user sessions the same way the x86 machine handled telnet logins, I imagine it would be swamped way worse than the AMD machine (or whatever it was that person was writing about 10 or 15 years ago). Perhaps a better comparison would be to a web server that was accepting HTTP requests from 1000 different clients. I'm quite sure that x86 machines of the Athlon era could cope with that load. > Another thing that doesn't get mentioned much is that classic mainframes had SRAM, so their memory bandwidth was enormous. I suspect this has less of a difference than one would hope when comparing against a modern machine. The specific comparison in this case was against an IBM 3090-600J. It appears to use SRAM for cache ("high speed buffer" in IBM-speak), but seems to use DRAM for central and expanded storage. In this reference I found on bitsavers, they make a big deal about their "one million bit memory chip", but that's DRAM (http://www.bitsavers.org/pdf/ibm/3090/G580-1005-0_The_IBM_3090_Processor_Family_Jul87.pdf; see "IBM Advances the Technology" on page 10). Moreover, that machine supported up to 6 CPUs running at a clock rate of 69 MHz. That same reference says they could bring cycle times down to 17.2ns using ECL chips; DDR2 can match that. My Mac Studio blows it out of the water. For systems older than the 3090, I'm not sure that the SRAM difference matters much at all: those machines had tiny memories compared to even modern cell phones, and their CPUs and buses were pitifully slow. Even if they had more RAM bandwidth than machines now (which I do not think is really true), they couldn't use it. Indeed, I suspect their total memory sizes were smaller than L3 cache (which is SRAM) on modern machines. - Dan C. From lm at mcvoy.com Fri Mar 10 06:09:32 2023 From: lm at mcvoy.com (Larry McVoy) Date: Thu, 9 Mar 2023 12:09:32 -0800 Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways In-Reply-To: References: Message-ID: <20230309200932.GK9225@mcvoy.com> On Thu, Mar 09, 2023 at 02:55:44PM -0500, Dan Cross wrote: > On Wed, Mar 8, 2023 at 8:22???PM John Cowan wrote: > > On Wed, Mar 8, 2023 at 2:53???PM Dan Cross wrote: > >> But the > >> 3090 was really more like a distributed system than the Athlon box > >> was, with all sorts of offload capabilities. For that matter, a > >> thousand users probably _could_ telnet into the Athlon system. With > >> telnet in line mode, it'd probably even be decently responsive. > > > > I find that difficult to believe. It seems too high by an order of magnitude. > > I'm not going to claim it would be zippy, but I do think it would work > acceptably. > > Suppose that 1000 users telnet'ed into the x86 machine, but remained > essentially idle; what resources would that consume? We'd have 1000 > open TCP connections, a thousand shell processes, a thousand > telnetd's, etc. The early Unix code really did not like stuff like this. Lots of linear scans through what were assumed to be short lists. I still remember an SGI Challenge being brought to it's knees by a bunch of racks of modems. The same machine could move a ton of data but not when it was being forced through a zillion sockets. Linux seems well past that problem but it's possible that back in the Athlon days it still sucked. I pinged Linus, if he remembers when the kernel got taught to scale on sockets I'll report back. --lm From stewart at serissa.com Sat Mar 11 00:20:48 2023 From: stewart at serissa.com (Larry Stewart) Date: Fri, 10 Mar 2023 09:20:48 -0500 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: <20230310131512.891A8212A8@orac.inputplus.co.uk> References: <20230310131512.891A8212A8@orac.inputplus.co.uk> Message-ID: <498576F7-6881-4176-B187-F4ACB0A42F76@serissa.com> TLDR exceptions don't make it better, they make it different. The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions. The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on. At the application level, literally anything can happen on any call. The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM. In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise. This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost. So no one did this, and at the top level, literally any exception could occur. Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases. On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity. I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author. The usual practice of course is the popup "Received unknown error, OK?" -Larry > On Mar 10, 2023, at 8:15 AM, Ralph Corderoy wrote: > > Hi Noel, > >>> if you say above that most people are unfamiliar with them due to >>> their use of goto then that's probably wrong >> >> I didn't say that. > > Thanks for clarifying; I did know it was a possibility. > >> I was just astonished that in a long thread about handling exceptional >> conditions, nobody had mentioned . . . exceptions. Clearly, either >> unfamiliarity (perhaps because not many laguages provide them - as you >> point out, Go does not), or not top of mind. > > Or perhaps those happy to use gotos also tend to be those who dislike > exceptions. :-) > > Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF. > > -- > Cheers, Ralph. From bakul at iitbombay.org Sat Mar 11 03:11:25 2023 From: bakul at iitbombay.org (Bakul Shah) Date: Fri, 10 Mar 2023 09:11:25 -0800 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) Message-ID: <445734B5-EFA3-4224-8ABE-81D35A4D9CEC@iitbombay.org> To make exceptional handling robust, I think every exception needs to be explicitly handled somewhere. If an exception not handled by a function, that fact must be specified in the function declaration. In effect the compiler can check that every exception has a handler somewhere. I think you can implement it using different syntactic sugar than Go’s obnoxious error handling but basically the same (though you may be tempted to make more efficient). > On Mar 10, 2023, at 6:21 AM, Larry Stewart wrote: > TLDR exceptions don't make it better, they make it different. > > The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions. > > The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on. > At the application level, literally anything can happen on any call. > > The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM. In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise. > > This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost. So no one did this, and at the top level, literally any exception could occur. > > Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases. > > On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity. > > I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author. > > The usual practice of course is the popup "Received unknown error, OK?" > > -Larry > >> On Mar 10, 2023, at 8:15 AM, Ralph Corderoy wrote: >> >> Hi Noel, >> >>>> if you say above that most people are unfamiliar with them due to >>>> their use of goto then that's probably wrong >>> I didn't say that. >> >> Thanks for clarifying; I did know it was a possibility. >> >>> I was just astonished that in a long thread about handling exceptional >>> conditions, nobody had mentioned . . . exceptions. Clearly, either >>> unfamiliarity (perhaps because not many laguages provide them - as you >>> point out, Go does not), or not top of mind. >> >> Or perhaps those happy to use gotos also tend to be those who dislike >> exceptions. :-) >> >> Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF. >> >> -- >> Cheers, Ralph. -------------- next part -------------- An HTML attachment was scrubbed... URL: From coff at tuhs.org Sat Mar 11 03:28:44 2023 From: coff at tuhs.org (segaloco via COFF) Date: Fri, 10 Mar 2023 17:28:44 +0000 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: <445734B5-EFA3-4224-8ABE-81D35A4D9CEC@iitbombay.org> References: <445734B5-EFA3-4224-8ABE-81D35A4D9CEC@iitbombay.org> Message-ID: On the flip side something I've always thought would be powerful, at least in development, is a way to tell any and all procedures being called to ignore their exception/condition handling and hard-crash. Of course you don't want to take that kind of hammer to a production situation, but a way to override any and all handling so that real errors become apparent would be incredibly nice. If nothing else, I could provide much better stack traces to vendors when I'm particularly stuck on something and convinced it isn't my fault. Maybe such a thing exists in C# but I've never gone looking for it, all I know is catching an exception from some vendor library with zero useful information makes me want to take a hammer to much more than the code... - Matt G. ------- Original Message ------- On Friday, March 10th, 2023 at 9:11 AM, Bakul Shah wrote: > To make exceptional handling robust, I think every exception needs to be explicitly handled somewhere. If an exception not handled by a function, that fact must be specified in the function declaration. In effect the compiler can check that every exception has a handler somewhere. I think you can implement it using different syntactic sugar than Go’s obnoxious error handling but basically the same (though you may be tempted to make more efficient). > >> On Mar 10, 2023, at 6:21 AM, Larry Stewart wrote: > >> TLDR exceptions don't make it better, they make it different. >> >> The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions. >> >> The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on. >> At the application level, literally anything can happen on any call. >> >> The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM. In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise. >> >> This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost. So no one did this, and at the top level, literally any exception could occur. >> >> Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases. >> >> On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity. >> >> I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author. >> >> The usual practice of course is the popup "Received unknown error, OK?" >> >> -Larry >> >>> On Mar 10, 2023, at 8:15 AM, Ralph Corderoy wrote: >> >>> >> >>> Hi Noel, >> >>> >> >>>>> if you say above that most people are unfamiliar with them due to >> >>>>> their use of goto then that's probably wrong >> >>>> >> >>>> I didn't say that. >> >>> >> >>> Thanks for clarifying; I did know it was a possibility. >> >>> >> >>>> I was just astonished that in a long thread about handling exceptional >> >>>> conditions, nobody had mentioned . . . exceptions. Clearly, either >> >>>> unfamiliarity (perhaps because not many laguages provide them - as you >> >>>> point out, Go does not), or not top of mind. >> >>> >> >>> Or perhaps those happy to use gotos also tend to be those who dislike >> >>> exceptions. :-) >> >>> >> >>> Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF. >> >>> >> >>> -- >> >>> Cheers, Ralph. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lm at mcvoy.com Sat Mar 11 03:34:53 2023 From: lm at mcvoy.com (Larry McVoy) Date: Fri, 10 Mar 2023 09:34:53 -0800 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: References: <445734B5-EFA3-4224-8ABE-81D35A4D9CEC@iitbombay.org> Message-ID: <20230310173453.GA9225@mcvoy.com> On Fri, Mar 10, 2023 at 05:28:44PM +0000, segaloco via COFF wrote: > On the flip side something I've always thought would be powerful, at least in development, is a way to tell any and all procedures being called to ignore their exception/condition handling and hard-crash. Of course you don't want to take that kind of hammer to a production situation, but a way to override any and all handling so that real errors become apparent would be incredibly nice. #include From imp at bsdimp.com Sat Mar 11 03:34:57 2023 From: imp at bsdimp.com (Warner Losh) Date: Fri, 10 Mar 2023 10:34:57 -0700 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: <20230310131512.891A8212A8@orac.inputplus.co.uk> References: <20230310121550.9A80718C080@mercury.lcs.mit.edu> <20230310131512.891A8212A8@orac.inputplus.co.uk> Message-ID: On Fri, Mar 10, 2023 at 6:15 AM Ralph Corderoy wrote: > Hi Noel, > > > > if you say above that most people are unfamiliar with them due to > > > their use of goto then that's probably wrong > > > > I didn't say that. > > Thanks for clarifying; I did know it was a possibility. > Exception handling is a great leap sideways. it's a supercharged goto with steroids on top. In some ways more constrained, in other ways more prone to abuse. Example: I diagnosed performance problems in a program that would call into 'waiting' threads that would read data from a pipe and then queue work. Easy, simple, straightforward design. Except they used exceptions to then process the packets rather than having a proper lockless producer / consumer queue. Exceptions are great for keeping the code linear and ignoring error conditions logically, but still having them handled "somewhere" above the current code and writing the code such that when it gets an abort, partial work is cleaned up and trashed. Global exception handlers are both good and bad. All errors become tracebacks to where it occurred. People often don't disambiguate between expected and unexpected exceptions, so programming errors get lumped in with remote devices committing protocol errors get lumped in with your config file had a typo and /dve/ttyU2 doesn't exist. It can be hard for the user to know what comes next when it's all jumbled together. In-line error handling, at least, can catch the expected things and give a more reasonable error near to where it happened so I know if my next step is vi prog.conf or email support at prog.com. So it's a hate hate relationship with both. What do I hate the least? That's a three drink minimum for the answer. Warner -------------- next part -------------- An HTML attachment was scrubbed... URL: From bakul at iitbombay.org Sat Mar 11 03:35:44 2023 From: bakul at iitbombay.org (Bakul Shah) Date: Fri, 10 Mar 2023 09:35:44 -0800 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: References: Message-ID: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> During development the runtime should simply invoke a debugger in this case. This should be perfectly doable but for some reason it is considered acceptable to crash a program! I don’t want to run a program *under a debugger* but want it invoked at the right time! > On Mar 10, 2023, at 9:28 AM, segaloco wrote: > >  > On the flip side something I've always thought would be powerful, at least in development, is a way to tell any and all procedures being called to ignore their exception/condition handling and hard-crash. Of course you don't want to take that kind of hammer to a production situation, but a way to override any and all handling so that real errors become apparent would be incredibly nice. > > If nothing else, I could provide much better stack traces to vendors when I'm particularly stuck on something and convinced it isn't my fault. Maybe such a thing exists in C# but I've never gone looking for it, all I know is catching an exception from some vendor library with zero useful information makes me want to take a hammer to much more than the code... > > - Matt G. > ------- Original Message ------- > On Friday, March 10th, 2023 at 9:11 AM, Bakul Shah wrote: > >> To make exceptional handling robust, I think every exception needs to be explicitly handled somewhere. If an exception not handled by a function, that fact must be specified in the function declaration. In effect the compiler can check that every exception has a handler somewhere. I think you can implement it using different syntactic sugar than Go’s obnoxious error handling but basically the same (though you may be tempted to make more efficient). >> >>> On Mar 10, 2023, at 6:21 AM, Larry Stewart wrote: >>> >>> TLDR exceptions don't make it better, they make it different. >>> >>> The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions. >>> >>> The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on. >>> At the application level, literally anything can happen on any call. >>> >>> The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM. In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise. >>> >>> This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost. So no one did this, and at the top level, literally any exception could occur. >>> >>> Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases. >>> >>> On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity. >>> >>> I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author. >>> >>> The usual practice of course is the popup "Received unknown error, OK?" >>> >>> -Larry >>> >>>> On Mar 10, 2023, at 8:15 AM, Ralph Corderoy wrote: >>>> >>>> Hi Noel, >>>> >>>>>> if you say above that most people are unfamiliar with them due to >>>>>> their use of goto then that's probably wrong >>>>> >>>>> I didn't say that. >>>> >>>> Thanks for clarifying; I did know it was a possibility. >>>> >>>>> I was just astonished that in a long thread about handling exceptional >>>>> conditions, nobody had mentioned . . . exceptions. Clearly, either >>>>> unfamiliarity (perhaps because not many laguages provide them - as you >>>>> point out, Go does not), or not top of mind. >>>> >>>> Or perhaps those happy to use gotos also tend to be those who dislike >>>> exceptions. :-) >>>> >>>> Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF. >>>> >>>> -- >>>> Cheers, Ralph. >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lm at mcvoy.com Sat Mar 11 03:42:22 2023 From: lm at mcvoy.com (Larry McVoy) Date: Fri, 10 Mar 2023 09:42:22 -0800 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> Message-ID: <20230310174222.GB9225@mcvoy.com> On Fri, Mar 10, 2023 at 09:35:44AM -0800, Bakul Shah wrote: > During development the runtime should simply invoke a debugger in this case. This should be perfectly doable but for some reason it is considered acceptable to crash a program! I don???t want to run a program *under a debugger* but want it invoked at the right time! Indeed. void gdb_backtrace(void) { FILE *f; char *cmd; unless (getenv("_BK_BACKTRACE")) return; unless ((f = efopen("BK_TTYPRINTF")) || (f = fopen(DEV_TTY, "w"))) { f = stderr; } cmd = aprintf("gdb -batch -ex backtrace '%s/bk' %u 1>&%d 2>&%d", bin, getpid(), fileno(f), fileno(f)); system(cmd); free(cmd); if (f != stderr) fclose(f); } From coff at tuhs.org Sat Mar 11 03:43:28 2023 From: coff at tuhs.org (segaloco via COFF) Date: Fri, 10 Mar 2023 17:43:28 +0000 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> Message-ID: Yeah it's a pain and different in different languages. My horror stories are mainly C# since that's what day job stuff is these days (backend anyway). The way assert does it is great, one little cpp define and it all goes away. However that being compile time, only applies to what is yours, if you're stuck with someone else's object code, you get what you get :/ - Matt G. ------- Original Message ------- On Friday, March 10th, 2023 at 9:35 AM, Bakul Shah wrote: > During development the runtime should simply invoke a debugger in this case. This should be perfectly doable but for some reason it is considered acceptable to crash a program! I don’t want to run a program *under a debugger* but want it invoked at the right time! > >> On Mar 10, 2023, at 9:28 AM, segaloco wrote: > >>  >> On the flip side something I've always thought would be powerful, at least in development, is a way to tell any and all procedures being called to ignore their exception/condition handling and hard-crash. Of course you don't want to take that kind of hammer to a production situation, but a way to override any and all handling so that real errors become apparent would be incredibly nice. >> >> If nothing else, I could provide much better stack traces to vendors when I'm particularly stuck on something and convinced it isn't my fault. Maybe such a thing exists in C# but I've never gone looking for it, all I know is catching an exception from some vendor library with zero useful information makes me want to take a hammer to much more than the code... >> >> - Matt G. >> ------- Original Message ------- >> On Friday, March 10th, 2023 at 9:11 AM, Bakul Shah wrote: >> >>> To make exceptional handling robust, I think every exception needs to be explicitly handled somewhere. If an exception not handled by a function, that fact must be specified in the function declaration. In effect the compiler can check that every exception has a handler somewhere. I think you can implement it using different syntactic sugar than Go’s obnoxious error handling but basically the same (though you may be tempted to make more efficient). >>> >>>> On Mar 10, 2023, at 6:21 AM, Larry Stewart wrote: >>> >>>> TLDR exceptions don't make it better, they make it different. >>>> >>>> The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions. >>>> >>>> The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on. >>>> At the application level, literally anything can happen on any call. >>>> >>>> The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM. In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise. >>>> >>>> This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost. So no one did this, and at the top level, literally any exception could occur. >>>> >>>> Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases. >>>> >>>> On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity. >>>> >>>> I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author. >>>> >>>> The usual practice of course is the popup "Received unknown error, OK?" >>>> >>>> -Larry >>>> >>>>> On Mar 10, 2023, at 8:15 AM, Ralph Corderoy wrote: >>>> >>>>> >>>> >>>>> Hi Noel, >>>> >>>>> >>>> >>>>>>> if you say above that most people are unfamiliar with them due to >>>> >>>>>>> their use of goto then that's probably wrong >>>> >>>>>> >>>> >>>>>> I didn't say that. >>>> >>>>> >>>> >>>>> Thanks for clarifying; I did know it was a possibility. >>>> >>>>> >>>> >>>>>> I was just astonished that in a long thread about handling exceptional >>>> >>>>>> conditions, nobody had mentioned . . . exceptions. Clearly, either >>>> >>>>>> unfamiliarity (perhaps because not many laguages provide them - as you >>>> >>>>>> point out, Go does not), or not top of mind. >>>> >>>>> >>>> >>>>> Or perhaps those happy to use gotos also tend to be those who dislike >>>> >>>>> exceptions. :-) >>>> >>>>> >>>> >>>>> Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF. >>>> >>>>> >>>> >>>>> -- >>>> >>>>> Cheers, Ralph. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bakul at iitbombay.org Sat Mar 11 03:47:29 2023 From: bakul at iitbombay.org (Bakul Shah) Date: Fri, 10 Mar 2023 09:47:29 -0800 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: References: Message-ID: <1DCF3FAD-ADAA-4FEC-8A76-739DF67A4859@iitbombay.org> I should add that (compared to goto or setjmp/longjmp), by making exceptions a language thing, the compiler can attach more context to the exception event (or condition). In the scheme I outlined, the vendor library function must declare what exceptions it doesn’t handle and the compiler can pass more context that may not make sense to a library user but may help its developer pinpoint the cause. > On Mar 10, 2023, at 9:28 AM, segaloco wrote: > > If nothing else, I could provide much better stack traces to vendors when I'm particularly stuck on something and convinced it isn't my fault. Maybe such a thing exists in C# but I've never gone looking for it, all I know is catching an exception from some vendor library with zero useful information makes me want to take a hammer to much more than the code... From crossd at gmail.com Sat Mar 11 04:03:23 2023 From: crossd at gmail.com (Dan Cross) Date: Fri, 10 Mar 2023 13:03:23 -0500 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> Message-ID: On Fri, Mar 10, 2023 at 12:36 PM Bakul Shah wrote: > During development the runtime should simply invoke a debugger in this case. This should be perfectly doable but for some reason it is considered acceptable to crash a program! I don’t want to run a program *under a debugger* but want it invoked at the right time! Common Lisp implementations have been doing that for years! Too bad using Lisp means bringing all the rest of the Lisp stuff with it, including the attitude. Oh well. :-) - Dan C. > On the flip side something I've always thought would be powerful, at least in development, is a way to tell any and all procedures being called to ignore their exception/condition handling and hard-crash. Of course you don't want to take that kind of hammer to a production situation, but a way to override any and all handling so that real errors become apparent would be incredibly nice. > > If nothing else, I could provide much better stack traces to vendors when I'm particularly stuck on something and convinced it isn't my fault. Maybe such a thing exists in C# but I've never gone looking for it, all I know is catching an exception from some vendor library with zero useful information makes me want to take a hammer to much more than the code... > > - Matt G. > ------- Original Message ------- > On Friday, March 10th, 2023 at 9:11 AM, Bakul Shah wrote: > > To make exceptional handling robust, I think every exception needs to be explicitly handled somewhere. If an exception not handled by a function, that fact must be specified in the function declaration. In effect the compiler can check that every exception has a handler somewhere. I think you can implement it using different syntactic sugar than Go’s obnoxious error handling but basically the same (though you may be tempted to make more efficient). > > On Mar 10, 2023, at 6:21 AM, Larry Stewart wrote: > > TLDR exceptions don't make it better, they make it different. > > The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions. > > The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on. > At the application level, literally anything can happen on any call. > > The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM. In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise. > > This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost. So no one did this, and at the top level, literally any exception could occur. > > Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases. > > On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity. > > I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author. > > The usual practice of course is the popup "Received unknown error, OK?" > > -Larry > > On Mar 10, 2023, at 8:15 AM, Ralph Corderoy wrote: > > > Hi Noel, > > > if you say above that most people are unfamiliar with them due to > > their use of goto then that's probably wrong > > > I didn't say that. > > > Thanks for clarifying; I did know it was a possibility. > > > I was just astonished that in a long thread about handling exceptional > > conditions, nobody had mentioned . . . exceptions. Clearly, either > > unfamiliarity (perhaps because not many laguages provide them - as you > > point out, Go does not), or not top of mind. > > > Or perhaps those happy to use gotos also tend to be those who dislike > > exceptions. :-) > > > Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF. > > > -- > > Cheers, Ralph. > > > From steffen at sdaoden.eu Sat Mar 11 04:09:38 2023 From: steffen at sdaoden.eu (Steffen Nurpmeso) Date: Fri, 10 Mar 2023 19:09:38 +0100 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: <498576F7-6881-4176-B187-F4ACB0A42F76@serissa.com> References: <20230310131512.891A8212A8@orac.inputplus.co.uk> <498576F7-6881-4176-B187-F4ACB0A42F76@serissa.com> Message-ID: <20230310180938.6rYu2%steffen@sdaoden.eu> Larry Stewart wrote in <498576F7-6881-4176-B187-F4ACB0A42F76 at serissa.com>: |TLDR exceptions don't make it better, they make it different. ... |On the whole, I came away with a great deal of grudging respect for \ |ERRNO as striking a great balance between ease of use and specificity. From my user space point of view i never understood why there is no dedicated hardware register / (plus) error indicating flag that callers could cheaply and easily test. (Maybe there is on some processor platforms, beside a one such where errno then can be placed in some per-thread structure stored there. Still this requires another dedicated return value.) I ran away from the exceptions i got used to with JAVA to -fno-rtti -fno-exceptions when i looked at the object output of g++ 2.95.?, and saw in the support code they use heap memory for this etc. |I also evolved Larry's Theory of Exceptions, which is that it is the \ |programmer's job to sort exceptional conditions into actionable categori\ |es: (1) resolvable by the user (bad arguments) (2) Temporary (out of \ |network sockets or whatever) (3) resolvable by the sysadmin (config) \ |(4) real bug, resolvable by the author. ... Really interesting point, like SMTP and other protocols which classify errors in categories. Errors are one of my waving-helplessly topics, where you simply have to let things go and where "perfection" just cannot be achieved in real-life (or add .. as time passes by). Often you just do not find the correct answer, with errno the name sometimes fits, but the decade-old description does not really, and very fast you end up with overloading (eg come to a second ENODATA because ESRCH is something different, or reuse EILSEQ for bogus input even though the function already used to use EILSEQ for non-convertible output). --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From bakul at iitbombay.org Sat Mar 11 04:57:13 2023 From: bakul at iitbombay.org (Bakul Shah) Date: Fri, 10 Mar 2023 10:57:13 -0800 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> Message-ID: <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org> On Mar 10, 2023, at 10:03 AM, Dan Cross wrote: > > On Fri, Mar 10, 2023 at 12:36 PM Bakul Shah wrote: >> During development the runtime should simply invoke a debugger in this case. This should be perfectly doable but for some reason it is considered acceptable to crash a program! I don’t want to run a program *under a debugger* but want it invoked at the right time! > > Common Lisp implementations have been doing that for years! Too bad > using Lisp means bringing all the rest of the Lisp stuff with it, > including the attitude. Oh well. :-) It can even fix the problem and continue! Note that such things don't have to be *tied* to Lisp. But that would require a change in mindset. From marzhall.o at gmail.com Sat Mar 11 05:57:40 2023 From: marzhall.o at gmail.com (Marshall Conover) Date: Fri, 10 Mar 2023 14:57:40 -0500 Subject: [COFF] [TUHS] Re: Conditions, AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful" 55th anniversary) In-Reply-To: <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org> References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org> Message-ID: While all this error and exception discussion is going down, I have to mention this piece: http://joeduffyblog.com/2016/02/07/the-error-model/ The author worked at MS on their "midori" research OS, and discussed what went into their decisions around using return codes, exceptions, etc. I felt it was a nice breakdown of the pros and cons of the different approaches, and fleshed out the concepts in my mind a bit. I thought others might enjoy it as well. That said, I absolutely loathe exceptions with all my heart. In my experience, along Warner and Matt's lines, they're more prone to the sort of abuse that wastes my time than they are productive. It's not that they can't be used well, they just so often aren't. Cheers, Marshall On Fri, Mar 10, 2023 at 1:57 PM Bakul Shah wrote: > On Mar 10, 2023, at 10:03 AM, Dan Cross wrote: > > > > On Fri, Mar 10, 2023 at 12:36 PM Bakul Shah wrote: > >> During development the runtime should simply invoke a debugger in this > case. This should be perfectly doable but for some reason it is considered > acceptable to crash a program! I don’t want to run a program *under a > debugger* but want it invoked at the right time! > > > > Common Lisp implementations have been doing that for years! Too bad > > using Lisp means bringing all the rest of the Lisp stuff with it, > > including the attitude. Oh well. :-) > > It can even fix the problem and continue! > > Note that such things don't have to be *tied* to Lisp. But that > would require a change in mindset. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralph at inputplus.co.uk Sat Mar 11 21:25:08 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Sat, 11 Mar 2023 11:25:08 +0000 Subject: [COFF] continue N. (Was: I can't drive 55...) In-Reply-To: <20230310165552.czZmL%steffen@sdaoden.eu> References: <20230309230130.q4I-f%steffen@sdaoden.eu> <20230310165552.czZmL%steffen@sdaoden.eu> Message-ID: <20230311112508.7306220145@orac.inputplus.co.uk> Hi Steffen, COFF'd. > Very often i find myself needing a restart necessity, so "continue > N" would that be. Then again when "N" is a number instead of > a label this is a (let alone maintainance) mess but for shortest > code paths. Do you mean ‘continue’ which re-tests the condition or more like Perl's ‘redo’ which re-starts the loop's body? ‘The "redo" command restarts the loop block without evaluating the conditional again. The "continue" block, if any, is not executed.’ — perldoc -f redo So like a ‘goto redo’ in while (...) { redo: ... if (...) goto redo ... } -- Cheers, Ralph. From ralph at inputplus.co.uk Sat Mar 11 21:28:49 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Sat, 11 Mar 2023 11:28:49 +0000 Subject: [COFF] Conditions, AKA exceptions. In-Reply-To: <20230310174222.GB9225@mcvoy.com> References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> <20230310174222.GB9225@mcvoy.com> Message-ID: <20230311112849.22C0920145@orac.inputplus.co.uk> Hi Larry, > cmd = aprintf("gdb -batch -ex backtrace '%s/bk' %u 1>&%d 2>&%d", > bin, getpid(), fileno(f), fileno(f)); > > system(cmd); I also came up with this, probably on an SGI Iris Indigo, and got it added to the Unix Programming FAQ. :-) 6.5 How can I generate a stack dump from within a running program? http://www.faqs.org/faqs/unix-faq/programmer/faq/ It works surprisingly often, i.e. the process is healthy enough to run system(3). -- Cheers, Ralph. From paul.winalski at gmail.com Sun Mar 12 01:42:51 2023 From: paul.winalski at gmail.com (Paul Winalski) Date: Sat, 11 Mar 2023 10:42:51 -0500 Subject: [COFF] continue N. (Was: I can't drive 55...) In-Reply-To: <20230311112508.7306220145@orac.inputplus.co.uk> References: <20230309230130.q4I-f%steffen@sdaoden.eu> <20230310165552.czZmL%steffen@sdaoden.eu> <20230311112508.7306220145@orac.inputplus.co.uk> Message-ID: Regarding the general subject of using GOTOs: The first computer on which I did hands-on programming was an IBM S/360 model 25. It had 32K of memory available for user programs--that's both instructions and data. It executed code at about a 30 KIPS (yes--KILO instructions/second) rate. When you're programming on a machine that is that slow and with that limited an address space, every instruction counts. You couldn't afford either the space or the time to execute conditional tests just to avoid a GOTO. Programming using GOTOs doesn't necessarily mean you're writing rat's nest or spaghetti code. Yes, you can make a mess using GOTOs, and perhaps messy code is easier when GOTOs are allowed, but structured programming just for its own sake can lead to convoluted and messy program structure as well. What was rat's nest control flow with GOTOs can turn into rat's nest data flow of state variables. It's also worth noting that one of the main functions of a modern optimizing compiler is to take your nice, structured program and put all those rat's nest GOTOs (unconditional branch instructions) back so the thing will execute more quickly. -Paul W. From steffen at sdaoden.eu Sun Mar 12 03:51:02 2023 From: steffen at sdaoden.eu (Steffen Nurpmeso) Date: Sat, 11 Mar 2023 18:51:02 +0100 Subject: [COFF] continue N. (Was: I can't drive 55...) In-Reply-To: <20230311112508.7306220145@orac.inputplus.co.uk> References: <20230309230130.q4I-f%steffen@sdaoden.eu> <20230310165552.czZmL%steffen@sdaoden.eu> <20230311112508.7306220145@orac.inputplus.co.uk> Message-ID: <20230311175102.Yl3ha%steffen@sdaoden.eu> Ralph Corderoy wrote in <20230311112508.7306220145 at orac.inputplus.co.uk>: |Hi Steffen, | |COFF'd. | |> Very often i find myself needing a restart necessity, so "continue |> N" would that be. Then again when "N" is a number instead of |> a label this is a (let alone maintainance) mess but for shortest |> code paths. | |Do you mean ‘continue’ which re-tests the condition or more like Perl's |‘redo’ which re-starts the loop's body? No Ralph, i unspecifically meant multiple nested loops where some inner has to restart/continue the outer (at some point). So a bit like that of "man perlsyn", but with deeper nesting If you need both "next" and "last", you have to do both and also use a loop label: LOOP: { do {{ next if $x == $y; last LOOP if $x == $y**2; # do something here }} until $x++ > $z; } --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From crossd at gmail.com Sun Mar 12 06:32:12 2023 From: crossd at gmail.com (Dan Cross) Date: Sat, 11 Mar 2023 15:32:12 -0500 Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways In-Reply-To: <20230309200932.GK9225@mcvoy.com> References: <20230309200932.GK9225@mcvoy.com> Message-ID: On Thu, Mar 9, 2023 at 3:09 PM Larry McVoy wrote: > On Thu, Mar 09, 2023 at 02:55:44PM -0500, Dan Cross wrote: > > On Wed, Mar 8, 2023 at 8:22???PM John Cowan wrote: > > > On Wed, Mar 8, 2023 at 2:53???PM Dan Cross wrote: > > >> But the > > >> 3090 was really more like a distributed system than the Athlon box > > >> was, with all sorts of offload capabilities. For that matter, a > > >> thousand users probably _could_ telnet into the Athlon system. With > > >> telnet in line mode, it'd probably even be decently responsive. > > > > > > I find that difficult to believe. It seems too high by an order of magnitude. > > > > I'm not going to claim it would be zippy, but I do think it would work > > acceptably. > > > > Suppose that 1000 users telnet'ed into the x86 machine, but remained > > essentially idle; what resources would that consume? We'd have 1000 > > open TCP connections, a thousand shell processes, a thousand > > telnetd's, etc. > > The early Unix code really did not like stuff like this. Lots of linear > scans through what were assumed to be short lists. I still remember an > SGI Challenge being brought to it's knees by a bunch of racks of modems. > The same machine could move a ton of data but not when it was being > forced through a zillion sockets. Oh for sure I wouldn't try it on a VAX or PDP-11. I'm a bit surprised by the SGI thing, to be honest, but only a bit: as you say, I think that was just before the big push to make Unix really scalable. > Linux seems well past that problem but it's possible that back in the > Athlon days it still sucked. I pinged Linus, if he remembers when the > kernel got taught to scale on sockets I'll report back. Thanks, I'm curious what he says. - Dan C. From bakul at iitbombay.org Sun Mar 12 09:28:08 2023 From: bakul at iitbombay.org (Bakul Shah) Date: Sat, 11 Mar 2023 15:28:08 -0800 Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways In-Reply-To: References: Message-ID: <8BD706F3-4F50-4836-91ED-10179F06C177@iitbombay.org> On Mar 9, 2023, at 11:55 AM, Dan Cross wrote: > > Suppose that 1000 users telnet'ed into the x86 machine, but remained > essentially idle; what resources would that consume? We'd have 1000 > open TCP connections, a thousand shell processes, a thousand > telnetd's, etc. All of that would consume some amount of RAM (though > there'd be a lot of sharing of text and read-only data and so on), > some VM space requiring RAM for paging structures and so on, some > accounting data in the kernel, 1000 pseudo-ttys allocated, entries in > the process table, etc. But, most of those shells would spend most of > their time blocked waiting on input, so wouldn't consume CPU > continuously, and similarly with the TCP connections mostly idle, the > kernel is not generally wasting a lot of processor time on the login > sessions. There'd be some bookkeeping data on disk, but that would be > small. System overhead would amount to maybe a few megabytes, I'd > imagine. Not the same but in 1995 at Real Networks our server s/w running on a 50Mhz or 100Mhz Pentium could handle 1000 TCP control connections (mostly idle) and 1000 UDP "streams", each sending 10 packets/second, which was the limiting factor. IIRC we had reduced per socket tcp send/recv buffer size to a small number. I don't recall now whether these machines had more than 16GB but we didn't want to tie up lots of memory in idle buffers. We got a real boost in traffic in Oct'95 when people all over the world wanted to know the verdict in O.J.Simpson's murder trial in real time! After that I added code for feeding live streams to any downstream servers so that theoretically a 3 level distribution tree can deliver live data to a billion people. From tytso at mit.edu Sun Mar 12 14:23:48 2023 From: tytso at mit.edu (Theodore Ts'o) Date: Sat, 11 Mar 2023 23:23:48 -0500 Subject: [COFF] Conditions, AKA exceptions. In-Reply-To: <20230311112849.22C0920145@orac.inputplus.co.uk> References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> <20230310174222.GB9225@mcvoy.com> <20230311112849.22C0920145@orac.inputplus.co.uk> Message-ID: <20230312042348.GJ860405@mit.edu> On Sat, Mar 11, 2023 at 11:28:49AM +0000, Ralph Corderoy wrote: > Hi Larry, > > > cmd = aprintf("gdb -batch -ex backtrace '%s/bk' %u 1>&%d 2>&%d", > > bin, getpid(), fileno(f), fileno(f)); > > > > system(cmd); > > I also came up with this, probably on an SGI Iris Indigo, and got it > added to the Unix Programming FAQ. :-) > > 6.5 How can I generate a stack dump from within a running program? > http://www.faqs.org/faqs/unix-faq/programmer/faq/ > > It works surprisingly often, i.e. the process is healthy enough to run > system(3). On Linux (or some other system using glibc) a limited facility is built into the C library. So you can just do somthing like this: { void *stack_syms[32]; int frames; frames = backtrace(stack_syms, 32); backtrace_symbols_fd(stack_syms, frames, 2); } This is convenient if you want a stack trace, but the binary might be on a rescue floppy which doesn't have space for gdb, or the user might not have gdb installed. I use this for the fsck for ext4, and the nice thing is that even with a stripped binary. For example: Signal (7) SIGBUS (sent from pid 4261) si_code=SI_USER e2fsck(+0x36691)[0x564da1ed2691] /lib/x86_64-linux-gnu/libc.so.6(+0x3bf90)[0x7f6e21c0bf90] /lib/x86_64-linux-gnu/libc.so.6(read+0xd)[0x7f6e21cc80ed] e2fsck(ask_yn+0x1de)[0x564da1ec90de] e2fsck(fix_problem+0xfc0)[0x564da1ecc7b0] e2fsck(+0x235b3)[0x564da1ebf5b3] e2fsck(+0x252d3)[0x564da1ec12d3] /lib/x86_64-linux-gnu/libext2fs.so.2(ext2fs_dblist_iterate3+0x5f)[0x7f6e21e430cf] e2fsck(e2fsck_pass2+0x18b)[0x564da1ebdd7b] e2fsck(e2fsck_run+0x5a)[0x564da1eb0c3a] e2fsck(main+0x16cb)[0x564da1eacdbb] /lib/x86_64-linux-gnu/libc.so.6(+0x2718a)[0x7f6e21bf718a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f6e21bf7245] e2fsck(_start+0x21)[0x564da1eaefc1] For more information see: https://github.com/tytso/e2fsprogs/blob/master/e2fsck/sigcatcher.c#L379 - Ted From ralph at inputplus.co.uk Sun Mar 12 20:44:17 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Sun, 12 Mar 2023 10:44:17 +0000 Subject: [COFF] Conditions, AKA exceptions. In-Reply-To: <20230312042348.GJ860405@mit.edu> References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> <20230310174222.GB9225@mcvoy.com> <20230311112849.22C0920145@orac.inputplus.co.uk> <20230312042348.GJ860405@mit.edu> Message-ID: <20230312104417.DC1DF215AA@orac.inputplus.co.uk> Hi Ted, > > > cmd = aprintf("gdb -batch -ex backtrace '%s/bk' %u 1>&%d 2>&%d", > > > bin, getpid(), fileno(f), fileno(f)); ... > > It works surprisingly often, i.e. the process is healthy enough to > > run system(3). > > On Linux (or some other system using glibc) a limited facility is > built into the C library. So you can just do somthing like this: ... > frames = backtrace(stack_syms, 32); > backtrace_symbols_fd(stack_syms, frames, 2); Since ’99, yes. :-) backtrace(3) says glibc 2.1 added it. > I use this for the fsck for ext4, and the nice thing is that even with > a stripped binary. > > For example: Yes, that is nice. > https://github.com/tytso/e2fsprogs/blob/master/e2fsck/sigcatcher.c#L379 Thanks, I've made a note. Do you ever find things are so messed up that stdio has trouble whereas using write(2) with compile-time memory allocations for a buffer would have a better chance of reaching the TTY? -- Cheers, Ralph. From paul.winalski at gmail.com Mon Mar 13 02:46:40 2023 From: paul.winalski at gmail.com (Paul Winalski) Date: Sun, 12 Mar 2023 12:46:40 -0400 Subject: [COFF] Conditions, AKA exceptions. In-Reply-To: <20230312104417.DC1DF215AA@orac.inputplus.co.uk> References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> <20230310174222.GB9225@mcvoy.com> <20230311112849.22C0920145@orac.inputplus.co.uk> <20230312042348.GJ860405@mit.edu> <20230312104417.DC1DF215AA@orac.inputplus.co.uk> Message-ID: On 3/12/23, Ralph Corderoy wrote: > > Do you ever find things are so messed up that stdio has trouble whereas > using write(2) with compile-time memory allocations for a buffer would > have a better chance of reaching the TTY? I hate it when that happens. Even worse is when adding the write(2) with compile-time memory allocations makes the bug go away. I once had to spend three days camped out in someone's office debugging a compiler crash. The crash only happened 4 hours into a massive multi-file compilation, and this guy's login session was the only one where the problem reproduced under the debugger. Heisenbugs are hell. -Paul W. From lm at mcvoy.com Mon Mar 13 02:53:19 2023 From: lm at mcvoy.com (Larry McVoy) Date: Sun, 12 Mar 2023 09:53:19 -0700 Subject: [COFF] Conditions, AKA exceptions. In-Reply-To: References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> <20230310174222.GB9225@mcvoy.com> <20230311112849.22C0920145@orac.inputplus.co.uk> <20230312042348.GJ860405@mit.edu> <20230312104417.DC1DF215AA@orac.inputplus.co.uk> Message-ID: <20230312165319.GN9225@mcvoy.com> On Sun, Mar 12, 2023 at 12:46:40PM -0400, Paul Winalski wrote: > On 3/12/23, Ralph Corderoy wrote: > > > > Do you ever find things are so messed up that stdio has trouble whereas > > using write(2) with compile-time memory allocations for a buffer would > > have a better chance of reaching the TTY? > > I hate it when that happens. Even worse is when adding the write(2) > with compile-time memory allocations makes the bug go away. I once > had to spend three days camped out in someone's office debugging a > compiler crash. The crash only happened 4 hours into a massive > multi-file compilation, and this guy's login session was the only one > where the problem reproduced under the debugger. Heisenbugs are hell. I had one like that. Sometimes, rarely, suninstall would throw a panic(psig) which meant that someone in the kernel had messed with the process' signal mask, which is a no-no. Turns out that the SCSI twins had heard that people were interrupting suninstall if it took too long, so under certain conditions, the SCSI tape driver would disable SIGINT. It was (obviously) my fault because I was doing POSIX conformance and I was the last person in many kernel files. Took me a long time to track that one down. -- --- Larry McVoy Retired to fishing http://www.mcvoy.com/lm/boat From ralph at inputplus.co.uk Tue Mar 14 02:47:18 2023 From: ralph at inputplus.co.uk (Ralph Corderoy) Date: Mon, 13 Mar 2023 16:47:18 +0000 Subject: [COFF] Conditions, AKA exceptions. In-Reply-To: References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org> Message-ID: <20230313164718.4169E21F37@orac.inputplus.co.uk> Hi Marshall, > While all this error and exception discussion is going down, I have to > mention this piece: > http://joeduffyblog.com/2016/02/07/the-error-model/ > > The author worked at MS on their "midori" research OS, and discussed > what went into their decisions around using return codes, exceptions, > etc. I felt it was a nice breakdown of the pros and cons of the > different approaches, and fleshed out the concepts in my mind a bit. > I thought others might enjoy it as well. Thanks, it was a long read but enjoyable. > That said, I absolutely loathe exceptions with all my heart. I'm not a fan either. The exceptions Joe introduces above are more of a simpler syntax for handling return codes. He gives the expanded equivalent at one point. I also liked his enthusiam for ‘abandonment’, similar to a BUG() macro. -- Cheers, Ralph. From paul.winalski at gmail.com Tue Mar 14 03:10:17 2023 From: paul.winalski at gmail.com (Paul Winalski) Date: Mon, 13 Mar 2023 13:10:17 -0400 Subject: [COFF] Conditions, AKA exceptions. In-Reply-To: <20230313164718.4169E21F37@orac.inputplus.co.uk> References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org> <20230313164718.4169E21F37@orac.inputplus.co.uk> Message-ID: On 3/13/23, Ralph Corderoy wrote: > >> That said, I absolutely loathe exceptions with all my heart. > > I'm not a fan either. Exceptions play merry hell with compiler optimizations. If you are in a piece of code where an exception can occur, unless you have knowledge of the global side-effects of the handler(s) that might get invoked you must abandon any attempts to do data flow analysis of global data items. The C++ Standard Library is fond of using throw and catch exception handling. An optimizing compiler pretty much has to throw all data flow optimization involving global variables, or things passed to a callee by pointer, if anything in the call chain calls a C++ Standard Library routine. >From a compiler writer's perspective, the name STD for the C++ Standard Library is most apt. STD routines are a disease that infects anything that touches them. -Paul W. From dave at horsfall.org Tue Mar 14 07:12:53 2023 From: dave at horsfall.org (Dave Horsfall) Date: Tue, 14 Mar 2023 08:12:53 +1100 (EST) Subject: [COFF] Conditions, AKA exceptions. In-Reply-To: References: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org> <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org> <20230313164718.4169E21F37@orac.inputplus.co.uk> Message-ID: On Mon, 13 Mar 2023, Paul Winalski wrote: > From a compiler writer's perspective, the name STD for the C++ Standard > Library is most apt. STD routines are a disease that infects anything > that touches them. .sig! .sig! -- Dave From crossd at gmail.com Tue Mar 14 08:34:38 2023 From: crossd at gmail.com (Dan Cross) Date: Mon, 13 Mar 2023 18:34:38 -0400 Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways In-Reply-To: References: Message-ID: I don't know if a thousand users ever logged in there at one time, but they do tend to have a lot of simultaneous logins. On Mon, Mar 13, 2023 at 6:16 PM Peter Pentchev wrote: > > On Wed, Mar 08, 2023 at 02:52:43PM -0500, Dan Cross wrote: > > [bumping to COFF] > > > > On Wed, Mar 8, 2023 at 2:05 PM ron minnich wrote: > > > The wheel of reincarnation discussion got me to thinking: > [snip] > > > The evolution of platforms like laptops to becoming full distributed systems continues. > > > The wheel of reincarnation spins counter clockwise -- or sideways? > > > > About a year ago, I ran across an email written a decade or more prior > > on some mainframe mailing list where someone wrote something like, > > "wow! It just occurred to me that my Athlon machine is faster than the > > ES/3090-600J I used in 1989!" Some guy responded angrily, rising to > > the wounded honor of IBM, raving about how preposterous this was > > because the mainframe could handle a thousand users logged in at one > > time and there's no way this Linux box could ever do that. > [snip] > > For that matter, a > > thousand users probably _could_ telnet into the Athlon system. With > > telnet in line mode, it'd probably even be decently responsive. > > sdf.org (formerly sdf.lonestar.org) comes to mind... > > G'luck, > Peter > > -- > Peter Pentchev roam at ringlet.net roam at debian.org pp at storpool.com > PGP key: http://people.FreeBSD.org/~roam/roam.key.asc > Key fingerprint 2EE7 A7A5 17FC 124C F115 C354 651E EFB0 2527 DF13 From ken.unix.guy at gmail.com Sun Mar 26 08:25:36 2023 From: ken.unix.guy at gmail.com (KenUnix) Date: Sat, 25 Mar 2023 18:25:36 -0400 Subject: [COFF] 3B2/400 Unix System V r3 man Message-ID: Hi. Was a man page kit ever made for Unix V r3. I am running it under a 3B2/400 sim. If it is available where could I get it? Thanks, Ken -- WWL 📚 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.unix.guy at gmail.com Mon Mar 27 00:00:28 2023 From: ken.unix.guy at gmail.com (KenUnix) Date: Sun, 26 Mar 2023 10:00:28 -0400 Subject: [COFF] Fortran Question for Unix System-V r3 Message-ID: Fortran question for Unix System-5 r3. When executing fortran programs requiring input the screen will show a blank screen. After entering input anyway the program completes under Unix System V *r3*. When the same program is compiled under Unix System V *r1* it works as expected. Sounds like on Unix System V *r3* the output buffer is not being flushed. I tried re-compiling F77. No help. Fortran code follows: PROGRAM EASTER INTEGER YEAR,METCYC,CENTRY,ERROR1,ERROR2,DAY INTEGER EPACT,LUNA C A PROGRAM TO CALCULATE THE DATE OF EASTER PRINT '(A)',' INPUT THE YEAR FOR WHICH EASTER' PRINT '(A)',' IS TO BE CALCULATED' PRINT '(A)',' ENTER THE WHOLE YEAR, E.G. 1978 ' READ '(A)',YEAR C CALCULATING THE YEAR IN THE 19 YEAR METONIC CYCLE-METCYC METCYC = MOD(YEAR,19)+1 IF(YEAR.LE.1582)THEN DAY = (5*YEAR)/4 EPACT = MOD(11*METCYC-4,30)+1 ELSE C CALCULATING THE CENTURY-CENTRY CENTRY = (YEAR/100)+1 C ACCOUNTING FOR ARITHMETIC INACCURACIES C IGNORES LEAP YEARS ETC. ERROR1 = (3*CENTRY/4)-12 ERROR2 = ((8*CENTRY+5)/25)-5 C LOCATING SUNDAY DAY = (5*YEAR/4)-ERROR1-10 C LOCATING THE EPACT(FULL MOON) EPACT = MOD(11*METCYC+20+ERROR2-ERROR1,30) IF(EPACT.LT.0)EPACT=30+EPACT IF((EPACT.EQ.25.AND.METCYC.GT.11).OR.EPACT.EQ.24)THEN EPACT=EPACT+1 ENDIF ENDIF C FINDING THE FULL MOON LUNA=44-EPACT IF(LUNA.LT.21)THEN LUNA=LUNA+30 ENDIF C LOCATING EASTER SUNDAY LUNA=LUNA+7-(MOD(DAY+LUNA,7)) C LOCATING THE CORRECT MONTH IF(LUNA.GT.31)THEN LUNA = LUNA - 31 PRINT '(A)',' FOR THE YEAR ',YEAR PRINT '(A)',' EASTER FALLS ON APRIL ',LUNA ELSE PRINT '(A)',' FOR THE YEAR ',YEAR PRINT '(A)',' EASTER FALLS ON MARCH ',LUNA ENDIF END Any help would be appreciated, Ken -- WWL 📚 -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.winalski at gmail.com Mon Mar 27 01:04:49 2023 From: paul.winalski at gmail.com (Paul Winalski) Date: Sun, 26 Mar 2023 11:04:49 -0400 Subject: [COFF] Fortran Question for Unix System-V r3 In-Reply-To: References: Message-ID: On 3/26/23, KenUnix wrote: > Fortran question for Unix System-5 r3. > > When executing fortran programs requiring input the screen will > show a blank screen. After entering input anyway the program completes > under Unix System V *r3*. > > When the same program is compiled under Unix System V *r1* it > works as expected. > > Sounds like on Unix System V *r3* the output buffer is not being flushed. > I tried re-compiling F77. No help. Re-compiling F77 doesn't help because the bug is in the Fortran run-time library (RTL), not in the compiler. The routine that implements the READ statement should be flushing the write buffer before doing the actual read. Clearly it isn't. Their test system probably didn't have very many (if any) tests for interactive behavior. That sort of thing is difficult to automate. -Paul W. From ken.unix.guy at gmail.com Tue Mar 28 23:26:12 2023 From: ken.unix.guy at gmail.com (KenUnix) Date: Tue, 28 Mar 2023 09:26:12 -0400 Subject: [COFF] Unix V r3 question Message-ID: Hi. Does anyone have the "man" pages for Basic for System-V r3? Thanks, Ken -- WWL 📚 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.unix.guy at gmail.com Tue Mar 28 23:31:14 2023 From: ken.unix.guy at gmail.com (KenUnix) Date: Tue, 28 Mar 2023 09:31:14 -0400 Subject: [COFF] Unix System V r1 -> Unix System V r3 Message-ID: Hi, Has anyone been successful in communicating using cu or some other method to transfer files between two SIMS running Unix V ? If so I would appreciate some help. Thanks, Ken -- WWL 📚 -------------- next part -------------- An HTML attachment was scrubbed... URL: From gingell at computer.org Thu Mar 30 09:07:57 2023 From: gingell at computer.org (Rob Gingell) Date: Wed, 29 Mar 2023 16:07:57 -0700 Subject: [COFF] [TUHS] Re: Origins of the frame buffer device In-Reply-To: <7w7cvr4x36.fsf@junk.nocrew.org> References: <20230305185202.91B7B18C08D@mercury.lcs.mit.edu> <7w7cvr4x36.fsf@junk.nocrew.org> Message-ID: [Redirected to COFF for some anecdotal E&S-related history and non-UNIX terminal room nostalgia.] On 3/7/23 9:43 PM, Lars Brinkhoff wrote: > Noel Chiappa wrote: >>> The first frame buffers from Evans and Sutherland were at University >>> of Utah, DOD SITES and NYIT CGL as I recall. Circa 1974 to 1978. >> >> Were those on PDP-11's, or PDP-10's? (Really early E+S gear attached to >> PDP-10's; '74-'78 sounds like an interim period.) > > The Picture System from 1974 was based on a PDP-11/05. It looks like > vector graphics rather than a frame buffer though. > > http://archive.computerhistory.org/resources/text/Evans_Sutherland/EvansSutherland.3D.1974.102646288.pdf E&S LDS-1s used PDP-10s as their host systems. LDS-2s could at least in principle use several different hosts (including spanning a range of word sizes, e.g., a SEL-840 with 24 bit words or a 16 bit PDP-11.) The Line Drawing Systems drove calligraphic displays. No frame buffers. The early Picture Systems (like the brochure referenced by Lars) also drove calligraphic displays but did sport a line segment "refresh buffer" so that screen refreshes weren't dependent on the whole pipeline's performance. At least one heavily customized LDS-2 (described further below) produced raster output by 1974 (and likely earlier in design and testing) and had a buffer for raster refresh which exhibited some of what we think of as the functionality of a frame buffer fitting the time frame referenced by Noel for other E&S products. On 3/8/23 10:21 AM, Larry McVoy wrote: > I really miss terminal rooms. I learned so much looking over the > shoulders of more experienced people. Completely agree. They were the "playground learning" that did all of educate, build craft and community, and occasionally bestow humility. Although it completely predates frame buffer technology, the PDP-10 terminal room of the research computing environment at CWRU in the 1970s was especially remarkable as well as personally influential. All (calligraphic) graphics terminals and displays (though later a few Datapoint CRTs appeared.) There was an LDS-1 hosted on the PDP-10 and later an LDS-2 (which was co-located but not part of the PDP-10 environment.) The chair of the department, Edward (Ted) Glaser, had been recruited from MIT in 1968 and was heavily influential in guiding the graphics orientation of the facilities, and later, in the design of the customized LDS-2. Especially remarkable as he had been blind since he was 8. He had a comprehensive vision of systems and thinking about them that influenced a lot about the department's programs and research. When I arrived in 1972, I only had a fleeting overlap with the LDS-1 to experience some of its games (color wheel lorgnettes and carrier landings!). The PDP-10 was being modified for TENEX and the LDS-1 was being decommissioned. I recall a tablet and button box for LDS-1 input devices. The room was kept dimly lit with the overhead lighting off and only the glow of the displays and small wattage desk lamps. It shared the raised floor environment with the PDP-10 machine room (though was walled off from it) and so had a "quiet-loud" aura from all the white noise. The white noise cocooned you but permitted conversation and interaction with others that didn't usually disturb the uninvolved. The luxury terminals were IMLAC PDS-1s. There was a detachable switch and indicator console that could be swapped between them for debugging or if you simply liked having the blinking lights in view. When not in use for real work the IMLACs would run Space War, much to the detriment of IMLAC keyboards. They could handle pretty complex displays, like, a screen full of dense text before flicker might set in. Light pens provided pointing input. The bulk of the terminals were an array of DEC VT02s. Storage tube displays (so no animation possible), but with joysticks for pointing and interacting. There were never many VT02s made and we always believed we had the largest single collection of them. None of these had character generators. The LDS-1 and the IMLACs drew their own characters programmatically. A PDP-8/I drove the VT02s and stroked all the characters. It did it at about 2400 baud but when the 8 got busy you could perceive the drawing of the characters like a scribe on speed. If you stood way back to take in the room you could also watch the PDP-8 going around as the screens brightened momentarily as the characters/images were drawn. I was told that CWRU wrote the software for the PDP-8 and gave it to DEC, in return DEC gave CWRU $1 and the biggest line printer they sold. (The line printer did upper and lower case, and the University archivists swooned when presented with theses printed on it -- RUNOFF being akin to magic in a typewriter primitive world.) Until the Datapoint terminals arrived all the devices in the room either were computers themselves or front-ended by one. Although I only saw it happen once, the LDS-1 with it's rather intimate connection to the -10 was particularly attuned to the status of TOPS-10 and would flash "CRASH" before users could tell that something was wrong vs. just being slow. (We would later run TOPS-10 for amusement. The system had 128K words in total: 4 MA10 16K bays and 1 MD10 64K bay. TENEX needed a minimum of 80K to "operate" though it'd be misleading to describe that configuration as "running". If we lost the MD10 bay that meant no TENEX so we had a DECtape-swapping configuration of TOPS-10 for such moments because, well, a PDP-10 with 8 DECtapes twirling is pretty humorously theatrical.) All the displays (even the later Datapoints) had green or blue-green phosphors. This had the side effect that after several hours of staring at them made anything which was white look pink. This was especially pronounced in the winter in that being Cleveland it wasn't that unusual to leave to find a large deposit of seemingly psychedelic snow that hadn't been there when you went in. The LDS-2 arrived in the winter of 1973-4. It was a highly modified LDS-2 that produced raster graphics and shaded images in real-time. It was the first system to do that and was called the Case Shaded Graphics System (SGS). (E&S called it the Halftone System as it wouldn't do color in real-time. In addition to a black & white raster display, It had a 35mm movie camera, a Polaroid camera, and an RGB filter that would triple-expose each frame and so in a small way retained the charm of the lorgnettes used on the LDS-1 to make color happen but not in real-time.) It was hosted by a PDP-11/40 running RT-11. Declining memory prices helped enable the innovations in the SGS as it incorporated more memory components than the previous calligraphic systems. The graphics pipeline was extended such that after translation and clipping there was a Y-sort box that ordered the polygons from top to bottom for raster scanning followed by a Visible Surface Processor that separated hither from yon and finally a Gouraud Shader that produced the final image to a monitor or one of the cameras. Physically the system was 5 or maybe 6 bays long not including the 11/40 bay. The SGS had some teething problems after its delivery. Ivan Sutherland even came to Cleveland to work on it though he has claimed his main memory of that is the gunfire he heard from the Howard Johnson's hotel next to campus. The University was encircled by several distressed communities at the time. A "bullet hole through glass" decal appeared on the window of the SGS's camera bay to commemorate his experience. The SGS configuration was unique but a number of its elements were incorporated into later Picture Systems. It's my impression that the LDS systems were pretty "one off" and the Picture Systems became the (relative) "volume, off the shelf" product from E&S. (I'd love to read a history of all the things E&S did in that era.) By 1975-6 the SGS was being used by projects ranging from SST stress analyses to mathematicians producing videos of theoretical concepts. The exaggerated images of stresses on aircraft structures got pretty widely distributed and referenced at the time. The SGS was more of a production system used by other departments and entities rather than computer graphics research as such, in some ways its (engineering) research utility was achieved by its having existed. One student, Ben Jones, created an extended ALGOL-60 to allow programming in something other than the assembly language. As the SGS came online in 1975 the PDP-10 was being decommissioned and the calligraphic technologies associated with it vanished along with it. A couple of years later a couple of Teraks appeared and by the end of the 1970s frame buffers as we generally think of them were economically practical. That along with other processing improvements rendered the SGS obsolete and and so it was decommissioned in 1980 and donated to the Computer History Museum where I imagine it sits in storage next to a LINC-8 or the Ark of the Covenant or something. One of the SGS's bays (containing the LDS-2 Channel Control, the front of the pipeline LDS program interpreter running out of the host's memory) and the PDP-11 interface is visible via this link: https://www.computerhistory.org/collections/catalog/102691213 The bezels on the E&S bays were cosmetically like the DEC ones of the same era. They were all smoked glass so the blinking lights were visible but had to be raised if you wanted to see the identifying legends for them.