<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"><br id="lineBreakAtBeginningOfMessage"><div><br><blockquote type="cite"><div>On Sep 20, 2024, at 1:58 PM, Warner Losh <imp@bsdimp.com> wrote:</div><br class="Apple-interchange-newline"><div><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Sep 20, 2024 at 9:16 PM Bakul Shah via TUHS <<a href="mailto:tuhs@tuhs.org">tuhs@tuhs.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">You are a bit late with your screed. You will find posts<br>

with similar sentiments starting back in 1980s in Usenet<br>

groups such as comp.lang.{c,misc,pascal}.<br>

<br>

Perhaps a more interesting (but likely pointless) question<br>

is what is the *least* that can be done to fix C's major<br>

problems.<br>

<br>

Compilers can easily add bounds checking for the array[index]<br>

construct but ptr[index] can not be checked, unless we make<br>

a ptr a heavy weight object such as (address, start, limit).<br>

One can see how code can be generated for code such as this:<br>

<br>

        Foo x[count];<br>

        Foo* p = x + n; // or &x[n]<br>

<br>

Code such as "Foo *p = malloc(size);" would require the<br>

compiler to know how malloc behaves to be able to compute<br>

the limit. But for a user to write a similar function will<br>

require some language extension.<br>

<br>

[Of course, if we did that, adding proper support for<br>

multidimensional slices would be far easier. But that<br>

is an exploration for another day!]<br></blockquote><div><br></div><div>The CHERI architecture extensions do this. It pushes this info into hardware</div><div>where all pointers point to a region (gross simplification) that also grant you</div><div>rights the area (including read/write/execute). It's really cool, but it does come</div><div>at a cost in performance. Each pointer is a pointer, and a capacity that's basically</div><div>a cryptographically signed bit of data that's the bounds and access permissions</div><div>associated with the pointer. There's more details on their web site:</div><div><a href="https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/">https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/</a></div></div></div></div></blockquote><div><br></div>Capabilities are heavier weight and perhaps an overkill to use as pointers.</div><div>And that doesn't help programs on normal processors. I view a capability</div><div>architecture better suited for microkernels -- a cap call would be akin to</div><div>a syscall + upcall to a server running in user code. For example</div><div>"read(file-cap, buffer-cap, size)" would need to be delivered to a fileserver</div><div>process etc. Basically a cap. is ptr *across* a protection domain. We want</div><div>type safe (including bound checking) within a protection domain (a process).</div><div><br></div><div><div>A compiler can often elide bounds checks or push them out of a loop.</div><div>Similarly for other smaller changes. The idea is to try to "fix" C with as little</div><div>rewriting as possible. Nobody is going to fund writing rewirtng all 10M lines of</div><div>kernel code in C (& more in user code) into Rust (not to mention such from</div><div>scratch rewrites usually result in incompatibilities).</div><div><br></div><div>But we still seem to want maximum performance and maximum security</div><div>without paying for it (and if pushed, we live with bugs but not lower</div><div>performance even if processors are orders of magniture faster now).</div></div><div><br><blockquote type="cite"><div><div dir="ltr"><div class="gmail_quote"><div><br></div><div>CHERI-BSD is a FreeBSD variant that runs on both CHERI variants (aarch64 and</div><div>riscv64) and where most of the research has been done.  There's also a Linux</div><div>variant as well.</div><div><br></div><div>Members of this project know way too many of the corner cases of the C language</div><div>from porting most popular software to the CHERI...  And have gone on screeds of</div><div>their own. The only one I can easily find is</div><div><a href="https://people.freebsd.org/~brooks/talks/asiabsdcon2017-helloworld/helloworld.pdf">https://people.freebsd.org/~brooks/talks/asiabsdcon2017-helloworld/helloworld.pdf</a><br></div><div><br></div><div>Warner<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Converting enums to behave like Pascal scalars would<br>

likely break things. The question is, can such breakage<br>

be fixed automatically (by source code conversion)?<br>

<br>

C's union type is used in two different ways: 1: similar<br>

to a sum type, which can be done type safely and 2: to<br>

cheat. The compiler should produce a warning when it can't<br>

verify a typesafe use -- one can add "unsafe" or some such<br>

to let the user absolve the compiler of such check.<br>

<br>

[May be naively] I tend to think one can evolve C this way<br>

and fix a lot of code &/or make a lot of bugs more explicit.<br>

<br>

> On Sep 20, 2024, at 10:11 AM, G. Branden Robinson <<a href="mailto:g.branden.robinson@gmail.com" target="_blank">g.branden.robinson@gmail.com</a>> wrote:<br>

> <br>

> At 2024-09-21T01:07:11+1000, Dave Horsfall wrote:<br>

>> Unless I'm mistaken (quite possible at my age), the OP was referring<br>

>> to that in C, pointers and arrays are pretty much the same thing i.e.<br>

>> "foo[-2]" means "take the pointer 'foo' and go back two things"<br>

>> (whatever a "thing" is).<br>

> <br>

> "in C, pointers and arrays are pretty much the same thing" is a common<br>

> utterance but misleading, and in my opinion, better replaced with a<br>

> different one.<br>

> <br>

> We should instead say something more like:<br>

> <br>

> In C, pointers and arrays have compatible dereference syntaxes.<br>

> <br>

> They do _not_ have compatible _declaration_ syntaxes.<br>

> <br>

> Chapter 4 of van der Linden's _Expert C Programming_: Deep C Secrets_<br>

> (1994) tackles this issue head-on and at length.<br>

> <br>

> Here's the salient point.<br>

> <br>

> "Consider the case of an external declaration `extern char *p;` but a<br>

> definition of `char p[10];`.  When we retrieve the contents of `p[i]`<br>

> using the extern, we get characters, but we treat it as a pointer.<br>

> Interpreting ASCII characters as an address is garbage, and if you're<br>

> lucky the program will coredump at that point.  If you're not lucky it<br>

> will corrupt something in your address space, causing a mysterious<br>

> failure at some point later in the program."<br>

> <br>

>> C is just a high level assembly language;<br>

> <br>

> I disagree with this common claim too.  Assembly languages correspond to<br>

> well-defined machine models.[1]  Those machine models have memory<br>

> models.  C has no memory model--deliberately, because that would have<br>

> gotten in the way of performance.  (In practice, C's machine model was<br>

> and remains the PDP-11,[2] with aspects thereof progressively sanded off<br>

> over the years in repeated efforts to salvage the language's reputation<br>

> for portability.)<br>

> <br>

>> there is no such object as a "string" for example: it's just an "array<br>

>> of char" with the last element being "\0" (viz: "strlen" vs. "sizeof".<br>

> <br>

> Yeah, it turns out we need a well-defined string type much more<br>

> powerfully than, it seems, anyone at the Bell Labs CSRC appreciated.<br>

> string.h was tacked on (by Nils-Peter Nelson, as I understand it) at the<br>

> end of the 1970s and C aficionados have defended the language's<br>

> purported perfection with such vigor that they annexed the haphazardly<br>

> assembled standard library into the territory that they defend with much<br>

> rhetorical violence and overstatement.  From useless or redundant return<br>

> values to const-carelessness to Schlemiel the Painter algorithms in<br>

> implementations, it seems we've collectively made every mistake that<br>

> could be made with Nelson's original, minimal API, and taught those<br>

> mistakes as best practices in tutorials and classrooms.  A sorry affair.<br>

> <br>

> So deep was this disdain for the string as a well-defined data type, and<br>

> moreover one conceptually distinct from an array (or vector) of integral<br>

> types that Stroustrup initially repeated the mistake in C++.  People can<br>

> easily roll their own, he seemed to have thought.  Eventually he thought<br>

> again, but C++ took so long to get standardized that by then, damage was<br>

> done.<br>

> <br>

> "A string is just an array of `char`s, and a `char` is just a<br>

> byte"--another hasty equivalence that surrendered a priceless hostage to<br>

> fortune.  This is the sort of fallacy indulged by people excessively<br>

> wedded to machine language programming and who apply its perspective to<br>

> every problem statement uncritically.<br>

> <br>

> Again and again, with signed vs. unsigned bytes, "wide" vs. "narrow"<br>

> characters, and "base" vs. "combining" characters, the champions of the<br>

> "portable assembly" paradigm charged like Lord Cardigan into the pike<br>

> and musket lines of the character type as one might envision it in a<br>

> machine register.  (This insistence on visualizing register-level<br>

> representations has prompted numerous other stupidities, like the use of<br>

> an integral zero at the _language level_ to represent empty, null, or<br>

> false literals for as many different data types as possible.  "If it<br>

> ends up as a zero in a register," the thinking appears to have gone, "it<br>

> should look like a zero in the source code."  Generations of code--and<br>

> language--cowboys have screwed us all over repeatedly with this hasty<br>

> equivalence.<br>

> <br>

> Type theorists have known better for decades.  But type theory is (1)<br>

> hard (it certainly is, to cowboys) and (2) has never enjoyed a trendy<br>

> day in the sun (for which we may be grateful), which means that is<br>

> seldom on the path one anticipates to a comfortable retirement from a<br>

> Silicon Valley tech company (or several) on a private yacht.<br>

> <br>

> Why do I rant so splenetically about these issues?  Because the result<br>

> of such confusion is _bugs in programs_.  You want something concrete?<br>

> There it is.  Data types protect you from screwing up.  And the better<br>

> your data types are, the more care you give to specifying what sorts of<br>

> objects your program manipulates, the more thought you give to the<br>

> invariants that must be maintained for your program to remain in a<br>

> well-defined state, the fewer bugs you will have.<br>

> <br>

> But, nah, better to slap together a prototype, ship it, talk it up to<br>

> the moon as your latest triumph while interviewing with a rival of the<br>

> company you just delivered that prototype to, and look on in amusement<br>

> when your brilliant achievement either proves disastrous in deployment<br>

> or soaks up the waking hours of an entire team of your former colleagues<br>

> cleaning up the steaming pile you voided from your rock star bowels.<br>

> <br>

> We've paid a heavy price for C's slow and seemingly deeply grudging<br>

> embrace of the type concept.  (The lack of controlled scope for<br>

> enumeration constants is one example; the horrifyingly ill-conceived<br>

> choice of "typedef" as a keyword indicating _type aliasing_ is another.)<br>

> Kernighan did not help by trashing Pascal so hard in about 1980.  He was<br>

> dead right that Pascal needed, essentially, polymorphic subprograms in<br>

> array types.  Wirth not speccing the language to accommodate that back<br>

> in 1973 or so was a sad mistake.  But Pascal got a lot of other stuff<br>

> right--stuff that the partisanship of C advocates refused to countenance<br>

> such that they ended up celebrating C's flaws as features.  No amount of<br>

> Jonestown tea could quench their thirst.  I suspect the truth was more<br>

> that they didn't want to bother having to learn any other languages.<br>

> (Or if they did, not any language that anyone else on their team at work<br>

> had any facility with.)  A rock star plays only one instrument, no?<br>

> People didn't like it when Eddie Van Halen played keyboards instead of<br>

> guitar on stage, so he stopped doing that.  The less your coworkers<br>

> understand your work, the more of a genius you must be.<br>

> <br>

> Now, where was I?<br>

> <br>

>> What's the length of "abc" vs. how many bytes are needed to store it?<br>

> <br>

> Even what is meant by "length" has several different correct answers!<br>

> Quantity of code points in the sequence?  Number of "grapheme clusters"<br>

> a.k.a. "user-perceived characters" as Unicode puts it?  Width as<br>

> represented on the output device?  On an ASCII device these usually had<br>

> the same answer (control characters excepted).  But even at the Bell<br>

> Labs CSRC in the 1970s, thanks to troff, the staff knew that they didn't<br>

> necessarily have to.  (How wide is an em dash?  How many bytes represent<br>

> it, in the formatting language and in the output language?)<br>

> <br>

>> Giggle...  In a device driver I wrote for V6, I used the expression<br>

>> <br>

>>    "0123"[n]<br>

>> <br>

>> and the two programmers whom I thought were better than me had to ask<br>

>> me what it did...<br>

>> <br>

>> -- Dave, brought up on PDP-11 Unix[*]<br>

> <br>

> I enjoy this application of that technique, courtesy of Alan Cox.<br>

> <br>

>  fsck-fuzix: blow 90 bytes on a progress indicator<br>

> <br>

>  static void progress(void)<br>

>  {<br>

>      static uint8_t progct;<br>

>      progct++;<br>

>      progct&=3;<br>

>      printf("%c\010", "-\\|/"[progct]);<br>

>      fflush(stdout);<br>

>  }<br>

> <br>

>> I still remember the days of BOS/PICK/etc, and I staked my career on<br>

>> Unix.<br>

> <br>

> Not a bad choice.  Your exposure to and recollection of other ways of<br>

> doing things, I suspect, made you a more valuable contributor than those<br>

> who mazed themselves with thoughts of "the Unix way" to the point that<br>

> they never seriously considered any other.<br>

> <br>

> It's fine to prefer "the C way" or "the Unix way", if you can<br>

> intelligibly define what that means as applied to the issue in dispute,<br>

> and coherently defend it.  Demonstrating an understanding of the<br>

> alternatives, and being able to credibly explain why they are inferior<br>

> approaches, is how to do advocacy correctly.<br>

> <br>

> But it is not the cowboy way.  The rock star way.<br>

> <br>

> Regards,<br>

> Branden<br>

> <br>

> [1] Unfortunately I must concede that this claim is less true than it<br>

>    used to be thanks to the relentless pursuit of trade-secret means of<br>

>    optimizing hardware performance.  Assembly languages now correspond,<br>

>    particularly on x86, to a sort of macro language that imperfectly<br>

>    masks a massive amount of microarchitectural state that the<br>

>    implementors themselves don't completely understand, at least not in<br>

>    time to get the product to market.  Hence the field day of<br>

>    speculative execution attacks and similar.  It would not be fair to<br>

>    say that CPUs of old had _no_ microarchitectural state--the Z80, for<br>

>    example, had the not-completely-official `W` and `Z` registers--but<br>

>    they did have much less of it, and correspondingly less attack<br>

>    surface for screwing your programs.  I do miss the days of<br>

>    deterministic cycle counts for instruction execution.  But I know<br>

>    I'd be sad if all the caches on my workaday machine switched off.<br>

> <br>

> [2] <a href="https://queue.acm.org/detail.cfm?id=3212479" rel="noreferrer" target="_blank">https://queue.acm.org/detail.cfm?id=3212479</a><br>

<br>

</blockquote></div></div>

</div></blockquote></div><br></body></html>