[TUHS] Maximum Array Sizes in 16 bit C
    Bakul Shah via TUHS 
    tuhs at tuhs.org
       
    Sat Sep 21 08:04:12 AEST 2024
    
    
  
> On Sep 20, 2024, at 1:58 PM, Warner Losh <imp at bsdimp.com> wrote:
> 
> 
> 
> On Fri, Sep 20, 2024 at 9:16 PM Bakul Shah via TUHS <tuhs at tuhs.org <mailto:tuhs at tuhs.org>> wrote:
>> You are a bit late with your screed. You will find posts
>> with similar sentiments starting back in 1980s in Usenet
>> groups such as comp.lang.{c,misc,pascal}.
>> 
>> Perhaps a more interesting (but likely pointless) question
>> is what is the *least* that can be done to fix C's major
>> problems.
>> 
>> Compilers can easily add bounds checking for the array[index]
>> construct but ptr[index] can not be checked, unless we make
>> a ptr a heavy weight object such as (address, start, limit).
>> One can see how code can be generated for code such as this:
>> 
>>         Foo x[count];
>>         Foo* p = x + n; // or &x[n]
>> 
>> Code such as "Foo *p = malloc(size);" would require the
>> compiler to know how malloc behaves to be able to compute
>> the limit. But for a user to write a similar function will
>> require some language extension.
>> 
>> [Of course, if we did that, adding proper support for
>> multidimensional slices would be far easier. But that
>> is an exploration for another day!]
> 
> The CHERI architecture extensions do this. It pushes this info into hardware
> where all pointers point to a region (gross simplification) that also grant you
> rights the area (including read/write/execute). It's really cool, but it does come
> at a cost in performance. Each pointer is a pointer, and a capacity that's basically
> a cryptographically signed bit of data that's the bounds and access permissions
> associated with the pointer. There's more details on their web site:
> https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/
Capabilities are heavier weight and perhaps an overkill to use as pointers.
And that doesn't help programs on normal processors. I view a capability
architecture better suited for microkernels -- a cap call would be akin to
a syscall + upcall to a server running in user code. For example
"read(file-cap, buffer-cap, size)" would need to be delivered to a fileserver
process etc. Basically a cap. is ptr *across* a protection domain. We want
type safe (including bound checking) within a protection domain (a process).
A compiler can often elide bounds checks or push them out of a loop.
Similarly for other smaller changes. The idea is to try to "fix" C with as little
rewriting as possible. Nobody is going to fund writing rewirtng all 10M lines of
kernel code in C (& more in user code) into Rust (not to mention such from
scratch rewrites usually result in incompatibilities).
But we still seem to want maximum performance and maximum security
without paying for it (and if pushed, we live with bugs but not lower
performance even if processors are orders of magniture faster now).
> 
> CHERI-BSD is a FreeBSD variant that runs on both CHERI variants (aarch64 and
> riscv64) and where most of the research has been done.  There's also a Linux
> variant as well.
> 
> Members of this project know way too many of the corner cases of the C language
> from porting most popular software to the CHERI...  And have gone on screeds of
> their own. The only one I can easily find is
> https://people.freebsd.org/~brooks/talks/asiabsdcon2017-helloworld/helloworld.pdf
> 
> Warner
>  
>> Converting enums to behave like Pascal scalars would
>> likely break things. The question is, can such breakage
>> be fixed automatically (by source code conversion)?
>> 
>> C's union type is used in two different ways: 1: similar
>> to a sum type, which can be done type safely and 2: to
>> cheat. The compiler should produce a warning when it can't
>> verify a typesafe use -- one can add "unsafe" or some such
>> to let the user absolve the compiler of such check.
>> 
>> [May be naively] I tend to think one can evolve C this way
>> and fix a lot of code &/or make a lot of bugs more explicit.
>> 
>> > On Sep 20, 2024, at 10:11 AM, G. Branden Robinson <g.branden.robinson at gmail.com <mailto:g.branden.robinson at gmail.com>> wrote:
>> > 
>> > At 2024-09-21T01:07:11+1000, Dave Horsfall wrote:
>> >> Unless I'm mistaken (quite possible at my age), the OP was referring
>> >> to that in C, pointers and arrays are pretty much the same thing i.e.
>> >> "foo[-2]" means "take the pointer 'foo' and go back two things"
>> >> (whatever a "thing" is).
>> > 
>> > "in C, pointers and arrays are pretty much the same thing" is a common
>> > utterance but misleading, and in my opinion, better replaced with a
>> > different one.
>> > 
>> > We should instead say something more like:
>> > 
>> > In C, pointers and arrays have compatible dereference syntaxes.
>> > 
>> > They do _not_ have compatible _declaration_ syntaxes.
>> > 
>> > Chapter 4 of van der Linden's _Expert C Programming_: Deep C Secrets_
>> > (1994) tackles this issue head-on and at length.
>> > 
>> > Here's the salient point.
>> > 
>> > "Consider the case of an external declaration `extern char *p;` but a
>> > definition of `char p[10];`.  When we retrieve the contents of `p[i]`
>> > using the extern, we get characters, but we treat it as a pointer.
>> > Interpreting ASCII characters as an address is garbage, and if you're
>> > lucky the program will coredump at that point.  If you're not lucky it
>> > will corrupt something in your address space, causing a mysterious
>> > failure at some point later in the program."
>> > 
>> >> C is just a high level assembly language;
>> > 
>> > I disagree with this common claim too.  Assembly languages correspond to
>> > well-defined machine models.[1]  Those machine models have memory
>> > models.  C has no memory model--deliberately, because that would have
>> > gotten in the way of performance.  (In practice, C's machine model was
>> > and remains the PDP-11,[2] with aspects thereof progressively sanded off
>> > over the years in repeated efforts to salvage the language's reputation
>> > for portability.)
>> > 
>> >> there is no such object as a "string" for example: it's just an "array
>> >> of char" with the last element being "\0" (viz: "strlen" vs. "sizeof".
>> > 
>> > Yeah, it turns out we need a well-defined string type much more
>> > powerfully than, it seems, anyone at the Bell Labs CSRC appreciated.
>> > string.h was tacked on (by Nils-Peter Nelson, as I understand it) at the
>> > end of the 1970s and C aficionados have defended the language's
>> > purported perfection with such vigor that they annexed the haphazardly
>> > assembled standard library into the territory that they defend with much
>> > rhetorical violence and overstatement.  From useless or redundant return
>> > values to const-carelessness to Schlemiel the Painter algorithms in
>> > implementations, it seems we've collectively made every mistake that
>> > could be made with Nelson's original, minimal API, and taught those
>> > mistakes as best practices in tutorials and classrooms.  A sorry affair.
>> > 
>> > So deep was this disdain for the string as a well-defined data type, and
>> > moreover one conceptually distinct from an array (or vector) of integral
>> > types that Stroustrup initially repeated the mistake in C++.  People can
>> > easily roll their own, he seemed to have thought.  Eventually he thought
>> > again, but C++ took so long to get standardized that by then, damage was
>> > done.
>> > 
>> > "A string is just an array of `char`s, and a `char` is just a
>> > byte"--another hasty equivalence that surrendered a priceless hostage to
>> > fortune.  This is the sort of fallacy indulged by people excessively
>> > wedded to machine language programming and who apply its perspective to
>> > every problem statement uncritically.
>> > 
>> > Again and again, with signed vs. unsigned bytes, "wide" vs. "narrow"
>> > characters, and "base" vs. "combining" characters, the champions of the
>> > "portable assembly" paradigm charged like Lord Cardigan into the pike
>> > and musket lines of the character type as one might envision it in a
>> > machine register.  (This insistence on visualizing register-level
>> > representations has prompted numerous other stupidities, like the use of
>> > an integral zero at the _language level_ to represent empty, null, or
>> > false literals for as many different data types as possible.  "If it
>> > ends up as a zero in a register," the thinking appears to have gone, "it
>> > should look like a zero in the source code."  Generations of code--and
>> > language--cowboys have screwed us all over repeatedly with this hasty
>> > equivalence.
>> > 
>> > Type theorists have known better for decades.  But type theory is (1)
>> > hard (it certainly is, to cowboys) and (2) has never enjoyed a trendy
>> > day in the sun (for which we may be grateful), which means that is
>> > seldom on the path one anticipates to a comfortable retirement from a
>> > Silicon Valley tech company (or several) on a private yacht.
>> > 
>> > Why do I rant so splenetically about these issues?  Because the result
>> > of such confusion is _bugs in programs_.  You want something concrete?
>> > There it is.  Data types protect you from screwing up.  And the better
>> > your data types are, the more care you give to specifying what sorts of
>> > objects your program manipulates, the more thought you give to the
>> > invariants that must be maintained for your program to remain in a
>> > well-defined state, the fewer bugs you will have.
>> > 
>> > But, nah, better to slap together a prototype, ship it, talk it up to
>> > the moon as your latest triumph while interviewing with a rival of the
>> > company you just delivered that prototype to, and look on in amusement
>> > when your brilliant achievement either proves disastrous in deployment
>> > or soaks up the waking hours of an entire team of your former colleagues
>> > cleaning up the steaming pile you voided from your rock star bowels.
>> > 
>> > We've paid a heavy price for C's slow and seemingly deeply grudging
>> > embrace of the type concept.  (The lack of controlled scope for
>> > enumeration constants is one example; the horrifyingly ill-conceived
>> > choice of "typedef" as a keyword indicating _type aliasing_ is another.)
>> > Kernighan did not help by trashing Pascal so hard in about 1980.  He was
>> > dead right that Pascal needed, essentially, polymorphic subprograms in
>> > array types.  Wirth not speccing the language to accommodate that back
>> > in 1973 or so was a sad mistake.  But Pascal got a lot of other stuff
>> > right--stuff that the partisanship of C advocates refused to countenance
>> > such that they ended up celebrating C's flaws as features.  No amount of
>> > Jonestown tea could quench their thirst.  I suspect the truth was more
>> > that they didn't want to bother having to learn any other languages.
>> > (Or if they did, not any language that anyone else on their team at work
>> > had any facility with.)  A rock star plays only one instrument, no?
>> > People didn't like it when Eddie Van Halen played keyboards instead of
>> > guitar on stage, so he stopped doing that.  The less your coworkers
>> > understand your work, the more of a genius you must be.
>> > 
>> > Now, where was I?
>> > 
>> >> What's the length of "abc" vs. how many bytes are needed to store it?
>> > 
>> > Even what is meant by "length" has several different correct answers!
>> > Quantity of code points in the sequence?  Number of "grapheme clusters"
>> > a.k.a. "user-perceived characters" as Unicode puts it?  Width as
>> > represented on the output device?  On an ASCII device these usually had
>> > the same answer (control characters excepted).  But even at the Bell
>> > Labs CSRC in the 1970s, thanks to troff, the staff knew that they didn't
>> > necessarily have to.  (How wide is an em dash?  How many bytes represent
>> > it, in the formatting language and in the output language?)
>> > 
>> >> Giggle...  In a device driver I wrote for V6, I used the expression
>> >> 
>> >>    "0123"[n]
>> >> 
>> >> and the two programmers whom I thought were better than me had to ask
>> >> me what it did...
>> >> 
>> >> -- Dave, brought up on PDP-11 Unix[*]
>> > 
>> > I enjoy this application of that technique, courtesy of Alan Cox.
>> > 
>> >  fsck-fuzix: blow 90 bytes on a progress indicator
>> > 
>> >  static void progress(void)
>> >  {
>> >      static uint8_t progct;
>> >      progct++;
>> >      progct&=3;
>> >      printf("%c\010", "-\\|/"[progct]);
>> >      fflush(stdout);
>> >  }
>> > 
>> >> I still remember the days of BOS/PICK/etc, and I staked my career on
>> >> Unix.
>> > 
>> > Not a bad choice.  Your exposure to and recollection of other ways of
>> > doing things, I suspect, made you a more valuable contributor than those
>> > who mazed themselves with thoughts of "the Unix way" to the point that
>> > they never seriously considered any other.
>> > 
>> > It's fine to prefer "the C way" or "the Unix way", if you can
>> > intelligibly define what that means as applied to the issue in dispute,
>> > and coherently defend it.  Demonstrating an understanding of the
>> > alternatives, and being able to credibly explain why they are inferior
>> > approaches, is how to do advocacy correctly.
>> > 
>> > But it is not the cowboy way.  The rock star way.
>> > 
>> > Regards,
>> > Branden
>> > 
>> > [1] Unfortunately I must concede that this claim is less true than it
>> >    used to be thanks to the relentless pursuit of trade-secret means of
>> >    optimizing hardware performance.  Assembly languages now correspond,
>> >    particularly on x86, to a sort of macro language that imperfectly
>> >    masks a massive amount of microarchitectural state that the
>> >    implementors themselves don't completely understand, at least not in
>> >    time to get the product to market.  Hence the field day of
>> >    speculative execution attacks and similar.  It would not be fair to
>> >    say that CPUs of old had _no_ microarchitectural state--the Z80, for
>> >    example, had the not-completely-official `W` and `Z` registers--but
>> >    they did have much less of it, and correspondingly less attack
>> >    surface for screwing your programs.  I do miss the days of
>> >    deterministic cycle counts for instruction execution.  But I know
>> >    I'd be sad if all the caches on my workaday machine switched off.
>> > 
>> > [2] https://queue.acm.org/detail.cfm?id=3212479
>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/tuhs/attachments/20240920/44903c4d/attachment.htm>
    
    
More information about the TUHS
mailing list