[COFF] [TUHS] Re: To NDEBUG or not to NDEBUG, that is the question
Douglas McIlroy via COFF
coff at tuhs.org
Wed Nov 12 09:14:36 AEST 2025
That's an impressive tale that I will remember in contrast to the
ancient Bell Labs debugging story below. The somewhat garbled
description comes from a transcript of a talk I gave at the Labs in
the mid-90s. (I can't imagine what I said that is recorded as "to no
specs".)
Another branch of computing was for the government. The whole Whippany
Laboratories [time check] Whippany, where we took on contracts for the
government particularly in the computing era in anti-missile defense,
missile defense, and underwater sound. Missile defense was a very
impressive undertaking. It was about in the early ’63 time frame when
it was estimated the amount of computation to do a reasonable job of
tracking incoming missiles would be 30 M floating point operations a
second. In the day of the Cray that doesn’t sound like a great lot,
but it’s more than your high end PCs can do. And the machines were
supposed to be reliable. They designed the machines at Whippany, a
twelve-processor multiprocessor, to no specs, enormously rugged, one
watt transistors. This thing in real life performed remarkably well.
There were sixty-five missile shots, tests across the Pacific Ocean
?...? and Lorinda Cherry here actually sat there waiting for them to
come in. [laughter] And only a half dozen of them really failed. As a
measure of the interest in reliability, one of them failed apparently
due to processor error. Two people were assigned to look at the dumps,
enormous amounts of telemetry and logging information were taken
during these tests, which are truly expensive to run. Two people were
assigned to look at the dumps. A year later they had not found the
trouble. The team was beefed up. They finally decided that there was a
race condition in one circuit. They then realized that this particular
kind of race condition had not been tested for in all the simulations.
They went back and simulated the entire hardware system to see if its
a remote possibility of any similar cases, found twelve of them, and
changed the hardware. But to spend over a year looking for a bug is a
sign of what reliability meant.
Doug
On Tue, Nov 11, 2025 at 5:31 PM Steffen Nurpmeso via COFF <coff at tuhs.org> wrote:
>
> David Barto via COFF wrote in
> <C279FA32-538B-407D-9397-7A1CBE4DA243 at kdbarto.org>:
> |At a company I worked for we caught any exception (OOM, SIGTERM, SIGHUP \
> |as examples) that would cause
> |the application to exit. In the exception handler we wrote out 100’s \
> |of MB of state data of the program, including
> |stack traces for all the threads (1000’s of those) along with data \
> |structures and anything else we could think of.
> |(Memory allocation traces and queries that were running as examples). \
> |This was done with very carefully crafted
> |code which could not call any other functions, nor allocate any memory.
> |
> |This was all written in a format that allowed us to load it into the \
> |same database in our office where we could then
> |write queries against the data to see what happened and where the program \
> |was when it occurred. We called
> |the data dump an 'x-ray' and the program that loaded it into the database \
> |and supported us examining the data
> |’the doctor’.
> |
> |A common thing to hear was “I’m running the doctor on an x-ray from \
> |customer <foo>”, or “the X-ray showed that
> |we designed the query wrong, it should have had a join <here> which \
> |would reduce the memory footprint by N-GB”
> |
> |As far as post-mortem debugging it was an amazing environment and was \
> |exceptional at finding bugs in the code
> |without having to use a standard debugger. No core files required.[1]
> |
> |It also let us ’Take an X-Ray’ of the running system while on the phone \
> |with the customer, allowing us to examine
> |what was happening before they did “the next step” which would crash \
> |the system.
> |
> | David
> |
> |[1] - there were several users of the system who would not let a core \
> |file leave the building b/c of security.
>
> That is (beside the frightening sheer data size) fascinating.
> Ie that you could analyze shipout code as used by customers to
> this extend. It could be it is one of the key competences that
> make up "the difference".
> Still -- i would blindly assume that, except for possible "beta"
> or "release candidate" states where customers possibly "could",
> maybe in conjunction with some later discount, shipout code, you
> know? More expensive code check paths, and all that, not in some
> shipout code. But that post-mortem is impressive, yes.
>
> In context regarding the subject of this thread another thing is
> for example that the GNU Linux C library *reduced* those
> extensive backtrace(3) info .. and really eight years passed since
> this one:
>
> https://patchwork.ozlabs.org/comment/1760227/
>
> which says
>
> malloc: Abort on heap corruption, without a backtrace [BZ #21754]
>
> ...
>
> I really think we should avoid generating backtraces on heap
> corruption. The process is in a precarious state at this point,
> and doing more work at this point risks obscuring the root cause
> of the corruption or enabling code execution exploits.
>
> When it came to backporting the corrupted arena avoidance
> patches, we asked internally if our support folks needed the
> backtraces, and they said they could live without it.
>
> The attached patch is just the minimum change required to get
> going. We can do quite a few cleanups afterwards (including
> removal of the corrupt arena avoidance logic, which does not
> work reliable anyway and never can).
>
> with an eg response
>
> It took me a long time of talking to support people to become
> convinced that what we provide here is relatively useless. The
> large-scale SOS reports we collect at Red Hat are much much more
> comprehensive than the dump glibc provides. Your pressure in
> this area has spurred me to these discussions, and what
> I thought was a semi-useful feature in the past is really
> looking like an unused security hazard. Thank you for changing
> my mind :-)
>
> Now this is something completely different than having assert() in
> as a no-op or not, yet it came to my mind.
>
> I only wanted to add a different view, and keep on hoping that
> control software of for example nuclear power plants, airplanes
> (regardless that i do not use them), certain chemical plants, etc
> etc do not have active assert() code paths, and not even the
> VERIFY approach of OSSL, what Rich Salz said.
> So i will not push development and test time down to users, 'happy
> not to go R(apid)A(pp)D(evel), good thing needs time, and most
> often many eyes. Slow food. But sometimes too many eyes make
> things worse. For example IETF email arena beyond pure and plain
> SMTP core. But they do work for who knows who.
>
> --steffen
> |
> |Der Kragenbaer, The moon bear,
> |der holt sich munter he cheerfully and one by one
> |einen nach dem anderen runter wa.ks himself off
> |(By Robert Gernhardt)
More information about the COFF
mailing list