[COFF] [TUHS] Re: To NDEBUG or not to NDEBUG, that is the question
Steffen Nurpmeso via COFF
coff at tuhs.org
Wed Nov 12 08:31:25 AEST 2025
David Barto via COFF wrote in
<C279FA32-538B-407D-9397-7A1CBE4DA243 at kdbarto.org>:
|At a company I worked for we caught any exception (OOM, SIGTERM, SIGHUP \
|as examples) that would cause
|the application to exit. In the exception handler we wrote out 100’s \
|of MB of state data of the program, including
|stack traces for all the threads (1000’s of those) along with data \
|structures and anything else we could think of.
|(Memory allocation traces and queries that were running as examples). \
|This was done with very carefully crafted
|code which could not call any other functions, nor allocate any memory.
|
|This was all written in a format that allowed us to load it into the \
|same database in our office where we could then
|write queries against the data to see what happened and where the program \
|was when it occurred. We called
|the data dump an 'x-ray' and the program that loaded it into the database \
|and supported us examining the data
|’the doctor’.
|
|A common thing to hear was “I’m running the doctor on an x-ray from \
|customer <foo>”, or “the X-ray showed that
|we designed the query wrong, it should have had a join <here> which \
|would reduce the memory footprint by N-GB”
|
|As far as post-mortem debugging it was an amazing environment and was \
|exceptional at finding bugs in the code
|without having to use a standard debugger. No core files required.[1]
|
|It also let us ’Take an X-Ray’ of the running system while on the phone \
|with the customer, allowing us to examine
|what was happening before they did “the next step” which would crash \
|the system.
|
| David
|
|[1] - there were several users of the system who would not let a core \
|file leave the building b/c of security.
That is (beside the frightening sheer data size) fascinating.
Ie that you could analyze shipout code as used by customers to
this extend. It could be it is one of the key competences that
make up "the difference".
Still -- i would blindly assume that, except for possible "beta"
or "release candidate" states where customers possibly "could",
maybe in conjunction with some later discount, shipout code, you
know? More expensive code check paths, and all that, not in some
shipout code. But that post-mortem is impressive, yes.
In context regarding the subject of this thread another thing is
for example that the GNU Linux C library *reduced* those
extensive backtrace(3) info .. and really eight years passed since
this one:
https://patchwork.ozlabs.org/comment/1760227/
which says
malloc: Abort on heap corruption, without a backtrace [BZ #21754]
...
I really think we should avoid generating backtraces on heap
corruption. The process is in a precarious state at this point,
and doing more work at this point risks obscuring the root cause
of the corruption or enabling code execution exploits.
When it came to backporting the corrupted arena avoidance
patches, we asked internally if our support folks needed the
backtraces, and they said they could live without it.
The attached patch is just the minimum change required to get
going. We can do quite a few cleanups afterwards (including
removal of the corrupt arena avoidance logic, which does not
work reliable anyway and never can).
with an eg response
It took me a long time of talking to support people to become
convinced that what we provide here is relatively useless. The
large-scale SOS reports we collect at Red Hat are much much more
comprehensive than the dump glibc provides. Your pressure in
this area has spurred me to these discussions, and what
I thought was a semi-useful feature in the past is really
looking like an unused security hazard. Thank you for changing
my mind :-)
Now this is something completely different than having assert() in
as a no-op or not, yet it came to my mind.
I only wanted to add a different view, and keep on hoping that
control software of for example nuclear power plants, airplanes
(regardless that i do not use them), certain chemical plants, etc
etc do not have active assert() code paths, and not even the
VERIFY approach of OSSL, what Rich Salz said.
So i will not push development and test time down to users, 'happy
not to go R(apid)A(pp)D(evel), good thing needs time, and most
often many eyes. Slow food. But sometimes too many eyes make
things worse. For example IETF email arena beyond pure and plain
SMTP core. But they do work for who knows who.
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
More information about the COFF
mailing list