Hi,
the kernel panic after tcsh here documents is understood.
And fixed, at least on my system.
The essential hint was Johnny's observation that on his system he gets
an "Illegal instruction - core dumped" and no kernel panic.
I'm using a self-build PDP 11/70 on an FPGA, see
https://github.com/wfjm/w11/https://wfjm.github.io/home/w11/
which doesn't have a floating point unit. Therefore the kernel is build
with floating point emulation, thus with
FPSIM YES # floating point simulator
In a kernel with FPSIM activated the trap handler trap(), see
http://www.retro11.de/ouxr/211bsd/usr/src/sys/pdp/trap.c.html
calls for each user mode illegal instruction trap fpsim(). In case
it was a floating point instruction fpsim() emulates it, returns 0,
and trap() simply returns. If not, fpsim() returns the abort signal
type, and trap() calls psignal() with this signal type, which in
general will terminate the offending process.
The kernel panic is due to a coding error in mch_fpsim.s. Look in
http://www.retro11.de/ouxr/211bsd/usr/src/sys/pdp/mch_fpsim.s.html
the code after label badins:
badins: / Illegal Instruction
mov $SIGILL.,r0
br 2b
The constant SIGILL is defined in assym.h as
#define SIGILL 4.
Thus after substitution the mov instruction is
mov $4..,r0
with *two dots* !!! The 'as' assembler generates from this
mov #160750,r0
So r0 will contain a invalid signal number, which is returned by fpsim() to
trap(). This signal number is passed to psignal(), which starts with
mask = sigmask(sig);
prop = sigprop[sig];
The access to sigprop[sig] results into an address in IO space, causes an
UNIBUS timeout, and in consequence the kernel panic.
After fixing the "$SIGILL." to "$SIGILL" (removing the extraneous '.') and
three similar cases the kernel doesn't panic anymore, tcsh crashed with an
illegal instruction trap.
Remains the question why tcsh runs onto an illegal instruction. Getting now
a tcsh core dump adb gives the answer
adb tcsh tcsh.core
$c
0172774: _rscan(0176024,0174434) from ~heredoc+0246
0176040: _heredoc(067676) from ~execute+0234
0176126: _execute(067040,01512,0,0) from ~execute+03410
0176222: _execute(066754,01512,0,0) from ~process+01224
0176274: _process(01) from ~main+06030
0177414: _main() from start+0104
heredoc(), which is located in OV1, calls rscan(), which is in OV6 with
rscan(Dv, Dtestq);
where Dtestq is a function pointer to Dtestq(), which is as heredoc() in OV1.
rscan(), which has the signature
rscan(t, f)
register Char **t;
void (*f) ();
uses 'f' in the statement
(*f) (*p++);
The problem is that
- heredoc() and Dtestq() are in OV1
- that's why in the end ~Dtestq is used a function pointer, like
for all overlay internal function invocations
- rscan() is in OV6, when it's called, overlay is switched OV1 -> OV6
- this invalidates the function pointer, which points to some random
code location, which happens to hold '000045', causing a trap.
It is clear that in this context _Dtestq, the forwarder in the base, must
be used and not ~Dtestq, the entry point in the overlay. The generated
code for 'rscan(Dv, Dtestq)' is
~heredoc+0230: mov $0174434,(sp) # arg Dtestq: uses ~Dtestq
~heredoc+0234: mov r5,-(sp)
~heredoc+0236: add $0177764,(sp) # arg Dv
~heredoc+0242: jsr pc,*$_rscan
Since rscan() is very small and only used by heredoc() I simply moved the
code of rscan() from sh.glob.c (OV6) to sh.dol.c where also heredoc() and
Dtestq() is defined.
After that tcsh works fine with here documents
./tcsh
cat >x.x <<EOF
1
$TERM
$PWD
EOF
cat x.x
1
vt100-long
/usr/src/bin/tcsh
Bottom line
- fpsim was broken all the time
- tcsh was broken all the time
I'm convert this into proper patches and send them to Steven, but this will
take some time because I've to tidy up my system to be again in the
position to provide proper and clean patch sets.
With best regards, Walter
P.S.: debugging the kernel issue was quite easy because the w11a CPU has
three essential 'build into the cpu' debug tools:
- a 'cpu monitor', which records 144 bits of processor state for the last 256
instructions or vector fetches, see
https://github.com/wfjm/w11/blob/master/rtl/w11a/pdp11_dmcmon.vhd
- a 'breakpoint unit' which allows to set instruction of data breakpoints
- an 'ibus monitor' which records the last 512 ibus transactions
After setting a breakpoint on the trap 004/010 handler an inspection of the
instruction trace gave the essential information. Below a very condensed
and annotated excerpt
nc ....pc cprptnzvc ..dsrc ..ddst ..dres vmaddr vmdata
#
# the "(*f) (*p++)" in tcsh, running onto an illegal instruction
#
15 145210 uu00-.... 000105 173052 000105 w d 173052 000105 mov r0,(sp)
25 145212 uu00-.... 173050 174434 174434 w d 173050 145216 jsr pc,@n(r5)
19 174434 uu00-.... 000010 173064 000010 r i 174434 000045 ?000045?
1 174434 uu00-.... 000012 173064 000012 r d 000010 000045 !VFETCH 010 RIT
#
# the "mov $SIGILL.,r0" in fpsim(), load 160750 instead of 000004
#
17 160744 ku00-n..c 160750 000045 160750 r i 160746 160750 mov #n,r0
14 160750 ku00-n..c 160752 160750 160732 r i 160750 000770 br .-14
#
# the "sigprop[sig]" access in psignal(), which accesses 174036
# which leads to a external bus (or UNIBUS) time out and IIT trap
#
23 161314 ku00-.z.. 000000 147500 000000 w d 147500 000000 mov r1,n(r5)
9 161320 ku00-.z.. 174036 000000 000000 Ebto 174036 013066 movb n(r3),r0
3 161320 ku00-.z.. 000006 000000 000006 r d 000004 013066 !VFETCH 004 IIT
Arnold gets it right on the Pascal indexing.
In UCSD Pascal you could specify any array bounds you would like and
the compiler would 0 base them for you by always doing a subtraction,
or addition if your min was negative, of your min array index. So a little
run time cost for non-zero based arrays.
I’m not sure how other Pascal compilers did this.
I find it interesting that there are now a slew of testing programs
(Valgrind, Address Sanitizer, Purify, etc) that will add the ‘missing’
array bounds checking for C.
David
> On Jun 7, 2017, at 10:01 AM, tuhs-request(a)minnie.tuhs.org wrote:
>
> Date: Wed, 07 Jun 2017 07:20:43 -0600
> From: arnold(a)skeeve.com
> To: tuhs(a)tuhs.org, ag4ve.us(a)gmail.com
> Subject: Re: [TUHS] Array index history
> Message-ID: <201706071320.v57DKhmJ026303(a)freefriends.org>
> Content-Type: text/plain; charset=us-ascii
>
> Pascal (IIRC) allowed you to specify upper and lower bounds, something
> like
>
> foo : array[5..10] of integer;
>
> with runtime bounds checking on array accesses. (I could be wrong ---
> it's been a LLLLOOONNNGGG time.)
>
> HTH,
>
> Arnold
On 2017-06-07 19:01, "Ron Natalie"<ron(a)ronnatalie.com> wrote:
> The original FORTRAN and BASIC arrays started indexing at one because everybody other than computer scientists start counting at 1.
FORTRAN, yes. BASIC (which dialect might we be talking about?) normally
actually start with 0. However, BASIC is weird, in that the DIM
statement is actually specifying the highest usable index, and not the
size of the array.
Thus:
DIM X(10)
means you get an array with 11 elements. So, people who wanted to use
array starting at 1 would still be happy, and if you wanted to start at
0, that also worked. You might unintentionally have a bit of wasted
memory, though.
> These languages were for scientists and the beginner, so you wanted to make things compatible with their normal concepts.
True.
> PASCAL on the other hand required you to give the minimum and maximum index for the array.
In a way, PASCAL makes the most sense. You still what range you want,
and you get that. Anything works, and it's up to you.
That said, PASCAL could get a bit ugly when passing arrays as arguments
to functions because of this.
> Of course, C’s half-assaed implementation of arrays kind of depends on zero-indexing to work.
:-)
Johnny
--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: bqt(a)softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol
On 2017-06-07 22:14, "Walter F.J. Mueller"<w.f.j.mueller(a)retro11.de> wrote:
> Hi,
>
> a few remarks on the feedback on the kernel panic after a 'here document' in tcsh.
>
> To Michael Kjörling question:
> > I'm curious whether the same thing happens if you try that in some
> > other shell? (Not sure how widely here documents were supported back
> > then, but I'm asking anyway.)
> And Johnny Billquist remark
> > Not sure if any of the other shells have this.
>
> 'here documents' are available and work fine in sh and csh.
> And are in fact used, examples
Ah. Thanks. Too lazy to check.
> To Michael Kjörling remark
> > The PC value in the panic report ("pc 161324") strikes me as high
> and Johnny Billquist remark
> > This is in kernel mode, and that is in the I/O page.
>
> 211bsd uses split I/D space and uses all 64 kB I space for code.
D'oh! Color me stupid. I should have thought of that.
> The top 8 kB are in fact the overlay area, and the crash happened
> in overlay 4 (as indicated by ov 4). With a simple
>
> nm /unix | sort | grep " 4"
>
> one gets
>
> 161254 t ~psignal 4
> 162302 t ~issignal 4
>
> so the crash is just 050 bytes after the entry point of psignal. So the
> PC address is fine and not the problem. For psignal look at
>
> http://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#s:_psignal
>
> the crash must be one of the first lines. psignal is an internal kernel
> function, called from
>
> http://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#xref:s:_p…
>
> and has nothing to do with the libc function psignal
>
> http://www.retro11.de/ouxr/211bsd/usr/man/cat3/psignal.0.html
> http://www.retro11.de/ouxr/211bsd/usr/src/lib/libc/gen/psignal.c.html
The libc function would be in user mode, so that one was pretty clear.
Ok. Digging through this a little for real then.
psignal gets called with a signal from the trap handler. The actual
signal is weird. It would appear to be 0160750, which would be -7704 if
I'm counting right. That does not make sense as a signal.
The psignal code pulls a value based on the signal number, which is the
line:
prop = sigprop[sig];
which uses the signal number as an index. With a random, weird signal
number, this access wherever that might end up. Which is when you get
the crash.
On my system, sigprop is at address 0012172, which, with a signal of
-7704 ends up at address 0173142, which by (un)luck happens to be in the
middle of the diagnostics bootstrap rom space. So I don't get a Unibus
timeout error, while you do. Probably because sigprop is at a slightly
different address in your kernel.
So, the real question is how trap can be calling psignal with such a
broken signal number.
I might dig further down that question another day. But unless you
already got this far, I might have saved you a few minutes of digging. I
did start looking into the trap code, which is in pdp/trap.c, but this
is not entirely straight forward. It goes through a bunch of things
trying to decide what signal to send, before actually calling psignal.
> To Johnny Billquist remark
> > Could you (Walter) try the latest version of 2.11BSD and see if you
> > still get that crash?
>
> very interesting that you see a core dump of tcsh rather a kernel panic.
Indeed.
> Whatever tcsh does, it should not lead to a kernel panic, and if it does,
> it is primarily a bug of the kernel. It looks like there are two issues,
> one in tcsh, and one in the kernel. I've a hunch were this might come from,
> but that will take a weekend or two to check on.
Agree that the kernel should not crash on this.
Also, tcsh should not really crash either, but it's a separate issue,
even though one might have triggered the other here.
But yes, there are two bugs in here.
If you can recreate the kernel crash on the latest version, that would
be good.
But it smells like trap.c have some path where it does not even set what
signal to deliver, and then calls psignal with whatever the variable i
got at the function start. Which would be some random stuff on the stack.
Johnny
--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: bqt(a)softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol
On 2017-06-08 22:17, Dave Horsfall<dave(a)horsfall.org> wrote:
>
> Just to diverge from this thread a little, it probably isn't all that
> remarkable that programming languages tend to reflect the hardware for
> which they were designed.
>
> Thus, for example, we have the C construct:
>
> do { ... } while (--i);
>
> which translated right into the PDP-11's "SOB" instruction (and
> reminiscent of FORTRAN's insistence that DO loops are run at least once
> (there was a CACM article about that once; anyone have a pointer to it?)).
>
> And of course the afore-mentioned FORTRAN, which really reflects the
> underlying IBM 70x architecture (shudder).
FORTRAN stopped running the loops at least once already with FORTRAN 77.
The last who insisted on running loops at least once was FORTRAN IV.
Johnny
--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: bqt(a)softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol
I learned the other day that array indexes in some languages start at 1
instead of 0. This seems to be an old trend that changed around the 70s?
Who started this? Why was the change made?
It seems to have come about around the same time as C, but interestingly
enough Lua is kinda in between (you can start an array at 0 or 1).
Smalltalk can probably have a 0 base index just by it's nature, but I
wonder whether that would work in a 40 year old interpreter.
> Basically, until C came along, the standard practice was for indices
> to start at 1. Certainly Fortran and Pascal did it that way.
Mercury Autocode used 0.
http://www.homepages.ed.ac.uk/jwp/history/mercury/manual/autocode/4.jpg
-- Richard
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
Hi,
a few remarks on the feedback on the kernel panic after a 'here document' in tcsh.
To Michael Kjörling question:
> I'm curious whether the same thing happens if you try that in some
> other shell? (Not sure how widely here documents were supported back
> then, but I'm asking anyway.)
And Johnny Billquist remark
> Not sure if any of the other shells have this.
'here documents' are available and work fine in sh and csh.
And are in fact used, examples
/usr/adm/daily (a /bin/sh script)
su uucp << EOF
/etc/uucp/clean.daily
EOF
/usr/crash/why (a /bin/csh script)
adb -k {unix,core}.$1 << 'EOF'
version/sn"Backtrace:"n
$c
'EOF'
To Michael Kjörling remark
> The PC value in the panic report ("pc 161324") strikes me as high
and Johnny Billquist remark
> This is in kernel mode, and that is in the I/O page.
211bsd uses split I/D space and uses all 64 kB I space for code.
The top 8 kB are in fact the overlay area, and the crash happened
in overlay 4 (as indicated by ov 4). With a simple
nm /unix | sort | grep " 4"
one gets
161254 t ~psignal 4
162302 t ~issignal 4
so the crash is just 050 bytes after the entry point of psignal. So the
PC address is fine and not the problem. For psignal look at
http://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#s:_psignal
the crash must be one of the first lines. psignal is an internal kernel
function, called from
http://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#xref:s:_p…
and has nothing to do with the libc function psignal
http://www.retro11.de/ouxr/211bsd/usr/man/cat3/psignal.0.htmlhttp://www.retro11.de/ouxr/211bsd/usr/src/lib/libc/gen/psignal.c.html
To Johnny Billquist remark
> Could you (Walter) try the latest version of 2.11BSD and see if you
> still get that crash?
very interesting that you see a core dump of tcsh rather a kernel panic.
Whatever tcsh does, it should not lead to a kernel panic, and if it does,
it is primarily a bug of the kernel. It looks like there are two issues,
one in tcsh, and one in the kernel. I've a hunch were this might come from,
but that will take a weekend or two to check on.
With best regards, Walter
On 2017-06-06 04:00, Michael Kjörling <michael(a)kjorling.se> wrote:
>
> On 5 Jun 2017 16:12 +0200, from w.f.j.mueller(a)retro11.de (Walter F.J. Mueller):
>> I'm using 211bsd (Version 447) and found that a 'here document' in tcsh
>> leads to a kernel panic. It's absolutely reproducible on my system, both
>> when run it on my FPGA PDP-11 or in simh. Just doing
>>
>> tcsh
>> cat << EOF
> I'm curious whether the same thing happens if you try that in some
> other shell? (Not sure how widely here documents were supported back
> then, but I'm asking anyway.)
Not sure if any of the other shells have this. We're basically talking
csh, sh and ksh unless I remember wrong.
But it's a good question. If noone else have tried it by tomorrow, I
could check.
>> is enough, and I get
>>
>> ka6 31333 aps 147472
>> pc 161324 ps 30004
>> ov 4
>> cpuerr 20
>> trap type 0
>> panic: trap
>> syncing disks... done
>>
>> looking at the crash dump gives
>>
>> cd /etc/crash
>> ./why 4
>> Backtrace:
>> 0147372: _boot(05000,0100) from ~panic+072
>> 0147414: _etext(011350) from ~trap+0350
>> 0147450: ~trap() from call+040
>> 0147516: _psignal(0101520,0160750) from ~trap+0364
>> 0147554: ~trap() from call+040
>>
>> so the crash is in psignal, which is afaik the kernel internal
>> mechanism to dispatch signals.
> The PC value in the panic report ("pc 161324") strikes me as high, but
> 161324 octal is 58068 decimal, so it's not excessively so, and perhaps
> in line with what one might expect to see with a kernel pinned near
> top of memory. Are the offsets in the backtrace constant, i.e. does it
> always crash on the same code?
161324 is way high. This is in kernel mode, and that is in the I/O page.
Basically no code lives in the I/O page (some boot roms and hardware
diagnostics excepted). This smells like corrupted memory (pointer or
stack), or something else very funny.
> Not knowing what cpuerr 20 is specifically doesn't help, and at least
> http://www.retro11.de/ouxr/29bsd/usr/src/sys/sys/trap.c.html#n:112
> (which doesn't seem to be too far from what you are running) isn't
> terribly enlightening; CPUERR is simply a pointer into a memory-mapped
> register of some kind, as seen at
> http://www.retro11.de/ouxr/29bsd/usr/include/sys/iopage.h.html#m:CPUERR,
> and at least pdp11_cpumod.c from the simh source code at
> http://simh.trailing-edge.com/interim/pdp11_cpumod.c wasn't terribly
> enlightening, though of course I could be looking in entirely the
> wrong place.
Like others said - the cpu error register is documented in the processor
handbook.
020 means Unibus Timeout, which is consistent with trying to access
something in the I/O page, where there is no device configured to
respond to that address.
I just tried the same thing on a simh system here, and I do not get a
crash. This on 2.11BSD at patch level 449, running on an emulated 11/94.
I do however get tcsh to crash.
simh:/home/bqt> su -
Password:
erase, kill ^U, intr ^C
# tcsh
simh:/# cat << EOF
Illegal instruction - core dumped
#
Suspended (tty input)
simh:/home/bqt>
simh:/home/bqt> cat /VERSION
Current Patch Level: 448
Date: January 5, 2010
Yes, it says patch level 448, but it really is 449. This was the system
where I worked together with Steven when doing the 449 patch set, but I
never got around to actually updating the VERSION file itself.
Also, this was while running on the console.
Could you (Walter) try the latest version of 2.11BSD and see if you
still get that crash?
Johnny
--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: bqt(a)softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol
Hi,
I'm using 211bsd (Version 447) and found that a 'here document' in tcsh
leads to a kernel panic. It's absolutely reproducible on my system, both
when run it on my FPGA PDP-11 or in simh. Just doing
tcsh
cat << EOF
is enough, and I get
ka6 31333 aps 147472
pc 161324 ps 30004
ov 4
cpuerr 20
trap type 0
panic: trap
syncing disks... done
looking at the crash dump gives
cd /etc/crash
./why 4
Backtrace:
0147372: _boot(05000,0100) from ~panic+072
0147414: _etext(011350) from ~trap+0350
0147450: ~trap() from call+040
0147516: _psignal(0101520,0160750) from ~trap+0364
0147554: ~trap() from call+040
so the crash is in psignal, which is afaik the kernel internal
mechanism to dispatch signals.
Questions:
1. has anybody seen this before ?
2. any idea what the reason could be ?
With best regards, Walter