[TUHS] 211bsd: kernel panic after a 'here document' in tcsh

Sat Jun 10 22:58:43 AEST 2017

Hi,

the kernel panic after tcsh here documents is understood.
And fixed, at least on my system.

The essential hint was Johnny's observation that on his system he gets
an "Illegal instruction - core dumped" and no kernel panic.

I'm using a self-build PDP 11/70 on an FPGA, see
   https://github.com/wfjm/w11/
   https://wfjm.github.io/home/w11/
which doesn't have a floating point unit. Therefore the kernel is build
with floating point emulation, thus with
   FPSIM   YES      # floating point simulator

In a kernel with FPSIM activated the trap handler trap(), see
   http://www.retro11.de/ouxr/211bsd/usr/src/sys/pdp/trap.c.html
calls for each user mode illegal instruction trap fpsim(). In case
it was a floating point instruction fpsim() emulates it, returns 0,
and trap() simply returns. If not, fpsim() returns the abort signal
type, and trap() calls psignal() with this signal type, which in
general will terminate the offending process.

The kernel panic is due to a coding error in mch_fpsim.s. Look in
   http://www.retro11.de/ouxr/211bsd/usr/src/sys/pdp/mch_fpsim.s.html
the code after label badins:

    badins:                         / Illegal Instruction
            mov     $SIGILL.,r0
            br      2b

The constant SIGILL is defined in assym.h as

    #define SIGILL 4.

Thus after substitution the mov instruction is

            mov     $4..,r0

with *two dots* !!! The 'as' assembler generates from this

            mov #160750,r0

So r0 will contain a invalid signal number, which is returned by fpsim() to
trap(). This signal number is passed to psignal(), which starts with

      mask = sigmask(sig);
      prop = sigprop[sig];

The access to sigprop[sig] results into an address in IO space, causes an
UNIBUS timeout, and in consequence the kernel panic.

After fixing the "$SIGILL." to "$SIGILL"  (removing the extraneous '.') and
three similar cases the kernel doesn't panic anymore, tcsh crashed with an
illegal instruction trap.

Remains the question why tcsh runs onto an illegal instruction. Getting now
a tcsh core dump adb gives the answer

   adb tcsh tcsh.core
     $c
       0172774: _rscan(0176024,0174434) from ~heredoc+0246
       0176040: _heredoc(067676) from ~execute+0234
       0176126: _execute(067040,01512,0,0) from ~execute+03410
       0176222: _execute(066754,01512,0,0) from ~process+01224
       0176274: _process(01) from ~main+06030
       0177414: _main() from start+0104

heredoc(), which is located in OV1, calls rscan(), which is in OV6 with

    rscan(Dv, Dtestq);

where Dtestq is a function pointer to Dtestq(), which is as heredoc() in OV1.
rscan(), which has the signature

      rscan(t, f)
           register Char **t;
           void    (*f) ();

uses 'f' in the statement

       (*f) (*p++);

The problem is that
   - heredoc() and Dtestq() are in OV1
   - that's why in the end ~Dtestq is used a function pointer, like
     for all overlay internal function invocations
   - rscan() is in OV6, when it's called, overlay is switched OV1 -> OV6
   - this invalidates the function pointer, which points to some random
     code location, which happens to hold '000045', causing a trap.

It is clear that in this context _Dtestq, the forwarder in the base, must
be used and not ~Dtestq, the entry point in the overlay. The generated
code for 'rscan(Dv, Dtestq)' is

       ~heredoc+0230:  mov     $0174434,(sp)         # arg Dtestq: uses ~Dtestq
       ~heredoc+0234:  mov     r5,-(sp)
       ~heredoc+0236:  add     $0177764,(sp)         # arg Dv
       ~heredoc+0242:  jsr     pc,*$_rscan

Since rscan() is very small and only used by heredoc() I simply moved the
code of rscan() from sh.glob.c (OV6) to sh.dol.c where also heredoc() and
Dtestq() is defined.

After that tcsh works fine with here documents
   ./tcsh
   cat >x.x <<EOF
   1
   $TERM
   $PWD
   EOF

   cat x.x
     1
     vt100-long
     /usr/src/bin/tcsh

Bottom line
   - fpsim was broken all the time
   - tcsh was broken all the time

I'm convert this into proper patches and send them to Steven, but this will
take some time because I've to tidy up my system to be again in the
position to provide proper and clean patch sets.

             With best regards,       Walter

P.S.: debugging the kernel issue was quite easy because the w11a CPU has
three essential 'build into the cpu' debug tools:
- a 'cpu monitor', which records 144 bits of processor state for the last 256
   instructions or vector fetches, see
     https://github.com/wfjm/w11/blob/master/rtl/w11a/pdp11_dmcmon.vhd
- a 'breakpoint unit' which allows to set instruction of data breakpoints
- an 'ibus monitor' which records the last 512 ibus transactions
After setting a breakpoint on the trap 004/010 handler an inspection of the
instruction trace gave the essential information. Below a very condensed
and annotated excerpt

  nc ....pc cprptnzvc ..dsrc ..ddst ..dres      vmaddr vmdata
#
# the "(*f) (*p++)" in tcsh, running onto an illegal instruction
#
  15 145210 uu00-.... 000105 173052 000105 w  d 173052 000105 mov r0,(sp)
  25 145212 uu00-.... 173050 174434 174434 w  d 173050 145216 jsr pc, at n(r5)
  19 174434 uu00-.... 000010 173064 000010 r  i 174434 000045 ?000045?
   1 174434 uu00-.... 000012 173064 000012 r  d 000010 000045 !VFETCH 010 RIT
#
# the "mov $SIGILL.,r0" in fpsim(), load 160750 instead of 000004
#
  17 160744 ku00-n..c 160750 000045 160750 r  i 160746 160750 mov #n,r0
  14 160750 ku00-n..c 160752 160750 160732 r  i 160750 000770 br .-14
#
# the "sigprop[sig]" access in psignal(), which accesses 174036
# which leads to a external bus (or UNIBUS) time out and IIT trap
#
  23 161314 ku00-.z.. 000000 147500 000000 w  d 147500 000000 mov r1,n(r5)
   9 161320 ku00-.z.. 174036 000000 000000 Ebto 174036 013066 movb n(r3),r0
   3 161320 ku00-.z.. 000006 000000 000006 r  d 000004 013066 !VFETCH 004 IIT