4BSD/usr/man/man8/crash.8

Compare this file to the similar file:
Show the results in this format:
.TH CRASH 8 VAX-11
.UC 4
.SH NAME
crash \- what happens when the system crashes
.SH DESCRIPTION
This section explains what happens when the system crashes and how
you can get a crash dump for analysis of non-transient problems.
.PP
When the system crashes voluntarily it prints a message of the form
.IP
panic: why i gave up the ghost
.LP
on the console, and then invokes an automatic reboot procedure as
described in
.IR reboot (8).
If the auto-reboot switch is off on the console, then the processor
will simply halt at this point.
Otherwise the registers and the top few locations of the stack will
be printed on the console, and then the system will check the disks
and (unless some unexpected inconsistency is encountered), resume
multi-user operations.
.PP
The system has a large number of internal consistency checks; if one
of these fails, then it will panic with a very short message indicating
which one failed.  In the absence of a dump, little can be done about
one of these.  If the problem recurs, you should arrange to get a dump
for further analysis by running with auto-reboot disabled during normal
working hours and then following the procedure described below.
.PP
The most common cause of system failures is hardware failure, which
can reflect itself in different ways.  Here are the messages which
you are likely to encounter, with some hints as to causes.
Left unstated in all cases is the possibility that hardware or software
error produced the message in some unexpected way.
.TP
IO err in push
.ns
.TP
hard IO err in swap
The system encountered an error trying to write to the paging device
or an error in reading critical information from a disk drive.
You should fix your disk if it is broken or unreliable.
.TP
Timeout table overflow
.ns
.TP
ran out of bdp's
.ns
.TP
ran out of uba map
These really shouldn't be panics, but until we fix up the data structures
involved, running out of entries causes a crash.  If the timeout table
overflows, you should make it bigger.  If you run out of bdp's or uba map
you probably have a buggy device driver in your system, allocating and
not releasing UNIBUS resources.
.TP
KSP not valid
.ns
.TP
SBI fault
.ns
.TP
Machine check
.ns
.TP
CHM? in kernel
These indicate either a serious bug in the system or, more often,
a glitch or failing hardware.  For the machine check, the top part of
the resulting stack frame gives more information.  You can refer to a
VAX 11/780 System Maintenance Guide for information on machine checks.
If machine checks or SBI faults recur, check out the hardware or call
field service.  If the other faults recur, there is likely a bug somewhere
in the system, although these can be caused by a flakey processor.
Run processor microdiagnostics.
.TP
trap type %d, code=%d
A unexpected trap has occurred within the system; the trap types are:
.RS
.TP 10
0
reserved addressing mode
.br
.ns
.TP 10
1
privileged instruction
.br
.ns
.TP 10
2
BPT
.br
.ns
.TP 10
3
XFC
.br
.ns
.TP 10
4
reserved operand
.br
.ns
.TP 10
5
CHMK (system call)
.br
.ns
.TP 10
6
arithmetic trap
.br
.ns
.TP 10
7
reschedule trap (software level 3)
.br
.ns
.TP 10
8
segmentation fault
.br
.ns
.TP 10
9
protection fault
.br
.ns
.TP 10
10
trace pending (TP bit)
.RE
.IP
The favorite trap type in system crashes is trap type 9, indicating
a wild reference.  The code is the referenced address.  If you look
down the stack, just after the trap type and the code are the pc and
the ps of the processor when it trapped, showing you where in the
system the problem occurred.  These problems tend to be easy to track
down if they are kernel bugs since the processor stops cold, but random
flakiness seems to cause this sometimes, e.g. we have trapped with
code 80000800 three times in six months as an instruction fetch went across
this page boundary in the kernel but have been unable to find any reason
for this to have happened.
.TP
init died
The system initialization process has exited.  This is bad news, as no new
users will then be able to log in.  Rebooting is the only fix, so the
system just does it right away.
.PP
That completes the list of panic types you are likely to see.
Now for the crash dump procedure:
.PP
At the moment a dump can be taken only on magnetic tape.
Before you do anything, be sure that a clean tape is mounted with a ring-in
on the tape drive if you plan to make a dump.
.PP
Write the date and time on the console log.
Use the console commands to examine the registers, program status long word,
and the top several locations on the stack.
A suggested command sequence, which is executed by the \*(lq@DUMP\*(rq
console command script, is:
.DS
.nf
	E PSL<return>
	E R0/NE:F<return>
	E SP<return>
	E/V @ /NE:40<return>
.fi
.DE
If hardware problems dictate a special set of commands be executed when
the system crashes, a sequence of commands can be saved using the console
command \*(lqLINK\*(rq to be reexecuted with \*(lqPERFORM\*(rq (which can be
abbreviated \*(lqP\*(rq).
If a dump is to be taken on magnetic tape (this is a good idea
in most any case where the cause of the crash is not immediately obvious)
then the following commands will (should) be executed:
.DS
.nf
	D PSL 0<return>
	D PC 80000200<return>
	C<return>
.fi
.DE
These commands are actually part of the standard \*(lq@DUMP\*(rq script.
This should write a copy of all of memory
on the tape, followed by two EOF marks.
Caution:
Any error is taken to mean the end of memory has been reached.
This means that you must be sure the ring is in,
the tape is ready, and the tape is clean and new.
.PP
If there are not 40(hex) locations active on the kernel stack when the
procedure is begun, then the console may begin to print error diagnostics.
You can stop this by hitting \*(lq^C\*(rq (control-C), and then give the
last three commands above.
.PP
If the dump fails, you can try again,
but some of the registers will be lost.
See below for what to do with the tape.
.PP
To restart after a crash, follow the directions in
.IR reboot (8);
if the virtual memory subsystem is suspected as the cause of the crash,
then a version of the system other than \*(lqvmunix\*(rq should be booted
which will leave the paging areas temporarily intact
for use by the post-mortem analysis program
.I analyze.
After checking your root file system consistency with
.IR fsck (8),
you can read the core dump tape into the file /vmcore with
.IP
dd if=/dev/rmt0 of=/vmcore bs=20b
.LP
It does not work to use just
.IR cp (1),
as the tape is blocked.
With the system still in single-user mode, run the analysis program
.I analyze,
e.g.:
.IP
analyze \-s /dev/drum /vmcore /vmunix
.LP
and save the output.
Then boot up
\*(lqvmunix\*(rq
and let it do the automatic reboot, i.e. to boot multi-user from
an RM03/RM05/RP06 on the MASSBUS
.IP
>>> BOOT RPM
.PP
After rebooting, to analyze a dump you should execute
.I "ps \-alxk"
to print the process table at the time of the crash.
Use
.IR adb (1)
to examine
.IR /vmcore .
The location
.I dumpstack\-80000000
is the bottom of a stack onto which were pushed the stack pointer
.BR sp ,
.B PCBB
(containing the physical address of a
.IR u_area ),
.BR MAPEN ,
.BR IPL ,
and registers
.BR r13 \- r0
(in that order).
.BR r13 (fp)
is the system frame pointer and the stack is used in standard
.B calls
format.  Use
.IR  adb (1)
to get a reverse calling order.
In most cases this procedure will give
an idea of what is wrong.
A more complete discussion
of system debugging is impossible here.
See, however,
.IR analyze (8)
for some more hints.
.SH "SEE ALSO"
analyze(8), reboot(8)
.br
.I "VAX 11/780 System Maintenance Guide"
for more information about machine checks.
.SH BUGS