Process restart.

Bennet Yee bsy at PLAY.MACH.CS.CMU.EDU
Sun Nov 13 11:14:58 AEST 1988


In article <18 at elgar.UUCP> ag at elgar.UUCP (Keith Gabryelski) writes:
>
>Long running processes that don't have any means of shutdown/restart
>built into them are what I am thinking of.
>
>Let's say we have this process computing prime numbers (or some other
>simple case) and the system needs to be shutdown because of some fatal
>error.  Can a snapshot be done?

I've done exactly this about two years ago.  My implementation of
M.O.Rabin's probabilistic primality test ran for about a week of real time
on a uVax II surviving multiple reboots/system crashes before finding a 1000
digit probabilistic prime....  I don't know how much real CPU time it took
-- the machine was a general purpose machine (I ran my program niced 19) and
I didn't keep track of timing info.  In retrospect it would have been easy:
I had it checkpoint every 5 minutes of CPU time anyway, so all I needed to
do is to increment a counter.  Anyway, since the program's I/O behavior is
very simple (it generated output only just before completing, and I only
redirected its stdout to a file), it was particularly simple to checkpoint
the process.

I thought about the case of replacing open/close with library routines and
syscall'ing the traps after saving state; at a checkpoint, we can lstat the
known descriptors so we can restore.  This would work only for files, of
course, and I didn't bother.  I may do this at a later date....

The code that I _do_ have simply checkpoints the data/stack portion of the
address space.  Note that this includes the stdio buffers etc, so if I _did_
decide to save file descriptor states all I need to do at restart is to
lseek to the old location... assuming the program doesn't lseek around also.
If it did, I'd have to copy all the files to get _their_ state at the time
of the checkpoint (bleh).  Restart is performed by running the program with
a switch specifying the checkpoint file, whereupon the state from the file
is loaded into the current address space (i.e, your program would have to
recognize a flag and call my restore function).  I have versions of this
code running on Vaxen and IBM RTs.

I currently have 3 1000 digit probabilistic primes.  Does any factoring
wizard want a 2000 digit compos... :-)

To generate 100 digit probabilistic primes (probability 1 - 2^-40), it takes
129.3u 0.7s 2:28 87% on an IBM RT/APC and 290.2u 0.1s 8:49 54% on a uVax III.

The primality code uses the cmump library package developed here at CMU
(cmump is based on the mp package from BTL), so probably won't be useful
unless you have source license or you're willing to rewrite it.  As for the
checkpointing code, I'm willing [and able] to share.  I only use Unix
syscalls and the code should have no Mach dependencies.

-bsy
-- 
Internet:	bsy at cs.cmu.edu		Bitnet:	bsy%cs.cmu.edu%smtp at interbit
CSnet:	bsy%cs.cmu.edu at relay.cs.net	Uucp:	...!seismo!cs.cmu.edu!bsy
USPS:	Bennet Yee, CS Dept, CMU, Pittsburgh, PA 15213-3890
Voice:	(412) 268-7571
-- 



More information about the Comp.unix.wizards mailing list