2.11BSD/doc/2.10/ovpap

.RP
.TL
Running Large Text Processes
.if n .br
on Small
.UX
Systems
.AU
Charles Haley
.AI
.MH
.AU
William Joy
.AI
Computer Science Division
Department of Electrical Engineering and Computer Science
University of California, Berkeley  94720
.AU
William F. Jolitz
.AI
U.S. Geological Survey
345 Middlefield Road
Menlo Park, California  94025
.AB
.PP
We describe a set of simple modifications to the
.UX 
system, which permit larger programs to be run than has previously
been possible.  In particular, the
.I f77
and
.I a68
compilers and version 2 of the
.I ex
editor,
which previously would not run on the non-separate I/D machines
such as the 11/23, 11/34 and 11/40, may be run, without source code
modification, using this scheme.
This scheme will also allow processes larger than 65K bytes of instruction
space to run on all 11/ cpu's with segmentation hardware.
.PP
The overlay scheme used has been designed so that it is transparent to
the C programmer.  Information about which routines are overlayed and in
which overlay they reside is not needed until load time, and only the
overlay loader
.I ovld,
need deal with this.
The system mechanism for implementing overlays should function for
languages other than C (such as
.I a68)
but the current
.I ovld
implementation deals specifically with creating load modules for C.
Since 
.I f77
actually generates code via the second pass of the C compiler,
it also can be used to generate code for overlaying.
.AE
.SH
Introduction
.PP
The cheap and wide availability of small PDP-11's makes it desirable to have
all of the programs available on the 11/70's and 11/45's of larger
.UX
installations available on the smaller machines such as 11/34's and 11/40's.
To date, this has not been possible, because the smaller machines do
not have the separate instruction and data scheme found on 11/45's and
11/70's which allows 16 bits of instruction space separate from the data
space.
.PP
We have designed and implemented a scheme for running large processes
on machines without this separate I/D feature.  It may also be used to
run processes larger than 65K bytes of instruction space on 11/45's and 11/70's.
.SH
Strategy
.PP
The basic strategy is quite simple.  We resist the complexity of most
overlay schemes, and opt for the following points:
.IP 1)
The overlaying should be (almost) completely invisible to the programmer.
.IP 2)
No restrictions should be made on the language features available to
overlay programs.  In particular, function pointers in C must continue
to work, with all the same properties (uniqueness, etc.)
.IP 3)
The basic system interface should not impose the C runtime organization
any more than the current system does; in particular, other languages
such as
.I a68
should be usable in an overlay fashion, perhaps using a different loader.
.IP 4)
The strategy should be simple to implement.
.SH
New executable file formats
.PP
We have added two new ``magic numbers'' for executable files:
0430 and 0431.  The 0430 magic number corresponds to an overlayed
version of 0410 (shared text) executable files, and the 0431 number
corresponds to an overlayed version of separate I/D spaces files (which
normally have magic number 0411).
.PP
The
.I a.out
file format for these files differs from the normal file format as
follows:
.IP \(**
After the 8 word header and before the text of the program begins
is placed an 8 word array of overlay information.  The first word
of information is the maximum size of any of the overlays, and
the rest of the information gives the sizes of each of the (up to 7)
overlays.
.IP \(**
The text space follows the newly added overlay information, which is
then followed by the text of each overlay.  The overlay text
sizes are all multiple of ``core clicks'', i.e. rounded up to a multiple
of 64 bytes in size.
.LP
The rest of the
.I a.out
file is in the normal format.
.SH
Segmentation register layout
.PP
When an 0430 or 0431 process is run, the overlay information and the
text for all of the overlays are saved by the system.
At any given point during program execution, only one of the
(up to) 7 overlays will be mapped into the process address space,
but the process can request, using a new
system call, that an overlay of its choice be mapped into a
portion of its address space shared by the overlays.
This call actually implemented using the emulator trap instruction
of the PDP11 to reduce the overlay switching overhead.
.PP
Thus, there are conceptually four possible usages for each segmentation
register:
.IP 1)
It may be part of the text segment (as before).
.IP 2)
It may be one of the overlay text segment registers, mapping
address space after the text segment (1) but before the data segment (3).
.IP 3)
It may be one of the data segment registers, or
.IP 4)
It may be one of the stack segment registers.
.SH
System management of 0430 and 0431 processes
.PP
There are three major aspects of system handling of the new form
of processes:
.IP 1)
The
.I exec
and related system calls must know how to establish such processes,
and how to detect that they will not fit.
.IP 2)
The
.I estabur
and
.I sureg
mechanisms must know how to set up the segmentation registers for such
processes.
This interface must be modifiable by a system call to permit the currently
chosen overlay to be mapped.
.IP 3)
The scheduler and swapping mechanisms must understand these processes
and allow for enough core space for them (they use more than they
would appear to from the first 8 word header, e.g.).
To simplify this, we have chosen, in this implementation,
to swap the basic text and all overlay text for such processes as one piece.
.PP
The considerations here are relatively straightforward, and will
not be discussed in more detail.
.SH
Link editor changes
.PP
We have added two new options to the link editor, and made modifications
as necessary to make the overlay loading as transparent as possible.
This modified link editor is called
.I ovld
and is identical to the normal loader with the addition of the following
two options:
.IP \fB\-Z\fR
marks the beginning of an overlay.  The routines in the files to the
next
.B \-Z
or
.B \-L
option are placed in the next overlay (numerically).
.IP \fB\-L\fR
marks the end of all overlays.  The rest of the routines go into the
base segment.\(dg
.FS
\(dg \fBL\fR was chosen because it was unused and can be thought
of as ``library''.  The \fBZ\fR option has no mnemonic value.
.FE
.PP
Here is a sample loader command to
.I ovld
which loads the
.I ex
editor into a base segment and four overlays:\(dd
.FS
\(dd The \fB\-lovc\fR library differs from \fB\-lc\fR in that it is compiled
with
.I ovcc
instead of
.I cc.  The difference between the source for
.I ovcc
and
.I cc:
is that
.I ovcc
uses a one word larger stack mark (it stacks overlay numbers of return
addresses in the extra word).  Unfortunately, this requires that the
library routines also allocate and preserve this extra word if they
are to live in overlays or call overlaid routines which may cause
overlay switching.  Thus, for generality, we have them always save
and restore this number.
.FE
.DS
ovld \-X /lib/crt0.o \-n\e
    \-Z ex\_addr.o ex\_cmds.o ex\_cmds2.o ex\_cmdsub.o ex\_re.o ex\_set.o ex.o\e
    \-Z ex\_vadj.o ex\_vmain.o ex\_voperate.o ex\_vwind.o ex\_vops3.o\e
    \-Z ex\_v.o ex\_vget.o ex\_vops.o ex\_vops2.o ex\_vput.o\e
    \-Z ex\_get.o ex\_io.o ex\_temp.o ex\_tty.o\e
    \-L ex\_put.o ex\_subr.o printf.o strings.o doprnt.o\e
       ex\_data.o termlib/termlib.a \-lovc
.DE
and a (modified version of the)
.I size
command run on the resulting
.I a.out
file yields:
.nf
15104+(15808,14784,13696,9152)+2946+7294 = 25344b = 061400b (68544 total text)
.fi
.PP
We have designed this overlay to use two segmentation registers
(each register maps 8192 bytes) for the root segment, and two registers
for each overlay.  This leaves four segmentation registers, which could
map 24K bytes of data and bss and dynamic space, and one register which
could map 8K bytes of stack.
.PP
As normally loaded for an 11/70, this version of the editor uses 64000 bytes
of text space.  The additional 5K bytes is taken up by interface code
to handle the overlays, which we will describe shortly.
.PP
One other point to be noted is that the namelist (symbol table) format
for the
.I a.out
file has been changed slightly for the overlay loaded
.I a.out's.
Previously, there was an unused byte in the format (basically, the high
byte of the ``type'' field of the namelist), and this field is now used
to contain the segment number where overlay routines reside.
Consider the following files:
.DS
---x0.c---
main()
{
    foo();
    foobar();
}
base() { }
.DE
.DS
---x1.c---
foo()
{
    base();
    ov2();
}
ov1() { }
.DE
.DS
---x2.c---
foobar()
{
    base();
    ov1();
}
ov2() { }
.DE
An appropriately modified
.I nm
command on a file loaded via:
.DS
ovld -X -n /lib/crt0.o  x0.o -Z x1.o -Z x2.o -L ovcsv.o -lovc
.DE
produces the following output:
.br
.ne 5
.ID
000000 f crt0.o
000000 a indir
000000 T start
000001 a exit
000001 a exit
000074 T _main
000074 f x0.o
000074 t ~main
000120 T _base
000120 t ~base
000134 T _exit
000134 f cuexit.o
000152 T __cleanu
000152 f fakcu.o
000154 T ovcsv
000154 f ovcsv.o
000172 T csv
000210 T cret
000210 T ovcret
000256 T _etext
000256 T _foo     1
000276 T _foobar  2
000316 T _ov2     2
000336 T _ov1     1
020000 f x1.o
020000 f x2.o
020000 t ~foo
020000 t ~foobar
020024 t ~ov1
020024 t ~ov2
040002 D __ovno
040004 B _environ
104000 a emt
.DE
Note that the addresses for
.I \_foo
and
.I \_foobar
appear in the base segment.  In fact, the true routines appear in
the overlaid segment (at
.I ~foo
and
.I ~foobar)
but the base segment contains an interface routine, which both insures
that they are mapped into the address space before transferring to them
and also allows function pointers to exist and work as normal.
.SH
Thunks
.PP
To interface each routine which is in an overlay to the outside world,
we add some interface which we (somewhat abusing the terminology) call
a ``thunk''.  This code has the following form:
.DS
\_foo:
        mov     $foo's\_ovno,r0
        cmp     r0,*$\_\_ovno
        beq     1f
        emt     0
1:
        jmp     *$~foo
.DE
Thus there is 16 bytes of interface code for each overlaid routine.
.PP
This code forces register 0 to be the number of the subroutine's overlay,
which is then checked against the currently loaded overlays number
(found in
.I \_\_ovno
).  If they are not the same,
the system is called (via an emulator trap) to make them so. Notice
that 
.I \_\_ovno
is still set to the previous overlay number so that it may be
preserved on the stack for our ultimate return.
.PP
The C save and restore sequences for use with overlayed
text register programs are coded as follows:
.br
.DS

/ C register save and restore -- version 7/75
/ modified by wnj && cbh 6/79 for overlayed text registers
/ modified by wf jolitz 2/80 to work and use emt syscall
/ we define ovcsv and ovcret which overlay routines call
/ even though ovcret is (currently) the same as cret
/ the loader finagles the .o files so this happens

.li
.globl  csv
.li
.globl  cret
.li
.globl  ovcsv, ovcret
.li
.globl  \_\_ovno
.li
.globl  \_etext
.li
.data
\_\_ovno: 0
.li
.text

emt= 0104000    / overlays switched by emulator trap. ovno in r0.

/ csv for routines in overlays
/ the previous overlay is in \_\_ovno, which is saved on the stack.
/ after it is saved, \_\_ovno is set to the current overlay number
/ which has been put in r0 by the thunk.
/
ovcsv:  mov     r5,r1
        mov     sp,r5
.DE
.DS
        mov     \_\_ovno,-(sp)
        mov     r0,\_\_ovno
        jbr     1f
/
/ only root segment routines call csv, and when it is called
/ no overlays have been changed, so we just save the previous overlay
/ number on the stack. note that r0 is'nt set to the current overlay
/ because we were'nt called through a thunk.
/
csv:    mov     r5,r1
        mov     sp,r5
        mov     \_\_ovno,-(sp)    / overlay is extra (first) word in mark
/ rest is old code common with ovcsv
/
1:
        mov     r4,-(sp)
        mov     r3,-(sp)
        mov     r2,-(sp)
        jsr     pc,(r1)         / jsr part is sub $2,sp
/
/ at this point, the stack frame looks like this:
/
.ta 2.5i
/          __________________	
/          |  return addr	|
/          |_________________	|
/ r5->  | old r5	|
/          |_________________	|
/          | previous ovnumber	|
/          |_________________	|
/          | old r4	|
/          |_________________	|
/          | old r3	|
/          |_________________	|
/ sp->  | old r2	|
/          |_________________	|
/


ovcret:         / same as cret, i think
cret:
        mov     r5,r2
/ get the overlay out of the mark, and if it is non-zero
/ make sure it is the currently loaded one
        mov     -(r2),r4
        bne     1f              / zero is easy
2:
        mov     -(r2),r4
        mov     -(r2),r3
        mov     -(r2),r2
        mov     r5,sp
        mov     (sp)+,r5
        rts     pc
/ not returning to root segment, so check that the right
/ overlay is loaded, and if not ask UNIX for help
1:
        cmp     r4,\_\_ovno
        beq     2b              / lucked out!
/ if return address is in root segment, then nothing to do
        cmp     2(r5),$\_etext
        blt     2b
/ returning to wrong overlay --- do something!
        mov     r0,r3
        mov     r4,r0
        emt
        mov     r4,\_\_ovno
        mov     r3,r0
/ intr. routines may run between these, so should force segment \_\_ovno
        br      2b
.DE
.PP
One subtle point here is that routines which are in overlays
are made to call the routines
.I ovcsv
and
.I ovcret
rather than
.I csv
and
.I cret.
This allows for slightly faster saving of the previous overlay value in
this case.
.SH
System Implementation
.PP
We have modified a Version 7
.UX
System to support 0430 and 0431 executable files. This version keeps
the overlays with the text segement, and uses an emulator trap to
switch overlays. Here is a description of some of the changes required:
.IP 1.
\fIgetxfile\fR,
in \fIsys1.c\fR, was modified to recognize 0430 and 0431 as legal magic numbers.
When either of them are encountered, the second 8 word header (containing
the sizes of the overlays) is read. The base or start of the overlayed area
of core, and the base of the data space is set into the per-process data area.
Each overlays size is checked against the maximum size.  A table of offsets is
computed for the starting location of each of the overlays.  The overlayed text
is skipped over to allow the data segement to be loaded correctly.
.IP 2.
\fIsbreak\fR,
in \fIsys1.c\fR, was modified to notice that the amount of text space in a
overlayed process includes the area set aside for (but not necessarly present)
an overlay.
.IP 3.
\fIxalloc\fR,
in \fItext.c\fR, was changed to allow the loading of each of the overlays from
the executable file.  Each overlay is loaded into the user's text space after
it has been mapped in by an
.I estabur
call.
.IP 4.
\fItrap\fR,
in \fItrap.c\fR, was changed to allow the use of an emulator trap as a system
call to switch the overlays of a type 0430 or 0431 p process.  This `call'
only works if used by an overlay process, it defaults to the normal trap
sequence if the call is invalid.  If valid, the current overlay number is
updated, and
.I estabur
is called to remap the user text space.
.IP 5.
\fIestabur\fR,
in \fIureg.c\fR, contains the changes that are the heart of the text overlays.
First, the executable size parameters are checked to see that they are within
the limits of the PDP-11 segmentation hardware.  The size of the text segment
is then coerced to include the overlays prepended to the end of the text 
segment.  Next, the segmentation registers are created, with space in the
addressing space being reserved even if an overlay is not active.
If an overlay is active, these registers are set to point at its location
in core.
.I estabur
then is used to map in the i'th overlay, and must be called every place in the
system when the given overlay must be accessed.
.IP 6.
The per-process data structure in \fIuser.h\fR has been changed to include the
offset table and base locations used by the overlays. They are appended to the
data structure to minimize system software conversion problems.
.IP 7.
The defined constant MAXMEM must be changed to reflect the new memory limit
per-process. Since this reflects the use of swap area as well, busy systems
might be advised to increase the size of the swap area as well.
.RE
.PP
Some areas of the system have yet to be changed to support overlays. In order
to \fIptrace\fR(UA2) them, the current overlay number has to become writeable
in this call. Profiling should also be changed to support overlayed processes.
.SH
Future improvements
.PP
If interactive debugging of 0430 and 0431 files is to occur, then
.I adb
must be changed to deal with the new format of the 
.I a.out
files.  We have not yet made the needed changes.
.PP
The mechanism here substantially improves the capability of a large class
of small 11's.  For machines with small amounts of real memory, it would
be nice if the text images of these 0430 and 0431 files would not
have to be completely resident to run.  Thus the individual overlays
could be swapped rather than being made part of the larger text segment.
This appears substantially more difficult to implement than the
present mechanism, for two reasons:
.IP 1)
It is a major change to the text mechanism, basically allowing
more than one text per process, and making the amount of core
required by a process much more dynamic.  Care must be taken in changing
the text mechanism of the system to allow this.
.IP 2)
Substantially more changes are needed to the scheduling algorithm in
the system to assign appropriate priorities to a new class of objects:
overlay text portions which are not currently ``active''.  It
seems pointless to implement this scheme if they are simply
abandoned as soon as they become free.  We suggest that they be
given ``abandon'' priority which keeps them just longer than slow terminal
i/o waits.