4.3BSD/usr/contrib/icon/docs/tr84-11a.roff

.TR 84-11
.DA "August 5, 1984"
.Gr
.TL
A Tour Through the C Implementation of Icon; Version 5.9
.AU
Ralph E. Griswold
.AU
Robert K. McConeghy
.AU
William H. Mitchell
.AB
This report documents the C implementation
of Version 5.9 of the Icon programming language.
This report concentrates on the
major parts of the system \(em
the translator, the linker, and the interpreter.
An additional section discusses how the implementation
can be modified for new language features.
.AE
.SH
Introduction
.PP
This report describes the C implementation of Version 5.9 of
the Icon programming language [1].
Most of the system is coded in C [2] and is designed to be run under \*U.
In addition to the C portion of the system, there is some assembly
language code.
To date, the C implementation has been adapted to the
PDP-11\u\(dg\d, VAX-11, and Onyx C8002.
.Un
.FS
\u\(dg\dPDP and VAX are trademarks of Digital Equipment Corporation.
.FE
This implementation is intended to be portable
to other computers running under UNIX,
but portability was not a primary design goal.
Reference 3 describes the process of transporting this implementation
and contains detailed descriptions of the assembly language routines for
the VAX implementation.
.PP
The implementation of the Icon system consists of three parts:
a translator, a linker,
and an interpreter. The interpreter contains a run-time system that
includes routines for the operations that are needed to execute an Icon program.
The translator converts an Icon source program into
an intermediate code, called
.I ucode .
The linker combines separately translated ucode files,
binds inter-procedure references,
and produces interpretable binary output, called \fIicode\fR.
.PP
The reference language for this report is Version 5.9 of Icon [4].
This report is intended to be used in conjunction with
the source listings for Version 5.9,
although a general overview of the system
can be obtained from this document alone.
.NH
The Translator
.PP
The Icon translator is written entirely in C
and consists of
12 files of source code and 10 header files.
The translator builds a parse tree for each Icon procedure,
then traverses the tree to generate code.
Three of the 12 source files contain only data initialization
and are automatically generated from specification files.
In addition, the LALR(1) parser is automatically generated
by the \fIYacc\fR parser generator [5].
.PP
The ucode output from the translator consists of two files.
One file, with the suffix \fM.u1\fR, contains
intermediate code corresponding to the procedures in the program.
The second file, with the suffix \fM.u2\fR,
contains global symbol table information.
Both files are printable.
.PP
The following sections discuss the four parts of the translator:
the lexical analyzer,
the parser,
the code generator,
and the symbol table manager.
.NH 2
The Lexical Analyzer
.PP
The lexical analyzer reads the Icon source program,
breaks it into tokens,
and delivers the tokens to the parser as requested.
A token is the basic syntactic unit of the Icon language;
it may be an identifier, a literal, a reserved word,
or an operator (operators include punctuation).
.PP
The lexical analyzer consists of four source files:
\fMlex.c\fR,
\fMchar.c\fR,
\fMoptab.c\fR,
and
\fMtoktab.c\fR.
The latter two of these files contain operator and token tables, respectively,
and are automatically generated
from operator and token specification files,
described below.
The file
\fMchar.c\fR
contains character mapping tables and the file
\fMlex.c\fR
contains the lexical analyzer itself.
.PP
The parser requests a token by calling
\fMyylex\fR,
which finds the next token in the source program
and determines its token type and value.
The parser bases its moves on the token type:
if the token is an operator or reserved word,
the token type specifically identifies the operator or reserved word;
otherwise, the token type indicates one of the six \*(oqprimitive\*(cq types:
identifier, integer literal, real literal, string literal, cset literal, or end-of-file.
The token value is a leaf node of the parse tree,
which, for the primitive types,
contains the source program representation of the token.
The token value node also contains
the source-program line and column numbers where the token starts.
A pointer to this node is placed in the global variable
\fMyychar\fR,
and
\fMyylex\fR
returns the token type.
.PP
The lexical analyzer finds the next token
by skipping white space, including comments.
The first character of the new token indicates which of the classes it belongs to.
A letter or underscore begins an identifier or reserved word,
a digit begins an integer or real literal,
a double quote begins a string literal,
a single quote begins a cset literal,
and any other character is assumed to begin an operator.
An identifier or reserved word is completed
by gathering all subsequent letters, digits, and underscores.
The reserved word table is consulted to determine if the token
is an identifier or a reserved word.
Numeric literals are recognized by a finite-state automaton,
which distinguishes real from integer literals by the
presence of a decimal point or the letter \*(oqe\*(cq.
A quoted literal is completed by reading
until the opening delimiter is repeated,
converting escapes in the process and continuing to new lines as necessary.
A table-driven finite-state automaton, described below,
recognizes operators.
.PP
An important task of the lexical analyzer is semicolon insertion.
The grammar requires that semicolons separate expressions
in a compound expression or procedure body,
so they must be inserted into the token stream
where they are omitted in the source program.
This process is table driven.
Associated with each token type are two flags,
.I BEGINNER
and
.I ENDER .
The
.I BEGINNER
flag is true if a token may legally begin an expression
(i.e., if it may follow a semicolon).
Similarly, the
.I ENDER
flag is true if a token may legally end an expression
(i.e., if it may precede a semicolon).
When a newline appears between two tokens, the
.I ENDER
flag of the first is true, and the
.I BEGINNER
flag of the second is true,
then a semicolon is inserted between the two tokens.
.PP
The token table is initialized in the file
\fMtoktab.c\fR.
The table is divided into three sections:
primitive types, reserved words, and operators.
The primitive types are fixed in the first six
slots in the table,
and must not be changed,
since they are referenced directly from the code.
The reserved words follow
and must be in alphabetical order.
The operators follow in no special order.
The last entry merely marks the end of the table.
.PP
Also in
\fMtoktab.c\fR
is an index to reserved words.
To speed up the search for reserved words,
this table hashes the search
using the first letter as the hash value.
The reserved words that begin with that letter then are examined linearly.
.PP
The operator table, in
\fMoptab.c\fR,
describes a finite-state automaton that
recognizes each operator in the language.
Each state is represented by an array of structures.
Each structure in the array
corresponds to a transition on the input symbol.
The structure contains three fields:
an input symbol,
an action,
and a value used by the action.
The recognizer starts in state 0;
the current input symbol is the first character
of the operator.
In a given state with a given input symbol,
the recognizer searches the array associated with the current state
for an entry that matches the current input symbol.
Failing a match, the last entry of the array, with
the input symbol field of 0,
is used.
The recognizer then performs one of the following actions,
depending on the value of the action field:
.in .5i
.IP \(bu \w'\(bu'u+1m
goes to the new state indicated by the value field
and gets the next input character
.IP \(bu
issues an error
.IP \(bu
returns the value field as
a pointer to the token table entry for the operator
.IP \(bu
returns the value field,
but pushes the current input character back onto the input.
.in 0
.LP
The difference between the last two actions is that
some operators are recognized immediately (e.g., \*(oq\fM;\fR\*(cq),
while others are not recognized until the character
following the operator is read (e.g., \*(oq\fM=\fR\*(cq).
.PP
The token table
and operator table
are automatically constructed by the Icon program
\fMmktoktab.icn\fR.
This program reads the specification file
\fMtokens\fR
and builds the file
\fMtoktab.c\fR.
The file
\fMtokens\fR
contains a list of all the tokens,
their token types (given as defined constants),
and any associated flags.
This list is divided into the three sections detailed above.
The program then reads the specification file
\fMoptab\fR
and builds the file
\fMoptab.c\fR.
The former is a skeleton for the operator table;
it contains the state tables,
but the program fills in the pointers to the token table entries.
.NH 2
The Parser
.PP
The parser, in the file
\fMparse.c\fR,
is automatically generated by \fIYacc\fR.
The grammar and semantic actions are contained in the file
\fMicon.g\fR.
From these specifications,
\fIYacc\fR generates parser tables for an LALR(1) parser.
.PP
In addition to the grammar,
\fMicon.g\fR
contains
a list of all the token types in the language
and declarations necessary to the actions.
\fIYacc\fR assigns an integer value to each token type,
and generates define statements,
which are written to the file
\fMtoken.h\fR.
These defined constants are the token types
returned by the lexical analyzer.
.PP
The grammar is context-free,
with actions associated with most of the rules.
An action is invoked when the corresponding rule is reduced.
The actions perform two duties:
maintaining the symbol tables
and constructing the parse tree.
The parse tree is built from the bottom up \(em
the leaves are supplied by the lexical analyzer
and the actions build trees from the leaves and from smaller trees
with each reduction.
.PP
The parser requests tokens from the lexical analyzer,
building a parse tree until it reduces a procedure.
At this point, it passes the root of the parse tree
to the code generator.
Once the intermediate code has been generated,
the parse tree is discarded,
and a new tree is begun for the next procedure.
.PP
Record and global declarations
affect only the symbol table
and do not generate parse trees.
Files named in link directives produce link instructions in the ucode
output.
.PP
A complete parse tree is rooted at a
\f3proc\fR
node, which identifies the procedure
and points to the subtrees for the
\fMinitial\fR
clause (if any)
and the body of the procedure.
Each node in the parse tree represents a source program construction
or some implicit semantic action.
A node can contain up to six fields,
the first of which is the node type.
The second and third fields are always line and column numbers
that are used for error messages and tracing.
Any additional fields contain information about the construction,
and possibly pointers to subtrees.
Appendix A contains a description of all the node types.
.PP
The grammar, shown in Appendix B, has several ambiguities.
The well-known \*(oqdangling else\*(cq problem exists
not only in the
\fMif-then-else\fR
expression, but also in the
\fMwhile-do\fR,
\fMuntil-do\fR,
\fMevery-do\fR,
and \fMto-by\fR
expressions.
In each of these expressions, the last clause is optional,
so that when the parser sees an
\fMelse\fR,
for example,
it does not know whether to shift the token
(associating it with the most recent
\fMif\fR),
or to reduce the preceding
\fMif-then\fR
expression (leaving the else \*(oqdangling\*(cq).
The latter choice is obviously incorrect, since the
\fMelse\fR
would never be shifted, and \fIYacc\fR correctly resolves such conflicts
in favor of the shift.
Thus, each
\fMelse\fR
is paired with the most recent unpaired
\fMif\fR.
All the control structures (except
\fMcase\fR)
have an additional ambiguity:
they do not have a closed syntax,
yet they may appear in an expression at the highest precedence level.
For example, the expression
.DS
.tr -\-+\(pl*\(**
\fMx := y + if a = b then z else -z * 3
.tr --++**
.ft R
.DE
could parse in either of two ways:
.DS
.tr -\-+\(pl*\(**
\fMx := y + (if a = b then z else (-z * 3))
x := y + (if a = b then z else -z) * 3
.tr --++**
.ft R
.DE
This problem, too, is resolved in favor of the shift,
such that the first parse is always used.
Thus, in the absence of parentheses,
the entire expression to the right of a control structure
is part of the control structure.
.PP
Little attention has been paid to error recovery.
A few error productions have been placed in the grammar
to enable \fIYacc\fR to recover from syntax errors;
the technique for doing so is described by Aho and Johnson [6].
The parser is slightly modified by the editor script
\fMpscript\fR
so that the parser state is passed to the routine
\fMyyerror\fR.
This routine prints an error message from the file
\fMsynerr.h\fR
that is associated with the current parser state.
This error table currently is constructed by
hand from the \fMy.output\fR file obtained by
running \fIYacc\fR with the
\f3\-v\fR
option.
.NH 2
The Code Generator
.PP
The parser calls the code generator upon
recognition of each Icon procedure,
giving it the root of the parse tree.
The code generator traverses the parse tree recursively,
emitting ucode.
Appendix C contains a description of ucode.
.PP
The file
\fMcode.c\fR
contains both the tree node allocation and the code generation routines.
The included header file
\fMcode.h\fR
contains macros and definitions needed by the code generator,
while
\fMtree.h\fR
defines the tree nodes and the macros that allocate them.
The macros in
\fMtree.h\fR
provide the interface between the parser and the code generator.
.PP
The tree traversal routine,
\fMtraverse\fR,
is a recursive procedure with one argument,
a pointer to the root of a tree or subtree
for which code is to be generated.
The routine examines the type field of the root
and, through a switch statement,
generates a sequence of ucode instructions as determined by the type.
If the node has subtrees,
\fMtraverse\fR
calls itself recursively at the appropriate point
to generate code for the subtree.
For example, the code generated for a binary operator
first generates code for its two subexpressions,
then emits the code that calls the appropriate run-time library routine.
.PP
The returned value of the traversal routine is used for counting
elements of expression lists.
If the root of the tree being traversed is an
\f3elist\fR
(expression list) node,
\fMtraverse\fR
returns the sum of the returned values of its two subtrees.
Otherwise, it returns 1.
This count is used when generating code for
procedure calls and lists with explicit elements,
which need to know the number of arguments
to be pushed onto the stack.
.PP
When generating code for loops,
the code generator needs to save three pieces of information
for each nested loop:
the
.I "break label" ,
the
.I "next label" ,
and the expression nesting level.
This information is kept on the
.I "loop stack" .
The break label is a label placed just past the end of the loop;
it is the place where control is passed when the loop is finished.
The next label is placed near the end of the loop,
at a point where the next iteration of the loop can be started.
The code for
\fMbreak\fR
and
\fMnext\fR
expressions branches to these labels,
but in either case, any incomplete expression frames (see Section 3.2)
within the loop must first be popped from the stack.
The expression nesting level counts the number of currently active
expression frames within the current loop;
an
\f3unmark\fR
instruction is generated for that many expression frames
(less one for a
\fMnext\fR
expression).
.PP
The possibility of nested
\fMcase\fR
expressions requires that certain information be kept on a
.I "case stack" .
For each case expression,
the code generator allocates a label for the end of the expression
and pushes it onto the case stack.
When a
\fMdefault\fR
clause is encountered,
its subtree is placed on the top of the case stack to delay
code generation for it until the end of the \fMcase\fR expression.
.NH 2
The Symbol Table Manager
.PP
The symbol table manager consists of the symbol table data structures
and routines that operate upon these data structures.
The source code for the symbol table manager is contained in two files.
The file
\fMkeyword.c\fR
contains only the keyword name table and is automatically constructed from a
keyword specification file discussed below.
The remainder of the symbol table manager is located in the file
\fMsym.c\fR.
.PP
The symbol table manager operates with two logical data structures,
the symbol table proper and the string space.
When the lexical analyzer identifies a token as either
an identifier or a literal,
the lexical analyzer requests the symbol table manager
to enter the token into the string space.
The symbol table manager returns a pointer into
the string space for that string.
The lexical analyzer then places this pointer in the token value node.
To help keep the size of the string space small,
all entries are hashed,
and only one copy of any string is kept.
This has the added benefit that two strings can be compared
by checking only the pointers into the string space.
.PP
The parser determines the context of the token
and requests the symbol table manager
to enter the token into the symbol table proper.
It is the responsibility of the symbol table manager
to verify that the use of the token is consistent with prior use.
Appropriate diagnostics are issued if the use is inconsistent.
.PP
The symbol table proper is physically divided into three separate structures:
the
.I global ,
.I local ,
and
.I literal
tables.
Each of these tables is hashed,
using the pointer into the string space as the key.
Since this pointer is an offset into the string space,
hashing is simply and effectively performed by taking the rightmost
.I n
bits of the offset
(where
.if n .I n "" 2**
.if t 2\s-2\u\fIn\fR\d\s+2
is the size of the hash vector for the table).
.PP
The global table contains identifiers that have been declared as
globals, procedures, or records.
The local table holds all identifiers declared as locals,
formal parameters for procedure declarations, field names for record
declarations, and all other identifiers referenced in the procedure
(including those declared as global elsewhere).
The literal table contains entries for literal strings and csets, integers,
and floating-point constants.
.PP
Both the local and literal tables are associated with the current
procedure being parsed
and are written to the \fM.u1\fR file
when the procedure has been successfully parsed.
If a record declaration has been parsed, then the local table,
containing only the field name identifiers, is written to the
\&\fM.u2\fR file.
After all procedure, record, and global declarations in a Icon source
file have been parsed, the global table is written into the global
declarations file.
.PP
An entry into any of the three symbol table sections is a structure with
three fields:
a link,
a name,
and a flag.
The link field holds the pointer to the next entry in the same hash bucket.
The name is the pointer to the identifier or literal name in the string space.
The flag field contains the type
.I "formal parameter" , (
.I "static local" ,
.I "procedure name" ,
etc.) of the entry.
Global table entries have a fourth field,
an integer providing the number of formal parameters for a procedure declaration,
or the number of fields in a record declaration.
.PP
Lookup in the local and global tables is merely the process of
following a hash chain until an entry of the same name is found or until
the hash chain is exhausted.
If a previous entry is found, the flags of the existing and new entries
are compared, and diagnostics are printed if the use of the new entry
conflicts with the previous usage.
The new entry is ignored whenever such an inconsistency is found.
.PP
The literal table uses the same lookup procedure, except
the search down the hash chain stops when an entry is found
with the same textual form and flag fields.
Thus the string literal
\fM"123"\fR
and the integer literal
\fM123\fR
have separate entries in the literal table, even though
they have the same string representations.
A consequence of this technique is that the integer
literals
\fM123\fR
and
\fM0123\fR
have separate entries in the literal table,
even though they have the same numeric value.
Since most programmers use a reasonably consistent style
when expressing literals,
this technique usually does not produce many
duplicate constants.
.PP
.tr ##
A final task of the symbol table manager is
the identification of keyword names.
(Note that keywords are of the form \fM&\fIname\fR.)
The symbol table manager maintains a list of the legal keyword names
and, upon request, returns a numeric identification for a keyword name
to the parser.
An automatic procedure exists for creating the keyword name table:
the Icon program
\fMmkkeytab.icn\fR
reads the specification file
\fMkeywords\fR
and produces the keyword name table in
\fMkeyword.c\fR.
The file
\fMkeywords\fR
is simply a list of the keyword names and a numeric identification for each.
Since the number of keyword names is small, and only a few
references to keywords are typical in an Icon program, lookup in the
keyword name table is done using a linear search.
.PP
The sizes of the respective portions of the symbol table
may be altered with command line arguments to the Icon translator.
.NH
Linker
.PP
The Icon linker is written entirely in C. It consists of eight files of
source code and three header files.
The linker performs three tasks:
combining the global symbol tables from one or more
runs of the translator,
resolving undeclared identifiers,
and translating ucode to icode.
The resulting combined global symbol table is used
for determining the scope of undeclared identifiers
during the second task.
The second and third tasks are done during a single pass
over each intermediate code file.
A single file of assembly code is produced.
.PP
The symbol table module, in the file
\fMlsym.c\fR,
is similar to the symbol table module of the translator,
except that there is an additional table for storing
field names of records.
The input module, in the file
\fMllex.c\fR,
recognizes the instructions in both the global symbol table files
and the intermediate code files.
The global symbol tables are merged by the routine \fMglobals\fR in
\fMglob.c\fR,
and the intermediate code files are produced by the routines in
\fMlcode.c\fR.
Of the remaining source files,
\fMilink.c\fR
and
\fMlmem.c\fR
contain the main program, miscellaneous support routines,
and memory initialization.
The files
\fMbuiltin.c\fR
and
\fMopcode.c\fR
contain table initializations for the list of built-in procedures
(functions)
and the ucode operations, respectively.
.PP
The first phase of the linker reads
the global symbol table file from each translator run,
and entering all the global symbols into one combined table.
The format of a global symbol table file is described in Appendix C.
This phase also builds the record/field table that cross-references
records and field names,
and sets the trace flag for execution-time tracing
if any of the files being linked were translated with the
\f3\-t\fR
option.
.PP
As records are entered into the global symbol table
and the record/field table,
they are numbered, starting from 1.
These record numbers are used to index the record/field table
at run-time when referencing a field.
.PP
When the linker encounters a link instruction, the named file is added
to the end of a linked list of files to be linked. The list initially
consists of the files named as arguments. Names are not added to the
list if they are already on it.
.PP
The second phase reads each intermediate code file in sequence,
emitting icode as each procedure is encountered.
Appendix C describes the intermediate code.
The intermediate code contains a prologue for each procedure,
beginning with a
\f3proc\fR
opcode, followed by a series of
\f3loc\fR
opcodes describing the local symbol table, a series of
\f3con\fR
opcodes describing the constant table, and a
\f3declend\fR
opcode terminating the prologue.
The local symbol table contains not only local symbols,
but all identifiers referenced in the procedure \(em
global, local, or undeclared.
When an undeclared identifier is entered into the local symbol table,
its scope is resolved by the following steps:
.in .5i
.IP \(bu \w'\(bu'u+1m
if the identifier has been entered in the global symbol table,
it is entered into the local symbol table as a global identifier
.IP \(bu
if the identifier matches the name of a function,
it is entered into the local symbol table as a function
.IP \(bu
otherwise it is entered as a local identifier
and a warning is issued if the linker was run with the
\f3\-u\fR
option
.in 0
.LP
The constant table contains an entry for each literal used
in the procedure.
.PP
The linker outputs icode in several regions.
The first region contains constants, procedure blocks, and code
for the Icon procedures.
The next region contains the record/field table and procedure blocks
for record constructors.
The next four regions contain
the global identifiers, the names of the global identifiers,
the static identifiers, and the identifier table.
The icode is a sequence of instructions,
each with an opcode and, in some cases, operands. The sizes of
opcodes and operands depend on the machine architecture and the
implementor's judgement. On the
VAX-11, opcodes are one byte long and operands are four bytes long.
Most instructions correspond exactly to instructions
in the ucode that is output by the translator.
The opcode values are those used internally by the linker
(defined in the file
\fMlink/opcode.h\fR).
.PP
Fields are provided in the global symbol and literal tables
for associating a location with each entry.
As the prologue is being read, each cset, real, or long-integer
literal entered into the literal table is output immediately
and its location is stored in the literal table.
Thus, the locations of all constants are known before their reference.
.PP
The same is true of references to procedures,
since these references only occur in the initialization for global identifiers,
which is not output until all procedures have been output.
When the prologue for a procedure has been completely processed,
the procedure data block is output, and its location is noted in the
global symbol table.
.PP
References to program labels require backpatching,
since there often are forward references.
Because program label references are always local to the current procedure,
the linker buffers the output code for a procedure.
A table of values for all program labels is initialized to zero at
the beginning of each procedure.
When a label is referenced and its table entry is zero,
the location of the reference is negated and stored in the table entry
and a zero is output for the operand.
If a label's table entry is negative,
the location of the reference is negated and stored in the table entry
as before, but the previous value of the table entry is output for the operand.
This forms a linked list of references to the as-yet-undefined label.
When a label is defined, each reference on the linked list is replaced
with the correct value of the label.
.PP
References to global and static identifiers are determined at run-time.
The
\f3glob\fR
and
\f3static\fR
instructions have an integer operand referring to the identifier by
position in the global or static identifier region.
When one of these instructions is interpreted,
the actual address is calculated from the position and the
known address of the global or static identifier region.
References to functions are also resolved at run-time.
Each function is assigned an integer index
(its position in the table of functions in
\fMbuiltin.c\fR).
When the global identifier initialization for a function is output,
the negated index is output instead of an address.
The interpreter fills in the correct address during program initialization.
.PP
Once the prologue has been processed,
a procedure data block (see Section 3.1)
is emitted.
Opcodes following the prologue represent execution-time operations,
and cause code to be emitted.
.PP
The record/field table is a two-dimensional matrix,
first indexed by a field number assigned to each identifier that
is used as a field name,
next by a record number assigned to each record type.
The value at the selected position in the table is the
index of the field in a record of the given type, or \-1
if the given record type does not contain the given field.
.PP
The initial value for global and static identifiers
is the null value unless the global identifier is a procedure,
function, or record constructor,
in which case the initial value is a descriptor of type
.I procedure
pointing to the appropriate procedure data block.
The values output use the data representations described
in Section 3.1.
.PP
The names of global and static identifiers are output as
.I "string qualifier"
descriptors (see Section 3.1)
and are used by the function
\fMdisplay\fR.
All string qualifiers contained in the generated procedure data blocks
and global and static names point into the identifier table,
which is just a static string space for that purpose.
.NH
The Interpreter
.PP
The interpreter consists of an interpretive loop and a collection of
run-time routines that
collectively provide support for the execution
of an Icon program.
.ig
This library is searched by the loader for those
routines necessary for a particular Icon program.
The assembly code generated by the linker contains
subroutine calls to library routines to perform most
high-level operations where in-line code would be inappropriate.
An executable program is created by assembling
the linker output, then loading a startup routine and
the assembler output with the run-time library and the C library.
The startup routine, run-time library, and C library together
form the run-time system.
..
.PP
Three directories contain routines relating directly to source-language
operations:
\fMfunctions\fR,
\fMoperators\fR,
and
\fMlib\fR.
The first two directories contain
one routine per function or operator, respectively.
The
\fMlib\fR
directory contains routines relating to Icon control structures.
A fourth directory,
\fMrt\fR,
contains routines for performing common operations
needed by many routines in the other three directories.
In particular,
\fMrt\fR
contains routines that handle storage allocation and reclamation,
type conversion, data comparison,
integer arithmetic with overflow checking,
generator suspension,
and tracing.
The directory \fMiconx\fR contains initialization and the interpreter proper.
.PP
In each of the four run-time directories, all of the object files are
archived in a \fMLib\fR file which is randomized to speed loading.
The \fMLib\fR files are loaded together with a startup routine and the
interpretive loop to produce the interpreter.
.PP
Most of the run-time system is coded in C,
but some of the routines are coded in assembly language.
The interpretive loop and startup routines are written in assembly language, as is
integer arithmetic with overflow checking
(C does not provide this),
as well as other routines concerned with stack management.
.PP
The interpreter is loaded with the run-time libraries and the C
library to form the program \fMiconx\fR, which interprets
icode.
.PP
Before the interpreter begins executing the Icon program,
it reads in the icode file generated by the linker.
The first eight words of this file contain header information
indicating the total size of the rest of the file,
the initial value of
\fM&trace\fR,
and the relative offsets from the beginning of the file to the
various regions.
These offsets are converted to actual addresses by adding the base
address of the icode buffer.
Several pointers in the icode must also be relocated.
The interpreter sweeps through the global identifiers,
looking for procedures, functions, and record constructors.
For each function, it supplies the address of the appropriate procedure block.
For each procedure, it relocates pointers
from its procedure block to the procedure entry point,
as well as to strings representing the procedure
and local identifier names in the identifier table.
For each record, it supplies the address of
\fMmkrec\fR,
the routine that constructs new records,
as the entry point field in the procedure block.
.PP
The interpreter then begins execution by invoking the value of the first global
identifier, which corresponds to the procedure
\fMmain\fR.
If there is no main procedure, the first global identifier has the
null value and a run-time error is reported.
The routine
\fMinvoke\fR
sets
the
\fIinterpreter program counter \fR(\fIipc\fR)
to the entry point, and branches to
\fMinterp\fR, which is contained in \fMiconx/interp.s\fR.
.PP
The routine
\fMinterp\fR
is the main interpreter loop.
It fetches the next opcode,
and branches to the appropriate processing routine through a jump table.
.NH 2
Data Representations
.PP
Icon has two elementary forms of data objects \(em values and variables.
Values often can be converted from one data type to another.
When this is done automatically, it is called
.I coercion .
There are three kinds of variables, each discussed below:
.I "natural variables" ,
.I "created variables" ,
and
.I "trapped variables" .
The process of obtaining the value referred to by a variable is called
.I dereferencing .
.PP
Every data object is represented by a two-word
.I descriptor ,
which may, depending on the type of the object,
contain a value or
refer to some other area of memory for the actual value.
The first word of the descriptor always indicates the data type,
and the second word either contains the value or a pointer to it.
There are six descriptor formats, shown in Appendix D:
.I null ,
.I "string qualifier" ,
.I "integer" ,
.I value ,
.I variable ,
and
.I "trapped variable" .
These formats are distinguished from one another by the three high-order
bits of the first word,
except that a
.I null
descriptor is distinguished from a
.I "string qualifier"
only by the contents of the second word.
Among
.I "integer" ,
.I value ,
and
.I "trapped variable"
descriptors, the low-order six bits of the first word
identify the type of object represented,
while the remaining high-order bits in the first word are flags that
classify the object (for example, whether the second word contains
a pointer
\(em historically, a \*(oqfloating address\*(cq [7]).
.PP
The
.I null
descriptor represents the null value.
A
.I "string qualifier"
represents a string, and contains the length of the string
and a pointer to the first character of the string.
An
.I "integer"
descriptor represents an integer small enough
to fit in the second word of the descriptor.
This includes all integers on computers whose C \fIints\fR are the same
size as C \fIlongs\fR (such as the VAX-11).
All data types other than \f3integer\fR, \f3string\fR, and \f3null\fR are
represented by
.I value
descriptors.
A value descriptor
contains a pointer to a
.I "data block"
of appropriate format for a value of the given type.
On computers whose C \fIlongs\fR are longer than C \fIints\fR
(such as the PDP-11),
an integer that requires more bits than there are in an \fIint\fR is contained
in a \fIlong integer\fR data block.
The data block formats for each data type are shown in Appendix D.
.PP
A
.I variable
descriptor represents either a natural variable
or a created variable.
A natural variable contains a pointer to a descriptor at a fixed location
(for a global identifier) or a location on the stack (for a local identifier)
where the value of the variable is stored.
A created variable, formed by a table, list, or field reference,
contains a pointer to a descriptor in the aggregate
where the referenced element is located.
Since such elements are in the heap,
created variables also contain an offset that indicates
the distance (in words) from the beginning of the data block
to the referenced descriptor.
This offset is used during the marking phase of garbage collection,
discussed in Section 3.3.
.PP
A
.I "trapped variable"
[8] descriptor represents a variable for which special action is necessary
upon dereferencing or assignment.
Such variables include substrings,
non-existent elements of tables,
and certain keywords.
Each type of trapped variable is distinguished by the first word
of the descriptor.
.PP
Substring trapped variables,
created by a section or subscripting operation,
contain a pointer to a data block that contains a
.I variable
descriptor identifying the value from which the substring was taken,
an integer indicating the beginning position of the substring,
and an integer showing the length of the substring.
With this information, assignment to a substring of a variable
can modify the contents of the variable properly.
Substrings of non-variables do not produce
substring trapped variables, since assignment to such substrings
is meaningless and illegal;
instead, forming the substring of a non-variable produces a
string qualifier.
.PP
Table element trapped variables,
formed by referencing a non-existent element of a table,
similarly contain a pointer to a data block that contains
enough information for assignment to add the element to the
referenced table or to supply the default table value.
.PP
The keywords
\fM&pos\fR,
\fM&random\fR,
and
\fM&trace\fR
are handled via trapped variables (\fM&subject\fR is handled differently).
These trapped variables
need no additional information.
It is sufficient to know the type of trapped variable on dereferencing \(em
the value of the keyword can be accessed and returned.
On assignment, the new value is coerced to the appropriate type,
checked for validity, and assigned to the keyword.
.PP
Strings formed during program execution are placed in the
.I "string space" ;
string qualifiers for these strings point into this region.
Substrings of existing strings are not allocated again;
instead, a string qualifier is formed that points into the
existing string.
When storage is exhausted in the string space,
the garbage collector (see Section 3.3)
is invoked to reclaim unused space and compact the region;
if enough space cannot be reclaimed, the region is expanded if possible.
.PP
Data blocks formed during program execution are placed in the
.I heap .
Data blocks have a rigid format dictated by the garbage collection
algorithm.
The first word of the block always contains a type code
which identifies the structure of the rest of the block.
Descriptors follow all non-descriptor information in the block.
If the size of the block is not determined by its type,
the size (in bytes) is contained in the second word of the block.
When storage is exhausted in the heap,
the garbage collector is invoked to reclaim unused space and compact the heap;
if enough space cannot be reclaimed, the heap is expanded if possible.
.PP
Co-expression stack blocks are allocated in a separate region and
are treated specially by the garbage collector.
.NH 2
Stack Organization
.PP
The stack is the focus of activity
during the execution of an Icon program.
All operators, functions, and procedures
expect to find their arguments at the top of the stack,
and replace the arguments with the result of their computation.
Local variables for Icon procedures are also kept on the stack.
The arguments, local variables, and temporaries on the stack
for an active Icon procedure are collectively called a
.I "procedure frame" .
This is one of several kinds of
.I "stack frames"
discussed in this section.
Appendix E summarizes the layouts of the stack frames for the PDP-11 and VAX-11.
See [3] for a detailed discussion of stack frames.
Each co-expression also has a stack. For uniformity, the main stack
is treated as the stack for the co-expression \fM&main\fR.
.PP
On the PDP-11 and VAX-11 stacks start in high memory and grow downward.
On these computers, a push causes the stack pointer to decrease and a
pop causes the stack pointer to increase. Thus ``above'' and ``below''
refer, respectively, to ``older'' and ``newer'' information on the
stack. An exception to this is the phrase
``top of the stack'', which is used to
refer to the \fIlowest\fR memory location. The description of relative
stack locations that follows is based on this kind of architecture
and nomenclature.
.PP
Before an Icon procedure calls another Icon procedure,
the caller pushes the procedure to be called
(a descriptor \(em procedures are data objects in Icon)
onto the stack.
The caller then pushes each argument (also a descriptor) onto the stack,
leftmost argument first.
The caller then pushes one word onto the stack
indicating the number of arguments supplied,
which may be different from the number of arguments expected.
The run-time library routine
\fMinvoke\fR
is then called, which checks that the first descriptor pushed above
actually does represent an integer, procedure, or a variable whose value is
an integer or a procedure.
An integer indicates the selection of one of the
arguments resulting from mutual evaluation.
A procedure, on the other hand,
points to a procedure data block,
which contains various information about the called procedure,
including the number of arguments expected,
the number of local variables used,
and the procedure's entry point address.
The routine
\fMinvoke\fR
next adjusts the number of arguments supplied to match the number expected,
deleting excess arguments or supplying the null value for missing ones.
This adjustment is performed by moving the portion of the stack
below the arguments up or down, as appropriate.
It then dereferences the arguments.
A
.I "procedure marker"
is then pushed onto the stack,
and the
.I "procedure frame pointer"
is set to point to the new procedure marker.
The procedure marker contains, among other things,
the return address in the calling procedure
and the previous value of the procedure frame pointer.
Next, the null value is pushed onto the stack
as the initial value for each local variable.
The routine
\fMinvoke\fR
then transfers control to the procedure's entry point,
and execution of the Icon program resumes in the new procedure.
.PP
When a procedure is ready to return to its caller,
it pushes its return value (a descriptor) on the stack.
It then transfers control to
\fMpret\fR,
which moves the return value
to the location occupied by the descriptor that represented the
called procedure.
That is, the return value is stored in place of the
first descriptor that was pushed
at the beginning of the calling sequence described above.
The return sequence
then restores the state of the previous procedure
from the current procedure marker
(the procedure marker that the procedure frame pointer
currently points to).
This includes restoring the previous value of the procedure frame pointer,
retrieving the return address,
and popping the returning procedure's local variables,
procedure marker, and arguments.
Thus, when the calling procedure regains control,
the arguments have been popped
and the return value is now at the top of the stack.
.PP
Functions and operators are written in C,
and therefore obey the C calling sequence.
By design, the Icon calling sequence described above
is similar to the C calling sequence.
When an Icon procedure calls a function, a
.I boundary
on the stack is introduced,
where the stack above the boundary is regimented by Icon standards,
and the stack below the boundary contains C information.
This boundary is important during garbage collection:
The garbage collector must ignore the area of the stack
below the boundary,
since the structure of this area is unknown,
whereas the structure of the area above the boundary is
well-defined.
In particular, all data above the boundary is contained
in descriptors or is defined by the structure of a frame,
so that all pointers into the heap or string space may be
located during a garbage collection.
.PP
Functions and operators are written to \*(oqstraddle\*(cq
the boundary.
From above, they are designed to resemble Icon procedures;
from below, they are C procedures.
An Icon procedure calls a function in the same way as it
calls another Icon procedure;
in fact, functions are procedure-typed
data objects just as Icon procedures are.
When
\fMinvoke\fR
recognizes that a function is being called,
it bypasses the argument adjustment
if the field in the procedure data block that indicates
the number of arguments expected contains \-1,
which indicates that the function can take an arbitrary number of arguments.
It also does not allocate stack space for local variables, since
any such variables are C variables and are allocated by the C
function itself.
C procedures have an entry sequence that creates a new
procedure frame;
since
\fMinvoke\fR
has already done this,
the entry point for functions must be set past any instructions
that are involved in procedure frame creation.
.PP
The first formal parameter of
all functions is
\fMnargs\fR,
which corresponds to the word that contains the number of
arguments supplied.
For functions that expect a fixed number of arguments,
they are also listed as arguments, in reverse order.
For functions that can take an arbitrary number of arguments,
there is a macro
\fMARG(\fIn\fM)\fR
that uses the address and contents
of
\fMnargs\fR
to calculate the location of the
.I n \^th
argument.
Thus,
\fMARG(1)\fR
accesses the first argument (as a descriptor),
and
\fMARG(nargs)\fR
accesses the last argument.
Each function is responsible for supplying defaults for missing arguments
and for dereferencing arguments that are variables.
Because of the calling protocol,
\fMARG(0)\fR
accesses the location where the return value should be stored.
Every function must place its result there and
then return through normal C conventions.
Each function must also supply a procedure data block that contains
the number of arguments expected (or \-1), its entry point,
and a string qualifier representing its name.
.PP
Operators are very similar to functions. The only difference is that
operators are called directly (rather than being called through \fMinvoke\fR)
and must set the boundary themselves.
.ig
.PP
Operators are written like functions,
with one exception.
Since operators are not variables
(as function and procedure names are),
the name of the operator is known at translation time,
and the Icon procedure calls it directly
(at its normal entry point)
without going through
\fMinvoke\fR.
Thus, operators do not need procedure data blocks.
Instead of pushing a procedure descriptor pointing to a procedure block,
a null value is pushed instead to hold a place for the return value.
..
.PP
When an operator or function fails to produce a result,
it calls
\fMfail\fR.
This routine initiates backtracking as described below.
.PP
Expressions are evaluated within an
.I "expression frame" .
When the evaluation of an expression is complete,
whether it has produced a result or failed,
the expression frame must be popped from the stack
and the result of the expression must be pushed back onto the stack.
The expression frame marks the stack height at the point
that the expression began to be evaluated,
so that the stack may be restored to its original state
when the evaluation of the expression is complete.
The stack normally would be restored to the original height
(that is, the pops would match the pushes)
except when an expression fails at some midpoint in its evaluation.
The expression frame is also used to limit backtracking:
backtracking is restricted in the language
to the current expression instance only.
.PP
When evaluation of an expression begins, an
.I "expression marker"
is pushed on the stack, the
.I "expression frame pointer"
is set to point to it, and the
.I "generator frame pointer" ,
discussed below, is cleared.
The marker contains the previous values of the
expression and generator frame pointers
and a failure label.
When an expression produces a result,
that result, on the top of the stack,
is popped and saved.
Then the stack is popped to the expression marker,
and the previous values of the two frame pointers are restored.
The marker is popped
and the result of the expression is pushed back onto the stack,
now a part of the previous expression frame.
If an expression fails to produce a result,
\fMfail\fR
pops the stack to the expression marker,
restores the previous values of the two frame pointers,
and branches to the failure label.
In the special case that the failure label is zero,
\fMfail\fR
is effectively called again to indicate failure in the new expression frame.
Thus the failure is propagated from one expression to an enclosing one.
.PP
If an expression has any generators,
then there is a
.I "generator frame"
within the current expression frame
for each generator that is inactive
(that is, that has produced a value but is not yet exhausted).
A generator frame preserves the state of the stack
at the point just before the generator
(whether it be operator, function, or procedure)
suspended (became inactive).
If
\fMfail\fR
is called and there are inactive generators,
then instead of exiting the current expression frame,
the most recently inactivated generator is reactivated
by restoring the stack to the state saved in the most recent
generator frame.
.PP
A function or operator suspends itself by calling
\fMsuspend\fR.
This routine preserves the state of the stack
by duplicating the current expression frame,
bounded on one end by the most recent generator frame
(or, if there are no inactive generators, the current expression frame)
and bounded on the other end by the beginning of the
argument list of the suspending function or operator.
A generator marker is pushed onto the stack,
followed by the duplicate expression frame.
The routine
\fMsuspend\fR
then causes the suspending function or operator to return to its caller,
instead of itself returning.
.PP
When reactivated by
\fMfail\fR,
the stack is restored to the generator marker,
which is used to restore the various frame pointers.
Then the marker is popped.
The stack is then in the same state that it was in when
\fMsuspend\fR
was called.
The routine
\fMfail\fR
then returns to the generator
as if the original call to
\fMsuspend\fR
had returned.
Thus the following schema is typical of operators and functions
that generate a sequence of values.
.DS
.ss 9
.ft M
\fIinitialize\fM;
\fMwhile\fM (\fInot exhausted\fM) {
\fIcompute next value\fM;
\fIstore return value\fM;
suspend()
}
fail();
.ft R
.ss 4
.DE
The effect of resuming an expression containing generators is that
\fMsuspend\fR
actually causes the generator to return.
If alternatives are needed, backtracking occurs,
and the apparent effect is that
\fMsuspend\fR
has finally returned.
The generator computes the next result,
and suspends with it.
When the generator is exhausted,
it merely fails without suspending,
which just passes the failure back to the next most recently inactivated generator,
if any.
.PP
Just as functions and operators can return normally, suspend, or fail,
so can Icon procedures.
The mechanics are essentially the same, but the differences in stack
layout require different primitives.
When Icon procedures return normally, the return value is presumed to
be at the top of the stack and
\fMpret\fR
is called.
Similarly, Icon procedures call
\fMpsusp\fR
to suspend.
Both of these routines also dereference the return result
if it is a local variable.
The routine
\fMpfail\fR
causes an Icon procedure to return with no result.
.PP
The same three primitives are also needed at the expression level:
\fMeret\fR,
\fMesusp\fR,
and
\fMefail\fR.
Unlike
\fMunmark\fR,
\fMeret\fR
is not a library routine, but is generated as in-line code.
Both cause an exit from the current expression frame; but
\fMeret\fR
supplies a result to the enclosing expression,
while
\fMunmark\fR
does not.
The routine
\fMesusp\fR
creates a inactive generator before supplying a result to the
enclosing expression;
it is used by the alternation control structure.
The routine
\fMefail\fR
simply causes backtracking within the current expression frame.
In fact,
\fMfail\fR
and
\fMpfail\fR
merely exit their procedure frame before branching to
\fMefail\fR.
.NH 2
Storage Allocation and Reclamation
.PP
During program execution, storage allocation is necessary when a data
object is created.
The three primitive routines
\fMallocate\fR,
\fMalcstr\fR, and \fMalcestk\fR
allocate storage in the heap, string space, and co-expression stack space, respectively.
All three routines return pointers to the beginning of newly
allocated space.
None of the routines is responsible for ensuring that enough space
remains in the data regions.
Ensuring that enough space remains in the data regions is
the responsibility of a
.I "predictive need"
strategy described below.
.PP
In the heap,
\fMallocate(n)\fR
returns a pointer to
\fMn\fR
contiguous bytes of storage.
Because a wide variety of objects may reside in the heap, a number of
support routines are provided to simplify the storing of various objects.
There is a specific routine to allocate a block for each datatype in the heap.
Where appropriate, these routines have the actual values to be stored
as their arguments.
All of the routines call
\fMallocate\fR
to obtain storage for the object.
.PP
In the string space,
\fMalcstr(s,\*bl)\fR
allocates room for a string of length
\fMl\fR
and copies the string pointed to by
\fMs\fR
into this space.
Since some routines such as
\fMleft\fR,
\fMright\fR,
and
\fMcenter\fR
need room in the string space in which to construct a string,
a call to
\fMalcstr\fR
with the defined constant
\fMNULL\fR
as the first argument results in the
allocation of storage without attempting to copy a string.
.PP
In the co-expression stack region, \fMalcestk()\fR allocates a new
co-expression stack.
.PP
Source code for all of the allocation routines is contained in the file
\fMrt/alc.c\fR.
Almost all interaction with the storage management is made through these
routines.
Two exceptions occur in string concatenation (\fMoperators/cat.c\fR) and
reading a fixed number of
bytes from a file (\fMfunctions/reads.c\fR).
In these cases, it is simpler and more efficient to have these operations
deal directly with storage management.
.PP
As mentioned earlier, a
.I "predictive need"
strategy is employed to ensure that enough room remains for data storage.
Simply put,
.I "predictive need"
states that it is the responsibility of any routine
that calls an allocation routine both to ensure that enough room remains
in the proper data region and to maintain the validity of any temporary
pointers into the data regions, should a
.I "garbage collection"
be necessary to free storage space.
.PP
Since the check for storage space only needs to occur before the allocation
takes place, each routine may perform this check at its convenience.
This approach permits the minimization of the number of temporary
pointers that must be protected during garbage collection.
As an aid, space for several descriptors is automatically protected
by the procedure invocation mechanism, and usually is used to hold
information pertaining to the arguments of the procedure (see Section 3.4).
.PP
Routines to ensure space are provided for each of the three
storage regions.
The routine
\fMsneed(n)\fR
ensures that at least
\fMn\fR
bytes of storage remain in the string space, and
\fMhneed(n)\fR
performs the same function in the heap.
\fMesneed()\fR ensures that there is a co-expression stack
available.
If either routine finds that there is insufficient storage remaining,
it invokes the
.I "garbage collector"
in an attempt to obtain that storage.
If that fails, then program execution is aborted with an appropriate
diagnostic.
.PP
Garbage collection, or
.I "storage reclamation" ,
is a process that identifies all valid allocated data.
In the string and heap regions, valid data is compacted
in order to provide a contiguous area of unused storage.
The algorithm used for identifying valid data is based upon
the algorithm described by Hanson [7].
Only the more novel features are discussed here.
.PP
Whenever a predictive need request discovers that insufficient
storage remains in either the heap or string space, the garbage
collector is invoked to reclaim unused space in all regions.
This approach is more efficient in situations where all regions
are heavily allocated and only slightly less efficient otherwise.
.PP
The approach is to sweep through the permanent data regions and the
stack, looking for descriptors that are either pointers into the heap or
string qualifiers.
When a string qualifier is found, a pointer to that qualifier is
saved in a temporary data region at the end of the heap.
If the descriptor is a pointer into the heap, then that heap data
block contains valid information.
The block is marked as valid, the descriptor is placed on a back chain
headed in the block,
and the marking process is called
recursively on any descriptors within that block.
Blocks that are already marked as valid are not processed a second time.
To simplify the marking of heap blocks, all data blocks have been
designed so that all descriptors within them exist as a contiguous
section at the end of the block.
Thus to sweep through the descriptors within a block, the marking
algorithm need only know the size of the block and the
location of the first descriptor.
Information concerning a data block's size, as well as the
offset for the first descriptor is in the file
\fMrt/dblocks.c\fR.
.PP
Valid co-expression stacks also may contain string qualifiers and
pointers to other valid data; such stacks are included in the
marking phase.
.PP
After the marking phase is completed, the string region is compacted.
The algorithm used is described by Hanson [9].
The pointers to the string qualifiers are sorted so that the order of
all valid strings within the string space is identified.
The string qualifiers are then processed in order, and modified as
the valid strings are compacted.
If this compaction does not free enough space within the
string space to satisfy the request, the heap must be moved in order
to provide more room in the string space.
An attempt is also made to provide
some additional
``breathing room''
in the string space to permit future expansion.
.PP
The heap cannot be moved until after the valid pointers into it are
adjusted and the storage is compacted.
The pointer adjustment and heap compaction phases are two linear passes through
the heap which must be performed during standard heap garbage collection.
The only difference when the heap is to be moved is that the adjusted
pointers point to where that data will be after the heap has been moved.
If not enough breathing room is freed in the heap, then
more space is requested from the operating system.
As a last step, if the string space needs more room, the heap is
relocated.
.PP
This method has proved to be quite satisfactory for most applications.
A shortcoming of the implementation is the absence of a process for
decreasing the size of a data region, should it become too large.
It is also possible that insufficient room would be available for storing
the pointers to the string qualifiers, even though enough storage
would become available if the heap were collected separately.
In practice, this has not been a problem.
The source code for the garbage collector is contained in the files
\fMrt/gcollect.s\fR, \fMrt/gc.c\fR, and \fMrt/sweep.c\fR.
.NH 2
Coding Conventions
.PP
The calling conventions for functions and operators have been mentioned
earlier.
Several other aspects of the run-time system are explained here.
.PP
All header files for the run-time system are in the directory
\fMh\fR.
The file
\fMh/rt.h\fR
(or, for assembly-language routines,
\fMh/defs.s\fR)
is included by almost every source file in the run-time system, and contains
machine-dependent defined constants, run-time data structure declarations,
and defined constants and macros for flags,
type codes, argument accessing, and bit manipulations.
.PP
During the execution of an Icon program, many type
conversions are done on temporary values, where data storage is not
required beyond the bounds of the current operation.
For this reason, the type conversion routines all operate with pointers passed
to them that reference
buffers in the calling procedure.
Any routine calling for type conversion must determine if
heap or string space storage is needed, and perform the
allocation.
Most of the conversion routines return the type of the result or
\fMNULL\fR
if the conversion cannot be performed.
One exception is
\fMcvstr\fR
which, in addition to
\fMNULL\fR,
returns
\fM2\fR
if the object was already a string, and
\fM1\fR
if the object
had to be converted to a string.
This distinction makes it possible to avoid a large number of
predictive-need checks.
The second exception is
\fMcvnum\fR,
which returns either
real numbers or integers
and makes no attempt to distinguish between
short and long integers.
.PP
As mentioned in Section 3.3, there is space set aside to hold
temporary descriptors and to protect the validity of these
descriptors during garbage collection.
The garbage collector knows about this region, and
.I tends
it during storage reclamation.
The region is defined in the file
\fMh/gc.h\fR,
and is bounded by the labels
\fMtended\fR
and
\fMetended\fR.
This area can be referenced from C by considering
\fMtended\fR
to be an array of descriptors.
Since a garbage collection can occur
only during a call to
\fMsneed\fR
or
\fMhneed\fR,
or between suspension and reactivation,
the only places where C routines need to ensure that all pointers
into the heap or string space are tended
are just before calls to
\fMsneed\fR,
\fMhneed\fR,
or
\fMsuspend\fR.
.PP
All function names are preceded by the letter
\fMX\fR,
and their procedure blocks are preceded by the letter
\fMB\fR.
This prevents name collisions between Icon procedures and other routines,
such as those for operators, type conversions, and storage management.
Reference from the generated code to functions is made entirely through
the procedure block;
the entry point field of the procedure block references the function itself.
.NH
Modifying the Implementation
.PP
This section is intended to serve as a brief guide
for those who wish to modify the Icon system.
It is not comprehensive;
it only points to various parts of the
implementation that need to be considered
when making various kinds of changes.
.PP
Perhaps the most common kind of change that one might expect to make
is to add new functions (built-in procedures).
To add a function, first write it according to the conventions
described in Section 3.4.
(Use an existing function similar to the new one as a prototype.
Appendix F contains several example functions.) Be
especially careful to observe the rules
concerning storage allocation and tended descriptors.
.PP
.c note PI here
Prepare to add the new function to the run-time library
by moving the source code into the
\fMfunctions\fR
directory and adding its name to
\fMfunctions/Makefile\fR
(the name must be added in three places \(em
there are many examples already in the \fMMakefile\fR).
Then add the name to \fMh/pdef.h\fR
in proper alphabetical order.
Use other functions as a guide to the format of the entry.
.PP
When all changes have been made to the source code,
go to the Icon root directory and run
.Ds
make Icon
.De
This runs
\fMmake\fR
in each of the system directories \(em
\fMtran\fR,
\fMlink\fR,
\fMfunctions\fR,
\fMoperators\fR,
\fMlib\fR,
\fMrt\fR, and \fMiconx\fR
\(em and then copies the new versions into the
\fMbin\fR
directory [10].
.PP
Adding a new operator is more complicated.
Again, the first step is to write the routine, place it in the
\fMoperators\fR
directory, and add the appropriate entry to the
\fMMakefile\fR
there.
Next, the operator must be added to the translator, as follows:
.IP (1) \w'(1)'u+1n
Add the operator to the operator table in
\fMtran/optab\fR;
the structure of the table is described in Section 1.1.
.IP (2)
Create a unique name for the new token
and make a new token table entry in
\fMtran/tokens\fR
in the operators section of the table.
Although the operators section of the table is in alphabetical order
by token name
as distributed,
there is no need to preserve this order.
.IP (3)
If a running Version 5 of Icon is not available,
edit the files
\fMtran/optab.c\fR
and
\fMtran/toktab.c\fR
to correspond to the changes made in steps 1 and 2.
This sometimes involves a renumbering of token table entries
in both files (but nowhere else).
If a running Version 5 of Icon is available,
a
\fMmake\fR
in
\fMtran\fR
executes
\fMmktoktab\fR
to produce the new token tables.
.IP (4)
Add the operator to the grammar in
\fMtran/icon.g\fR.
The token name must be added to the list of terminal symbols
at the beginning of the grammar file,
and the operator must be inserted into the syntax at the
appropriate precedence level.
If the precedence is the same as that of an existing operator,
simply add the operator as an alternative to the existing production;
otherwise, insert a new production,
and change the production at the next lower precedence level
to refer to the new one.
The semantic action should create either a
.I BINOP
or a
.I UNOP
node in the parse tree;
use existing actions as a prototype.
.IP (5)
The new operator must now be added to the code generator in
\fMtran/code.c\fR.
Insert a case in either of the routines
\fMbinop\fR
or
\fMunop\fR
for the new token name
that assigns a new intermediate code opcode to
\fMname\fR,
as for other operators \(em
this causes the new opcode to be emitted into the ucode.
The opcode should have the same name as the library routine
that performs the operation.
.IP (6)
The new intermediate code opcode must also be added to the linker.
Add a defined constant to
\fMlink/opcode.h\fR;
order here is not important.
Then add the opcode name and the defined constant to
\fMlink/opcode.c\fR;
alphabetical order must be preserved here,
since a binary search is used.
Then edit the code generator in
\fMlink/lcode.c\fR,
adding a case in the routine
\fMgencode\fR
with either the binary or the unary operators.
The standard processing here emits code that
evaluates the operand(s),
then calls a library routine
with the same name as the intermediate code opcode.
.IP (7)
Add entries for the operator to \fMh/pnames.h\fR, using other operators as a guide.
.LP
The system is then be ready to be made as described above.
.PP
Adding a new control structure is similar in nature to
adding a new operator.
Most often, a new reserved word must be added to
\fMtran/tokens\fR;
this part of the token table must be kept in alphabetical order.
The new token must be added to the grammar,
and productions must be added,
usually at the highest precedence level
(the same as
\fMif\fR,
for example).
The semantic action for the new production probably will
involve creating a parse tree node of a new type.
The new node type should be added to
\fMtran/tree.h\fR
and a new case in the routine
\fMtraverse\fR
(in
\fMtran/code.c\fR)
should be added to generate intermediate code.
The intermediate code generated can use any of the existing
opcodes or can use new ones created specifically for the new control structure.
If new opcodes are created,
they must be added to the linker as described above,
and a new case in the routine
\fMgencode\fR
must generate code for it.
The generated code can be either entirely in-line
or can call a new library routine.
If new code generation templates are needed,
modify the code emission routines
in
\fMlink/lcode.c\fR.
If the code calls a new library routine,
add it to the
\fMlib\fR
directory and the
\fMMakefile\fR
there.
Then the system is ready to be made.
.PP
Modifying the semantics of existing control structures,
operators, or functions,
often involves changing only the generated in-line code
or a library routine.
Modifying the syntax without disturbing any semantics
usually requires only a change to the grammar.
.PP
Adding a new datatype means making many of the above changes.
A new datatype code must be added to
\fMh/rt.h\fR
and
\fMh/defs.s\fR,
and a new data block format must be defined in \fMh/rt.h\fR.
The size and location of the first descriptor of the new data block
must be entered in
\fMrt/dblocks.c\fR
so that the garbage collector knows how to treat the block.
The routines in
\fMfunctions/image.c\fR
and
\fMrt/outimage.c\fR
must be extended so that images of the new datatype can be produced.
The routines in \fMfunctions/type.c\fR and \fMrt/equiv.c\fR must
also be modified to account for the new type.
In addition, \fMrt/anycmp.c\fR must be extended so that objects
of the new type can be sorted relative to other types.
New functions and operators on the new type may be needed,
and possibly new coercion routines must be added to
\fMrt\fR.
.PP
Adding a new keyword entails a change to
\fMtran/keywords\fR
(and, if a running Version 5 of Icon is not available, to
\fMtran/keyword.h\fR)
and a new case in
\fMlib/keyword.c\fR.
A
\fMmake\fR
in
\fMtran\fR
runs the program
\fMmkkeytab\fR
to produce both
\fMtran/keyword.h\fR
and
\fMtran/keyword.c\fR.
Many keywords require trapped variables,
which requires changes to
\fMh/rt.h\fR,
\fMoperators/asgn.c\fR,
and
\fMrt/deref.c\fR;
the trapped variable for
\fM&random\fR
is a good model.
.PP
As mentioned above,
the examples in this section are intended
to identify what parts of the system are affected
by certain kinds of changes or extensions.
A thorough understanding of the system is
suggested, however, for other than minor changes.
.bp
.SH
Acknowledgements
.PP
This report is a revision of earlier descriptions of the C
implementation of Icon [11, 12], which were co-authored by Cary Coutant
and Steve Wampler.
Much of the material in this report is taken from the earlier ones.
.PP
Many features of the current implementation of Icon are based
upon the original Ratfor implementation by
Dave Hanson, Tim Korb, and Walt Hansen [13, 14].
.sp 1
.SH
References
.IP 1. 5n
Griswold, Ralph E. and Madge T. Griswold.
.I "The Icon Programming Language" .
Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983.
.IP 2. 5n
Kernighan, Brian W., and Dennis M. Ritchie.
.I "The C Programming Language" .
Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1978.
.IP 3. 5n
Mitchell, William W. \fIPorting the UNIX Implementation of Icon\fR.
Technical Report TR 83-10d, Department of Computer Science, The
University of Arizona, Tucson, Arizona, August 1984.
.IP 4. 5n
Griswold, Ralph E., Robert K. McConeghy, and William H. Mitchell. \fIVersion 5.9 of Icon\fR.
Technical Report, Department of Computer Science, The University of
Arizona, Tucson, Arizona, July 1984.
.IP 5. 5n
Johnson, Stephen C.
``Yacc: Yet Another Compiler-Compiler''
.I "Unix Programmer's Manual, Seventh Edition" .
Bell Telephone Laboratories, Inc., Murray Hill, New Jersey, 1979.
.IP 6. 5n
Aho, A. V., and S. C. Johnson.
``LR Parsing''"
.I "Computing Surveys"
.B 6 ,
2 (June 1974), 99-124.
.IP 7. 5n
Hanson, David R.
``Storage Management for an Implementation of SNOBOL4'',
.I "Software\(emPractice and Experience"
.B 7 ,
2 (March 1977), 179-192.
.IP 8. 5n
Hanson, David R.
``Variable Associations in SNOBOL4'',
.I "Software\(emPractice and Experience"
.B 6 ,
2 (April 1976), 245-254.
.IP 9. 5n
Hanson, David R.
.I "The Manipulation of Variable-Length String Data in Fortran IV" .
Technical Report,
Department of Computer Science,
The University of Arizona, Tucson, Arizona,
May 1975.
.IP 10. 5n
Griswold, Ralph E. and William H. Mitchell. \fIInstallation and Maintenance
Guide for Version 5.9 of Icon\fR. Technical Report TR 84-13,
Department of Computer Science,
The University
of Arizona,
Tucson, Arizona. August 1984.
.IP 11. 5n
Coutant, Cary A. and Stephen B. Wampler. \fIA Tour Through the
C Implementation of Icon; Version 5\fR. Technical Report TR 81-11a,
Department of Computer Science, The University of Arizona, Tucson,
Arizona, December 1981.
.IP 12. 5n
Griswold, Ralph E., William H. Mitchell, and Stephen B. Wampler.
\fIThe C Implementation of Icon; A Tour Through Version 5\fR.
Technical Report TR 83-11a, Department of Computer Science,
The University of Arizona, Tucson, Arizona. December 1983.
.IP 13. 5n
Korb, John Timothy.
.I "The Design and Implementation of a Goal-Directed Programming Language" .
Ph.D. Dissertation, Technical Report TR 79-11,
Department of Computer Science,
The University of Arizona, Tucson, Arizona,
June 1979.
.IP 14. 5n
Hanson, David R., and Walter J. Hansen.
.I "Icon Implementation Notes" .
Technical Report TR 79-12a,
Department of Computer Science,
The University of Arizona, Tucson, Arizona,
February 1980.