4.4BSD/usr/src/share/doc/index/index.4.3/README

The process of putting together the index:

A. For the 4.3 distribution and the first Usenix printing:

1. The Indexer (a hairy lisp program) was run on the inputs,
which consisted of troff -a documents and troff -a man pages (many files)
which were specially marked up mechanically -- the page numbers
had TeX-like tags.   The run took more than a day for 3500 pages...

(The program runs on a Symbolics lisp machine; it was a collaborative
effort among several programmers at Thinking Machines Corporation in 
Cambridge, MA.)  There are a few knobs to adjust the sort of indexing
considered significant, but several passes over the index and over the
documents eliminating badly chosen examples of index-worthy occurrences
of terms were necessary. This took weeks of human time.  

After that there was a final pass of sed/emacs editing to make things barely
reasonable enough to publish, resulting in a file with this content:
Each line contains an example of an index term considered worthy of 
indexing, the name of the file containing it, a not-too-useful
percentage-within-the file of the occurrence, and the page number within
document.

For example (some blanks compressed out for readability):
"here" shell scripts				 CSH.1:  40% CSH(1)-8
#define						 CC.1:  46% CC(1)-2
    "						 F77.1:  28% F77(1)-1
    "						 SMM.19:  96% SMM:19-26
    "						 PS1.01:  77% PS1:1-26
#else						 PS1.01:  82% PS1:1-27
#endif						 PS1.01:  82% PS1:1-28
#if						 PS1.01:  80% PS1:1-27
#if !defined					 PS1.01:  81% PS1:1-27
#ifdef						 CTAGS.1:	94%
    "						 PS1.01:  81% PS1:1-27
#ifdefs, removing				 UNIFDEF.1:   2% UNIFDEF(1)-1
#ifndef						 PS1.01:  81% PS1:1-27
#include					 F77.1:  29% F77(1)-1
    "						 PS1.01:  79% PS1:1-27
#include files					 HIER.7:  46% HIER(7)-4
#include path					 CC.1:  50% CC(1)-2
#line						 PS1.01:  82% PS1:1-28
#undef						 PS1.01:  79% PS1:1-26

Wrong page numbers can result from initial pages with the page number in a
footer (usually only on the first page of -me documents, so these references
appear to be on page 0).  Bad page numbers can also result in the
unfortunate case that, for some reason, the ASCII approximation that was
indexed is not exactly what ended up being printed (or was printed on a
different printer).  In this case, they're often "almost right", 
usually off by a page approaching the end of the document.  

It was also necessary to index documents in several batches, because the
volumes would freeze individually, even though this would involve duplicate
human work.  The result is several output files which must be consolidated by
sorting and merging.  Notice that the presence of the ditto marks
(representing additional occurences of duplicate index terms, and intended
as a convenience to the reader) makes sorting and merging difficult.  

2. The duphead program replaces the ditto marks in any indexes
with headers duplicated from the header line above them. 

3. The resulting form of the index is then split (according to the first
character of index term) into 27 files (one for each letter and one for
everything else) then sorted, then merged, using splitindex.sh.  (This
strategem is useful for shortening the sorting time).

4. The unduphead program replaces the duplicated heads with ditto marks again.

One unsolved problem with this coalescing procedure:
The indexer has some knowledge about singular and plural rules in English,
and tries to reduce all plurals to their singular form which has caused some
amusing bugs (e.g. /sys is obviously a plural term, the singular of which
it believes must be /sie...).  The unduphead procedure has no such knowledge,
so if an entry is listed as both "filesystem" and "filesystems", or, even
worse "file systems" these are separate entries.  Anyway, such entries need to
be combined manually.

So now the index looks like this:
  
"here" shell scripts		CSH.1:  40% CSH(1)-8
#define		CC.1:  46% CC(1)-2
  "    F77.1:  28% F77(1)-1
  "    PS1.01:  77% PS1:1-26
  "    PS1.02:	7% PS1:2-5
  "    PS1.18:	31% PS1:18-8
  "    SMM.19:  96% SMM:19-26
#else		PS1.01:  82% PS1:1-27
#endif		PS1.01:  82% PS1:1-27
#if		PS1.01:  80% PS1:1-27  
#if !defined		PS1.01:  81% PS1:1-27
#ifdef		CTAGS.1:	94%
  "    PS1.01:  81% PS1:1-27 
#ifdefs, removing		UNIFDEF.1:   2% UNIFDEF(1)-1
#ifndef		PS1.01:  81% PS1:1-27
#include		F77.1:  29% F77(1)-1
  "    PS1.01:  79% PS1:1-27
  "    PS1.04:  79% PS1:4-40 
#include files		HIER.7:  46% HIER(7)-4  

5. The reduce program condenses the information in consecutive entries with 
the same index term using heuristically-derived rules and some knowledge about
the documents.  For example, references to most of a document's pages
will be reduced to a single reference to the entire document;
individual references to contiguous pages will be reduced to a reference to
a range of pages, and ranges will be combined even across single-page gaps, 
etc.  (The guiding belief is that if there are many references to a term in 
a document it is probably "about" that term in a global sense.)
Reduce also groups together the references within published volumes and
outputs them in volume order (User man pages together, User supplementary
docs together, etc.)  Reduce also inserts troff macros, which are
defined in index.macs.  

Macro and defined symbol usage:
.L is used for large headings, which appear only at the beginning of a 
letter or other index category.
.X is used for each index entry.
The aliased symbol \*(tx is a dash; this hack is the only way we could think
of to have a dash appear but not cause it to be recognized by troff's
hyphenation routines, which would normally break index entries across
the dash.  This way an individual entry will always be within the same
line.

6. The output now looks something like, and is now ready to print using
print.sh (which merely supplies the right troff argument magic):
.L "4.3BSD System Index"
.L "special characters and numbers"
.X "#define"
cc(1)\*(tx2, f77(1)\*(tx1, \s-1PS1\s0:1\*(tx26, \s-1PS1\s0:2\*(tx5, \s-1PS1\s0:18\*(tx8, \s-1SMM\s0:19\*(tx26
.X "#else"
\s-1PS1\s0:1\*(tx27
.X "#endif"
\s-1PS1\s0:1\*(tx27
.X "#if"
\s-1PS1\s0:1\*(tx27 
.X "#if !defined"
\s-1PS1\s0:1\*(tx27
.X "#ifdef"
ctags(1), \s-1PS1\s0:1\*(tx27 
.X "#ifdefs, removing"
unifdef(1)
.X "#ifndef"
\s-1PS1\s0:1\*(tx27
.X "#include"
f77(1)\*(tx1, \s-1PS1\s0:1\*(tx27, \s-1PS1\s0:4\*(tx40 
.X "#include files"
hier(7)\*(tx4 
.X "#include path"
cc(1)\*(tx2
.X "#line"
\s-1PS1\s0:1\*(tx28
.X "#undef"
\s-1PS1\s0:1\*(tx26

B. The future.
1. The index terms need to be inserted back into the troff sources
for the documents. The current scheme is to use a macro which outputs
the index term and the current output byte count and page offset (output to a 
separate file descriptor). This will enable documents to evolve without
losing the indexing work.

2. An index of page numbers and byte offsets in troff -a needs to be
made.  This can be used for fast access to individual pages of the 
formatted documents for browsing.

3.  A browser needs to be written to take advantage of the possible views
of documents.

At least these views should be accomodated:
- index view
- table of contents view
- apropos view

What is a view?  It's an editable (at least searchable and scrollable) list
of index terms (and how it's organized is a matter of taste and religion)
and a list of pointers to places in documents.  The browser needs to
understand only the pointers, not how the index is organized.

Other requirements:
The browser should keep a list of places in the index where the user has
been, as well as a list of places documents.  It should be easy to return
to places been to recently.

The browser should run on bitmap displays with full function, and on
curses-supporting asynch terminals with reduced function.