V10/vol2/index/sandt.notes

Indexing Vol2

47 ``unrelated papers -> 40
724 pages -> 617+index
troff, tex, monk

Term Generation
titles & headings - with stop list for Introduction, Conclusions etc
repeated noun phrases
distinguished words/distinguished word+next
	.I
	.CW
	.UL
code for:	single entry only
		global entry only
		font change
result: paper.terms

problems:
	single letter commands: sam a, sed s
	common words that are commands: anim again
	commands that match parts of words: uucp pack (not package), con

	partial solution: CW a, I pack
	but would like to match terms in tables, examples & pictures

Phrase finding
diction - fgrep for English with longest match
	works on ``sentences'' not lines (.?!)
	maps uppercase to lower, some punctuation to space
	in terms: map upper to lower, substitute something for terms with .

result: linenumber term1
	term2
	term3
	...
	where case is lost in term?
restore case & terms with .
run paper.sed to remove paper dependencies
	i.e.	monk & tex font changes
		paper specific string definitions
		kludges to avoid matching problems (CW a etc)
		add some consistency between papers (DMD5620)

using line number as an approximation insert
	.Tm phrase or \index{phrase}
in a safe place in the paper
	i.e. not inside .TS/.TE .PS/.PE
	(for tex s/\t/<tab>/ in phrases)

run troff/tex/monk to produce file with page numbers
result:
	page number<tab>phrase #.*

create file of
	phrase:papername:page_number
	papername:phrase:page_number
	or just 1 of above if single/global only

sort by phrase & accumulating page numbers and tagging with font info
result: paper.ind
	con, ipc, 536	c
	ipc, con, 536	c

Making the index
get page ranges for each paper
trash any \\f* stuff that remains (breaks sort)
sort */*.ind
put size changes around strings of caps (with numbers $ / etc)
for multiple line with the same first term
	if term in papernames - put out papername in italic & page ranges
	then only put out 2nd term & pages - arranging for CW font where coded
	putting line breaks & indentation in appropriate places for long line
results: lots better than commencial index with v7 vol2
	about the right length - bwk books .015, bently pearls .05 gerard .01 this is .012

problems
	folding uppercase to lowercase destroys ability to recognize case where it matters
		(sam d D commands)
	lack of consistency between papers
		typesetter, typesetting etc
		-ms vs ms
	would like some terms to be multiply entered or qualified
		Typesetting Mathematics
		Mathematics, Typesetting
	papers that explain by example hard to index
	usefulness of headings varies greatly among the papers -
		for some they're useless, for others they're really all that's
		needed
	some terms simply not literally in the papers
	it's impossible to tell what terms are missing