4.4BSD/usr/src/usr.bin/grep/README/pep4grep.doc4

#  How long a time lies in one little word! -- Shakespeare, Richard II, I, iii

#  Fine words butter no parsnips. -- Southern proverb

		[ef]?grep Implementation Changes

'grep' r.e. translation:

     To buy speed for the novice 'grep' user who deigns not to learn the
extended 'egrep' syntax, we translate 'grep' r.e.'s to the 'egrep' superset.
It is straightforward enough to surround search patterns meaningful to
'egrep' but not to 'grep'.  Odd cases include the -w option, not implemented
in standard 'egrep', the defunct -y option, and "tagged expressions", which
are done via an exec() of /bin/grep.  Tagged exprs, like

	grep '\(....\).*\1' /usr/dict/words

which outputs chaff like "beriberi", "couscous", "hodgepodge", and
"lightweight", are weird.  The irregularity these exprs lend coupled with
a low complexity/utility ratio kept them from being part of 'egrep'.
But for this feature, old 'grep' code could be thrown away.

'fgrep' improvement / (partial) unification:

     In the new release, we trap low-complexity disjunctions such as

		egrep 'boyer|moore' file
or
		fgrep 'boyer\n
		moore' file

(or with "-f patfile" in place of the pattern) with a method to superimpose
the non-terminals within the Boyer/Moore table.  When scanning text backwards,
other programming tricks short-circuit some tests against the pattern.
Sparing further details, which might make for a more formal writeup, it
suffices to say that although worst-case complexity here is O(Rn) with string
length 'n', and R == r.e. size, average-case for text is still sublinear.  E.g.

	egrep 'silver|orange|purple'  	# hard-to-rhyme color test in eg.shell

looks at ~55000 chars in /usr/dict/words, whereas (three) separate invocations
of egrep on the individual color words make the code look at ~40000 bytes per
word.  Aho/Corrasick's 'fgrep', in contrast, must look at all 200KB in the
dictionary.  The elegant "trie" construction within "fgrep" excels, however,
for medium/large R.  An equally ambitious "reverse trie", supplementing our
extant "alternation mush", would attenuate worst-case behavior while preserving
low R speedup.  We save the addition for another day.

     Since the syntax for [ef]grep is similar, we thought of making egrep
hand off to fgrep for sufficiently large metacharacter-free R, as there is no
strong reason to make the user conscious of the separate algorithms.  Certain
technicalities prevent this.  For one, we are not willing to invent an 'egrep'
option to inform the code to interpret a file of (say a hundred) word
alternatives containing some innocent metacharacter, that it is literal
'fgrep' input, rather than a closure-containing 'egrep' pattern which would
otherwise make egrep explode.  More work could be done here.

     Our motivation?  Is this not all overblown?  Perhaps, but now you can
build a simple fast "NSA filter", or search for the seven dwarfs at leisure.
Besides, the final nail needed to be driven into 'bm/match', which tried
to do parallel match, but actually shuffled things out of order during its
simplistic block-based scheme.  These programs, part of source archive also,
are now historical curiosities.

Kanji egrep:

     Copious notes are in README.kanji.mods.  The March 1987 Unix Review
was indispensable for pointing out the embedded "SS2" Katakana pitfalls.
The modularity of our code as a semi-dependent filter was necessary for this
exploration, as we have no access to AT&T/Unisoft Kanji code.  Again, JUNET
or Sigma project people -- please respond with grep war stories or usage notes.

Worst-case r.e. handling:

     The first code release elaborately called upon a function mcilroy()
to pipe partial match output to old egrep for tough expressions, whose
backtracking might swamp regexp().  Some details of the DFA/NDFA tradeoff
were discussed in pep4grep.doc[12].  Due largely to feedback from John Gilmore
of the GNU project, the strategy was revised.  egrep.c function kernighan()
("let others do the hard part") now usurps calls to costly popen() by invoking
exec() on old egrep when necessary.  Rough partial match statistics gathered
on the fly determine the handoff.  You may revise the time reported previously
for
	egrep 'hoe.*g' /usr/dict/words

from 1.2 user, 1.1 sys seconds (VAX 11/750, 4.3BSD, Fuji disks) to 0.8u, 0.4s.
For those public-spirited souls who really want to build a PD egrep out of
what we offer, sans fallback from regexp() to an AT&T /usr/bin/egrep, the
slippery test "egrep 'g+h+o' /usr/dict/words" will prove enlightening.

Faster -n option:

     By popular demand.  Though Boyer/Moore techniques subvert line numbering,
we've made it faster with brute force (loop unrolling helps VAXEN, but not
CRISPS).  Timing tests for this and other options appear in the eg.shell script.

Not so fast:

	-v, -b, -w, various r.e.'s with no rexexp() "residue"

(you'll still have to use the layered "grep host /etc/hosts | grep -w host"
for speed.)

Other contra-indications for new [ef]?grep:

	Monster patterns

     The amazing expressions output by /usr/lib/calendar still beg for
the lazy evaluation technique rolled into edition 8 egrep by Prof. Aho of
Princeton.  Hinted at on p. 105 in Kernighan & Pike, lazy evaluation reduces
standard egrep r.e. compile time.  Here the possible O(R**2) machine
construction cost is eliminated to amortize complexity at run-time and 
shifted to such only if a bad match actually happens.  Whew!  Fortunately,
this is not so important for simple r.e. fare, where H. Spencer's regexp()
works well, but it certainly helps calendar(1).

     The catch with lazy eval. is that it slows down simple matching (15-20%
for /usr/dict/words on VAX), so it hasn't been adopted by System V egrep.
Note that our egrep, deferring to the underlying one in /usr/bin, doesn't
care much about these hideous beasts -- it just doesn't do better on them.
However, [ef]?grep does well by the Kadhafi matcher (eg.shell, again).

	Long lines, small alphabets

     Finally, a comment on one rapidly burgeoning application area
where new egrep should not be blindly proscribed -- genome sequencing.
Though line limits have been raised (to 8192 byte buffers), much of
GENBANK has no newlines.  The code would need modification for scanning
freestyle.  Also, locating ACGT sequences with the current "superimposed BMG"
over a 4-letter alphabet might actually be worse, but the global homology
crowd probably uses a >20 letter protein alphabet (for other reasons).
At any rate, genetic string search generally relies on more sophisticated
methods such as dynamic programming ala Sankoff/Kruskal.

     On the other hand, large alphabets such as Kanji probably help
performance.  As do parallel transfer disks, MACH file mapping, ...
Your suggestions welcome.

     James A. Woods (ames!jaw)
     NASA Ames Research Center

P.S.  Preserving author credit, [ef]?grep may be redistributed as you wish.