4.3BSD-Reno/src/usr.bin/grep/README/pep4grep.doc3

Subject: Get Hep to Kanji-Ready Five-Algorithm [ef]?grep

#  "I need very little,
      and of that little,
         I need very little."  -- St. Francis of Assisi

     Hybrid blitz 'egrep', whose urquell is a veritable chimaera of at least
five string search techniques, is now further tuned.

     Posted to USENET (and the mod.sources archive) some months ago, our
implementation of "plug-compatible" egrep.c has been extended to support:

	transparent 'grep' expression translation, to bring the speed of
	hyper-'egrep' to bear upon searches specified via the less capable
	'grep' syntax.

	interception of 'fgrep' for alternations of low (<= 7) cardinality,
	using a novel method of Boyer-Moore table superimposition and
	pre-computation magic.  the resulting speedup applies also to simple
	metacharacter-free 'egrep'-style alternations.

	(the above two improvements are made invisible by linking the
	grep/egrep/fgrep triumvirate.)

	a revised strategy of fallback to standard 'egrep' for hard
	cases, which eliminates costly popen() plumbing in favor of a
	statistically-based re-exec() dynamic.

	more complete application of fast match to standard options,
	including -n line numbering.

	preparation for Kanji pattern input, based upon parity-marked EUC
	codes.  new egrep.c is "eight-bit clean".  the fast algorithms
	unfortunately currently apply only to meta-free patterns and
	simple alternations; full Kanji regular expression treatment
	remains problematic.  however, the new code will pass difficult
	input through to [ef]?grep utilities in the UNIX Japan standard
	substrate.

     Kanji capability is perhaps the most important addition, as the
number of UNIX systems in the Orient proliferate, providing a "new market"
for Boyer-Moore-style search.  However, actual search efficacy remains
unknown until the Gaijin author gains feedback from JUNET or points beyond.
For all we know, Nippon text search utilities may already incorporate
the methods.  Tests conducted so far have been with artificial Kanji files.

     In case you are w(o|a)ndering, be reminded that no vendor source
changes are required to use the code.  It is configured as a turbo-charged
"front-end" to the existing section one commands, though it is (barely)
conceivable to adapt things, at a loss in worst-case efficiency, for
(heaven forfend!) non-Unix systems running C.  And, since we do not offer
a minimalist grep, do not expect it to run well on minimalist UNIX clones.

     Below appears a brief timing run on Webster's 2nd wordlist.  Notes
on implementation changes since original release follow in the next message.
If you want to skip the rest, fine.  The easy instructions -- download
from comp.sources [or 'anonymous' ftp of egrep.shar.Z from ames-aurora.arpa
for the (im)patient], and run:

	make
	sh eg.shell	# regression test amusement
	make install

after perusing README.FIRST.  Though the bundle in ~ftp/pub at NASA Ames
Research Center contains prerequisite support from Univ. of Toronto's
Henry Spencer, we are not re-posting regcomp()/regexec() to comp.sources.
John Gilmore of the GNU project has a modified regexec(), but it is not
mandatory for running the new egrep.

     Contrary to popular belief, old egrep is not I/O bound for large
large files on most machines.  The new version is.  One sample:

	time egrep 'u.*nix' /usr/dict/web2	(2.5 MB)
	  (yielding {Coturnix, Pseudophoenix, [Tt]urnix}), shows

		        user	 sys      real	 (load ave. < 1)

	VAX 11/750, 4.3 BSD, Fujitsu Eagles
	   (new)	6.8      3.8       11
	   (old)       70.0      5.5       87

	Sun 3/160, 3.2 OS, Eagle on SMD
	   (new)	1.7      2.2	    5
	   (old)       14.7      1.5       16

	Cray 2, Sys 5, no disk striping
	   (new)        .93      .11        1
	   (old)       8.07      .21        8

notes:  New egrep was actually run with -i option, but not old egrep.
Also, fumbling for three-character residue is not [ef]?grep's forte,
so the example is conservative.

Sun 3 has higher sys time for some unknown reason (a guess:  VAX 4.3 kernel
handles read() calls with oddsize buffers differently?).  Cray 2 reportedly
does disk I/O at 5-10 megabytes per second, but the architecture/compiler
is bad at byte addressing -- no cache, no vectors here.  Unfair comparison:
new egrep on Sun beats old egrep on Cray 2, even with fast Cray I/O!

Speculation:  the code might be useful on the Macintosh II, even if the Unix
filesystem (Sys 5?) were to waste 3/4 of the 1 MB/sec SCSI disk bandwidth.
Mac 2 testers please forward info to ames!jaw.

     Another metric is inner loop efficiency:

				# instructions
	VAX Berkeley cc			5
	Sun 68020 3.2 cc		6
	Stallman's GNU 68020 cc		4
	MIPS R2000			6
	Cray 2			       25

Thanks goes to mips!dce (David Elliott) for his testing time, as well as
to RMS for two-upping Sun's C compiler.

     Of course, if you have a Connection Machine dedicated to running their
admirable full-text keyworder on "in-core" text, you won't need [ef]?grep at
all.  And, for unindexed text on fine-grained parallel machines, reg. expr.
search algorithms can be made to run with a lower time bound (see the Hillis
book).  But if your files are on disk, even a specialized search chip won't help
once things become I/O or bus limited.  For this reason, vectorization on a
Cray(ette) would be a bust, though Cray buffs may write the author for other
scalar speedup ideas...

[continued]