Subject: Get Hep to Kanji-Ready Five-Algorithm [ef]?grep # "I need very little, and of that little, I need very little." -- St. Francis of Assisi Hybrid blitz 'egrep', whose urquell is a veritable chimaera of at least five string search techniques, is now further tuned. Posted to USENET (and the mod.sources archive) some months ago, our implementation of "plug-compatible" egrep.c has been extended to support: transparent 'grep' expression translation, to bring the speed of hyper-'egrep' to bear upon searches specified via the less capable 'grep' syntax. interception of 'fgrep' for alternations of low (<= 7) cardinality, using a novel method of Boyer-Moore table superimposition and pre-computation magic. the resulting speedup applies also to simple metacharacter-free 'egrep'-style alternations. (the above two improvements are made invisible by linking the grep/egrep/fgrep triumvirate.) a revised strategy of fallback to standard 'egrep' for hard cases, which eliminates costly popen() plumbing in favor of a statistically-based re-exec() dynamic. more complete application of fast match to standard options, including -n line numbering. preparation for Kanji pattern input, based upon parity-marked EUC codes. new egrep.c is "eight-bit clean". the fast algorithms unfortunately currently apply only to meta-free patterns and simple alternations; full Kanji regular expression treatment remains problematic. however, the new code will pass difficult input through to [ef]?grep utilities in the UNIX Japan standard substrate. Kanji capability is perhaps the most important addition, as the number of UNIX systems in the Orient proliferate, providing a "new market" for Boyer-Moore-style search. However, actual search efficacy remains unknown until the Gaijin author gains feedback from JUNET or points beyond. For all we know, Nippon text search utilities may already incorporate the methods. Tests conducted so far have been with artificial Kanji files. In case you are w(o|a)ndering, be reminded that no vendor source changes are required to use the code. It is configured as a turbo-charged "front-end" to the existing section one commands, though it is (barely) conceivable to adapt things, at a loss in worst-case efficiency, for (heaven forfend!) non-Unix systems running C. And, since we do not offer a minimalist grep, do not expect it to run well on minimalist UNIX clones. Below appears a brief timing run on Webster's 2nd wordlist. Notes on implementation changes since original release follow in the next message. If you want to skip the rest, fine. The easy instructions -- download from comp.sources [or 'anonymous' ftp of egrep.shar.Z from ames-aurora.arpa for the (im)patient], and run: make sh eg.shell # regression test amusement make install after perusing README.FIRST. Though the bundle in ~ftp/pub at NASA Ames Research Center contains prerequisite support from Univ. of Toronto's Henry Spencer, we are not re-posting regcomp()/regexec() to comp.sources. John Gilmore of the GNU project has a modified regexec(), but it is not mandatory for running the new egrep. Contrary to popular belief, old egrep is not I/O bound for large large files on most machines. The new version is. One sample: time egrep 'u.*nix' /usr/dict/web2 (2.5 MB) (yielding {Coturnix, Pseudophoenix, [Tt]urnix}), shows user sys real (load ave. < 1) VAX 11/750, 4.3 BSD, Fujitsu Eagles (new) 6.8 3.8 11 (old) 70.0 5.5 87 Sun 3/160, 3.2 OS, Eagle on SMD (new) 1.7 2.2 5 (old) 14.7 1.5 16 Cray 2, Sys 5, no disk striping (new) .93 .11 1 (old) 8.07 .21 8 notes: New egrep was actually run with -i option, but not old egrep. Also, fumbling for three-character residue is not [ef]?grep's forte, so the example is conservative. Sun 3 has higher sys time for some unknown reason (a guess: VAX 4.3 kernel handles read() calls with oddsize buffers differently?). Cray 2 reportedly does disk I/O at 5-10 megabytes per second, but the architecture/compiler is bad at byte addressing -- no cache, no vectors here. Unfair comparison: new egrep on Sun beats old egrep on Cray 2, even with fast Cray I/O! Speculation: the code might be useful on the Macintosh II, even if the Unix filesystem (Sys 5?) were to waste 3/4 of the 1 MB/sec SCSI disk bandwidth. Mac 2 testers please forward info to ames!jaw. Another metric is inner loop efficiency: # instructions VAX Berkeley cc 5 Sun 68020 3.2 cc 6 Stallman's GNU 68020 cc 4 MIPS R2000 6 Cray 2 25 Thanks goes to mips!dce (David Elliott) for his testing time, as well as to RMS for two-upping Sun's C compiler. Of course, if you have a Connection Machine dedicated to running their admirable full-text keyworder on "in-core" text, you won't need [ef]?grep at all. And, for unindexed text on fine-grained parallel machines, reg. expr. search algorithms can be made to run with a lower time bound (see the Hillis book). But if your files are on disk, even a specialized search chip won't help once things become I/O or bus limited. For this reason, vectorization on a Cray(ette) would be a bust, though Cray buffs may write the author for other scalar speedup ideas... [continued]