Minix1.5/lib/string/READ_ME

Here is the latest revision of my string package, written in assembly for
PC's and compatibles using Minix.  This is virtually complete as far as
the draft ANSI C standard goes, except that strcoll() and strxfrm() are
not really implemented (i.e. they are really just front ends for strcmp()
and strncpy(), respectively).  Once again, the V7 and Berkeley compatibility
routines are not included, but a cdiff is provided for the 1.5.0 <string.h>
that defines them as (non-name-space-polluting) macros.  Memccpy() is not
included, since I cannot find consistent documentation as to its function.

Henry Spencer's C routines are widely used, very reliable, very portable,
and are easily compiled into reasonably efficient code.  They can take no
advantage, of course, of special architectural features, which Intel
processors possess in abundance in this case.  If the best that could be
done was a 10-20% improvement in the string code, which I would consider
fairly typical for assembly over C, I wouldn't consider it worthwhile.
But my rewritten routines show much larger improvements for typical
inputs - from 40% to 95% depending on function.

The code that I write tends to use a lot of str[len|cpy|cmp]() and
mem[set|cpy]().  The improvements for these routines are substantial
enough that I use my assembly language versions.  The recent Dhrystone
1.1 posting by ast shows a 40% increase in Dhrystone rating on my 
machine with these routines (only strcpy() and strcmp() are used there).

The code is faster for a number of reasons.  It uses special instructions
not generated by the C compiler, pays careful attention to register
contents, uses simplified linkage, unrolls most loops once, and takes
advantage of word alignment where possible.  The first three involve
fairly simple adaptations of Spencer's C code to the Intel architecture.
The last two are sometimes unpleasant.  Unrolling loops once saves 10-15%
in most cases; the attention to alignment saves 3-5% on top of that.  The
code is less clear (in some cases, MUCH less clear) and harder to debug,
but 20% is not to be sniffed at.

The code was optimized on a Toshiba 5100, which has an 80386 and uncached
1-wait state 32 bit memory.  As a result, the code may not be optimal on
other machines.  I expect it to be quite good for 16 bit CPU's, with
perhaps slightly less improvement on 8088's, where the attention to
alignment is wasted.

I am open to bug reports and suggestions for improvement.  I am also
interested in reports of performance on other machines.  To that end, I
have included programs that compute the improvement for a variety of
routines automagically; please email the results to me along with a
description of your CPU and memory architecture.  I have included a
copy of the program's output for my machine in the file Perf.T5100;
I hope the improvements shown there will be typical.

To use the new string package, check that the macro definitions at the
top of the makefile are compatible with your configuration.  By default,
the makefile generates the performance comparison with the existing library.
If you "make install", packed versions of the routines will be installed in
your library.  These routines are now compatible with the posted 1.5.0 headers.

Not also that the locale specific routines strcoll() and strxfrm() are
dependent on others in the library (strcoll() needs strcmp(); strxfrm() needs
strncpy() and strlen()).  Make sure they are placed appropriately in the
library.


Norbert Schlenker (nfs@princeton.edu)