Here is the latest revision of my string package, written in assembly for PC's and compatibles using Minix. This is virtually complete as far as the draft ANSI C standard goes, except that strcoll() and strxfrm() are not really implemented (i.e. they are really just front ends for strcmp() and strncpy(), respectively). Once again, the V7 and Berkeley compatibility routines are not included, but a cdiff is provided for the 1.5.0 <string.h> that defines them as (non-name-space-polluting) macros. Memccpy() is not included, since I cannot find consistent documentation as to its function. Henry Spencer's C routines are widely used, very reliable, very portable, and are easily compiled into reasonably efficient code. They can take no advantage, of course, of special architectural features, which Intel processors possess in abundance in this case. If the best that could be done was a 10-20% improvement in the string code, which I would consider fairly typical for assembly over C, I wouldn't consider it worthwhile. But my rewritten routines show much larger improvements for typical inputs - from 40% to 95% depending on function. The code that I write tends to use a lot of str[len|cpy|cmp]() and mem[set|cpy](). The improvements for these routines are substantial enough that I use my assembly language versions. The recent Dhrystone 1.1 posting by ast shows a 40% increase in Dhrystone rating on my machine with these routines (only strcpy() and strcmp() are used there). The code is faster for a number of reasons. It uses special instructions not generated by the C compiler, pays careful attention to register contents, uses simplified linkage, unrolls most loops once, and takes advantage of word alignment where possible. The first three involve fairly simple adaptations of Spencer's C code to the Intel architecture. The last two are sometimes unpleasant. Unrolling loops once saves 10-15% in most cases; the attention to alignment saves 3-5% on top of that. The code is less clear (in some cases, MUCH less clear) and harder to debug, but 20% is not to be sniffed at. The code was optimized on a Toshiba 5100, which has an 80386 and uncached 1-wait state 32 bit memory. As a result, the code may not be optimal on other machines. I expect it to be quite good for 16 bit CPU's, with perhaps slightly less improvement on 8088's, where the attention to alignment is wasted. I am open to bug reports and suggestions for improvement. I am also interested in reports of performance on other machines. To that end, I have included programs that compute the improvement for a variety of routines automagically; please email the results to me along with a description of your CPU and memory architecture. I have included a copy of the program's output for my machine in the file Perf.T5100; I hope the improvements shown there will be typical. To use the new string package, check that the macro definitions at the top of the makefile are compatible with your configuration. By default, the makefile generates the performance comparison with the existing library. If you "make install", packed versions of the routines will be installed in your library. These routines are now compatible with the posted 1.5.0 headers. Not also that the locale specific routines strcoll() and strxfrm() are dependent on others in the library (strcoll() needs strcmp(); strxfrm() needs strncpy() and strlen()). Make sure they are placed appropriately in the library. Norbert Schlenker (nfs@princeton.edu)