[TUHS] Command line options and complexity

Wed Mar 11 03:38:23 AEST 2020

On Tue, Mar 10, 2020 at 12:16 PM Doug McIlroy <doug at cs.dartmouth.edu> wrote:

> > The idea of a simple rule is great, but the suggested rule fails on sort
> -u
> > which afaik came after sort | uniq for performance reasons.
>
> As the guilty party for most of sort's comparison options, I can
> attest that efficiency was not an objective of -u. It was invented
> precisely because uniq had proved useful, but not when one was
> interested in uniqueness only of some key aspect of the data.
>
> -u differs from uniq in that -u selects samples based on
> equality of keys, not equality of lines. In the default
> case of whole-line keys, sort -u of course does exactly
> what sort|uniq does.
>
> For many applications of -u with keys, the non-key fields
> are not of interest. Then sed s/nonkeys//|sort|uniq may
> suffice. But sed did not exist when -u was invented.
> And not all sort key specs are easily imitated in sed.
>

This begs questions of stability: in the event of non-unique keys and
non-key fields in the sortable data, which "records" (lines) are kept and
which are discarded? Surely the "first" is kept and subsequent entries with
the same key suppressed, but I confess I don't know enough about the
internals of sed to know even what algorithm it uses (I assume a disk-based
merge sort?), but I would imagine these details have changed over time.

        - Dan C.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://minnie.tuhs.org/pipermail/tuhs/attachments/20200310/d41ac7c9/attachment.html>