[TUHS] Command line options and complexity

Thu Mar 5 16:12:24 AEST 2020

On 2020-03-04 16:50:34, Random832 spake thus:
[...]
> Sure, but "stdin is a sequence of any type, and the argument is an expression that operates on that type or the name of a property that that type has" is universal enough.
> 
> The part that has to operate on a specific structure isn't the command, it's the arguments.
> 
> For example, a powershell pipeline to produce a list of files sorted by modified date is:
> 
> gci . | sort lastwritetime | select name
> 
> all three *commands* are universal - not all objects have a "lastwritetime" and "name" property, but sort and select can operate on any property that the sequence of objects passed into it has.

There are some examples of that type of thing in widely used Unix tools;
my use of 'sort -k1,1n' further down is demonstrating such a use case (the
'sort' command is being told that it is operating on numbers). But beyond
some lowest common denominator types ("number", "string", ...) how many
commands can really usefully operate on a large number of types? For
example, a program that can operate on IP addresses is probably doing
something different than a program that wants to operate on email
addresses.

I could see where named properties of some object can be used more
generally than types, but again there are widely used tools that do do
that (e.g., jq(1)). IMHO, though, they are more cumbersome to use than
most of the commands I need to use minute to minute.

> (gci is an alias for get-childitem... it also has aliases ls and dir, but I'm emphasizing that it's not exclusive to directories)
> 
> *assuming that ls -t didn't exist*, to do this with unix tools that operate on text you would need:
> 
> ls -l | [somehow convert the date to a sortable format, probably in awk] | sort | [somehow pick the filename alone out of the output - possibly with cut or sed or awk again]

(Just nit-picking at this particular example)

You could do it without ls[0]:

    $ stat -c '%Y %n' * | sort -k1,1n | xargs -L1 sh -c 'echo "$@"'

That doesn't seem so bad to me, but if it was something I needed regularly
I'd of course put it in an alias[1] or (more likely) a short script file.

> and it's very difficult to get tools like awk, sort, and cut to work on formats that contain more than one field that may contain embedded spaces (you can just about get away with it for ls output because the date is always three "words").
[...]

Yes, that's often true. And when I enounter it I typically start out by
seeing if I can inject and remove tokens in the data at key places in the
pipeline. Beyond anything trivial, though, I then quickly start reaching
for tools to put the data into some form that more easily allow for it
(CSV, JSON, ...). But that invariably adds other complications (such as
the need to find or build tools to marshal/unmarshal the data, and to
deal with data-domain-specific notions of null-vs-empty-string).

For the (more common (for me)) case where there is only one field that
contains embedded spaces, I just try to get 'em at the end of the line
and let the shell deal with it:

    $ some-command | while read -r first second rest; do ... ; done

> Maybe it would be enough to have the universal interface be "tables" (i.e. text streams in some format that supports adequate escaping of embedded row and column delimiters)... or maybe even just table rows, and let the user deal with memorizing column numbers (or let each originating command support a fully general way to specify what columns are requested, as ps alone does on modern systems) Of course, this isn't *really* different from allowing any data structure - after all, the value for any field could itself be a fully escaped table in text format.
[...]

Well, in some sense with byte streams you have a table of newline-delimited
bytes (rows), and byte subfields separated by whitespace (columns). And
anything on top of that could (in some context, and with some syntax) be
considered just further escaped tables in text format. I think that's
essentially the same thing that you said, only with the outermost table
syntax removed. But like you said, this isn't really different from
allowing any data structure. Importantly, though, it doesn't impose any
particular data structure, either.

I've worked at a couple of different places that had in-house tools for
working with explicit table semantics in command line suites, and where
they fit the data domain, that was hugely useful. Generally speaking, they
were special purpose enough to warrant their own tools, but still general
purpose enough to be composable (were designed for use in shell pipelines)
and applicable in domains beyond the intentions of their original authors.

Still, the burden of "thinking in tables" would make them too heavyweight
for a lot of common use cases. Sometimes my data structure is "paragraphs
of text":

    $ lorem -p 3 | perl -00 -wnle '2 == $. && print' | wc -w

Other times I want a tree (JSON, s-expressions, ...), or even a stream of
trees[2]. I consider it a feature that these more complex data structures
are not assumed or imposed in contexts where they are not needed.

Take care,
-Al 

[0] You could get 'ls' to do it, too, (without '-t') but here the use of
    TIME_STYLE is a presumably non-portable (but handy!) GNU-ism:

        $ TIME_STYLE='+%s' ls -l | tail -n +2 | sort -k6,6n | xargs -L1 sh -c 'shift 5; echo "$@"'

    It's different from the '-t' option, though, in that it forces a
    predicatable date field format in the output of 'ls -l', so side-steps
    the need for downstream date parsing altogether and simply jumps into
    sorting (after chopping off the 'total N' header (groans all around)).

[1] E.g.,

        $ # read 'bmt' as: "by mtime"

        $ alias bmt='stat -c "%Y %n" * | sort -k1,1n | xargs -L1 sh -c '"'echo "'"$@"'"'"

        $ bmt

[2] Probably flattened.