[TUHS] Canonical Historic Approach to iconv(1)

Thu Nov 28 04:56:16 AEST 2024

So a project I'm working on recently includes a need to store UTF-8 Japanese kana text in source files for readability, but then process those source files through tools only guaranteed to support single-byte code points, with something mapping the UTF-8 code points to single-byte points in the destination execution environment.  After a bit of futzing, I've landed on the definition of iconv(1) provided by the Single UNIX Specification to push this character mapping concern to the tip of my pipelines.  It is working well thus far and insulates the utilities down-pipe from needing multi-byte support (I'm looking at you Apple).

I started thumbing through my old manuals and noted that iconv(1) is not a historic utility, rather, SUS picked it up from HP-UX along the way.

Was there any older utility or set of practices for converting files between character encodings besides the ASCII/EBCDIC stuff in dd(1)?  As I understand it, iconv(1) is just recognizing sequences of bytes, mapping them to a symbolic name, then emitting them in the complementary series of bytes assigned to that symbolic name in a second charmap file.  This sounds like a simple filter operation that could be done in a few other ways.  I'm curious if any particular approach was relatively ubiquitous, or if this was an exercise largely left to the individual and so solutions were wide and varied?  My tool chain doesn't need to work on historic UNIX, but it would be cool to understand how to make it work on the least common denominator.

- Matt G.