[TUHS] Canonical Historic Character Encoding Conversion?

Thu Nov 20 01:20:22 AEST 2025

A bit late, but

segaloco via TUHS wrote in
 <orUW3q4tPxUlZdEdjIzXJlp7dS-d0DoPmgeizRguFLRsJMUxjP9TWqOHCIuO6U1ALub3dVJ\
 ttj3SNNgSFtRhTaJvQH1Ljcc1n2VX4_uRtZA=@protonmail.com>:
 |So I was working up a draft procedure for identifying strings in
 |NES and other tile-based video game titles, and was intending
 |for iconv(1) to be my central encoding converter.  Unfortunately,
 |I was today years old when I noticed:
 |
 |> The implementation need not support the use of charmap files
 |> for codeset conversion unless the POSIX2_LOCALEDEF symbol is
 |> defined on the system.
 |
 |However the rationale section of the page then states:
 |
 |> The iconv utility can be used portably only when the user
 |> provides two charmap files as option-arguments.
 |
 |Don't these two statements contradict one another, one stating
 |that a feature is not required, and the other stating that a
 |feature is required for portability?  Isn't the whole point of
 |POSIX portability?

No, POSIX allows to create your own locale environment via the
localedef(1) command, the input of which is a standardized
extensive definition syntax that must have been a pain to create.

(It seems to me those who *really* deal with locality, the Unicode
related CLDR (and ICU) framework, things like perl, they use their
own instead; and i *think* .. no, i confirm that these truly
capable frameworks are used by the free Unix software world as
input to converters which boil "the real thing" down to those
things.  That is: the ISO C and (thus, though better, think
wcwidth(3)) POSIX environment has never been enabled to truly deal
with the languages, or calendars, or whatever, of the world.)

What you refer to seems to be a differentiation in between
built-in character sets (no / slash in the codeset name), and
locally created additional maps (at least one / slash).

Anyone should note well: iconv always has been a broken interface,
even with the changes of latest (2024) POSIX.  One can still not
generate a "normalized" character set name, or get any information
of a "MIME" or "mean" name variant, which means that anyone who
actually cares has to create and use her or his (or supersuperfluid
etc etc) own character set database, like, eg

  {"us-ascii","ansi x 3 4 1968","iso ir 6","us ascii","ansi x 3 4 1986","iso 646 irv 1991","iso 646 us","ascii","us","ibm 367","cp 367","csascii"}

or ("for giggles")

  {"euc-jp","extended unix code packed format for japanese","cseucpkdfmtjapanese","euc jp"}
  {"extended_unix_code_fixed_width_for_japanese","extended unix code fixed width for japanese","cseucfixwidjapanese"}

I attach a sh/awk script which (downloads and) generates
that database; define DBG to get the real name (in comments);
define FMT as one of txt, c, list to choose in between formats.
(Tested last in 2023-12.)

*Note* that IANA has introduced some errors in the official
character set database when they converted the files from
wonderful plain ASCII text to XML format, and they are not
willing (once i looked last) to fix that.  (The FMT "txt" mimics
that wonderful old format that did not miss a bit a bit.)

This means that certain entries may need manual fixing, but i have
forgotten the details, and, sorry, i am too lazy to look it all up
now; given that they simply "hung" the issue without fixing it,
you know.  "Fuck it" is, i think, the correct term.

Yes, iconv interface is a mess.  You cannot query any information
of the actually used character set.  But it matters, for example
for email: if the real charset is US-ASCII, things can be done
differently, you know.  One knows nothing, is it multibyte
byte-based, is it multibyte otherwise, etc.  And unfortunately
the internal codeset names, especially of systems twenty years
ago and elder, and of especially ASCII, can hardly become tested.

And then things like UnixWare, they use special, non-standardized
character set names with certain Unicode subsets (i have
forgotten the details after i have lost my running UnixWare VM,
and new installation was disallowed by installer; sigh!!), noone
can do any sensible testing with such.

And then of course the prototype mismatch with "char const**
[__restrict__]" vs "char** [__restrict__]", and then all the
warnings on "intermediate pointers to be safe" when one wants to
use "char**" vs "char const**" etc etc.

It seems there is support to at least extend the iconv interfaces
of the future so that application developers can differentiate in
between conversion failures / incapabilites in respect to "illegal
input" vs "output impossible", which now end up as EILSEQ, even
though the consequences *could* be different.  But not yet
i think.

 |In any case, what this has me wondering is if there was any,
 |older, more guaranteed method for arbitrary encoding conversions?
 |There is the well-known ASCII-EBCDIC conversion in dd(1), but
 |this mechanism does not seem to be extensible to other, arbitrary
 |character encodings.
 |
 |Is there some historic mechanism I'm glossing over?  In the end I
 |can go with another approach, but I want to use the canonical
 |UNIX way if at all possible, I don't want the same shock I had
 |today after carefully drafting several charmaps for a project
 |only to find that the POSIX standard had no teeth in that area
 |and pretty much guarantees nothing.

iconv, with all its problems, is i think the only option available
by default, *if* it is available by default.

 |- Matt G.
 --End of <orUW3q4tPxUlZdEdjIzXJlp7dS-d0DoPmgeizRguFLRsJMUxjP9TWqOHCIuO6U\
 1ALub3dVJttj3SNNgSFtRhTaJvQH1Ljcc1n2VX4_uRtZA=@protonmail.com>

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)