[TUHS] Canonical Historic Character Encoding Conversion?

Fri Nov 21 03:32:33 AEST 2025

P.S.: for the complete picture, as i see it.

Steffen Nurpmeso wrote in
 <20251119152022.2-uWUzRn at steffen%sdaoden.eu>:
 ...
 |iconv, with all its problems, is i think the only option available
 |by default, *if* it is available by default.

It may be necessary, dependend on what you want, to apply the
"//TRANSLIT" string to the character set to encode to.
My Linux manual now says

      If  the  string //TRANSLIT is appended to to‐encoding, characters
      being converted are  transliterated  when  needed  and  possible.
      This  means  that  when  a character cannot be represented in the
      target character set, it can be approximated through one or  sev‐
      eral  similar looking characters.  Characters that are outside of
      the target character set and cannot  be  transliterated  are  re‐
      placed with a question mark (?) in the output.

but, when i wrote the workaround, i denoted to myself this as
a "bug", as it is not present on other systems, and not in POSIX,
which says

   • If no indicator suffix was specified when the conversion
   descriptor cd was opened, or the //TRANSLIT indicator suffix
   was specified but no transliteration of the character is
   possible, iconv( ) shall perform an implementation-defined
   conversion on the character and it shall be counted in the
   return value of the iconv( ) call.

which reads to me that "implementation-defined" conversion is
necessary, and i do not count "failure of the function" as such,
also given that, historically, replacement characters where then
simply added.  (I must be very much mistaken otherwise?  But this
can very much happen *of course*; for example, i would have
sworn that, in C, if i assign a char[X] a string like say "Y\0",
that the compiler would accept that the \0 was already seen,
and not warn if that \0 fits the buffer, yet the "otherwise"
automatically added \0 would exceed the buffer, but no, that is
not what actually happened, already back around Y2K!)

The according C code is something like that:

  #define n_ICONV_ASCII_NAME "us-ascii"
  #define n_ICONV_UTF8_NAME "utf-8"

          /* For cross-compilations this needs to be evaluated once at runtime */
  # ifndef mx_ICONV_NEEDS_TRANSLIT
          if(mx_ICONV_NEEDS_TRANSLIT == FAL0){
                  for(;;){
                          char inb[8], *inbp, oub[8], *oubp;
                          size_t inl, oul;

                          /* U+1FA78/f0 9f a9 b9/;DROP OF BLOOD */
                          memcpy(inbp = inb, "\360\237\251\271", sizeof("\360\237\251\271"));
                          inl = sizeof("\360\237\251\271") -1;
                          oul = sizeof oub;
                          oubp = oub;

                          if((id = iconv_open((mx_ICONV_NEEDS_TRANSLIT
                                           ? n_ICONV_ASCII_NAME "//TRANSLIT" : n_ICONV_ASCII_NAME), n_ICONV_UTF8_NAME)
                                          ) == (iconv_t)-1)
                                  break;

                          if(iconv(id, &inbp, &inl, &oubp, &oul) == (size_t)-1){
                                  iconv_close(id);
                                  if(mx_ICONV_NEEDS_TRANSLIT)
                                          break;
                                  mx_ICONV_NEEDS_TRANSLIT = TRUM1;
                          }else{
                                  iconv_close(id);
                                  mx_ICONV_NEEDS_TRANSLIT = TRU1;
                                  break;
                          }
                  }
          }
  # endif /* ifndef mx_ICONV_NEEDS_TRANSLIT */

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)