[TUHS] Canonical Historic Character Encoding Conversion?
Steffen Nurpmeso via TUHS
tuhs at tuhs.org
Fri Nov 21 03:32:33 AEST 2025
P.S.: for the complete picture, as i see it.
Steffen Nurpmeso wrote in
<20251119152022.2-uWUzRn at steffen%sdaoden.eu>:
...
|iconv, with all its problems, is i think the only option available
|by default, *if* it is available by default.
It may be necessary, dependend on what you want, to apply the
"//TRANSLIT" string to the character set to encode to.
My Linux manual now says
If the string //TRANSLIT is appended to to‐encoding, characters
being converted are transliterated when needed and possible.
This means that when a character cannot be represented in the
target character set, it can be approximated through one or sev‐
eral similar looking characters. Characters that are outside of
the target character set and cannot be transliterated are re‐
placed with a question mark (?) in the output.
but, when i wrote the workaround, i denoted to myself this as
a "bug", as it is not present on other systems, and not in POSIX,
which says
• If no indicator suffix was specified when the conversion
descriptor cd was opened, or the //TRANSLIT indicator suffix
was specified but no transliteration of the character is
possible, iconv( ) shall perform an implementation-defined
conversion on the character and it shall be counted in the
return value of the iconv( ) call.
which reads to me that "implementation-defined" conversion is
necessary, and i do not count "failure of the function" as such,
also given that, historically, replacement characters where then
simply added. (I must be very much mistaken otherwise? But this
can very much happen *of course*; for example, i would have
sworn that, in C, if i assign a char[X] a string like say "Y\0",
that the compiler would accept that the \0 was already seen,
and not warn if that \0 fits the buffer, yet the "otherwise"
automatically added \0 would exceed the buffer, but no, that is
not what actually happened, already back around Y2K!)
The according C code is something like that:
#define n_ICONV_ASCII_NAME "us-ascii"
#define n_ICONV_UTF8_NAME "utf-8"
/* For cross-compilations this needs to be evaluated once at runtime */
# ifndef mx_ICONV_NEEDS_TRANSLIT
if(mx_ICONV_NEEDS_TRANSLIT == FAL0){
for(;;){
char inb[8], *inbp, oub[8], *oubp;
size_t inl, oul;
/* U+1FA78/f0 9f a9 b9/;DROP OF BLOOD */
memcpy(inbp = inb, "\360\237\251\271", sizeof("\360\237\251\271"));
inl = sizeof("\360\237\251\271") -1;
oul = sizeof oub;
oubp = oub;
if((id = iconv_open((mx_ICONV_NEEDS_TRANSLIT
? n_ICONV_ASCII_NAME "//TRANSLIT" : n_ICONV_ASCII_NAME), n_ICONV_UTF8_NAME)
) == (iconv_t)-1)
break;
if(iconv(id, &inbp, &inl, &oubp, &oul) == (size_t)-1){
iconv_close(id);
if(mx_ICONV_NEEDS_TRANSLIT)
break;
mx_ICONV_NEEDS_TRANSLIT = TRUM1;
}else{
iconv_close(id);
mx_ICONV_NEEDS_TRANSLIT = TRU1;
break;
}
}
}
# endif /* ifndef mx_ICONV_NEEDS_TRANSLIT */
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
More information about the TUHS
mailing list