[TUHS] Bell Foreign-Language UNIX Efforts

Thu Mar 23 09:33:46 AEST 2023

I've often pondered on the storage differential that non-ASCII languages rack up.

Let's say one primarily stores documents in Japanese.  This puts you up in 2-bytes-per-character range.  If you go simply by character count, the same amount of characters take up twice the amount of actual bytes.  Of course, Japanese isn't the greatest case for this being a problem as like many other non-phonetic scripts (and even with kana syllables) it takes less actual characters to convey a thought, cutting the character count for a complete sentence, even katakana/Hepburn stuff, in at least half.  All in all, they may break even or even better given multi-syllable kanji.  A better example of scripts that would likely suffer data bloat would be Hebrew or Arabic, although being abjads with diacritics to represent vowel sounds, you likewise land somewhere like Japanese kana where a single glyph represents what in the Latin alphabet would be at least two letters.  I would imagine Cyrillic users for instance do actually have to take the storage hit involved since their entire script is outside ASCII *and* the language is a full alphabet and not an abjad nor logographic.  Can't say I've worked with much Cyrillic text though.  That's not even to mention scripts where diacritics may be represented by a separate individual code-plane entry requiring combination with another.

This is of course, way off list, so I don't want to start a whole side-chain on it, but linguistic storage in computers has interested me for a long time, especially in my reverse engineering research of old games looking at how different studios implemented various code-pages for non-ASCII scripts.  For example, I've seen plenty of older (8/16-bit) Japanese games that obviously don't use UTF-8 due to overhead in constrained console environments (or even being older than UTF-8) but also don't use ShiftJIS or other known encodings, instead opting towards their own custom code-plane to map bytes, usually to kana, although I haven't really peeked into any engines that use kanji.  This was uncommon as video games were typically marketed towards children who weren't expected to know enough kanji to read complicated text.  You see the same today with text associated with children's media in Japan in that hiragana syllabilary for a given kanji is displayed adjacent to it (furigana).

I think one resounding conclusion of this thread though is we all owe Rob and Ken (and colleagues) a great deal for nailing this matter down in such a well-engineered way.  Long live UTF-8!

- Matt G.

------- Original Message -------
On Wednesday, March 22nd, 2023 at 3:33 PM, Steffen Nurpmeso <steffen at sdaoden.eu> wrote:

> Rob Pike wrote in
> CAKzdPgwYPxK9oYemG5-vPgRR7mSfj_qkjD5-iJnLffP-23PUaQ at mail.gmail.com:
> 
> |The appendix version named it plain UTF, repurposing the extant name to the
> |new encoding. The -8 came later, as it is in these linked documents,
> |because some people wanted a UTF-7 and a UTF-16. Those people should be
> |punished.
> 
> I agree, but please with a but.
> 
> For one especially so since UTF-7 (that i like) then didn't make
> it all through, but only here and there.
> Ie, if it would have been used for anything mail and DNS related
> to keep 7-bit compat. Instead they introduced monstrosities like
> IDNA for DNS, mUTF-7 (locale charset -> UTF-16BE -> mUTF-7) etc.
> 
> 
> That i hated: IDNA. If they would have said we give up on
> backward compatibility around Y2K, and the old stuff grows out;
> and 255 bytes UTF-8 is surely enough for domain names for some
> time (even percent encoded) even for those encodings which need
> four byte for one codepoint, and it simply does not work before.
> Like so they introduced those backward incompatibilities that they
> wanted to avoid.
> 
> I did oppose strongly in the past, but UTF-16 has merits for some
> languages as well as for coding, even though you have to be able
> to deal with surrogates, .. and with grapheme boundaries, if you
> are doing it right, so 1:many is there anyhow. I mean, wchar_t is
> often 32-bit, and then not even UTF-32, at least possibly. But
> still you have the 1:many, so it buys you nothing.
> All-UTF-8 is of course great imho. (Asian people may disagree.)
> 
> --steffen
> |
> |Der Kragenbaer, The moon bear,
> |der holt sich munter he cheerfully and one by one
> |einen nach dem anderen runter wa.ks himself off
> |(By Robert Gernhardt)