[TUHS] Bell Foreign-Language UNIX Efforts

Sun Mar 19 23:32:23 AEST 2023

On 19-Mar-23 7:00, segaloco via TUHS wrote:
> Good evening or whichever time of day you find yourself in.  I was reading up on Japanese computer history when I got to thinking specifically on where UNIX plays in with it all, which then lead to some further curiosity with non-English UNIX in general.
> 
> In the midst of documentation searches/study, I've spotted French and what I believe to be Japanese documentation bearing Bell/AT&T logos.  I've also seen a few things pop up in German although they looked to be university resources, not something from the Bell System.  In any case, is there any clear historical record on efforts within the USG/USL line, or research for that matter, towards the end of foreign language support or perhaps even single polyglot installations?  Would BSD have been more poised for this sort of thing being more widely utilized in the academic scene?

I think the most significant development that came out of Unix regarding 
internationalization was the proposal and adoption of Unicode and UTF-8. 
  This was published in 1993 in the USENIX Technical Conference proceedings:

Pike, Rob, and Ken Thompson. "Hello World or Καλημέρα κόσμε or こんにちは世界." 
Proceedings of the Winter 1993 USENIX Conference. 1993.

At the time of the decision to adopt Unicode and UTF-8 in Unix (Plan 9 
actually) there was no consensus on international character 
representations and encodings.  Many systems extended ASCII with 8 bit 
characters to represent those required in a particular country  These 
"code pages" were standardized in numerous mutually incompatible 
ISO-8859-X variants.  My understanding is the for many Asian (Chinese, 
Japanese, and Korean) languages the situation was even worse, with ISO 
2022 being used to shift mid-string from one character set encoding to 
another.

In addition, Unicode was a draft standard for unified 16-bit character 
codes promoted by a group US companies.  It was battling against the ISO 
10646 draft, which had taken the approach of allocating character set 
blocks to national bodies, thus creating a sparse 32-bit representation 
with considerable redundancy between similar languages.  Furthermore, 
the ISO 10646 standard proposed a (non-required) UTF multibyte encoding 
(now known as UTF-1), which was not self-synchronized, because bytes 
used for representing ASCII characters were also employed as parts of 
multibyte sequences.

The Bell Labs team took the bold approach of adopting the draft Unicode 
standard and an X-Open proposal for encoding multibyte characters only 
using bytes with the top bit set.  At the time the encoding was known as 
UTF-2; it is what we now call UTF-8.  UTF-8 makes it easier to achieve 
backward compatibility in existing code; for example code scanning for 
the "/" file path separation character in a string, will never encounter 
it in the UTF-8 representation of non-ASCII characters.

The Plan 9 choices proved wise and prescient.  I do not know how much 
the Plan 9 implementation and the USENIX paper influenced further 
developments (its authors may enlighten us), but in the end Unicode 
converged with ISO 10646 becoming a single standard, and UTF-8 was 
widely adopted.

The Plan 9 team's decision to adopt UTF-8 was by no means a given. 
Consider the case of Microsoft, which released Windows NT with Unicode 
support in the same year.  Microsoft's Windows NT 1993 offering 
supported a wide character encoding, not UTF-8: initially UCS-2 and 
later UTF-16.  To achieve backward compatibility the Windows API offers 
two functions for each call involving strings: a so-called "ANSI" 
version (actually using the currently active code page) and a "Wide" 
(Unicode) version.  Furthermore, text files use a byte order mark to 
inform programs regarding their character representation, and in C/C++ 
code strings are often enclosed in a special macro to facilitate porting 
to wide characters.  In the end, in 2019 Microsoft yielded, supporting 
UTF-8 in its Windows API through code page 65001 (CP_UTF8), and 
recommending its use.  The double APIs and BOM files are still with us 
as a reminder that deficient technical decisions come at a cost.

Diomidis - https://www.spinellis.gr