[TUHS] Character sets

Johnny Billquist bqt at update.uu.se
Mon Mar 28 08:19:32 AEST 2016


On 2016-03-27 23:59, Greg 'groggy' Lehey wrote:
> On Sunday, 27 March 2016 at 23:53:32 +0200, Johnny Billquist wrote:
>> On 2016-03-27 23:49, Greg 'groggy' Lehey wrote:
>>> On Sunday, 27 March 2016 at 13:47:43 +0200, Johnny Billquist wrote:
>>>> On 2016-03-27 13:29, John Cowan wrote:
>>>>> Johnny Billquist scripsit:
>>>>>
>>>>>> On 2016-03-27 08:18, Greg 'groggy' Lehey<grog at lemis.com> wrote:
>>>>>>> Isn't it wonderful that we no longer have issues with character
>>>>>>> representation?
>>>>>>
>>>>>> I hope that comment was meant as a joke, ironic, cynical, or whatever...
>>>>>
>>>>> Undoubtedly.  But things *are* much better than they used to be:
>>>>> we can now do everything within a single character set, and convert
>>>>> only at the boundaries (and increasingly, only in one direction).
>>>>
>>>> Haha. Yes... Except that you now have multiple representations of each
>>>> character within one character set. So what has improved???
>>>
>>> In the Good Old Days, characters were all the same size, and you could
>>> do nice, simple things like
>>>
>>>     while (*c && *c++ != " ");
>>>
>>> Now you need a whole library to do the same thing.
>>
>> Another one I noted a while ago was that functions and command in Unix,
>> such as lpq, which try to print things in nice columns now fail, because
>> the code don't actually know how many characters have been output.
>>
>> And let's not even talk about such wonderful concepts as colors in the
>> character set definition... Unicode seems to have it all... I wonder how
>> many code points exist for 'A'. It's definitely more than one...
>
> For some definition of A, of course.  In addition there's clearly at
> least Α (0x391) and А (0x410).

Oh, definitely. I'm trying to limit myself to Latin-A at the moment. 
Otherwise the list will just be ridiculously long.

You have, of course, U+41, but you also have U+FF21. But if you want to 
go slightly silly, you also have U+1F110, U+1F130, U+1F150, U+1F170, 
U+1F1E6, U+E0041... And god know if I've missed some other ones.
Of course whitespace and other typographic details matters. That's why 
we have different code points for the letter, depending on things like 
whitespace.

	Johnny

-- 
Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt at softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol



More information about the TUHS mailing list