[TUHS] Character sets

Random832 random832 at fastmail.com
Mon Mar 28 15:12:58 AEST 2016


On Sun, Mar 27, 2016, at 21:58, John Cowan wrote:
> Random832 scripsit:
> 
> > Sure it does, but replace that != " " with !isblank(*c), and it doesn't
> > work anymore since it ignores multibyte characters. 
> 
> In which locales does isblank() actually return true on characters other
> than space and tab?  (This is a straight question.)

See, no, that's a trick question. None of the other blank class
characters are single-byte, so of course isblank doesn't. The following
characters return true on is*w*blank for me: U+00a0 U+1680 U+2000 U+2001
U+2002 U+2003 U+2004 U+2005 U+2006 U+2007 U+2008 U+2009 U+200a U+200b
U+202f U+205f U+3000 (Oddly enough, isblank(0xA0) is true even in the
UTF-8 locale, though of course U+00a0 is actually a multibyte character
"\xc2\xa0".) So, if what you _want_ is to find the next blank character,
doing this loop with isblank won't work. If what you want is to find
space or tab, sure. But that's why grep for patterns containing \s are
so slow.



More information about the TUHS mailing list