Re: i18n-ISSUE-411: Definition of whitespace should come from Unicode

Eric Prud'hommeaux scripsit:

> I suspect that whitespace is pretty consistently treated as the four
> control codes this point. In 2006 I tried a more inclusive definition
> of whitespace in SPARQL but folks said "what the hell is this? Everybody
> knows that whitespace is four characters." Had things like non-breaking,
> zero-width, all-singing space stayed in SPARQL, parsers would have
> required multi-byte lexers and the interoperability of incomplete
> implementations would have suffered.

If we look in detail at the 25 characters that Unicode says are of type
WS, here's what we find:

Tab, CR, LF, Space:  the Fantastic Four.

Vertical tab, form feed: mostly obsolete ASCII controls.

Next line (U+0085): ANSI's failed attempt to create a newline character
distinct from LF that no one uses.

Line separator, paragraph separator (U+2028..U+2029): Unicode's failed
attempt to create these distinct from LF and LF+LF that no one uses.

Em quad and space, en quad and space (1/2 em), 1/3 em space, 1/4 em
space, 1/6 em space, figure space, punctuation space, thin space, hair
space (U+2000..U+200A): fixed-with spaces introduced into Unicode for
compatibility with old typesetting software, but rarely or never used.

Medium mathematical space (U+205F):  also has a fixed width (4/18 em,
the space around a mathematical operator), but was introduced into
Unicode much later, and I'm not sure exactly why.

No-break space (U+00A0): works more like a printing charaacter that
doesn't happen to print anything than like whitespace.  Commonly used as
a method of forcing horizontal whitespace for (crude) formatting purposes.

> The downside is that someone typing in some script with its own
> whitespace (does that exist?) must use ASCII space, but they have to
> anyways because all of the language keywords are in ASCII.

Of the three remaining space characters, two are like that:

Ideographic space (U+3000):  yet another fixed-width space, but actually
heavily used in Japanese text.

Ogham space mark (U+1680):  like Devanagari letters, Ogham letters hang
down from a head line.  Unlike Devanagari, the head line connects words.
Consequently, a head line without a letter needs to be represented
specially, so U+1680 works like a space but puts ink on the surface.
Naturally, Ogham is not much used.

Narrow no-break space (U+202F):  Used in Mongolian within certain words;
words are separated by the ordinary space.

-- 
John Cowan          http://www.ccil.org/~cowan        cowan@ccil.org
Humpty Dump Dublin squeaks through his norse
                Humpty Dump Dublin hath a horrible vorse
But for all his kinks English / And his irismanx brogues
                Humpty Dump Dublin's grandada of all rogues.  --Cousin James

Received on Sunday, 8 March 2015 17:50:30 UTC