- From: John Cowan <cowan@mercury.ccil.org>
- Date: Sun, 8 Mar 2015 13:49:58 -0400
- To: Eric Prud'hommeaux <eric@w3.org>
- Cc: Andrew Sullivan <ajs@anvilwalrusden.com>, public-ldp-comments@w3.org, cowan@ccil.org, Steven Atkin <atkin@us.ibm.com>, www-international@w3.org
Eric Prud'hommeaux scripsit:
> I suspect that whitespace is pretty consistently treated as the four
> control codes this point. In 2006 I tried a more inclusive definition
> of whitespace in SPARQL but folks said "what the hell is this? Everybody
> knows that whitespace is four characters." Had things like non-breaking,
> zero-width, all-singing space stayed in SPARQL, parsers would have
> required multi-byte lexers and the interoperability of incomplete
> implementations would have suffered.
If we look in detail at the 25 characters that Unicode says are of type
WS, here's what we find:
Tab, CR, LF, Space: the Fantastic Four.
Vertical tab, form feed: mostly obsolete ASCII controls.
Next line (U+0085): ANSI's failed attempt to create a newline character
distinct from LF that no one uses.
Line separator, paragraph separator (U+2028..U+2029): Unicode's failed
attempt to create these distinct from LF and LF+LF that no one uses.
Em quad and space, en quad and space (1/2 em), 1/3 em space, 1/4 em
space, 1/6 em space, figure space, punctuation space, thin space, hair
space (U+2000..U+200A): fixed-with spaces introduced into Unicode for
compatibility with old typesetting software, but rarely or never used.
Medium mathematical space (U+205F): also has a fixed width (4/18 em,
the space around a mathematical operator), but was introduced into
Unicode much later, and I'm not sure exactly why.
No-break space (U+00A0): works more like a printing charaacter that
doesn't happen to print anything than like whitespace. Commonly used as
a method of forcing horizontal whitespace for (crude) formatting purposes.
> The downside is that someone typing in some script with its own
> whitespace (does that exist?) must use ASCII space, but they have to
> anyways because all of the language keywords are in ASCII.
Of the three remaining space characters, two are like that:
Ideographic space (U+3000): yet another fixed-width space, but actually
heavily used in Japanese text.
Ogham space mark (U+1680): like Devanagari letters, Ogham letters hang
down from a head line. Unlike Devanagari, the head line connects words.
Consequently, a head line without a letter needs to be represented
specially, so U+1680 works like a space but puts ink on the surface.
Naturally, Ogham is not much used.
Narrow no-break space (U+202F): Used in Mongolian within certain words;
words are separated by the ordinary space.
--
John Cowan http://www.ccil.org/~cowan cowan@ccil.org
Humpty Dump Dublin squeaks through his norse
Humpty Dump Dublin hath a horrible vorse
But for all his kinks English / And his irismanx brogues
Humpty Dump Dublin's grandada of all rogues. --Cousin James
Received on Sunday, 8 March 2015 17:50:32 UTC