- From: John Cowan <cowan@mercury.ccil.org>
- Date: Sun, 8 Mar 2015 13:49:58 -0400
- To: Eric Prud'hommeaux <eric@w3.org>
- Cc: Andrew Sullivan <ajs@anvilwalrusden.com>, public-ldp-comments@w3.org, cowan@ccil.org, Steven Atkin <atkin@us.ibm.com>, www-international@w3.org
Eric Prud'hommeaux scripsit: > I suspect that whitespace is pretty consistently treated as the four > control codes this point. In 2006 I tried a more inclusive definition > of whitespace in SPARQL but folks said "what the hell is this? Everybody > knows that whitespace is four characters." Had things like non-breaking, > zero-width, all-singing space stayed in SPARQL, parsers would have > required multi-byte lexers and the interoperability of incomplete > implementations would have suffered. If we look in detail at the 25 characters that Unicode says are of type WS, here's what we find: Tab, CR, LF, Space: the Fantastic Four. Vertical tab, form feed: mostly obsolete ASCII controls. Next line (U+0085): ANSI's failed attempt to create a newline character distinct from LF that no one uses. Line separator, paragraph separator (U+2028..U+2029): Unicode's failed attempt to create these distinct from LF and LF+LF that no one uses. Em quad and space, en quad and space (1/2 em), 1/3 em space, 1/4 em space, 1/6 em space, figure space, punctuation space, thin space, hair space (U+2000..U+200A): fixed-with spaces introduced into Unicode for compatibility with old typesetting software, but rarely or never used. Medium mathematical space (U+205F): also has a fixed width (4/18 em, the space around a mathematical operator), but was introduced into Unicode much later, and I'm not sure exactly why. No-break space (U+00A0): works more like a printing charaacter that doesn't happen to print anything than like whitespace. Commonly used as a method of forcing horizontal whitespace for (crude) formatting purposes. > The downside is that someone typing in some script with its own > whitespace (does that exist?) must use ASCII space, but they have to > anyways because all of the language keywords are in ASCII. Of the three remaining space characters, two are like that: Ideographic space (U+3000): yet another fixed-width space, but actually heavily used in Japanese text. Ogham space mark (U+1680): like Devanagari letters, Ogham letters hang down from a head line. Unlike Devanagari, the head line connects words. Consequently, a head line without a letter needs to be represented specially, so U+1680 works like a space but puts ink on the surface. Naturally, Ogham is not much used. Narrow no-break space (U+202F): Used in Mongolian within certain words; words are separated by the ordinary space. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Humpty Dump Dublin squeaks through his norse Humpty Dump Dublin hath a horrible vorse But for all his kinks English / And his irismanx brogues Humpty Dump Dublin's grandada of all rogues. --Cousin James
Received on Sunday, 8 March 2015 17:50:30 UTC