Re: i18n-ISSUE-411: Definition of whitespace should come from Unicode

Nice summary, John. Some comments below.

A./

On 3/8/2015 10:49 AM, John Cowan wrote:
> Eric Prud'hommeaux scripsit:
>
>> I suspect that whitespace is pretty consistently treated as the four
>> control codes this point. In 2006 I tried a more inclusive definition
>> of whitespace in SPARQL but folks said "what the hell is this? Everybody
>> knows that whitespace is four characters." Had things like non-breaking,
>> zero-width, all-singing space stayed in SPARQL, parsers would have
>> required multi-byte lexers and the interoperability of incomplete
>> implementations would have suffered.
> If we look in detail at the 25 characters that Unicode says are of type
> WS, here's what we find:
>
> Tab, CR, LF, Space:  the Fantastic Four.
>
> Vertical tab, form feed: mostly obsolete ASCII controls.
>
> Next line (U+0085): ANSI's failed attempt to create a newline character
> distinct from LF that no one uses.
>
> Line separator, paragraph separator (U+2028..U+2029): Unicode's failed
> attempt to create these distinct from LF and LF+LF that no one uses.
>
> Em quad and space, en quad and space (1/2 em), 1/3 em space, 1/4 em
> space, 1/6 em space, figure space, punctuation space, thin space, hair
> space (U+2000..U+200A): fixed-with spaces introduced into Unicode for
> compatibility with old typesetting software, but rarely or never used.

They are useful whenever one needs to create a fixed amount of relative 
space
between two items that otherwise float inline. This is rarely needed in 
ordinary
text, but not unknown in mathematical and other specialized typesetting.

Their other use in historic hot lead typography, to make the space needed in
indents and the like, are now handled differently. Positional offsets 
from the
margin are thus different from fixed relative space between adjacent 
characters.

The latter could be represented in styles by the relatively awkward 
means of setting
up spans with explicit inter-character spacing, but that's fragile. So, 
rare, but not
"never used".

> Medium mathematical space (U+205F):  also has a fixed width (4/18 em,
> the space around a mathematical operator), but was introduced into
> Unicode much later, and I'm not sure exactly why.

For use in mathematical typesetting, where it can be used to express the
relative space between two symbols on the line; most useful if the 
characters
are used as operators, but not recognized as such by the typesetting 
software,
which otherwise should have been able to automatically supply the 
correct spacing.

>
> No-break space (U+00A0): works more like a printing charaacter that
> doesn't happen to print anything than like whitespace.  Commonly used as
> a method of forcing horizontal whitespace for (crude) formatting purposes.

Also commonly used to "keep together" pairs of words, like title and 
name, in
ordinary text.

It's most common use, by far, is as placeholder in empty paragraphs in HTML
documents. :)   (Needless to say, that is not what it was encoded for...)

The fact that HTML collapses whitespace (deeming it to belong to the 
syntactic
substrate, rather than the text content) is the reason why NBSP occurs 
instead
of SPACE in HTML documents for this purpose.

>
>> The downside is that someone typing in some script with its own
>> whitespace (does that exist?) must use ASCII space, but they have to
>> anyways because all of the language keywords are in ASCII.
> Of the three remaining space characters, two are like that:
>
> Ideographic space (U+3000):  yet another fixed-width space, but actually
> heavily used in Japanese text.
>
> Ogham space mark (U+1680):  like Devanagari letters, Ogham letters hang
> down from a head line.  Unlike Devanagari, the head line connects words.
> Consequently, a head line without a letter needs to be represented
> specially, so U+1680 works like a space but puts ink on the surface.
> Naturally, Ogham is not much used.
>
> Narrow no-break space (U+202F):  Used in Mongolian within certain words;
> words are separated by the ordinary space.

Also, not encoded in the Mongolian block, because it's useful in certain 
other
contexts, outside Mongolian; they are mainly notational.

Received on Sunday, 8 March 2015 18:20:46 UTC