Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Jonathan Kew on 2009-02-04 (public-i18n-core@w3.org from January to March 2009)

From: Jonathan Kew <jonathan@jfkew.plus.com>
Date: Wed, 4 Feb 2009 23:32:53 +0000
To: Henri Sivonen <hsivonen@iki.fi>
Cc: Andrew Cunningham <andrewc@vicnet.net.au>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <B3149E4D-4ADD-4D99-BDC0-8D010BCF970E@jfkew.plus.com>

On 4 Feb 2009, at 12:36, Henri Sivonen wrote:

> Before anyone accuses me of “Western bias”, I'd like to point out  
> that there is precedent for preferring (in my opinion quite  
> reasonably) a faster kind of Unicode string equality relation over  
> one that could be claimed to make more sense to users of Western  
> languages: XML deliberately uses code point for code point string  
> comparison for start/end tag names instead of Unicode-wise correct  
> case-insensitive comparison for performance reasons (and to avoid  
> having to pick a collation locale). (I'm assuming here that we can  
> agree that bicameral scripts are big in the “West” and case- 
> insensitive equality makes some intuitive sense for the users of  
> bicameral scripts.)

The choice of case-sensitivity vs case-insensitivity is not really  
comparable. It's true that most current programming and markup  
languages are case sensitive, I think, although this is not universal.  
(What does the HTML spec say? The UAs I'm accustomed to seem to treat  
it as case-insensitive.) Similarly with names in filesystems: both  
case-sensitive and case-insensitive systems are in widespread use.  
There is presumably a performance win for those that are case  
sensitive, but it doesn't appear to be a compelling criterion for  
system developers.

However, the key difference (in my mind, at least) is that for  
bicameral Western scripts, the user can clearly see the difference  
between upper and lower case. Yes, we do "equate" upper and lower case  
letters in some sense, but we are also well aware that they exist as  
two different things, and we are aware of which we're using at any  
given time. So if the markup system in use is case-sensitive, it is  
easy for the user to see whether the data is consistent and correct.  
Where normalization is concerned, this is not so: canonically- 
equivalent Unicode sequences are supposed to be essentially  
indistinguishable to the user, except through the use of special low- 
level facilities for examining the underlying codepoint sequence.

For a better analogy, imagine having to work with a language like HTML  
or XML or CSS, but where the distinction between lines terminated by  
CR, LF, or CR/LF is significant to the interpretation of your markup.  
Now imagine sharing files and fragments with co-workers using  
different tools on different platforms, where the usual line-end  
conventions vary. Some of your editors may show the different kinds of  
line differently, perhaps, but many will display them all identically.  
That's not a scenario I would wish on anyone. I don't want "human- 
readable" source code that has to be hex-dumped before I can be sure  
of its meaning.

JK

Received on Wednesday, 4 February 2009 23:33:50 UTC