Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Henri Sivonen on 2009-02-05 (www-style@w3.org from February 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 5 Feb 2009 17:22:43 +0200
To: Jonathan Kew <jonathan@jfkew.plus.com>
Cc: Andrew Cunningham <andrewc@vicnet.net.au>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <B015AAE8-84D3-4156-B2EF-87532289A396@iki.fi>
On Feb 5, 2009, at 01:32, Jonathan Kew wrote:

> On 4 Feb 2009, at 12:36, Henri Sivonen wrote:
>
>> Before anyone accuses me of “Western bias”, I'd like to point  
>> out that there is precedent for preferring (in my opinion quite  
>> reasonably) a faster kind of Unicode string equality relation over  
>> one that could be claimed to make more sense to users of Western  
>> languages: XML deliberately uses code point for code point string  
>> comparison for start/end tag names instead of Unicode-wise correct  
>> case-insensitive comparison for performance reasons (and to avoid  
>> having to pick a collation locale). (I'm assuming here that we can  
>> agree that bicameral scripts are big in the “West” and case- 
>> insensitive equality makes some intuitive sense for the users of  
>> bicameral scripts.)
>
> The choice of case-sensitivity vs case-insensitivity is not really  
> comparable.

My point is that it's generally not helpful to bring out the Western  
bias[1] thing in discussions of using Unicode in computer languages.  
Previously, too, performance has been preferred over full natural  
language complexity for computer language identifier equality  
comparison and in that instance clearly it could not have been an  
issue of Western bias. The thing is that comparing computer language  
identifiers code point for code point is the common-sense thing to do.  
If you consider the lack of case-insensitivity, some languages are not  
perfectly convenienced. If you consider the lack normalization,  
another (overlapping) set of languages is not perfectly convenienced.  
If you consider the sensitivity to diacritics, yet another set of  
languages is not perfectly convenienced. No language is prohibited by  
code point for code point comparison, though.

> It's true that most current programming and markup languages are  
> case sensitive, I think, although this is not universal. (What does  
> the HTML spec say? The UAs I'm accustomed to seem to treat it as  
> case-insensitive.)

HTML5 says the parser replaces A through Z with a though z and  
thereafter the comparisons are done code point for code point. This is  
done for backward compatibility. Doing it this way works, because the  
conforming HTML vocabulary is entirely in the Basic Latin range.

Also, doing it this way avoids the problem of sneaking scripts past  
ASCII-oriented black list-based gatekeepers by writing <SCRİPT>.

> Similarly with names in filesystems: both case-sensitive and case- 
> insensitive systems are in widespread use. There is presumably a  
> performance win for those that are case sensitive, but it doesn't  
> appear to be a compelling criterion for system developers.

File names are exposed to all end users. However, class names and  
selectors are only exposed to a step more technically savvy group of  
people who deal with code.

(Aside: Letting the non-NFC not-quite-NFD form of HFS+ file names to  
leak to e.g. URIs is pretty annoying.)

> However, the key difference (in my mind, at least) is that for  
> bicameral Western scripts, the user can clearly see the difference  
> between upper and lower case.

Sure, but that's not part of the point I was making.

> For a better analogy, imagine having to work with a language like  
> HTML or XML or CSS, but where the distinction between lines  
> terminated by CR, LF, or CR/LF is significant to the interpretation  
> of your markup. Now imagine sharing files and fragments with co- 
> workers using different tools on different platforms, where the  
> usual line-end conventions vary. Some of your editors may show the  
> different kinds of line differently, perhaps, but many will display  
> them all identically. That's not a scenario I would wish on anyone.  
> I don't want "human-readable" source code that has to be hex-dumped  
> before I can be sure of its meaning.

CRLF to LF normalization is implementation-wise so troublesome and  
annoying in performance-critical code that if it is an analogy for  
anything, the conclusion should be firmly against introducing any more  
things that are analogous with CRLF. I'm sure normalizing CRLF to LF  
seems like no big deal conceptually, but implementing e.g. CRLF to LF  
normalization where the CR comes from document.write() and the LF from  
the network stream gets more complex than it first appears if you also  
want to maintain sane buffering that doesn't copy data too many times.

[1] http://lists.w3.org/Archives/Public/www-style/2009Jan/0481.html
-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Thursday, 5 February 2009 15:23:31 UTC