- From: Jonathan Kew <jonathan@jfkew.plus.com>
- Date: Wed, 4 Feb 2009 23:32:53 +0000
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: Andrew Cunningham <andrewc@vicnet.net.au>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
On 4 Feb 2009, at 12:36, Henri Sivonen wrote: > Before anyone accuses me of “Western bias”, I'd like to point out > that there is precedent for preferring (in my opinion quite > reasonably) a faster kind of Unicode string equality relation over > one that could be claimed to make more sense to users of Western > languages: XML deliberately uses code point for code point string > comparison for start/end tag names instead of Unicode-wise correct > case-insensitive comparison for performance reasons (and to avoid > having to pick a collation locale). (I'm assuming here that we can > agree that bicameral scripts are big in the “West” and case- > insensitive equality makes some intuitive sense for the users of > bicameral scripts.) The choice of case-sensitivity vs case-insensitivity is not really comparable. It's true that most current programming and markup languages are case sensitive, I think, although this is not universal. (What does the HTML spec say? The UAs I'm accustomed to seem to treat it as case-insensitive.) Similarly with names in filesystems: both case-sensitive and case-insensitive systems are in widespread use. There is presumably a performance win for those that are case sensitive, but it doesn't appear to be a compelling criterion for system developers. However, the key difference (in my mind, at least) is that for bicameral Western scripts, the user can clearly see the difference between upper and lower case. Yes, we do "equate" upper and lower case letters in some sense, but we are also well aware that they exist as two different things, and we are aware of which we're using at any given time. So if the markup system in use is case-sensitive, it is easy for the user to see whether the data is consistent and correct. Where normalization is concerned, this is not so: canonically- equivalent Unicode sequences are supposed to be essentially indistinguishable to the user, except through the use of special low- level facilities for examining the underlying codepoint sequence. For a better analogy, imagine having to work with a language like HTML or XML or CSS, but where the distinction between lines terminated by CR, LF, or CR/LF is significant to the interpretation of your markup. Now imagine sharing files and fragments with co-workers using different tools on different platforms, where the usual line-end conventions vary. Some of your editors may show the different kinds of line differently, perhaps, but many will display them all identically. That's not a scenario I would wish on anyone. I don't want "human- readable" source code that has to be hex-dumped before I can be sure of its meaning. JK
Received on Wednesday, 4 February 2009 23:33:50 UTC