- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 5 Feb 2009 17:22:43 +0200
- To: Jonathan Kew <jonathan@jfkew.plus.com>
- Cc: Andrew Cunningham <andrewc@vicnet.net.au>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
On Feb 5, 2009, at 01:32, Jonathan Kew wrote: > On 4 Feb 2009, at 12:36, Henri Sivonen wrote: > >> Before anyone accuses me of “Western bias”, I'd like to point >> out that there is precedent for preferring (in my opinion quite >> reasonably) a faster kind of Unicode string equality relation over >> one that could be claimed to make more sense to users of Western >> languages: XML deliberately uses code point for code point string >> comparison for start/end tag names instead of Unicode-wise correct >> case-insensitive comparison for performance reasons (and to avoid >> having to pick a collation locale). (I'm assuming here that we can >> agree that bicameral scripts are big in the “West” and case- >> insensitive equality makes some intuitive sense for the users of >> bicameral scripts.) > > The choice of case-sensitivity vs case-insensitivity is not really > comparable. My point is that it's generally not helpful to bring out the Western bias[1] thing in discussions of using Unicode in computer languages. Previously, too, performance has been preferred over full natural language complexity for computer language identifier equality comparison and in that instance clearly it could not have been an issue of Western bias. The thing is that comparing computer language identifiers code point for code point is the common-sense thing to do. If you consider the lack of case-insensitivity, some languages are not perfectly convenienced. If you consider the lack normalization, another (overlapping) set of languages is not perfectly convenienced. If you consider the sensitivity to diacritics, yet another set of languages is not perfectly convenienced. No language is prohibited by code point for code point comparison, though. > It's true that most current programming and markup languages are > case sensitive, I think, although this is not universal. (What does > the HTML spec say? The UAs I'm accustomed to seem to treat it as > case-insensitive.) HTML5 says the parser replaces A through Z with a though z and thereafter the comparisons are done code point for code point. This is done for backward compatibility. Doing it this way works, because the conforming HTML vocabulary is entirely in the Basic Latin range. Also, doing it this way avoids the problem of sneaking scripts past ASCII-oriented black list-based gatekeepers by writing <SCRİPT>. > Similarly with names in filesystems: both case-sensitive and case- > insensitive systems are in widespread use. There is presumably a > performance win for those that are case sensitive, but it doesn't > appear to be a compelling criterion for system developers. File names are exposed to all end users. However, class names and selectors are only exposed to a step more technically savvy group of people who deal with code. (Aside: Letting the non-NFC not-quite-NFD form of HFS+ file names to leak to e.g. URIs is pretty annoying.) > However, the key difference (in my mind, at least) is that for > bicameral Western scripts, the user can clearly see the difference > between upper and lower case. Sure, but that's not part of the point I was making. > For a better analogy, imagine having to work with a language like > HTML or XML or CSS, but where the distinction between lines > terminated by CR, LF, or CR/LF is significant to the interpretation > of your markup. Now imagine sharing files and fragments with co- > workers using different tools on different platforms, where the > usual line-end conventions vary. Some of your editors may show the > different kinds of line differently, perhaps, but many will display > them all identically. That's not a scenario I would wish on anyone. > I don't want "human-readable" source code that has to be hex-dumped > before I can be sure of its meaning. CRLF to LF normalization is implementation-wise so troublesome and annoying in performance-critical code that if it is an analogy for anything, the conclusion should be firmly against introducing any more things that are analogous with CRLF. I'm sure normalizing CRLF to LF seems like no big deal conceptually, but implementing e.g. CRLF to LF normalization where the CR comes from document.write() and the LF from the network stream gets more complex than it first appears if you also want to maintain sane buffering that doesn't copy data too many times. [1] http://lists.w3.org/Archives/Public/www-style/2009Jan/0481.html -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Thursday, 5 February 2009 15:23:34 UTC