- From: Robert J Burns <rob@robburns.com>
- Date: Thu, 5 Feb 2009 13:59:50 -0600
- To: "Tab Atkins Jr." <jackalmage@gmail.com>
- Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Hi Tab, On Feb 5, 2009, at 1:42 PM, Tab Atkins Jr. wrote: > On Thu, Feb 5, 2009 at 1:30 PM, Robert J Burns <rob@robburns.com> > wrote: >> On Feb 5, 2009, at 01:32, Jonathan Kew wrote: >>> On 4 Feb 2009, at 12:36, Henri Sivonen wrote: >>> It's true that most current programming and markup languages are >>> case sensitive, I think, although this is not universal. (What does >>> the HTML spec say? The UAs I'm accustomed to seem to treat it as >>> case-insensitive.) >> HTML5 says the parser replaces A through Z with a though z and >> thereafter the comparisons are done code point for code point. This >> is >> done for backward compatibility. Doing it this way works, because the >> conforming HTML vocabulary is entirely in the Basic Latin range. >> Also, doing it this way avoids the problem of sneaking scripts past >> ASCII-oriented black list-based gatekeepers by writing <SCRÄ°PT>. >> >> So HTML can take a performance hit like that for case sensitivity, >> but for >> canonical normalization it would be an undue burden. How is that >> not Western >> bias? > > Case sensitivity can be dealt with without a significant performance > hit because case normalization happens at the parser level (as > specified in the section you are quoting and responding to), > converting things *once* as they arrive. The rest of the system can > completely ignore case issues and rely on code-point comparisons > instead. > Just as canonical normalization can occur at the parser level. > It's been stated before that if we were allowed to do a similar eager > normalization to a particular normal form (NFC was the suggested form, > but the choice is irrelevant here), this really wouldn't pose any > problem. The issue is that at least one person has stated that eager > normalization should not be done. Both Henri and Anne have argued against parser stage normalization for canonical equivalent character sequences. > Having to handle case normalization on the fly in every string > comparison *would* be a horrific performance hit, which is why it's > done eagerly. Thus this does not show any Western bias. The topic of discussion in this sub-thread is about parser normalization (I guess what you're calling eager normalization). I am in favor of it and Henri is against it. So this is about the same type of performance hit that case normalization takes at the parser level. Regardless my point about Western bias is that case sensitivity has been dealt with in all sorts of ways in nearly every spec. However, canonical normalization has not been dealt with in any satisfactory way and Henri continues to argue that it should not be dealt with in a satisfactory way (or how it has been dealt with should be deemed satisfactory by fiat). At the very least we need to normalize non- singletons (where the canonical decomposition of the character is not to only one character). Any combining characters need to be reordered into the order of their canonical combining class and precomposed characters need to be normalized (which could still leave the singleton decompositions that have other authoring problems untouched). Take care, Rob
Received on Thursday, 5 February 2009 20:00:28 UTC