Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Robert J Burns on 2009-02-05 (www-style@w3.org from February 2009)

From: Robert J Burns <rob@robburns.com>
Date: Thu, 5 Feb 2009 13:59:50 -0600
To: "Tab Atkins Jr." <jackalmage@gmail.com>
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <B6553850-E940-48A3-8BAC-0FE18BFBC056@robburns.com>

Hi Tab,

On Feb 5, 2009, at 1:42 PM, Tab Atkins Jr. wrote:

> On Thu, Feb 5, 2009 at 1:30 PM, Robert J Burns <rob@robburns.com>  
> wrote:
>> On Feb 5, 2009, at 01:32, Jonathan Kew wrote:
>>> On 4 Feb 2009, at 12:36, Henri Sivonen wrote:
>>> It's true that most current programming and markup languages are
>>> case sensitive, I think, although this is not universal. (What does
>>> the HTML spec say? The UAs I'm accustomed to seem to treat it as
>>> case-insensitive.)
>> HTML5 says the parser replaces A through Z with a though z and
>> thereafter the comparisons are done code point for code point. This  
>> is
>> done for backward compatibility. Doing it this way works, because the
>> conforming HTML vocabulary is entirely in the Basic Latin range.
>> Also, doing it this way avoids the problem of sneaking scripts past
>> ASCII-oriented black list-based gatekeepers by writing <SCRİPT>.
>>
>> So HTML can take a performance hit like that for case sensitivity,  
>> but for
>> canonical normalization it would be an undue burden. How is that  
>> not Western
>> bias?
>
> Case sensitivity can be dealt with without a significant performance
> hit because case normalization happens at the parser level (as
> specified in the section you are quoting and responding to),
> converting things *once* as they arrive.  The rest of the system can
> completely ignore case issues and rely on code-point comparisons
> instead.
>

Just as canonical normalization can occur at the parser level.

> It's been stated before that if we were allowed to do a similar eager
> normalization to a particular normal form (NFC was the suggested form,
> but the choice is irrelevant here), this really wouldn't pose any
> problem.  The issue is that at least one person has stated that eager
> normalization should not be done.

Both Henri and Anne have argued against parser stage normalization for  
canonical equivalent character sequences.

> Having to handle case normalization on the fly in every string
> comparison *would* be a horrific performance hit, which is why it's
> done eagerly.  Thus this does not show any Western bias.

The topic of discussion in this sub-thread is about parser  
normalization (I guess what you're calling eager normalization). I am  
in favor of it and Henri is against it. So this is about the same type  
of performance hit that case normalization takes at the parser level.  
Regardless my point about Western bias is that case sensitivity has  
been dealt with in all sorts of ways in nearly every spec. However,  
canonical normalization has not been dealt with in any satisfactory  
way and Henri continues to argue that it should not be dealt with in a  
satisfactory way (or how it has been dealt with should be deemed  
satisfactory by fiat). At the very least we need to normalize non- 
singletons (where the canonical decomposition of the character is not  
to only one character). Any combining characters need to be reordered  
into the order of their canonical combining class and precomposed  
characters need to be normalized (which could still leave the  
singleton decompositions that have other authoring problems untouched).

Take care,
Rob

Received on Thursday, 5 February 2009 20:00:28 UTC