Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization

On Thu, Feb 5, 2009 at 1:30 PM, Robert J Burns <rob@robburns.com> wrote:
> On Feb 5, 2009, at 01:32, Jonathan Kew wrote:
>> On 4 Feb 2009, at 12:36, Henri Sivonen wrote:
>> It's true that most current programming and markup languages are
>> case sensitive, I think, although this is not universal. (What does
>> the HTML spec say? The UAs I'm accustomed to seem to treat it as
>> case-insensitive.)
> HTML5 says the parser replaces A through Z with a though z and
> thereafter the comparisons are done code point for code point. This is
> done for backward compatibility. Doing it this way works, because the
> conforming HTML vocabulary is entirely in the Basic Latin range.
> Also, doing it this way avoids the problem of sneaking scripts past
> ASCII-oriented black list-based gatekeepers by writing <SCRİPT>.
>
> So HTML can take a performance hit like that for case sensitivity, but for
> canonical normalization it would be an undue burden. How is that not Western
> bias?

Case sensitivity can be dealt with without a significant performance
hit because case normalization happens at the parser level (as
specified in the section you are quoting and responding to),
converting things *once* as they arrive.  The rest of the system can
completely ignore case issues and rely on code-point comparisons
instead.

It's been stated before that if we were allowed to do a similar eager
normalization to a particular normal form (NFC was the suggested form,
but the choice is irrelevant here), this really wouldn't pose any
problem.  The issue is that at least one person has stated that eager
normalization should not be done.

Having to handle case normalization on the fly in every string
comparison *would* be a horrific performance hit, which is why it's
done eagerly.  Thus this does not show any Western bias.

~TJ

Received on Thursday, 5 February 2009 19:42:38 UTC