Re: Unicode Normalization from Henri Sivonen on 2009-02-04 (www-style@w3.org from February 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 4 Feb 2009 13:23:04 +0200
To: Robert J Burns <rob@robburns.com>
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <8A714527-0D3B-465C-9F70-DF5489489C7E@iki.fi>

On Feb 3, 2009, at 20:51, Robert J Burns wrote:

>> Making existing browsers normalize before string equality checks is
>> also too late.
>
> But doing so in the parser as I and others have suggest should work  
> fine.

For clarity, I meant that browsers that are out there are out there.  
You can't make them do a normalization step anywhere in their  
processing before an identifier comparison happens.

I think we don't have performance data showing that normalization in  
the parser would be "fine" in terms of performance, and without data  
it is quite reasonable to assume unfavorable performance  
characteristics.

> Unicode depends on two canonically equivalent but byte-wise  
> different strings matching.

No, it doesn't. Unicode itself doesn't depend on one kind of equality  
check. Unicode enables a wide variety of equality relations between  
strings. Different equality relations are appropriate for different  
purposes.

> We cannot hope to eliminate such strings from the internet, so this  
> is something that implementations have to deal with. I think most  
> everyone here is on the same page on that, but I want you to  
> understand too.

One way of dealing with it is specifying that implementations do their  
string identity comparisons code point for code point thus making  
comparisons between strings that differ in normalization evaluate to  
false uniformly across implementations.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 4 February 2009 11:23:49 UTC