W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2009

Re: Unicode Normalization

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 4 Feb 2009 13:23:04 +0200
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <8A714527-0D3B-465C-9F70-DF5489489C7E@iki.fi>
To: Robert J Burns <rob@robburns.com>

On Feb 3, 2009, at 20:51, Robert J Burns wrote:

>> Making existing browsers normalize before string equality checks is
>> also too late.
> But doing so in the parser as I and others have suggest should work  
> fine.

For clarity, I meant that browsers that are out there are out there.  
You can't make them do a normalization step anywhere in their  
processing before an identifier comparison happens.

I think we don't have performance data showing that normalization in  
the parser would be "fine" in terms of performance, and without data  
it is quite reasonable to assume unfavorable performance  

> Unicode depends on two canonically equivalent but byte-wise  
> different strings matching.

No, it doesn't. Unicode itself doesn't depend on one kind of equality  
check. Unicode enables a wide variety of equality relations between  
strings. Different equality relations are appropriate for different  

> We cannot hope to eliminate such strings from the internet, so this  
> is something that implementations have to deal with. I think most  
> everyone here is on the same page on that, but I want you to  
> understand too.

One way of dealing with it is specifying that implementations do their  
string identity comparisons code point for code point thus making  
comparisons between strings that differ in normalization evaluate to  
false uniformly across implementations.

Henri Sivonen
Received on Wednesday, 4 February 2009 11:23:50 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:23:04 UTC