Re: Unicode Normalization from Robert J Burns on 2009-02-03 (www-style@w3.org from February 2009)

From: Robert J Burns <rob@robburns.com>
Date: Tue, 3 Feb 2009 12:51:06 -0600
To: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <AEF14883-C137-4ED2-85D4-A7893DC8776F@robburns.com>

Hi Henri,

>
>> On Feb 3, 2009, at 15:10, Richard Ishida wrote:
>>
>>  I didn't have to look hard for a problem. If you install the Tlicho
>>  (Tłįchǫ or Dogrib) keyboard on Windows (see a picture athttp://rishida.net/scripts/pickers/tl 
>> ich
>>  o/) and type the name of the language itself, it comes out in NFD.
>>  It is also possible to incorrectly order multiple diacritics (ie.
>>  not even NFD). You could say that the keyboard *ought* to churn out
>>  NFC, but it's too late. People using those keyboards will be
>>  producing content that may look different to that created by people
>>  using other input methods.
>
>
> Making existing browsers normalize before string equality checks is
> also too late.

But doing so in the parser as I and others have suggest should work  
fine.

>
> When considering what software to change in a future version, to me it
> seems more sensible to change the software that is less performance-
> critical, is closer to the problem and doesn't depend on wide
> consistent deployment to address the problem for a given Web author.
> That is, it seems more sensible to make the input methods produce
> consistently ordered output. This should be within the realm of
> possibility; after all, producing pre-composed characters with
> European diacritic dead keys is a solved problem.

Another common misconception is that normalization is only about  
combining characters. There are also singletons that are normalized as  
part of the normalization algorithm. Therefore one cannot simply  
require input methods to normalize on the fly. And even so, which  
normalization would that be (since as I said before NFC or NFD is a  
rather bikeshed-like disagreement).

Unicode depends on two canonically equivalent but byte-wise different  
strings matching. We cannot hope to eliminate such strings from the  
internet, so this is something that implementations have to deal with.  
I think most everyone here is on the same page on that, but I want you  
to understand too.

Take care,
Rob

Received on Tuesday, 3 February 2009 18:51:45 UTC