Re: Unicode Normalization

Hi Aryeh,

On Feb 3, 2009, at 8:53 AM, Aryeh Gregor wrote:

> On Tue, Feb 3, 2009 at 5:04 AM, Robert J Burns <rob@robburns.com>  
> wrote:
>> The problem with this is that there would have to be a prior  
>> agreement so
>> that a Unicode processing application could count on everything  
>> received
>> already as NFC and that's simply not the case. If a Unicode UA is  
>> incapable
>> of processing NFD (which also implies it cannot process NFC  
>> characters that
>> are combining characters) then it would be up to that application  
>> to convert
>> internally to something it could handle (just what conversion it  
>> would do, I
>> don't know).
>
> Who's talking about a Unicode UA being unable to process NFD?

Henri raised this issue right before the fragment you quote from me.  
There Henri says:

>>> The central reason for using NFC for interchange (i.e. what goes  
>>> over
>>> HTTP) is that legacy software (including the text rendering code in
>>> legacy browsers) works better with NFC.
>>>

To me that implies Henri thinks we need to promote NFC to help legacy  
software that cannot process combining characters. But that forgets  
that even NFC has combining characters.

> The question on the table seems to be whether UAs should normalize all
> input to NFC when they parse it.  This would permit them to process
> NFC, NFD, or any other normalized or non-normalized input.  They would
> then probably end up sending responses like form data in NFC even if
> they received the original input in NFD.  If the server prefers to use
> NFD internally, it's up to the server to then convert back to NFD on
> its end.

Yes, that's precisely what my messages were arguing for[1]. This needs  
to be done at the parser level and XML's  dependence on Unicode  
implies it should probably already be happening in XML parsers now  
(i.e., it is an implementation error to no canonically normalize one  
way or the other for string comparisons)

> We aren't really talking about transmission formats here, AFAICT, or
> at least that wasn't the original question.  The question is whether
> it's acceptable for browsers to internally normalize all input somehow
> (to NFC, NFD, whatever) as soon as it's received, so that they can
> ensure that they make correct comparisons according to the Unicode
> standard.  This is relevant to CSS because it seems to be the best way
> of ensuring that CSS comparisons aren't normalization-sensitive.

I agree.

> I'm not clear on what exactly the objections are to that, other than
> possibly violating the XML standard (it would be surprising to me if
> that did violate XML).

Quite the opposite I think it violates the XML standard to not compare  
canonically equivalent strings and determine they are equivalent.

> The only practical objection I can see is that
> some sites might be broken and not do normalization themselves.  You
> could have something like user registers with a name in NFD (or
> entirely unnormalized) in non-normalizing browser -> site saves to
> database -> same user tries to log in later in a normalizing browser
> -> login fails because site thinks the names are different.  I don't
> know whether this would be a problem in practice.

Which by fixing it in the implementations (XML parsers, CSS parsers  
and otherwise) begins to fix the problem.

Anne wrote:
> (As far as I can tell XML is Unicode
> Normalization agnostic. It merely recommends authors to do a certain
> thing. We can certainly recommend authors to do a certain thing in  
> HTML
> and CSS too...)

XML is not Unicode agnostic. Unicode is a normative reference in terms  
of text handling. So an XML UA is by definition also a Unicode UA.  
That means that an implementation needs to have some reason for  
comparing two byte-wise unequal  though canonically equivalent strings  
and determining they do not match. I haven't heard anyone here say why  
an XML processor needs to support (and therefore promote) such errors.

Take care,
Rob

[1]: <http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0120.html 
 >. This only went to the I18N list and not the CSS list.

Received on Tuesday, 3 February 2009 18:39:18 UTC