Re: Unicode Normalization from Robert J Burns on 2009-02-04 (www-style@w3.org from February 2009)

From: Robert J Burns <rob@robburns.com>
Date: Wed, 4 Feb 2009 15:07:59 -0600
To: "Anne van Kesteren" <annevk@opera.com>
Cc: "Aryeh Gregor" <Simetrical+w3c@gmail.com>, public-i18n-core@w3.org, jonathan@jfkew.plus.com, "W3C Style List" <www-style@w3.org>
Message-Id: <750C89EC-87E6-427E-A2DE-4B21F2C757B5@robburns.com>
Hi Anne,

On Feb 4, 2009, at 11:01 AM, Anne van Kesteren wrote:

> On Tue, 03 Feb 2009 19:38:32 +0100, Robert J Burns  
> <rob@robburns.com> wrote:
>> Anne wrote:
>>> (As far as I can tell XML is Unicode
>>> Normalization agnostic. It merely recommends authors to do a certain
>>> thing. We can certainly recommend authors to do a certain thing in  
>>> HTML
>>> and CSS too...)
>>
>> XML is not Unicode agnostic.
>
> I did not say that.

Some more explanation of what you meant would be helpful. Are you just  
saying that XML is Unicode normalization agnostic in terms that it  
doesn't matter which normalization form is used in comparing string  
(NFC or NFD).  If thats what you meant then we agree. If you meant  
that XML is Unicode normalization agnostic in that it doesn't care (or  
know?) whether two canonically equivalent strings are a match then  
there I disagree with that. Unicode is fairly clear that two  
canonically equivalent strings are equivalent even if their code  
points differ.

>> Unicode is a normative reference in terms of text handling. So an  
>> XML UA is by definition also a Unicode UA. That means that an  
>> implementation needs to have some reason for comparing two byte- 
>> wise unequal  though canonically equivalent strings and determining  
>> they do not match. I haven't heard anyone here say why an XML  
>> processor needs to support (and therefore promote) such errors.
>
> The XML grammar is expressed in Unicode codepoints so comparison  
> also happens on that level.

However Unicode has a SHOULD requirement that two canonically  
equivalent but codepoint differing strings match. Unicode's Chapter 3  
(C6 norm) says:

> A process shall not assume that the interpretations of two canonical- 
> equivalent character sequences are distinct.

If the strings are not interpreted as distinct then they are  
interpreted as the same: as equivalent strings even though they have  
different code points (I admit Unicode's wording here is poor, but  
this is how I understand this recommendation).

The next question is why does Unicode leave a should here and not a  
MUST. My reading of that is that it opens up the possibility that a UA  
might have a reason to treat the strings as distinct and therefore a  
UA might have good reason to ignore the SHOULD recommendation. I have  
a hard time thinking about what types of UA might want to ignore this.  
When one UA ignores it it might create a need for other UAs to ignore  
it. But even more, I can think of no reason that XML (or CSS or HTML),  
as a consumer of Unicode, would want to ignore this recommendation.  
You have said that a use case might be that authors want to use  
canonically equivalent characters as semantically distinct. Do you  
have any real world examples of this? And even if we could find some  
real world examples of this, why would XML (or CSS or text/HTML or  
javascript) really want to facilitate such poor practice.

Take for example the following three strings (NFD, NFC and non- 
normalized):

〈this string〉
〈this string〉
〈this string〉

The are no combining marks here. The angle brackets each represent  
canonically equivalent characters where UAs should not interpret them  
as distinct (but may do so if there's some reason to do so). Ideally  
we wouldn't have this problem in Unicode, but we do. In many ways it  
is more nefarious than case-sensitivity since (unless there's a buggy  
font) the glyphs should be identical so there's no way for an author  
to track down the problem (this is in contrast to the non-canonical  
compatibility characters where at least glyph difference should point  
authors to the problem).

To me it is an error to use canonically equivalent characters as  
semantically distinct characters. I agree with Henri that input system  
should address this. However, input systems have absolutely zero  
specifications that I am aware of that provide guidance on this. The  
Unicode Standard doesn't address input systems. I think it should and  
that such norms should address the canonical and compatibility  
equivalent characters, but until that happens there's nothing saying  
what input system should do (or MUST do). So to say this is a bug in  
the input systems that should be addressed there is not really  
tenable. Even if new input systems addressed the situation, there  
would still be Unicode 1-6 UAs out there producing non-normalized  
content. Moreover even when an input system fixes the situation  
internally, there will still be a UA that outputs NFC and another UA  
that outputs NFD, so the only way to meet the need of internally  
matching equivalent Unicode strings is performing parser (or on-the- 
fly) normalization. Again, it would be better if Unicode never  
admitted canonical and compatibility equivalent characters (the  
reasons given for doing so are dubious and they now occupy 5,000 code  
points out of 65,000 in what has now become a tightly packed and  
therefore precious Basic Multilingual Plane).

Finally, Henri compared this to case sensitivity. Indeed I think we  
should think of this as something like case sensitivity, however in a  
situation where XML (and even Unicode) failed to be clear about how to  
interpret case differing strings. Imagine the XML specification failed  
to address the situation of case sensitivity and then some UAs went  
with case-insensitive sting comparisons and others went with case- 
sensitive string compares. This would be something we would need to  
address in clarifying XML. I think it is the same with canonical  
equivalence. It needs to be addressed with the same precision that XML  
dealt with case-sensitivity. However that has not been done. Simply  
saying that the input systems will take care of it seems absurd to me  
(especially when such guidance doesn't yet come from Unicode or W3C).

Take care,
Rob
Received on Wednesday, 4 February 2009 21:08:40 UTC