W3C home > Mailing lists > Public > www-style@w3.org > February 2009

Re: Unicode Normalization

From: Robert J Burns <rob@robburns.com>
Date: Thu, 5 Feb 2009 04:11:03 -0600
Cc: "Aryeh Gregor" <Simetrical+w3c@gmail.com>, public-i18n-core@w3.org, jonathan@jfkew.plus.com, "W3C Style List" <www-style@w3.org>
Message-Id: <8BE0C3D3-7A39-42DF-912B-DB3A3FEB72F9@robburns.com>
To: "Anne van Kesteren" <annevk@opera.com>

Hi Anne,

On Feb 5, 2009, at 3:14 AM, Anne van Kesteren wrote:

> On Wed, 04 Feb 2009 22:07:59 +0100, Robert J Burns  
> <rob@robburns.com> wrote:
>> [...] If you meant that XML is Unicode normalization agnostic in  
>> that it doesn't care (or know?) whether two canonically equivalent  
>> strings are a match then there I disagree with that. Unicode is  
>> fairly clear that two canonically equivalent strings are equivalent  
>> even if their code points differ.
>
> That's what I mean. There are many different comparison algorithms.  
> Unicode definitely does not make it non-conforming to compare two  
> strings codepoint for codepoint. I'm not sure why you think it does.

I've elaborated extensively on why I think Unicode doesn't expect  
Unicode UAs to treat canonically equivalent strings as distinct. I'd  
be curious what your interpretation is of these Unicode criteria. Or  
even canonical equivalence. Why do you think the Unicode Standard even  
speaks of canonical equivalence and canonical decompositions. As I've  
said my interpretation is that it is because the process of assigning  
code points to the UCS has for one reason or another brought in  
redundant characters/graphemes. To cope with this, Unicode has created  
the concept of canonical equivalence where the characters with  
canonical decompositions should be considered equivalent (as in string  
comparisons) to their decomposed character sequence.

If you think these criteria and concepts are about something entirely  
different, I wish you share your ideas with the list rather than  
simply expressing contrarian viewpoints.

>>> The XML grammar is expressed in Unicode codepoints so comparison  
>>> also happens on that level.
>>
>> However Unicode has a SHOULD requirement that two canonically  
>> equivalent but codepoint differing strings match. Unicode's Chapter  
>> 3 (C6 norm) says:
>>
>>> A process shall not assume that the interpretations of two  
>>> canonical-equivalent character sequences are distinct.
>
> I suggest to read all of C6. Martin Dürst already pointed out long  
> ago that this does not always apply:
>
>  http://lists.w3.org/Archives/Public/www-style/2009Feb/0020.html

I have read all of C6 and I've provided my interpretation of that  
criterion in depth (more depth than some probably cared to hear). If  
your interpretation of these concepts is different, it would be  
helpful for you to layout those differences rather than just  
suggesting I go read the Unicode Standard. I have said that  
comparisons code point to code point are allowed in Unicode, but that  
I think the Unicode Standard implies that the process should have a  
reason for doing so rather than deviating from the ideal situation  
where an "implementation would always interpret two canonical  
equivalent character sequences identically".

If you interpretation is different than mine than why don't you  
explain it here on this list.

Take care,
Rob
Received on Thursday, 5 February 2009 10:11:44 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 17:20:16 GMT