W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2009

Re: Unicode Normalization

From: Jonathan Kew <jonathan@jfkew.plus.com>
Date: Thu, 5 Feb 2009 00:12:57 +0000
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <237BEB96-8113-4266-94DA-F2E05A709BA7@jfkew.plus.com>
To: Robert J Burns <rob@robburns.com>

On 4 Feb 2009, at 23:46, Robert J Burns wrote:

>
> A slight correction on this that isn't really all that  germane to  
> the present conversation, but I felt I should make nonetheless and  
> this correction helps improve understanding of the various issues.
>
> On Feb 4, 2009, at 3:07 PM, I wrote:
>> Take for example the following three strings (NFD, NFC and non- 
>> normalized):
>>
>> 〈this string〉
>> 〈this string〉
>> 〈this string〉
>
>
> Actually the first form is non-normalized too. The second string is  
> conforming to both NFC and NFD. The third string is non-normalized  
> as well
>
> Just to provide further clarification each line is a separate string  
> where the interior "this string" is an identical code point sequence  
> irrelevant for normalization purposes. The angle brackets themselves  
> however, have been encoded repeatedly as different code points  
> despite Unicode offering no semantically distinct interpretation  
> between the two code points.

And as an illustration of just how unwise it would be for someone to  
use these distinct but canonically-equivalent characters to represent  
a significant distinction in markup: when I copy and paste those lines  
from my email client into a text editor, and examine the resulting  
codepoints, I find that all three lines are identical. Some process --  
I'm not sure whether it is my mail client's Copy command, my text  
editor's Paste, or the operating system pasteboard in between -- has  
helpfully applied Unicode normalization to the data. So if that was a  
semantically important distinction in the hypothetical markup language  
you're using, it just got destroyed. By processes that are fully  
Unicode-compliant.

(I know that you did indeed use different characters in the original  
mail, and they reached my mail client in that form, because I can  
examine the bytes in the message and see that this was the case. But  
simply copying the text to a plain-text editor changes that.)

JK
Received on Thursday, 5 February 2009 00:13:46 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 5 February 2009 00:13:47 GMT