W3C home > Mailing lists > Public > www-style@w3.org > February 2009

Re: Unicode Normalization

From: Jonathan Kew <jonathan@jfkew.plus.com>
Date: Thu, 5 Feb 2009 00:12:57 +0000
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <237BEB96-8113-4266-94DA-F2E05A709BA7@jfkew.plus.com>
To: Robert J Burns <rob@robburns.com>

On 4 Feb 2009, at 23:46, Robert J Burns wrote:

> A slight correction on this that isn't really all that  germane to  
> the present conversation, but I felt I should make nonetheless and  
> this correction helps improve understanding of the various issues.
> On Feb 4, 2009, at 3:07 PM, I wrote:
>> Take for example the following three strings (NFD, NFC and non- 
>> normalized):
>> 〈this string〉
>> 〈this string〉
>> 〈this string〉
> Actually the first form is non-normalized too. The second string is  
> conforming to both NFC and NFD. The third string is non-normalized  
> as well
> Just to provide further clarification each line is a separate string  
> where the interior "this string" is an identical code point sequence  
> irrelevant for normalization purposes. The angle brackets themselves  
> however, have been encoded repeatedly as different code points  
> despite Unicode offering no semantically distinct interpretation  
> between the two code points.

And as an illustration of just how unwise it would be for someone to  
use these distinct but canonically-equivalent characters to represent  
a significant distinction in markup: when I copy and paste those lines  
from my email client into a text editor, and examine the resulting  
codepoints, I find that all three lines are identical. Some process --  
I'm not sure whether it is my mail client's Copy command, my text  
editor's Paste, or the operating system pasteboard in between -- has  
helpfully applied Unicode normalization to the data. So if that was a  
semantically important distinction in the hypothetical markup language  
you're using, it just got destroyed. By processes that are fully  

(I know that you did indeed use different characters in the original  
mail, and they reached my mail client in that form, because I can  
examine the bytes in the message and see that this was the case. But  
simply copying the text to a plain-text editor changes that.)

Received on Thursday, 5 February 2009 00:13:48 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 11 February 2015 12:34:22 UTC