Re: Unicode Normalization from Jonathan Kew on 2009-02-05 (www-style@w3.org from February 2009)

From: Jonathan Kew <jonathan@jfkew.plus.com>
Date: Thu, 5 Feb 2009 09:29:16 +0000
To: Andrew Cunningham <andrewc@vicnet.net.au>
Cc: Robert J Burns <rob@robburns.com>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <E697AE43-CAB3-41F0-B3DC-21BDCB2D4C71@jfkew.plus.com>

On 5 Feb 2009, at 02:50, Andrew Cunningham wrote:

> HI
>
> Jonathan Kew wrote:
>> And as an illustration of just how unwise it would be for someone  
>> to use these distinct but canonically-equivalent characters to  
>> represent a significant distinction in markup: when I copy and  
>> paste those lines from my email client into a text editor, and  
>> examine the resulting codepoints, I find that all three lines are  
>> identical. Some process -- I'm not sure whether it is my mail  
>> client's Copy command, my text editor's Paste, or the operating  
>> system pasteboard in between -- has helpfully applied Unicode  
>> normalization to the data. So if that was a semantically important  
>> distinction in the hypothetical markup language you're using, it  
>> just got destroyed. By processes that are fully Unicode-compliant.
>>
>> (I know that you did indeed use different characters in the  
>> original mail, and they reached my mail client in that form,  
>> because I can examine the bytes in the message and see that this  
>> was the case. But simply copying the text to a plain-text editor  
>> changes that.)
>
> Now the question is which characters did you receive? U+003c and U 
> +003e? Which weren't present in the example? Just curious

U+3008 and U+3009 (as expected; these are the canonical decompositions  
of U+2329 and U+232A respectively).

If the software had silently replaced them with U+003C and U+003E, I'd  
be complaining; those are not canonically equivalent, and are normally  
quite distinct in design.

JK

Received on Thursday, 5 February 2009 09:30:09 UTC