- From: Jonathan Kew <jonathan@jfkew.plus.com>
- Date: Thu, 5 Feb 2009 09:29:16 +0000
- To: Andrew Cunningham <andrewc@vicnet.net.au>
- Cc: Robert J Burns <rob@robburns.com>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
On 5 Feb 2009, at 02:50, Andrew Cunningham wrote: > HI > > Jonathan Kew wrote: >> And as an illustration of just how unwise it would be for someone >> to use these distinct but canonically-equivalent characters to >> represent a significant distinction in markup: when I copy and >> paste those lines from my email client into a text editor, and >> examine the resulting codepoints, I find that all three lines are >> identical. Some process -- I'm not sure whether it is my mail >> client's Copy command, my text editor's Paste, or the operating >> system pasteboard in between -- has helpfully applied Unicode >> normalization to the data. So if that was a semantically important >> distinction in the hypothetical markup language you're using, it >> just got destroyed. By processes that are fully Unicode-compliant. >> >> (I know that you did indeed use different characters in the >> original mail, and they reached my mail client in that form, >> because I can examine the bytes in the message and see that this >> was the case. But simply copying the text to a plain-text editor >> changes that.) > > Now the question is which characters did you receive? U+003c and U > +003e? Which weren't present in the example? Just curious U+3008 and U+3009 (as expected; these are the canonical decompositions of U+2329 and U+232A respectively). If the software had silently replaced them with U+003C and U+003E, I'd be complaining; those are not canonically equivalent, and are normally quite distinct in design. JK
Received on Thursday, 5 February 2009 09:30:09 UTC