- From: Addison Phillips <addison@yahoo-inc.com>
- Date: Mon, 02 Jul 2007 11:50:02 -0700
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- CC: Martin Duerst <duerst@it.aoyama.ac.jp>, public-i18n-core@w3.org, public-iri@w3.org
[The following is a personal response.] Bjoern Hoehrmann wrote: > There should be no SHOULD, it's critical that applications get this > right. Where normalization is necessary or beneficial, it should be > applied to the text content before any IRI processing takes place. I agree with this and am not sure why the change was made? The problem with this edit is that step 1b. is now doing two things where formerly it did only one thing. Previously, it specified that the IRI be converted to a normalized Unicode character sequence without specifying how that took place. Now it specifies converting from the legacy encoding and *then* (perhaps) normalize. It reduces the requirement for NFC from an inherent MUST to an explicit SHOULD. Now I understand that encoding converters may or may not produce a sequence that is NFC. For example, mapping a sequence containing the combining flavors of Japanese dakuten or handakuten characters (i.e. U+3099, U+309A) to Unicode from a Japanese encoding will result in a combining sequence in several converters I have handy. I think it acceptable and even smart not to require the transcoding process to be normalizing. However, that wasn't the requirement in 1b. Normalization could be applied outside the transcoding process and still be conformant with the old text. So, I think this change is counter-productive. It would have been better to say: -- b. If the IRI is in some digital representation (e.g., an octet stream) in some known non-Unicode character encoding, convert the IRI to a sequence of characters from the UCS normalized according to NFC. Note that not all transcoding processes produce normalized text and that normalization might need to be checked after transcoding or applied separately. -- WRT Bjoern's note: > Besides, this does not resolve disputes as to when step b) would apply > at all. I don't understand this comment, however. I'm not sure what disputes could arise here, since this section specifies a process for mapping IRIs to URIs. It doesn't specify that any particular Unicode encoding (or that it use one at all), but it does require that the text be a sequence of characters in the Unicode character set. I note that the whole of XML, for example, is based on this exact same idea [1]. Addison -- Addison Phillips Globalization Architect -- Yahoo! Inc. Chair -- W3C Internationalization Core WG Internationalization is an architecture. It is not a feature.
Received on Monday, 2 July 2007 18:50:25 UTC