- From: Addison Phillips <addison@yahoo-inc.com>
- Date: Mon, 02 Jul 2007 11:50:02 -0700
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- CC: Martin Duerst <duerst@it.aoyama.ac.jp>, public-i18n-core@w3.org, public-iri@w3.org
[The following is a personal response.]
Bjoern Hoehrmann wrote:
> There should be no SHOULD, it's critical that applications get this
> right. Where normalization is necessary or beneficial, it should be
> applied to the text content before any IRI processing takes place.
I agree with this and am not sure why the change was made?
The problem with this edit is that step 1b. is now doing two
things where formerly it did only one thing. Previously, it specified
that the IRI be converted to a normalized Unicode character sequence
without specifying how that took place.
Now it specifies converting from the legacy encoding and *then*
(perhaps) normalize. It reduces the requirement for NFC from an inherent
MUST to an explicit SHOULD.
Now I understand that encoding converters may or may not produce a
sequence that is NFC. For example, mapping a sequence containing the
combining flavors of Japanese dakuten or handakuten characters (i.e.
U+3099, U+309A) to Unicode from a Japanese encoding will result in a
combining sequence in several converters I have handy. I think it
acceptable and even smart not to require the transcoding process to be
normalizing. However, that wasn't the requirement in 1b. Normalization
could be applied outside the transcoding process and still be conformant
with the old text.
So, I think this change is counter-productive. It would have been
better to say:
--
b. If the IRI is in some digital representation (e.g., an
octet stream) in some known non-Unicode character
encoding, convert the IRI to a sequence of characters
from the UCS normalized according to NFC. Note that not
all transcoding processes produce normalized text and that
normalization might need to be checked after transcoding
or applied separately.
--
WRT Bjoern's note:
> Besides, this does not resolve disputes as to when step b) would apply
> at all.
I don't understand this comment, however. I'm not sure what disputes
could arise here, since this section specifies a process for mapping
IRIs to URIs. It doesn't specify that any particular Unicode encoding
(or that it use one at all), but it does require that the text be a
sequence of characters in the Unicode character set. I note that the
whole of XML, for example, is based on this exact same idea [1].
Addison
--
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG
Internationalization is an architecture.
It is not a feature.
Received on Monday, 2 July 2007 18:50:27 UTC