RE: [i18n+P&C] IRI/URI normalization from Phillips, Addison on 2009-08-14 (public-i18n-core@w3.org from July to September 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Fri, 14 Aug 2009 09:01:47 -0700
To: Marcin Hanclik <Marcin.Hanclik@access-company.com>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
CC: "public-webapps@w3.org" <public-webapps@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA01AC93547B@EX-SEA5-D.ant.amazon.com>

Hello Marcin,

Thank you for the note. This is a PERSONAL response.

I immediately spotted some red flags in your email. You state:

> 1. The widget configuration document may contain only US-ASCII
> characters, and thus conform to P&C.

This appears to me to be false. The widget configuration document is defined as an XML document. The examples all declare the encoding as UTF-8. See, for example [1]. A UTF-8 encoded configuration document can represent any Unicode character sequence. For that matter, an XML document can also use numeric character entities to represent characters not in the document's character encoding and the configuration document could use any valid character encoding recognized by the XML processor.

However, you go on to say:

> 4. To use the non-US-ASCII feature-name, I would percent-encode it,
> as e.g. in [2]. (This seems to be the core of the problem, namely
> usage of feature-name specified in one language within the
> configuration document and text editor using another
> language/encoding).

Percent encoding is described in the IRI spec as part of mapping IRIs to URIs. If you encode it, you should decode it later. But it is not necessary to percent-encode it, even if you use US-ASCII-7 as the character encoding of your configuration document. You can use NCRs, for example (these are decoded by the XML processor in the WUA).

> Proposed solutions (OR-ed):
> 
> a. Define a rule similar to "10.1.4 Rule for Getting a Single
> Attribute Value" (or a statement in that rule) that would specify
> the IRI/URI normalization according to RFC3987 (section 5.3.2.3).

I would support changing this, although I note that the widget document goes out of its way to prohibit URL-encoding (percent encoding) of IRIs. As noted above, there are other ways to put non-ASCII path characters into your configuration document.

> 
> c. Mandate only UTF-8 encoded configuration documents and disallow
> other encodings (like Shift-JIS, ISO-XY etc).

This would be wrong. It doesn't really solve the problem anyway. The character encoding of the serialized XML document is not the limit on the characters that can be represented in it, although it might be inconvenient for a lot of NCRs to appear in a document.

Please note, I am not recommending that anyone actually use other character encodings. I always, personally, recommend that people use UTF-8 for XML. But if you need to use a legacy encoding, the spec should not necessarily prevent you from doing so.

> 
> d. Mandate only US-ASCII feature-names (probably bad/against
> internationalization).

This I18N WG would certainly object to this.

I hope that helps. The I18N WG will consider this at our next meeting (next week).

Kind Regards,

Addison


[1] http://www.w3.org/TR/2009/CR-widgets-20090723/#configuration-document


Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

Received on Friday, 14 August 2009 16:02:28 UTC