RE: [i18n+P&C] IRI/URI normalization from Marcin Hanclik on 2009-08-14 (public-webapps@w3.org from July to September 2009)

From: Marcin Hanclik <Marcin.Hanclik@access-company.com>
Date: Fri, 14 Aug 2009 23:03:57 +0200
To: "Phillips, Addison" <addison@amazon.com>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
CC: "public-webapps@w3.org" <public-webapps@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <FAA1D89C5BAF1142A74AF116630A9F2C2890BCA503@OBEEX01.obe.access-company.com>
Hi Addison,

A self-correction of my below message.
I used wrong character table (ISO-8859-2 instead of Unicode).

http://example.com/&#163;&#243;dzki&#163;piewnik&#172;d&#188;b&#179;owy

should be:

http://example.com/&#x141;&#xf3;dzki&#x15a;piewnik&#x179;d&#x17a;b&#x142;owy


Thanks.

Kind regards,
Marcin
________________________________________
From: public-iri-request@w3.org [public-iri-request@w3.org] On Behalf Of Marcin Hanclik [Marcin.Hanclik@access-company.com]
Sent: Friday, August 14, 2009 9:27 PM
To: Phillips, Addison; public-i18n-core@w3.org
Cc: public-webapps@w3.org; public-iri@w3.org
Subject: RE: [i18n+P&C] IRI/URI normalization

Hi Addison,

Great thanks for your rapid answer!

The red flags mean to me that some clarification is needed, I was probably not clear enough.
So here it goes.

>>> 1. The widget configuration document may contain only US-ASCII
>>> characters, and thus conform to P&C.
>>
>>This appears to me to be false.
The scenario I presented is meant to be just a use case.
I imagine a hypothetical situation that someone wants to use a feature whose name includes non-US-ASCII characters, but she/he has only US-ASCII editor at hand.

I quote here some fragments of my initial email in WebApps [1].

>P&C says:
>"An attribute defined as containing a valid IRI. A valid IRI is one that matches the IRI  token of the [RFC3987] specification."
>Again, is it the syntax in config.xml (i.e. impossible on byte level) or later?
>
>DOM3Core http://www.w3.org/TR/DOM-Level-3-Core/core.html says
>"A solution for loading a Document and saving it persistently is proposed in [DOM Level 3 Load and Save]."
>
>DOM3LS http://www.w3.org/TR/DOM-Level-3-LS/load-save.html will normalize the entities AFAIK, but probably will not normalize percent-encoded characters in URI/IRI.
>
>http://www.w3.org/TR/REC-xml-names/#iri-use says:
>"Because of the risk of confusion between URIs that would be equivalent if dereferenced, the use of %-escaped characters in namespace names is strongly discouraged."

You say:
>>The widget configuration document is defined as an XML document. The examples all declare the encoding as UTF-8.
>>See, for example [1]. A UTF-8 encoded configuration document can represent any Unicode
>>character sequence. For that matter, an XML document can also use numeric character entities
>>to represent characters not in the document's character encoding and the configuration document could use any
>>valid character encoding recognized by the XML processor.
OK.
RFC3987 provides normalization method between IRI and URI based on percent-encoding.
Whereas XML assumes another encoding based on  numeric character entities.
XML1.1 says [2]:
[10]    AttValue           ::=          '"' ([^<&"] | Reference)* '"'
                        |  "'" ([^<&'] | Reference)* "'"
[67]    Reference          ::=           EntityRef | CharRef
and so on.

So, my understanding is:
the IRI

http://example.com/ŁódzkiŚpiewnikŹdźbłowy

could be encoded as

http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy

based on RFC3987, or as

http://example.com/&#163;&#243;dzki&#163;piewnik&#172;d&#188;b&#179;owy

based on XML character entities encoding method.

In general, IMHO, it should simply be specified in P&C how the IRIs could/should be encoded.
Method based on character entities looks odd to me. That form cannot be used as URL e.g. in the browser (and e.g. we could imagine the situation that a feature name - IRI - could point to something as is, by copy-pasting of the string).
For practical reasons, one method would be enough, I think.

XML Namespaces specification uses URI as attribute value in XML, this seems to be similar to our situation.
[4] says:
"The IRI references below are also all different for the purposes of identifying namespaces:
      http://www.example.org/~wilbur
      http://www.example.org/%7ewilbur
      http://www.example.org/%7Ewilbur"
So as for me, the normalization based on RFC3987 is not mandated there and is even wrong for namespace identification use-case.
We may have further confusion :(

>>I always, personally, recommend that people use UTF-8 for XML.
Mandating UTF-8 as mandatory encoding for configuration document is one of the proposed solutions.
There are free UTF-8-capable editors, so my hypothetical situation could be easily overcome.
The only thing we seem to need is the clarification about what is mandated by the P&C specification.

>>But if you need to use a legacy encoding, the spec should not necessarily prevent you from doing so.
I wonder whether for practical reasons legacy encodings should not be excluded.

P&C says:
"A user agent must support the following specifications:

    * [XML].
    * [XMLNS].
    * [DOM3CORE].
    * [UTF-8]."

So by inclusion of [XML], it seems that other encodings than UTF-8 are implicitly mandated, or?
I am not sure whether this is the understanding in WebApps.
Also then - depending on the interpretation - a P&C compliant WUA could not be able to process config.xml written in ISO-XY.
E.g. I read the above as the requirement for config.xml to be encoded in either US-ASCII or UTF-8.

Additionally, XML1.1 [3] says:
"processors are, of course, not required to support all IANA-registered encodings"

>>I hope that helps. The I18N WG will consider this at our next meeting (next week).
Thanks again.

Kind regards,
Marcin

[1] http://lists.w3.org/Archives/Public/public-webapps/2009JulSep/0365.html
[2] http://www.w3.org/TR/xml11/#NT-AttValue
[3] http://www.w3.org/TR/xml11/#charencoding
[4] http://www.w3.org/TR/REC-xml-names/#NSNameComparison


________________________________________
From: Phillips, Addison [addison@amazon.com]
Sent: Friday, August 14, 2009 6:01 PM
To: Marcin Hanclik; public-i18n-core@w3.org
Cc: public-webapps@w3.org; public-iri@w3.org
Subject: RE: [i18n+P&C] IRI/URI normalization

Hello Marcin,

Thank you for the note. This is a PERSONAL response.

I immediately spotted some red flags in your email. You state:

> 1. The widget configuration document may contain only US-ASCII
> characters, and thus conform to P&C.

This appears to me to be false. The widget configuration document is defined as an XML document. The examples all declare the encoding as UTF-8. See, for example [1]. A UTF-8 encoded configuration document can represent any Unicode character sequence. For that matter, an XML document can also use numeric character entities to represent characters not in the document's character encoding and the configuration document could use any valid character encoding recognized by the XML processor.

However, you go on to say:

> 4. To use the non-US-ASCII feature-name, I would percent-encode it,
> as e.g. in [2]. (This seems to be the core of the problem, namely
> usage of feature-name specified in one language within the
> configuration document and text editor using another
> language/encoding).

Percent encoding is described in the IRI spec as part of mapping IRIs to URIs. If you encode it, you should decode it later. But it is not necessary to percent-encode it, even if you use US-ASCII-7 as the character encoding of your configuration document. You can use NCRs, for example (these are decoded by the XML processor in the WUA).

> Proposed solutions (OR-ed):
>
> a. Define a rule similar to "10.1.4 Rule for Getting a Single
> Attribute Value" (or a statement in that rule) that would specify
> the IRI/URI normalization according to RFC3987 (section 5.3.2.3).

I would support changing this, although I note that the widget document goes out of its way to prohibit URL-encoding (percent encoding) of IRIs. As noted above, there are other ways to put non-ASCII path characters into your configuration document.

>
> c. Mandate only UTF-8 encoded configuration documents and disallow
> other encodings (like Shift-JIS, ISO-XY etc).

This would be wrong. It doesn't really solve the problem anyway. The character encoding of the serialized XML document is not the limit on the characters that can be represented in it, although it might be inconvenient for a lot of NCRs to appear in a document.

Please note, I am not recommending that anyone actually use other character encodings. I always, personally, recommend that people use UTF-8 for XML. But if you need to use a legacy encoding, the spec should not necessarily prevent you from doing so.

>
> d. Mandate only US-ASCII feature-names (probably bad/against
> internationalization).

This I18N WG would certainly object to this.

I hope that helps. The I18N WG will consider this at our next meeting (next week).

Kind Regards,

Addison


[1] http://www.w3.org/TR/2009/CR-widgets-20090723/#configuration-document

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.




________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.


________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.
Received on Friday, 14 August 2009 21:04:43 UTC