[P&C] Low-level internationalization, XML deserialization, IRI or URI, IRI normalization from Marcin Hanclik on 2009-07-25 (public-iri@w3.org from July 2009)

From: Marcin Hanclik <Marcin.Hanclik@access-company.com>
Date: Sat, 25 Jul 2009 11:11:08 +0200
To: "public-webapps@w3.org" <public-webapps@w3.org>
CC: "connolly@w3.org" <connolly@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <FAA1D89C5BAF1142A74AF116630A9F2C2890BCA4E3@OBEEX01.obe.access-company.com>

Hi Marcos, All,

Regarding the usage of IRI in the widget configuration document, I do not know which speicification is responsible for mandating the IRI normalization.
It is possible that I simply have not yet found the proper existing explanation to the issue, so if you know it, I would be grateful to get this information.

These are more details.

The P&C spec mixes the targets of the grammars (or low-level format specifications) it operates on.
E.g.
the sections about Zip archive operate on bytes
http://www.w3.org/TR/widgets/#zip-archive
http://www.w3.org/TR/widgets/#version-of-zip-needed-to-extract-a-file-

and
zip-rel-path grammar
http://www.w3.org/TR/widgets/#zip-rel-path
operates on characters, not bytes (it may not be fully clear from the P&C text).

XML Fifth Edition refers only to URI specification, it does not know about IRI.

WUA must support XML and UTF-8:
http://www.w3.org/TR/widgets/#dependencies-on-other-specifications-and

The configuration document is only required to be XML:
http://www.w3.org/TR/widgets/#configuration-document
and its encoding may be virtually any that is registered with IANA (my assumption).

So we can have the following situation:
The WUA, that I develop widgets for, has a very interesting feature, whose IRI is really international (Polish in this case):

http://example.com/ŁódzkiŚpiewnikŹdźbłowy

I.e. the IRI contains characters outside of the US-ASCII character set.
Then, I may not have an UTF-X capable editor at hand, so I convert the IRI to URI as in
http://tools.ietf.org/html/rfc3987#section-3.1, Step 2. and
I write the following config.xml with US-ASCII legacy encoding:

<?xml version=”1.0” encoding=”us-ascii”>
<widget …>
…
<feature name=”http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy” />
…
</widget>

http://tools.ietf.org/html/rfc3987#section-3.2 provides a method to convert URI to IRI.
However, I am not sure whether this conversion is mandated in P&C, since P&C just says that e.g. the name attribute is an IRI:
I am not sure whether it should have IRI syntax in config.xml (not possible in my case, since I use US-ASCII only) or later.

Percent encoding is allowed in IRIs:
http://tools.ietf.org/html/rfc3987#section-2.2
and
"Terminals in the ABNF are characters, not bytes."

Therefore it seems possible that the above config.xml, when parsed by XML- and UTF-8-supporting WUA, will refer to a feature whose IRI would be

http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy

on the character level. Then, this valid IRI has to be checked for equivalence with

http://example.com/ŁódzkiŚpiewnikŹdźbłowy

based on the algorithm specified in http://tools.ietf.org/html/rfc3987#section-5.1
and
http://tools.ietf.org/html/rfc3987#section-5.3.1

http://tools.ietf.org/html/rfc3987#section-5.3.2, specifically section 5.3.2.3 mentions percent-encoding normalization.
I am not sure whether DOM3Core Load&Save mechanisms perform such normalization (as also below).
P&C does not specify it.

P&C says:
"An attribute defined as containing a valid IRI. A valid IRI is one that matches the IRI token of the [RFC3987] specification."
Again, is it the syntax in config.xml (i.e. impossible on byte level) or later?

DOM3Core http://www.w3.org/TR/DOM-Level-3-Core/core.html says
"A solution for loading a Document and saving it persistently is proposed in [DOM Level 3 Load and Save]."

DOM3LS http://www.w3.org/TR/DOM-Level-3-LS/load-save.html will normalize the entities AFAIK, but probably will not normalize percent-encoded characters in URI/IRI.

Proposal

http://www.w3.org/TR/REC-xml-names/#iri-use says:
"Because of the risk of confusion between URIs that would be equivalent if dereferenced, the use of %-escaped characters in namespace names is strongly discouraged."

So maybe P&C shall state something similar, e.g.

"Because of the risk of confusion between IRIs that would be equivalent if dereferenced, the use of %-escaped characters in feature names is strongly discouraged."

This could result in percent-encoded IRIs not be present in the configuration document, and the need for the configuration document developer to use UTF-8 capable editor (it may be too hard requirement, it is just a proposal).

Alternatively, we could specify in P&C that the attributes – that are currently specified as being IRI – shall actually be "IRI or URI" depending on the encoding of the config.xml.

Third option would be to say something about IRI/URI normalization.

More comments:

The part of
http://www.w3.org/html/wg/href/draft.html#parsing-urls
namely:
„How does this compare to just parsing using the IRI grammar of RFC 3987?”
makes me think that the problem (I assume my problem and the Web addresses are similar) is not yet fully solved in any spec.
I am sorry for any ignorance if such is identified.

The latest draft for IRI is this one:
http://tools.ietf.org/html/draft-duerst-iri-bis-06
and it is being discussed also in W3C, see e.g. very recent comments from Anne at
http://lists.w3.org/Archives/Public/public-iri/2009Jul/

These are the documents that could help more:
http://www.w3.org/International/articles/idn-and-iri/
http://www.w3.org/html/wg/href/draft-ietf.html (seems to be just newer version of the above draft )

Please let me know what you think.
Thanks.

Kind regards,
Marcin

________________________________________

Access Systems Germany GmbH
Essener Strasse 5 | D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.

Received on Saturday, 25 July 2009 09:11:54 UTC