Re: [P&C] Low-level internationalization, XML deserialization, IRI or URI, IRI normalization from Marcos Caceres on 2009-07-27 (public-webapps@w3.org from July to September 2009)

From: Marcos Caceres <marcosc@opera.com>
Date: Mon, 27 Jul 2009 18:23:53 +0200
To: Marcin Hanclik <Marcin.Hanclik@access-company.com>
Cc: "public-webapps@w3.org" <public-webapps@w3.org>, "connolly@w3.org" <connolly@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <b21a10670907270923k41aa7f65ke69ba46c910e453e@mail.gmail.com>
2009/7/25 Marcin Hanclik <Marcin.Hanclik@access-company.com>:
> Hi Marcos, All,
>
> Regarding the usage of IRI in the widget configuration document, I do not know which speicification is responsible for mandating the IRI normalization.
> It is possible that I simply have not yet found the proper existing explanation to the issue, so if you know it, I would be grateful to get this information.
>
> These are more details.
>
> The P&C spec mixes the targets of the grammars (or low-level format specifications) it operates on.
> E.g.
> the sections about Zip archive operate on bytes
> http://www.w3.org/TR/widgets/#zip-archive
> http://www.w3.org/TR/widgets/#version-of-zip-needed-to-extract-a-file-
>
> and
> zip-rel-path grammar
> http://www.w3.org/TR/widgets/#zip-rel-path
> operates on characters, not bytes (it may not be fully clear from the P&C text).
>
> XML Fifth Edition refers only to URI specification, it does not know about IRI.
>
> WUA must support XML and UTF-8:
> http://www.w3.org/TR/widgets/#dependencies-on-other-specifications-and
>
> The configuration document is only required to be XML:
> http://www.w3.org/TR/widgets/#configuration-document
> and its encoding may be virtually any that is registered with IANA (my assumption).
>
> So we can have the following situation:
> The WUA, that I develop widgets for, has a very interesting feature, whose IRI is really international (Polish in this case):
>
> http://example.com/ŁódzkiŚpiewnikŹdźbłowy
>
> I.e. the IRI contains characters outside of the US-ASCII character set.
> Then, I may not have an UTF-X capable editor at hand, so I convert the IRI to URI as in
> http://tools.ietf.org/html/rfc3987#section-3.1, Step 2. and
> I write the following config.xml with US-ASCII legacy encoding:
>
> <?xml version=”1.0” encoding=”us-ascii”>
> <widget …>
> …
> <feature name=”http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy” />
> …
> </widget>
>
> http://tools.ietf.org/html/rfc3987#section-3.2 provides a method to convert URI to IRI.
> However, I am not sure whether this conversion is mandated in P&C, since P&C just says that e.g. the name attribute is an IRI:
> I am not sure whether it should have IRI syntax in config.xml (not possible in my case, since I use US-ASCII only) or later.
>
> Percent encoding is allowed in IRIs:
> http://tools.ietf.org/html/rfc3987#section-2.2
> and
> "Terminals in the ABNF are characters, not bytes."
>
> Therefore it seems possible that the above config.xml, when parsed by XML- and UTF-8-supporting WUA, will refer to a feature whose IRI would be
>
> http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy
>
> on the character level. Then, this valid IRI has to be checked for equivalence with
>
> http://example.com/ŁódzkiŚpiewnikŹdźbłowy
>
> based on the algorithm specified in http://tools.ietf.org/html/rfc3987#section-5.1
> and
> http://tools.ietf.org/html/rfc3987#section-5.3.1
>
> http://tools.ietf.org/html/rfc3987#section-5.3.2, specifically section 5.3.2.3 mentions percent-encoding normalization.
> I am not sure whether DOM3Core Load&Save mechanisms perform such normalization (as also below).
> P&C does not specify it.
>
> P&C says:
> "An attribute defined as containing a valid IRI. A valid IRI is one that matches the IRI  token of the [RFC3987] specification."
> Again, is it the syntax in config.xml (i.e. impossible on byte level) or later?
>
> DOM3Core http://www.w3.org/TR/DOM-Level-3-Core/core.html says
> "A solution for loading a Document and saving it persistently is proposed in [DOM Level 3 Load and Save]."
>
> DOM3LS http://www.w3.org/TR/DOM-Level-3-LS/load-save.html will normalize the entities AFAIK, but probably will not normalize percent-encoded characters in URI/IRI.
>
> Proposal
>
> http://www.w3.org/TR/REC-xml-names/#iri-use says:
> "Because of the risk of confusion between URIs that would be equivalent if dereferenced, the use of %-escaped characters in namespace names is strongly discouraged."
>
> So maybe P&C shall state something similar, e.g.
>
> "Because of the risk of confusion between IRIs that would be equivalent if dereferenced, the use of %-escaped characters in feature names is strongly discouraged."
>

Yeah, that seems reasonable. I've added it.

> This could result in percent-encoded IRIs not be present in the configuration document, and the need for the configuration document developer to use UTF-8 capable editor (it may be too hard requirement, it is just a proposal).
>
> Alternatively, we could specify in P&C that the attributes – that are currently specified as being IRI – shall actually be "IRI or URI" depending on the encoding of the config.xml.
>
> Third option would be to say something about IRI/URI normalization.
>
> More comments:
>
> The part of
> http://www.w3.org/html/wg/href/draft.html#parsing-urls
> namely:
> „How does this compare to just parsing using the IRI grammar of RFC 3987?”
> makes me think that the problem (I assume my problem and the Web addresses are similar) is not yet fully solved in any spec.
> I am sorry for any ignorance if such is identified.
>
> The latest draft for IRI is this one:
> http://tools.ietf.org/html/draft-duerst-iri-bis-06
> and it is being discussed also in W3C, see e.g. very recent comments from Anne at
> http://lists.w3.org/Archives/Public/public-iri/2009Jul/
>
> These are the documents that could help more:
> http://www.w3.org/International/articles/idn-and-iri/
> http://www.w3.org/html/wg/href/draft-ietf.html (seems to be just newer version of the above draft )
>
> Please let me know what you think.
> Thanks.
>
> Kind regards,
> Marcin
>
> ________________________________________
>
> Access Systems Germany GmbH
> Essener Strasse 5  |  D-46047 Oberhausen
> HRB 13548 Amtsgericht Duisburg
> Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda
>
> www.access-company.com
>
> CONFIDENTIALITY NOTICE
> This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
> individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
> If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.
>
>



-- 
Marcos Caceres
http://datadriven.com.au
Received on Monday, 27 July 2009 16:25:05 UTC