- From: Marcos Caceres <marcosc@opera.com>
- Date: Mon, 27 Jul 2009 18:23:53 +0200
- To: Marcin Hanclik <Marcin.Hanclik@access-company.com>
- Cc: "public-webapps@w3.org" <public-webapps@w3.org>, "connolly@w3.org" <connolly@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
2009/7/25 Marcin Hanclik <Marcin.Hanclik@access-company.com>: > Hi Marcos, All, > > Regarding the usage of IRI in the widget configuration document, I do not know which speicification is responsible for mandating the IRI normalization. > It is possible that I simply have not yet found the proper existing explanation to the issue, so if you know it, I would be grateful to get this information. > > These are more details. > > The P&C spec mixes the targets of the grammars (or low-level format specifications) it operates on. > E.g. > the sections about Zip archive operate on bytes > http://www.w3.org/TR/widgets/#zip-archive > http://www.w3.org/TR/widgets/#version-of-zip-needed-to-extract-a-file- > > and > zip-rel-path grammar > http://www.w3.org/TR/widgets/#zip-rel-path > operates on characters, not bytes (it may not be fully clear from the P&C text). > > XML Fifth Edition refers only to URI specification, it does not know about IRI. > > WUA must support XML and UTF-8: > http://www.w3.org/TR/widgets/#dependencies-on-other-specifications-and > > The configuration document is only required to be XML: > http://www.w3.org/TR/widgets/#configuration-document > and its encoding may be virtually any that is registered with IANA (my assumption). > > So we can have the following situation: > The WUA, that I develop widgets for, has a very interesting feature, whose IRI is really international (Polish in this case): > > http://example.com/ŁódzkiŚpiewnikŹdźbłowy > > I.e. the IRI contains characters outside of the US-ASCII character set. > Then, I may not have an UTF-X capable editor at hand, so I convert the IRI to URI as in > http://tools.ietf.org/html/rfc3987#section-3.1, Step 2. and > I write the following config.xml with US-ASCII legacy encoding: > > <?xml version=”1.0” encoding=”us-ascii”> > <widget …> > … > <feature name=”http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy” /> > … > </widget> > > http://tools.ietf.org/html/rfc3987#section-3.2 provides a method to convert URI to IRI. > However, I am not sure whether this conversion is mandated in P&C, since P&C just says that e.g. the name attribute is an IRI: > I am not sure whether it should have IRI syntax in config.xml (not possible in my case, since I use US-ASCII only) or later. > > Percent encoding is allowed in IRIs: > http://tools.ietf.org/html/rfc3987#section-2.2 > and > "Terminals in the ABNF are characters, not bytes." > > Therefore it seems possible that the above config.xml, when parsed by XML- and UTF-8-supporting WUA, will refer to a feature whose IRI would be > > http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy > > on the character level. Then, this valid IRI has to be checked for equivalence with > > http://example.com/ŁódzkiŚpiewnikŹdźbłowy > > based on the algorithm specified in http://tools.ietf.org/html/rfc3987#section-5.1 > and > http://tools.ietf.org/html/rfc3987#section-5.3.1 > > http://tools.ietf.org/html/rfc3987#section-5.3.2, specifically section 5.3.2.3 mentions percent-encoding normalization. > I am not sure whether DOM3Core Load&Save mechanisms perform such normalization (as also below). > P&C does not specify it. > > P&C says: > "An attribute defined as containing a valid IRI. A valid IRI is one that matches the IRI token of the [RFC3987] specification." > Again, is it the syntax in config.xml (i.e. impossible on byte level) or later? > > DOM3Core http://www.w3.org/TR/DOM-Level-3-Core/core.html says > "A solution for loading a Document and saving it persistently is proposed in [DOM Level 3 Load and Save]." > > DOM3LS http://www.w3.org/TR/DOM-Level-3-LS/load-save.html will normalize the entities AFAIK, but probably will not normalize percent-encoded characters in URI/IRI. > > Proposal > > http://www.w3.org/TR/REC-xml-names/#iri-use says: > "Because of the risk of confusion between URIs that would be equivalent if dereferenced, the use of %-escaped characters in namespace names is strongly discouraged." > > So maybe P&C shall state something similar, e.g. > > "Because of the risk of confusion between IRIs that would be equivalent if dereferenced, the use of %-escaped characters in feature names is strongly discouraged." > Yeah, that seems reasonable. I've added it. > This could result in percent-encoded IRIs not be present in the configuration document, and the need for the configuration document developer to use UTF-8 capable editor (it may be too hard requirement, it is just a proposal). > > Alternatively, we could specify in P&C that the attributes – that are currently specified as being IRI – shall actually be "IRI or URI" depending on the encoding of the config.xml. > > Third option would be to say something about IRI/URI normalization. > > More comments: > > The part of > http://www.w3.org/html/wg/href/draft.html#parsing-urls > namely: > „How does this compare to just parsing using the IRI grammar of RFC 3987?” > makes me think that the problem (I assume my problem and the Web addresses are similar) is not yet fully solved in any spec. > I am sorry for any ignorance if such is identified. > > The latest draft for IRI is this one: > http://tools.ietf.org/html/draft-duerst-iri-bis-06 > and it is being discussed also in W3C, see e.g. very recent comments from Anne at > http://lists.w3.org/Archives/Public/public-iri/2009Jul/ > > These are the documents that could help more: > http://www.w3.org/International/articles/idn-and-iri/ > http://www.w3.org/html/wg/href/draft-ietf.html (seems to be just newer version of the above draft ) > > Please let me know what you think. > Thanks. > > Kind regards, > Marcin > > ________________________________________ > > Access Systems Germany GmbH > Essener Strasse 5 | D-46047 Oberhausen > HRB 13548 Amtsgericht Duisburg > Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda > > www.access-company.com > > CONFIDENTIALITY NOTICE > This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the > individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited. > If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you. > > -- Marcos Caceres http://datadriven.com.au
Received on Monday, 27 July 2009 16:25:05 UTC