RE: [P&C] Low-level internationalization, XML deserialization, IRI or URI, IRI normalization from Marcin Hanclik on 2009-07-28 (public-webapps@w3.org from July to September 2009)

From: Marcin Hanclik <Marcin.Hanclik@access-company.com>
Date: Tue, 28 Jul 2009 12:27:46 +0200
To: "marcosc@opera.com" <marcosc@opera.com>
CC: "public-webapps@w3.org" <public-webapps@w3.org>, "connolly@w3.org" <connolly@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <FAA1D89C5BAF1142A74AF116630A9F2C2890AF8568@OBEEX01.obe.access-company.com>
Hi Marcos,

>>Yeah, that seems reasonable. I've added it.
I have not seen your change, I do not know where to look for it.
Anyway, I think the actual problem remains unsolved.

Maybe we should ask the developers?

The target of widgets seems to be to address the most possible audience of developers and create a simple solution mainly for them. WUA may get a bit complex, I think.

If e.g. a WUA exposes some nice feature specified as IRI (i.e. with characters outside of US-ASCII), then it mainly affects the developers who potentially may not have Unicode-capable editor.
config.xml can be specified e.g. in Shift-JIS or some other location specific charset (that people are used to), so Unicode-requirement also seems not to solve the problem, but just imposes something new.

Also maybe we should ask i18n people?

So we may want to specify the following (one of, more ideas welcome):

a) the definition of feature names SHOULD/MUST use only US-ASCII characters. IMHO, this would mean that we actually assume that feature names are URIs, not IRIs.
Rationale/advantage: simplicity
Disadvantage: we are not i18n anymore

b) Specify in P&C that the IRI/URI - once retrieved from configuration document - must be normalized according to the algorithm specified in RFC3987 section 5.
Section 5.1 http://tools.ietf.org/html/rfc3987#section-5.1 says:
"Applications using IRIs as identity tokens with no relationship to a
   protocol MUST use the Simple String Comparison (see section 5.3.1)."
It may be valid for P&C.

My preference is b), but I think that prior to a potential update of the P&C, we need some discussion, as said e.g. with i18n.

Additionally - but this may be out of the scope of P&C - we may have to specify how to compare the feature names (and probably the other attributes) with the IRIs/features implemented in the WUA.

Thanks.

Kind regards,
Marcin

Marcin Hanclik
ACCESS Systems Germany GmbH
Tel: +49-208-8290-6452  |  Fax: +49-208-8290-6465
Mobile: +49-163-8290-646
E-Mail: marcin.hanclik@access-company.com

-----Original Message-----
From: marcosscaceres@gmail.com [mailto:marcosscaceres@gmail.com] On Behalf Of Marcos Caceres
Sent: Monday, July 27, 2009 6:24 PM
To: Marcin Hanclik
Cc: public-webapps@w3.org; connolly@w3.org; public-iri@w3.org
Subject: Re: [P&C] Low-level internationalization, XML deserialization, IRI or URI, IRI normalization

2009/7/25 Marcin Hanclik <Marcin.Hanclik@access-company.com>:
> Hi Marcos, All,
>
> Regarding the usage of IRI in the widget configuration document, I do not know which speicification is responsible for mandating the IRI normalization.
> It is possible that I simply have not yet found the proper existing explanation to the issue, so if you know it, I would be grateful to get this information.
>
> These are more details.
>
> The P&C spec mixes the targets of the grammars (or low-level format specifications) it operates on.
> E.g.
> the sections about Zip archive operate on bytes
> http://www.w3.org/TR/widgets/#zip-archive

> http://www.w3.org/TR/widgets/#version-of-zip-needed-to-extract-a-file-

>
> and
> zip-rel-path grammar
> http://www.w3.org/TR/widgets/#zip-rel-path

> operates on characters, not bytes (it may not be fully clear from the P&C text).
>
> XML Fifth Edition refers only to URI specification, it does not know about IRI.
>
> WUA must support XML and UTF-8:
> http://www.w3.org/TR/widgets/#dependencies-on-other-specifications-and

>
> The configuration document is only required to be XML:
> http://www.w3.org/TR/widgets/#configuration-document

> and its encoding may be virtually any that is registered with IANA (my assumption).
>
> So we can have the following situation:
> The WUA, that I develop widgets for, has a very interesting feature, whose IRI is really international (Polish in this case):
>
> http://example.com/ŁódzkiŚpiewnikŹdźbłowy

>
> I.e. the IRI contains characters outside of the US-ASCII character set.
> Then, I may not have an UTF-X capable editor at hand, so I convert the IRI to URI as in
> http://tools.ietf.org/html/rfc3987#section-3.1, Step 2. and
> I write the following config.xml with US-ASCII legacy encoding:
>
> <?xml version=”1.0” encoding=”us-ascii”>
> <widget …>
> …
> <feature name=”http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy” />
> …
> </widget>
>
> http://tools.ietf.org/html/rfc3987#section-3.2 provides a method to convert URI to IRI.
> However, I am not sure whether this conversion is mandated in P&C, since P&C just says that e.g. the name attribute is an IRI:
> I am not sure whether it should have IRI syntax in config.xml (not possible in my case, since I use US-ASCII only) or later.
>
> Percent encoding is allowed in IRIs:
> http://tools.ietf.org/html/rfc3987#section-2.2

> and
> "Terminals in the ABNF are characters, not bytes."
>
> Therefore it seems possible that the above config.xml, when parsed by XML- and UTF-8-supporting WUA, will refer to a feature whose IRI would be
>
> http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy

>
> on the character level. Then, this valid IRI has to be checked for equivalence with
>
> http://example.com/ŁódzkiŚpiewnikŹdźbłowy

>
> based on the algorithm specified in http://tools.ietf.org/html/rfc3987#section-5.1

> and
> http://tools.ietf.org/html/rfc3987#section-5.3.1

>
> http://tools.ietf.org/html/rfc3987#section-5.3.2, specifically section 5.3.2.3 mentions percent-encoding normalization.
> I am not sure whether DOM3Core Load&Save mechanisms perform such normalization (as also below).
> P&C does not specify it.
>
> P&C says:
> "An attribute defined as containing a valid IRI. A valid IRI is one that matches the IRI  token of the [RFC3987] specification."
> Again, is it the syntax in config.xml (i.e. impossible on byte level) or later?
>
> DOM3Core http://www.w3.org/TR/DOM-Level-3-Core/core.html says
> "A solution for loading a Document and saving it persistently is proposed in [DOM Level 3 Load and Save]."
>
> DOM3LS http://www.w3.org/TR/DOM-Level-3-LS/load-save.html will normalize the entities AFAIK, but probably will not normalize percent-encoded characters in URI/IRI.
>
> Proposal
>
> http://www.w3.org/TR/REC-xml-names/#iri-use says:
> "Because of the risk of confusion between URIs that would be equivalent if dereferenced, the use of %-escaped characters in namespace names is strongly discouraged."
>
> So maybe P&C shall state something similar, e.g.
>
> "Because of the risk of confusion between IRIs that would be equivalent if dereferenced, the use of %-escaped characters in feature names is strongly discouraged."
>

Yeah, that seems reasonable. I've added it.

> This could result in percent-encoded IRIs not be present in the configuration document, and the need for the configuration document developer to use UTF-8 capable editor (it may be too hard requirement, it is just a proposal).
>
> Alternatively, we could specify in P&C that the attributes – that are currently specified as being IRI – shall actually be "IRI or URI" depending on the encoding of the config.xml.
>
> Third option would be to say something about IRI/URI normalization.
>
> More comments:
>
> The part of
> http://www.w3.org/html/wg/href/draft.html#parsing-urls

> namely:
> „How does this compare to just parsing using the IRI grammar of RFC 3987?”
> makes me think that the problem (I assume my problem and the Web addresses are similar) is not yet fully solved in any spec.
> I am sorry for any ignorance if such is identified.
>
> The latest draft for IRI is this one:
> http://tools.ietf.org/html/draft-duerst-iri-bis-06

> and it is being discussed also in W3C, see e.g. very recent comments from Anne at
> http://lists.w3.org/Archives/Public/public-iri/2009Jul/

>
> These are the documents that could help more:
> http://www.w3.org/International/articles/idn-and-iri/

> http://www.w3.org/html/wg/href/draft-ietf.html (seems to be just newer version of the above draft )
>
> Please let me know what you think.
> Thanks.
>
> Kind regards,
> Marcin
>
> ________________________________________
>
> Access Systems Germany GmbH
> Essener Strasse 5  |  D-46047 Oberhausen
> HRB 13548 Amtsgericht Duisburg
> Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda
>
> www.access-company.com
>
> CONFIDENTIALITY NOTICE
> This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
> individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
> If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.
>
>



--
Marcos Caceres
http://datadriven.com.au


________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.
Received on Tuesday, 28 July 2009 10:29:11 UTC