Re: Proposed resolution of HRRI/IRI discussion from Konrad Lanz on 2007-11-05 (public-xml-core-wg@w3.org from November 2007)

From: Konrad Lanz <Konrad.Lanz@iaik.tugraz.at>
Date: Mon, 05 Nov 2007 14:36:36 +0100
To: Richard Tobin <richard@inf.ed.ac.uk>
CC: Martin Duerst <duerst@it.aoyama.ac.jp>, "Grosso, Paul" <pgrosso@ptc.com>, Richard Ishida <ishida@w3.org>, public-i18n-core@w3.org, public-xml-core-wg@w3.org, public-iri@w3.org
Message-ID: <472F1C64.1050507@iaik.tugraz.at>
Hi Richard,

Richard Tobin wrote:
>> Richard Tobin wrote:
>>> That's an attraction of them, but as far as the XML Core WG is 
>>> concerned are motivation is merely to simplify and clarify our 
>>> specs, several of which describe identifiers of this kind.  So we
>>>  want LEIRIs to match those definitions, and they don't currently
>>>  allow square brackets unescaped.
> Konrad Lanz wrote:
>> I would argue they do allow them, obviously for IPv6 hosts and more
>>  interestingly in the fragment of a URI reference by referring to 
>> RFC 2732 and its amendment to the grammar of RFC 2396.
> 
> Sorry, I should have said the existing definitions require 
> implementations to leave square brakcets untouched, and not escape 
> them.

Let's assume this is what the specs currently say (not LEIRI/HRRI as
they will be referred to in a not yet defined way) and let's call this
Assumption I (A-I).

> [...] I would expect them to do the %-encoding specified and then
> pass the result on to a generic URI-retrieval library function.

So all the characters specified will be percent-encoded and
*potentially* result in a valid URI reference or the generic
URI-retrieval library would throw an error.

Let's call this Assumption II (A-II).

>> This section about escaping - found in various RECs like, XPointer 
>> Framework, XMLDsig, XLink, XIncude, XML Base, RDF ...- *excludes* 
>> square brackets [] from the list of characters that will be/should
>>  be escaped for some string to *potentially* be/become a URI 
>> reference.
> 
> Yes.
> 
>> What "these specifications" usually don't say is that the described
>>  percent-encoding is necessary, but not necessarily sufficient for 
>> strings to be converted to a valid URI reference. (except for RDF, 
>> XLink)
> 
> They say what the implementation should do to make them into URIs. If
>  after that they aren't URIs, then user has made an error.

So if an implementation receives for instance the value
"#xpointer(//*[@attr='%#true#%25'])" as input from the user it will
accept it according to A-I and not percent encode it and put it into the
XML document.

Implementations would then however on accessing the value *try* to
escape the value according to Assumption II and throw an error.

I would assume however that the cost of throwing the error at this stage
would be higher than to escape square brackets in the fragment based on
the assumption that percent encoding is considerably cheaper than
reporting the error back to the original Author of the value. (The XML
document containing the value has quite likely traveled, been published
or sent to someone else already. The same might apply for all # but the
first # from the left and % if not followed by two or more hex chars).

That is why I believe that implementations would be/become lenient and
accept such lapses. I also sense that language similar to the previous
paragraph might sneak into specs again and we would essentially be again
where we are now.

I do understand however that making LEIRI and HRRI specs more tolerant
would make these specs more complicated. I would assume however that
this would be minimal if additionally only square brackets would be
allowed in the fragment and should be considered if just for the sake of
honoring the "legacy" grammar amendments of RFC 2732 and for being
compatible with non percent encoded square brackets in XPointers and
XPath (see next Quote).

http://www.w3.org/TR/xpath :
http://www.w3.org/TR/xpath20/ :
> [...] XPath uses a compact, non-XML syntax to facilitate use of XPath
> within URIs and XML attribute values. [...]

>> What "these specifications" from my reading actually do say however
>>  is that one does not percent-encode %, #, [ and ] at all.
>> 
>> And "these specifications" often do not make clear that - as we 
>> know today - this is to be understood as an algorithm for 
>> interpreting the value rather than a constraint on the value 
>> itself.

> The algorithm specifies what the implementation must do in order to 
> produce a URI.  If that doesn't produce a URI, it's an error. Perhaps
>  this is not explicit enough in all the specifications, but that's 
> certainly the intention.

What I'm sill not quite sure about is if the intention currently is to

A) Throw an error when generating the value and before putting the value
into the actual XML document (or to throw the error on validation) which
is equal to saying the value is a LEIRI/HRRI

B) Throw an error on interpreting/dereferencing/absolutizing/accessing
the value which conforms to A-I.

>> Unfortunately they didn't and hopefully referring to LEIRI/HRRI 
>> will clarify the situation assuming a value has the constraint of 
>> being a LEIRI/HRRI and isn't as now just interpreted as LEIRI/HRRI
>>  plus yet again some additional percent-encoding rules.
> 
> Yes, the idea is to say that these values must be LEIRIs.

Sound more like A).
If so it may be worth to explicitly say this somewhere.

I just have doubts that this will be actually adopted consistently
across all the specs in this way as it would be contradicting A-I.

Further the ambiguous grammar of RFC 2732 and the use of XPointer
(although being a WD since 2002 or so) already lead to deployed
documents not properly escaping square brackets (cf. EBICS).

To conclude, I think its good to move towards A) but one should consider
the ambiguities in RFC 2732 and the current algorithms in the current
specs as having been misunderstood to actually allow [,],%,# in the
fragment if not conflicting with their original meaning.

Konrad

-- 
Konrad Lanz, IAIK/SIC - Graz University of Technology
Inffeldgasse 16a, 8010 Graz, Austria
Tel: +43 316 873 5547
Fax: +43 316 873 5520
https://www.iaik.tugraz.at/aboutus/people/lanz
http://jce.iaik.tugraz.at

Certificate chain (including the EuroPKI root certificate):
https://europki.iaik.at/ca/europki-at/cert_download.htm
Received on Monday, 5 November 2007 13:37:26 UTC