W3C home > Mailing lists > Public > public-xml-core-wg@w3.org > November 2007

Re: Proposed resolution of HRRI/IRI discussion

From: Konrad Lanz <Konrad.Lanz@iaik.tugraz.at>
Date: Fri, 02 Nov 2007 15:47:49 +0100
Message-ID: <472B3895.1020709@iaik.tugraz.at>
To: Richard Tobin <richard@inf.ed.ac.uk>
CC: Martin Duerst <duerst@it.aoyama.ac.jp>, "Grosso, Paul" <pgrosso@ptc.com>, Richard Ishida <ishida@w3.org>, public-i18n-core@w3.org, public-xml-core-wg@w3.org, public-iri@w3.org
Martin Duerst wrote:
>> My understanding was that one of the motivations behind Legacy 
>> Extended IRIs was to allow as little escaping as possible e.g. in 
>> XPointers.

One could potentially go as far as replacing % by %25 iff % was not
followed by two hex characters and hence it is no percent-encoding. Or
percent-encoding each # but the first one from the left ...

Well my understanding would be to percent-encode "as little as
reasonably possible" and that would include square brackets in the fragment.

Richard Tobin wrote:
> That's an attraction of them, but as far as the XML Core WG is 
> concerned are motivation is merely to simplify and clarify our specs,
>  several of which describe identifiers of this kind.  So we want 
> LEIRIs to match those definitions, and they don't currently allow 
> square brackets unescaped.

I would argue they do allow them, obviously for IPv6 hosts and more
interestingly in the fragment of a URI reference by referring to
RFC 2732 and its amendment to the grammar of RFC 2396.

My guess would be that many implementations are lenient and accept non
percent-encoded square brackets in the fragment anyway so one might just
as well legalize them for LEIRI / HRRI.

> I haven't checked the ancient history of this, but even XML 1.0 2nd 
> edition knew about RFC 2732 and excluded square brackets: 
> http://www.w3.org/TR/2000/REC-xml-20001006#sec-external-ent

This section about escaping - found in various RECs like, XPointer
Framework, XMLDsig, XLink, XIncude, XML Base, RDF ...- *excludes* square
brackets [] from the list of characters that will be/should be escaped
for some string to *potentially* be/become a URI reference.

http://www.ietf.org/internet-drafts/draft-walsh-tobin-hrri-01.txt :
> ... which allow the use of characters which must be escaped in a 
> legal IRI, such as delimiters and a few other ASCII characters. 
> Examples include XML System Identifiers[4], the href attribute in 
> XLink[5], and XML Base attributes[6]. These specifications all 
> describe, with slightly different wording, the same algorithm for 
> converting that string to a URI or IRI.

What "these specifications" usually don't say is that the described
percent-encoding is necessary, but not necessarily sufficient for
strings to be converted to a valid URI reference. (except for RDF, XLink)

What "these specifications" from my reading actually do say however is
that one does not percent-encode %, #, [ and ] at all.

And "these specifications" often do not make clear that - as we know
today - this is to be understood as an algorithm for interpreting the
value rather than a constraint on the value itself.

Although this is not entirely true as the grammar of RFC 2396 + RFC 2732
/ RFC 3986 still reflect back on the set of allowable values via the
interpretation. Otherwise #xpointer(//*[@atr='#%%%']) would be valid
until it is actually dereferenced and one would have to distinguish
between allowable and interpretable values. Which I would find very
confusing.

What "these specifications" should have said however is that there is a
constraint to percent-encode %, #, [ and ] unless used for
percent-encoding itself, separating the fragment or delimiting an IPv6
host respectively.

Or better: %, #, [ and ] are not percent-encoded before the
interpretation, because they are assumed to be only used for
percent-encoding, separating the fragment or delimiting an IPv6 host
respectively and hence must be escaped if used otherwise as data.

Unfortunately they didn't and hopefully referring to LEIRI/HRRI will
clarify the situation assuming a value has the constraint of being a
LEIRI/HRRI and isn't as now just interpreted as LEIRI/HRRI plus yet
again some additional percent-encoding rules.

The quietness in "these specifications" from my point of view is
confusing but no big problem for % and # when one looks at the grammar
for RFC 2396 amended by RFC 2732 and hence recognizes that % can only be
used for percent encoding and # only for separating the fragment.

The subtle difference with square brackets [] however is that the
grammar in RFC 2732 actually allows them in fragments, whereas
it's prose didn't and is hence at least to some extend ambiguous
and so are the referring specifications.

Konrad

-- 
Konrad Lanz, IAIK/SIC - Graz University of Technology
Inffeldgasse 16a, 8010 Graz, Austria
Tel: +43 316 873 5547
Fax: +43 316 873 5520
https://www.iaik.tugraz.at/aboutus/people/lanz
http://jce.iaik.tugraz.at

Certificate chain (including the EuroPKI root certificate):
https://europki.iaik.at/ca/europki-at/cert_download.htm




-- 
Konrad Lanz, IAIK/SIC - Graz University of Technology
Inffeldgasse 16a, 8010 Graz, Austria
Tel: +43 316 873 5547
Fax: +43 316 873 5520
https://www.iaik.tugraz.at/aboutus/people/lanz
http://jce.iaik.tugraz.at

Certificate chain (including the EuroPKI root certificate):
https://europki.iaik.at/ca/europki-at/cert_download.htm



Received on Friday, 2 November 2007 14:48:31 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:21:37 GMT