Re: Fwd: Re: HRRIs, IRIs, etc from Martin Duerst on 2007-06-22 (public-xml-core-wg@w3.org from June 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Fri, 22 Jun 2007 19:39:25 +0900
To: Richard Tobin <richard@inf.ed.ac.uk>, Bjoern Hoehrmann <derhoermi@gmx.net>, "Grosso, Paul" <pgrosso@ptc.com>
Cc: <public-iri@w3.org>, <www-xml-linking-comments@w3.org>, <public-xml-core-wg@w3.org>, <public-i18n-core@w3.org>
Message-Id: <6.0.0.20.2.20070622181646.0692f910@localhost>

Hello Richard,

At 00:59 07/06/21, Richard Tobin wrote:

>> You should simply drop this effort and use IRI References instead. There
>> is a high cost associated with yet another notion of resource identifier
>> technology
>
>This is not another notion of resource identifier.  It is the existing
>notion used for XML system identifier, XLink href, and several other
>things.  We are merely providing a name and a single place for a
>definition that already exists in multiple specs.

If these things are not resource identifiers, then what are they?

>> Simply prohibit anything but IRI references

That would constitute a normative change to several specs.
In my oppinion, that may be inappropriate for spaces and
a few other characters, in particular in the context of XPointer,
but it would definitely be highly appropriate for arbitrary
control characters (if you ever have encountered an URI/IRI
with an arbitrary control character (not TAB/CR/LF, I'd really
like to know).

>> and,
>> if necessary, specify "utf-8-percent-escape all disallowed characters"
>> as error recovery method.

That would not, at least not if you consider observable behavior
to be the relevant criterion.

At least for the XML spec itself, there may be a point of
view that simply saying "it's an IRI" won't change anything.
I'll try to explain this below.

Looking at the definition of a PubidLiteral for a moment
(http://www.w3.org/TR/REC-xml/#NT-PubidLiteral), it just
specifies a range of characters that can be used, nothing
more in terms of syntax, although it could be argued that
some syntaxes (those with several // included) are much
more likely, or even highly expected to make a PubidLiteral
usable in a wider context (which 'Public' suggests in the
first place).

Likewise, the syntax for SystemLiteral is specified simply
as a string of characters (from a much wider repertoire).
To say that this is an IRI does not restrict this syntax.

It is a well acknowledged fact that URI and IRI syntax are
very difficult to check (because there are scheme-dependent
restrictions, and so on) and that therefore, any strict
checking (in the way e.g. the XML syntax is checked for
well-formedness) is not appropriate for URIs or IRIs.

The rest (namely conversion of unallowed characters to
%hh-encoding) seems to already be covered under the following
paragraph from the IRI spec:

   Systems accepting IRIs MAY also deal with the printable characters in
   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
   characters are found but are not converted, then the conversion
   SHOULD fail.  Please note that the number sign ("#"), the percent
   sign ("%"), and the square bracket characters ("[", "]") are not part
   of the above list and MUST NOT be converted.  Protocols and formats
   that have used earlier definitions of IRIs including these characters
   MAY require percent-encoding of these characters as a preprocessing
   step to extract the actual IRI from a given field.  This
   preprocessing MAY also be used by applications allowing the user to
   enter an IRI.

I'm not saying that this interpretation is the only one possible,
and I'm not sure how it would apply to XLink and others, but
I wanted to show it here as one point of view.

Regards,    Martin.

>That would constitute a normative change to several specs.
>
>-- Richard

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Received on Friday, 22 June 2007 10:40:22 UTC