HRRI vs IRI in XML

Hi,

Sorry I was out of the loop for a bit. I see from the email threads
that we've got some improved wording proposed for the list of
characters that have to be escaped if they appear in HRRI and some
improved wording for the security considerations section. I'll
incorporate those as soon as I can.

However, as far as I can tell, we still don't have a clear
understanding about whether we need HRRI or not.

Here's how I see it. Sorry if this is a little repetative; I'm hoping
that considering this issue from a higher level again will help.

1. The XML Recommendation says that a system identifier consists of a
single or double quote followed by any characters followed by a
matching quote:

  SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")

Any attempt to limit the characters allowed in a system identifier
would be a backwards incompatible change to XML. That is simply not an
option.

2. Because we knew that system identifiers allowed characters that
couldn't appear in URIs, we added some wording to clarify how
processors must escape those characters if they needed URIs.

Over time, this text was refined, using fragments taken from drafts of
the IRI spec, and is now "cut-and-pasted" into several
recommendations.

It's become clear that this cut-and-paste approach is tedious and
error-prone and does not scale. Asking future specs to continue this
cut-and-paste process from one or another of the existing specs is
just not helpful to the community.

3. The HRRI spec proposes to instantiate the very liberal repertoire
of characters allowed in a system identifier (and all the other
places) in a short, stand-alone specification. This specification will
have a name and will be available for normative reference.

I understand that perhaps the world would be a better place if we
didn't need another name for another flavor of a string that serves
the role of identifying a resource. But that's not an option, see
point 1.

Martin's message that quoted this paragraph from the IRI spec gave a
glimmer of hope that perhaps we could avoid 3.

   Systems accepting IRIs MAY also deal with the printable characters in
   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
   characters are found but are not converted, then the conversion
   SHOULD fail.  Please note that the number sign ("#"), the percent
   sign ("%"), and the square bracket characters ("[", "]") are not part
   of the above list and MUST NOT be converted.  Protocols and formats
   that have used earlier definitions of IRIs including these characters
   MAY require percent-encoding of these characters as a preprocessing
   step to extract the actual IRI from a given field.  This
   preprocessing MAY also be used by applications allowing the user to
   enter an IRI.

Unfortunately, our problem is that system identifiers can contain not
just "printable characters in US-ASCII that are not allowed in URIs"
but a wide range of characters from elsewhere in Unicode that are not
allowed in URIs (or IRIs).

Question: Is the paragraph from the IRI spec above intended to be
broader than a literal reading would suggest? Is it the intent of the
IRI spec that systems accepting IRIs MAY also deal with characters not
allowed in URIs by converting them?

If so, then perhaps we can simply say that system identifiers are IRIs
and note this provision in the IRI spec for what I'll call "legacy"
identifiers.

If not, then I think we must proceed with the HRRI spec.

Thoughts?

                                        Be seeing you,
                                          norm

-- 
Norman Walsh <ndw@nwalsh.com> | A great deal may be done by severity,
http://nwalsh.com/            | more by love, but most by clear
                              | discernment and impartial
                              | justice.--Goethe

Received on Wednesday, 18 July 2007 15:52:02 UTC