Re: HRRI vs IRI in XML from Martin Duerst on 2007-08-08 (public-i18n-core@w3.org from July to September 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Wed, 08 Aug 2007 12:56:48 +0900
To: Norman Walsh <ndw@nwalsh.com>, public-xml-core-wg@w3.org, public-iri@w3.org, Richard Ishida <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>, www-xml-linking-comments@w3.org, public-i18n-core@w3.org
Message-Id: <6.0.0.20.2.20070808112425.080b5660@localhost>
Hello Norm, others,

Sorry for the delay in responding; in summer, everything moves
a bit slower.

At 00:50 07/07/19, Norman Walsh wrote:
>Hi,
>
>Sorry I was out of the loop for a bit. I see from the email threads
>that we've got some improved wording proposed for the list of
>characters that have to be escaped if they appear in HRRI and some
>improved wording for the security considerations section. I'll
>incorporate those as soon as I can.
>
>However, as far as I can tell, we still don't have a clear
>understanding about whether we need HRRI or not.
>
>Here's how I see it. Sorry if this is a little repetative; I'm hoping
>that considering this issue from a higher level again will help.

I think laying out the issues clearly can only help. Thanks for doing this.

>1. The XML Recommendation says that a system identifier consists of a
>single or double quote followed by any characters followed by a
>matching quote:
>
>  SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
>
>Any attempt to limit the characters allowed in a system identifier
>would be a backwards incompatible change to XML. That is simply not an
>option.

Well, it would sure look like a backwards-incompatible change on
the spec level. But how many XML documents would indeed turn
non-well-formed if one e.g. disallowed general control characters
in the C0 area (I'm not speaking about TAB/CR/LF)?

As far as I understand, these characters cannot appear in XML 1.0.
They can appear, in the form of numeric character references (NCRs),
in XML 1.1, but the above grammar rule doesn't allow NCRs in
System Literals. The XML REC mentions this explicitly, as follows:
"Note that a SystemLiteral can be parsed without scanning for markup."

So in fact changing the SystemLiteral production to exclude general
C0 control characters wouldn't change anything at all.

[There is potentially another interpretation of the grammar in the
XML spec, which is that the Char production
(http://www.w3.org/TR/REC-xml/#NT-Char) does not restrict the
contents of SystemLiteral, but in that case, it would also not
restrict the contens of http://www.w3.org/TR/REC-xml/#NT-CharData,
which would mean that arbitrary element content could contain
such control characters including NUL characters/bytes.
I think it would probably be best to fix this by explicitly
using the Char production in SystemLiteral and the other
relevant places. If I need to submit an erratum, please tell
me where.]

This is of course different for e.g. C1 control characters and for
URI-like fields in XML attributes or element content. But even for
these, the question remains of how many XML document there are really
out there that use any of these characters (for any other purpose
than to prove that there are indeed such documents).

>2. Because we knew that system identifiers allowed characters that
>couldn't appear in URIs, we added some wording to clarify how
>processors must escape those characters if they needed URIs.

Well, I think it's actually slightly different. Because we wanted
System Literals to accept something like IRIs (which didn't have
that name yet at that time), we added wording to clarify how to
convert these into URIs. I do not remember the SystemLiteral
production ever having been brought up in the discussion, neither
in the way above (we neeed to describe the conversion because
SystemLiteral allows anything) nor the other way round (to make
sure that we can use more than just URIs, we have to make the
SystemLiteral production more general than US-ASCII). But
these things happended a long time ago.

My guess is that the main motivation for having the SystemLiteral
production the way it is is that people who wrote the XML spec
understood one of the general principles of URI/IRI syntax, which
is that it's a bad idea to unnecessarily restrict this specs that
carry URIs/IRIs, because this creates unnecessary dependencies.

>Over time, this text was refined, using fragments taken from drafts of
>the IRI spec, and is now "cut-and-pasted" into several
>recommendations.
>
>It's become clear that this cut-and-paste approach is tedious and
>error-prone and does not scale. Asking future specs to continue this
>cut-and-paste process from one or another of the existing specs is
>just not helpful to the community.

I agree. However, please note that many other W3C specs currently
have circumscriptive texts. In some cases, these have been written
in expectation of the IRI spec being available as an RFC, in other
cases, they are there to allow to use old terminology (URI) with
new meaning (IRI). For some examples, please see
http://www.w3.org/International/iri-edit/spec-use-survey.html,
a page I have started to put together to get an overview of the
different ways the issue we are discussing here is addressed in
W3C specs. Please feel free to add to that page (if you have
access rights) or to suggest additions.

>3. The HRRI spec proposes to instantiate the very liberal repertoire
>of characters allowed in a system identifier (and all the other
>places) in a short, stand-alone specification. This specification will
>have a name and will be available for normative reference.

The "all" in "all the other places" is misleading, because it very
much depends on the scale at which things are looked at.

>I understand that perhaps the world would be a better place if we
>didn't need another name for another flavor of a string that serves
>the role of identifying a resource. But that's not an option, see
>point 1.

I don't think it's productive to write "that's not an option" without
actual backup technical arguments. I'm yet waiting for the first
XML document that contains any of the characters in question in
any of the URI/IRI-like slots under discussion here (of course
this would exclude documents that have been created just to show
that such documents exist, but I haven't even see one of these).
I'm still waiting for anybody comming up and claiming that they
actually need or want to use any of the obscure "characters"
(not talking about printable US-ASCII or TAB/CR/LF/Space here).

If the XML Core WG said "we think that the risk is extremely
low, but we don't want to take this risk", I could to some
extent understand this, and it's ultimately the job of the
XML Core WG to decide how they want to proceed with their
specs. However, I think that the overall effect on the community
should considered when looking at the benefits and problems
of different approaches. For the overall community, the
benefit of having a single concept, defined by a single
specification, is very high compared to the issue of the XML
Core WG wanting to save a few lines in a few specs that otherwise
may be needed to avoid a risk that is extremely small.


>Martin's message that quoted this paragraph from the IRI spec gave a
>glimmer of hope that perhaps we could avoid 3.
>
>   Systems accepting IRIs MAY also deal with the printable characters in
>   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
>   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
>   characters are found but are not converted, then the conversion
>   SHOULD fail.  Please note that the number sign ("#"), the percent
>   sign ("%"), and the square bracket characters ("[", "]") are not part
>   of the above list and MUST NOT be converted.  Protocols and formats
>   that have used earlier definitions of IRIs including these characters
>   MAY require percent-encoding of these characters as a preprocessing
>   step to extract the actual IRI from a given field.  This
>   preprocessing MAY also be used by applications allowing the user to
>   enter an IRI.
>
>Unfortunately, our problem is that system identifiers can contain not
>just "printable characters in US-ASCII that are not allowed in URIs"
>but a wide range of characters from elsewhere in Unicode that are not
>allowed in URIs (or IRIs).
>
>Question: Is the paragraph from the IRI spec above intended to be
>broader than a literal reading would suggest? Is it the intent of the
>IRI spec that systems accepting IRIs MAY also deal with characters not
>allowed in URIs by converting them?

This is a very interesting thought. What I have said earlier is that
I think it would be possible to extend the above paragraph to other
kinds of characters in an (already started) update of the IRI spec.

I'm quite a bit more sceptical about dealing with this just as an
erratum, because looking at all the drafts of the IRI spec listed at http://www.w3.org/International/iri-edit/#Published, there never seems
to have been any question about whether control characters (both
general C0 and all of C1) should be allowed or not. Nobody ever
came up and requested that these be allowed, in any way, and I'm
still not seeing any actual need at all. The above note was specifically
put in to address the actual and expressed needs of some people in
the XML community (see my earlier email with references to the
email archive).


>If so, then perhaps we can simply say that system identifiers are IRIs
>and note this provision in the IRI spec for what I'll call "legacy"
>identifiers.

This is essentially what I proposed, except that this would happen
in a new version of the IRI spec. There is a huge difference between
using an erratum (with very little feedback possibilities from the
community on whether this was indeed intended, and very little
room for adding additional warning text), and an updated spec,
where we can make sure we spend all the necessary time on
getting the wording correct and adding all the necessary
warnings.

>If not, then I think we must proceed with the HRRI spec.

"must" is quite strong. What about looking at what other specs
did? What about going for something along the lines of:

A SystemLiteral SHOULD be an IRI [RFC3987 (or its successor)].
Note: This includes the provision in the IRI spec for dealing with
      printable characters in US-ASCII that are not allowed in URIs.
Note: XML processors MUST/SHOULD also convert characters outside
      the repertoire of characters allowed in IRIs according to
      Section 3.1 of [RFC 3987].

With the erratum, you would have used the first three lines.
Without the erratum, your text gets a bit longer. You might
even want to tweak the second note to cover some of what the
RDF specs say (essentially, processors may issue warnings
when they see something that doesn't conform to the IRI spec).

If you wait for a new version of the IRI spec, you should be
able to then use some text such as:

A SystemLiteral is an IRI according to [RFCXXXX], including
the provisions in section Y.Z of [RFCXXXX]. We will be able
to make sure that Y.Z covers your needs, and hopefully the
needs of other W3C (and other) specs, and we will greatly
reduce the confusion for the overall community and have
technology converge to what's really needed, rather than
diverge for the sake of non-existing backwards compatibility
needs.

Hope this helps.


Regards,     Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 8 August 2007 03:58:23 UTC