Re: HRRI vs IRI in XML from Norman Walsh on 2007-08-09 (public-xml-core-wg@w3.org from August 2007)

From: Norman Walsh <ndw@nwalsh.com>
Date: Thu, 09 Aug 2007 08:47:14 -0400
To: Martin Duerst <duerst@it.aoyama.ac.jp>
CC: public-xml-core-wg@w3.org, public-iri@w3.org, Richard Ishida <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>, www-xml-linking-comments@w3.org, public-i18n-core@w3.org
Message-ID: <87643olvsd.fsf@nwalsh.com>
/ Martin Duerst <duerst@it.aoyama.ac.jp> was heard to say:
| At 00:50 07/07/19, Norman Walsh wrote:
|>1. The XML Recommendation says that a system identifier consists of a
|>single or double quote followed by any characters followed by a
|>matching quote:
|>
|>  SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
|>
|>Any attempt to limit the characters allowed in a system identifier
|>would be a backwards incompatible change to XML. That is simply not an
|>option.
|
| Well, it would sure look like a backwards-incompatible change on
| the spec level. But how many XML documents would indeed turn
| non-well-formed if one e.g. disallowed general control characters
| in the C0 area (I'm not speaking about TAB/CR/LF)?
|
| As far as I understand, these characters cannot appear in XML 1.0.
| They can appear, in the form of numeric character references (NCRs),
| in XML 1.1, but the above grammar rule doesn't allow NCRs in
| System Literals. The XML REC mentions this explicitly, as follows:
| "Note that a SystemLiteral can be parsed without scanning for markup."
|
| So in fact changing the SystemLiteral production to exclude general
| C0 control characters wouldn't change anything at all.

Fair enough. That still leaves the non-IRI space character (and other
non-IRI characters?) and a bunch of non-URI characters.

| [There is potentially another interpretation of the grammar in the
| XML spec, which is that the Char production
| (http://www.w3.org/TR/REC-xml/#NT-Char) does not restrict the
| contents of SystemLiteral, but in that case, it would also not
| restrict the contens of http://www.w3.org/TR/REC-xml/#NT-CharData,
| which would mean that arbitrary element content could contain
| such control characters including NUL characters/bytes.
| I think it would probably be best to fix this by explicitly
| using the Char production in SystemLiteral and the other
| relevant places. If I need to submit an erratum, please tell
| me where.]

This message is probably sufficient.

| This is of course different for e.g. C1 control characters and for
| URI-like fields in XML attributes or element content. But even for
| these, the question remains of how many XML document there are really
| out there that use any of these characters (for any other purpose
| than to prove that there are indeed such documents).

The XML 1.1 experience has (absolutely, utterly) convinced me that no
backwards-incompatible change to XML, no matter how negligible the
practical impact, is acceptable. Backwards incompatibility is simply
not an option.

|>2. Because we knew that system identifiers allowed characters that
|>couldn't appear in URIs, we added some wording to clarify how
|>processors must escape those characters if they needed URIs.
|
| Well, I think it's actually slightly different.

Fine. We can argue about the history over a beer sometime :-)

|>Over time, this text was refined, using fragments taken from drafts of
|>the IRI spec, and is now "cut-and-pasted" into several
|>recommendations.
|>
|>It's become clear that this cut-and-paste approach is tedious and
|>error-prone and does not scale. Asking future specs to continue this
|>cut-and-paste process from one or another of the existing specs is
|>just not helpful to the community.
|
| I agree. However, please note that many other W3C specs currently
| have circumscriptive texts.

I don't see how that helps. We have to describe strings that aren't
URIs or IRIs. We don't want IRIs, we don't want to use the term URI
and mean IRI, we want to use a term that means "this {string}".

|>3. The HRRI spec proposes to instantiate the very liberal repertoire
|>of characters allowed in a system identifier (and all the other
|>places) in a short, stand-alone specification. This specification will
|>have a name and will be available for normative reference.
|
| The "all" in "all the other places" is misleading, because it very
| much depends on the scale at which things are looked at.

I meant "all the other places that currently refer to the XLink 1.0
spec for their description of what characters are allowed".

|>I understand that perhaps the world would be a better place if we
|>didn't need another name for another flavor of a string that serves
|>the role of identifying a resource. But that's not an option, see
|>point 1.
|
| I don't think it's productive to write "that's not an option" without
| actual backup technical arguments.

The principle argument isn't technical, it's political. I'm not saying
it's "not an option" on the basis of some self-righteous personal
opinion, I'm saying it because I have scars and burns from the last
time we did something backwards incompatible to XML. The community
will not stand for it.

I'd be *delighted* to be proved wrong. Get consensus from the
community that it would be ok to change the definition of system
literal or the value of href attributes so that "this {string}.html"
no longer accesses the resource we'd identify with the URI
"this%20%7bstring%7d.html" but is instead an error and I'll be the
first in line to fix the specs.

Specs have to reflect reality. The reality is system identifiers and
href attributes contain values that aren't valid in URIs or IRIs so we
need to do something.

|>Martin's message that quoted this paragraph from the IRI spec gave a
|>glimmer of hope that perhaps we could avoid 3.
|>
|>   Systems accepting IRIs MAY also deal with the printable characters in
|>   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
|>   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
|>   characters are found but are not converted, then the conversion
|>   SHOULD fail.  Please note that the number sign ("#"), the percent
|>   sign ("%"), and the square bracket characters ("[", "]") are not part
|>   of the above list and MUST NOT be converted.  Protocols and formats
|>   that have used earlier definitions of IRIs including these characters
|>   MAY require percent-encoding of these characters as a preprocessing
|>   step to extract the actual IRI from a given field.  This
|>   preprocessing MAY also be used by applications allowing the user to
|>   enter an IRI.
|>
|>Unfortunately, our problem is that system identifiers can contain not
|>just "printable characters in US-ASCII that are not allowed in URIs"
|>but a wide range of characters from elsewhere in Unicode that are not
|>allowed in URIs (or IRIs).
|>
|>Question: Is the paragraph from the IRI spec above intended to be
|>broader than a literal reading would suggest? Is it the intent of the
|>IRI spec that systems accepting IRIs MAY also deal with characters not
|>allowed in URIs by converting them?
|
| This is a very interesting thought. What I have said earlier is that
| I think it would be possible to extend the above paragraph to other
| kinds of characters in an (already started) update of the IRI spec.

If the IRI spec can be extended to cover the characters we need, then
I think we could say that these things are IRIs. That means we need
to answer two questions:

1. Will the folks maintaining the IRI spec agree to extend that
   paragraph to cover all of the characters we need to have in order
   to maintain perfect backwards-compatibility with XML 1.0 and 1.1?

2. What is a realistic timeline for IRI v.Next?

|>If so, then perhaps we can simply say that system identifiers are IRIs
|>and note this provision in the IRI spec for what I'll call "legacy"
|>identifiers.
|
| This is essentially what I proposed, except that this would happen
| in a new version of the IRI spec. There is a huge difference between
| using an erratum (with very little feedback possibilities from the
| community on whether this was indeed intended, and very little
| room for adding additional warning text), and an updated spec,
| where we can make sure we spend all the necessary time on
| getting the wording correct and adding all the necessary
| warnings.
|
|>If not, then I think we must proceed with the HRRI spec.
|
| "must" is quite strong. What about looking at what other specs
| did? What about going for something along the lines of:

Perhaps. I'll see what the Core WG says next week.

| A SystemLiteral SHOULD be an IRI [RFC3987 (or its successor)].
| Note: This includes the provision in the IRI spec for dealing with
|       printable characters in US-ASCII that are not allowed in URIs.
| Note: XML processors MUST/SHOULD also convert characters outside
|       the repertoire of characters allowed in IRIs according to
|       Section 3.1 of [RFC 3987].
|
| With the erratum, you would have used the first three lines.
| Without the erratum, your text gets a bit longer. You might
| even want to tweak the second note to cover some of what the
| RDF specs say (essentially, processors may issue warnings
| when they see something that doesn't conform to the IRI spec).
|
| If you wait for a new version of the IRI spec, you should be
| able to then use some text such as:
|
| A SystemLiteral is an IRI according to [RFCXXXX], including
| the provisions in section Y.Z of [RFCXXXX]. We will be able
| to make sure that Y.Z covers your needs, and hopefully the
| needs of other W3C (and other) specs, and we will greatly
| reduce the confusion for the overall community and have
| technology converge to what's really needed, rather than
| diverge for the sake of non-existing backwards compatibility
| needs.
|
| Hope this helps.

Thanks, Martin.

                                        Be seeing you,
                                          norm

-- 
Norman Walsh <ndw@nwalsh.com> | There is no such thing as an absolute
http://nwalsh.com/            | certainty, but there is assurance
                              | sufficient for the purposes of human
                              | life.--John Stuart Mill
Received on Thursday, 9 August 2007 12:48:30 UTC