- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Wed, 08 Aug 2007 12:56:48 +0900
- To: Norman Walsh <ndw@nwalsh.com>, public-xml-core-wg@w3.org, public-iri@w3.org, Richard Ishida <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>, www-xml-linking-comments@w3.org, public-i18n-core@w3.org
Hello Norm, others, Sorry for the delay in responding; in summer, everything moves a bit slower. At 00:50 07/07/19, Norman Walsh wrote: >Hi, > >Sorry I was out of the loop for a bit. I see from the email threads >that we've got some improved wording proposed for the list of >characters that have to be escaped if they appear in HRRI and some >improved wording for the security considerations section. I'll >incorporate those as soon as I can. > >However, as far as I can tell, we still don't have a clear >understanding about whether we need HRRI or not. > >Here's how I see it. Sorry if this is a little repetative; I'm hoping >that considering this issue from a higher level again will help. I think laying out the issues clearly can only help. Thanks for doing this. >1. The XML Recommendation says that a system identifier consists of a >single or double quote followed by any characters followed by a >matching quote: > > SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'") > >Any attempt to limit the characters allowed in a system identifier >would be a backwards incompatible change to XML. That is simply not an >option. Well, it would sure look like a backwards-incompatible change on the spec level. But how many XML documents would indeed turn non-well-formed if one e.g. disallowed general control characters in the C0 area (I'm not speaking about TAB/CR/LF)? As far as I understand, these characters cannot appear in XML 1.0. They can appear, in the form of numeric character references (NCRs), in XML 1.1, but the above grammar rule doesn't allow NCRs in System Literals. The XML REC mentions this explicitly, as follows: "Note that a SystemLiteral can be parsed without scanning for markup." So in fact changing the SystemLiteral production to exclude general C0 control characters wouldn't change anything at all. [There is potentially another interpretation of the grammar in the XML spec, which is that the Char production (http://www.w3.org/TR/REC-xml/#NT-Char) does not restrict the contents of SystemLiteral, but in that case, it would also not restrict the contens of http://www.w3.org/TR/REC-xml/#NT-CharData, which would mean that arbitrary element content could contain such control characters including NUL characters/bytes. I think it would probably be best to fix this by explicitly using the Char production in SystemLiteral and the other relevant places. If I need to submit an erratum, please tell me where.] This is of course different for e.g. C1 control characters and for URI-like fields in XML attributes or element content. But even for these, the question remains of how many XML document there are really out there that use any of these characters (for any other purpose than to prove that there are indeed such documents). >2. Because we knew that system identifiers allowed characters that >couldn't appear in URIs, we added some wording to clarify how >processors must escape those characters if they needed URIs. Well, I think it's actually slightly different. Because we wanted System Literals to accept something like IRIs (which didn't have that name yet at that time), we added wording to clarify how to convert these into URIs. I do not remember the SystemLiteral production ever having been brought up in the discussion, neither in the way above (we neeed to describe the conversion because SystemLiteral allows anything) nor the other way round (to make sure that we can use more than just URIs, we have to make the SystemLiteral production more general than US-ASCII). But these things happended a long time ago. My guess is that the main motivation for having the SystemLiteral production the way it is is that people who wrote the XML spec understood one of the general principles of URI/IRI syntax, which is that it's a bad idea to unnecessarily restrict this specs that carry URIs/IRIs, because this creates unnecessary dependencies. >Over time, this text was refined, using fragments taken from drafts of >the IRI spec, and is now "cut-and-pasted" into several >recommendations. > >It's become clear that this cut-and-paste approach is tedious and >error-prone and does not scale. Asking future specs to continue this >cut-and-paste process from one or another of the existing specs is >just not helpful to the community. I agree. However, please note that many other W3C specs currently have circumscriptive texts. In some cases, these have been written in expectation of the IRI spec being available as an RFC, in other cases, they are there to allow to use old terminology (URI) with new meaning (IRI). For some examples, please see http://www.w3.org/International/iri-edit/spec-use-survey.html, a page I have started to put together to get an overview of the different ways the issue we are discussing here is addressed in W3C specs. Please feel free to add to that page (if you have access rights) or to suggest additions. >3. The HRRI spec proposes to instantiate the very liberal repertoire >of characters allowed in a system identifier (and all the other >places) in a short, stand-alone specification. This specification will >have a name and will be available for normative reference. The "all" in "all the other places" is misleading, because it very much depends on the scale at which things are looked at. >I understand that perhaps the world would be a better place if we >didn't need another name for another flavor of a string that serves >the role of identifying a resource. But that's not an option, see >point 1. I don't think it's productive to write "that's not an option" without actual backup technical arguments. I'm yet waiting for the first XML document that contains any of the characters in question in any of the URI/IRI-like slots under discussion here (of course this would exclude documents that have been created just to show that such documents exist, but I haven't even see one of these). I'm still waiting for anybody comming up and claiming that they actually need or want to use any of the obscure "characters" (not talking about printable US-ASCII or TAB/CR/LF/Space here). If the XML Core WG said "we think that the risk is extremely low, but we don't want to take this risk", I could to some extent understand this, and it's ultimately the job of the XML Core WG to decide how they want to proceed with their specs. However, I think that the overall effect on the community should considered when looking at the benefits and problems of different approaches. For the overall community, the benefit of having a single concept, defined by a single specification, is very high compared to the issue of the XML Core WG wanting to save a few lines in a few specs that otherwise may be needed to avoid a risk that is extremely small. >Martin's message that quoted this paragraph from the IRI spec gave a >glimmer of hope that perhaps we could avoid 3. > > Systems accepting IRIs MAY also deal with the printable characters in > US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, > "{", "}", "|", "\", "^", and "`", in step 2 above. If these > characters are found but are not converted, then the conversion > SHOULD fail. Please note that the number sign ("#"), the percent > sign ("%"), and the square bracket characters ("[", "]") are not part > of the above list and MUST NOT be converted. Protocols and formats > that have used earlier definitions of IRIs including these characters > MAY require percent-encoding of these characters as a preprocessing > step to extract the actual IRI from a given field. This > preprocessing MAY also be used by applications allowing the user to > enter an IRI. > >Unfortunately, our problem is that system identifiers can contain not >just "printable characters in US-ASCII that are not allowed in URIs" >but a wide range of characters from elsewhere in Unicode that are not >allowed in URIs (or IRIs). > >Question: Is the paragraph from the IRI spec above intended to be >broader than a literal reading would suggest? Is it the intent of the >IRI spec that systems accepting IRIs MAY also deal with characters not >allowed in URIs by converting them? This is a very interesting thought. What I have said earlier is that I think it would be possible to extend the above paragraph to other kinds of characters in an (already started) update of the IRI spec. I'm quite a bit more sceptical about dealing with this just as an erratum, because looking at all the drafts of the IRI spec listed at http://www.w3.org/International/iri-edit/#Published, there never seems to have been any question about whether control characters (both general C0 and all of C1) should be allowed or not. Nobody ever came up and requested that these be allowed, in any way, and I'm still not seeing any actual need at all. The above note was specifically put in to address the actual and expressed needs of some people in the XML community (see my earlier email with references to the email archive). >If so, then perhaps we can simply say that system identifiers are IRIs >and note this provision in the IRI spec for what I'll call "legacy" >identifiers. This is essentially what I proposed, except that this would happen in a new version of the IRI spec. There is a huge difference between using an erratum (with very little feedback possibilities from the community on whether this was indeed intended, and very little room for adding additional warning text), and an updated spec, where we can make sure we spend all the necessary time on getting the wording correct and adding all the necessary warnings. >If not, then I think we must proceed with the HRRI spec. "must" is quite strong. What about looking at what other specs did? What about going for something along the lines of: A SystemLiteral SHOULD be an IRI [RFC3987 (or its successor)]. Note: This includes the provision in the IRI spec for dealing with printable characters in US-ASCII that are not allowed in URIs. Note: XML processors MUST/SHOULD also convert characters outside the repertoire of characters allowed in IRIs according to Section 3.1 of [RFC 3987]. With the erratum, you would have used the first three lines. Without the erratum, your text gets a bit longer. You might even want to tweak the second note to cover some of what the RDF specs say (essentially, processors may issue warnings when they see something that doesn't conform to the IRI spec). If you wait for a new version of the IRI spec, you should be able to then use some text such as: A SystemLiteral is an IRI according to [RFCXXXX], including the provisions in section Y.Z of [RFCXXXX]. We will be able to make sure that Y.Z covers your needs, and hopefully the needs of other W3C (and other) specs, and we will greatly reduce the confusion for the overall community and have technology converge to what's really needed, rather than diverge for the sake of non-existing backwards compatibility needs. Hope this helps. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 8 August 2007 04:00:31 UTC