- From: Norman Walsh <ndw@nwalsh.com>
- Date: Wed, 18 Jul 2007 11:50:32 -0400
- To: public-xml-core-wg@w3.org, public-iri@w3.org, Richard Ishida <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>, www-xml-linking-comments@w3.org, public-i18n-core@w3.org
- Message-ID: <87sl7logmv.fsf@nwalsh.com>
Hi, Sorry I was out of the loop for a bit. I see from the email threads that we've got some improved wording proposed for the list of characters that have to be escaped if they appear in HRRI and some improved wording for the security considerations section. I'll incorporate those as soon as I can. However, as far as I can tell, we still don't have a clear understanding about whether we need HRRI or not. Here's how I see it. Sorry if this is a little repetative; I'm hoping that considering this issue from a higher level again will help. 1. The XML Recommendation says that a system identifier consists of a single or double quote followed by any characters followed by a matching quote: SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'") Any attempt to limit the characters allowed in a system identifier would be a backwards incompatible change to XML. That is simply not an option. 2. Because we knew that system identifiers allowed characters that couldn't appear in URIs, we added some wording to clarify how processors must escape those characters if they needed URIs. Over time, this text was refined, using fragments taken from drafts of the IRI spec, and is now "cut-and-pasted" into several recommendations. It's become clear that this cut-and-paste approach is tedious and error-prone and does not scale. Asking future specs to continue this cut-and-paste process from one or another of the existing specs is just not helpful to the community. 3. The HRRI spec proposes to instantiate the very liberal repertoire of characters allowed in a system identifier (and all the other places) in a short, stand-alone specification. This specification will have a name and will be available for normative reference. I understand that perhaps the world would be a better place if we didn't need another name for another flavor of a string that serves the role of identifying a resource. But that's not an option, see point 1. Martin's message that quoted this paragraph from the IRI spec gave a glimmer of hope that perhaps we could avoid 3. Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, "{", "}", "|", "\", "^", and "`", in step 2 above. If these characters are found but are not converted, then the conversion SHOULD fail. Please note that the number sign ("#"), the percent sign ("%"), and the square bracket characters ("[", "]") are not part of the above list and MUST NOT be converted. Protocols and formats that have used earlier definitions of IRIs including these characters MAY require percent-encoding of these characters as a preprocessing step to extract the actual IRI from a given field. This preprocessing MAY also be used by applications allowing the user to enter an IRI. Unfortunately, our problem is that system identifiers can contain not just "printable characters in US-ASCII that are not allowed in URIs" but a wide range of characters from elsewhere in Unicode that are not allowed in URIs (or IRIs). Question: Is the paragraph from the IRI spec above intended to be broader than a literal reading would suggest? Is it the intent of the IRI spec that systems accepting IRIs MAY also deal with characters not allowed in URIs by converting them? If so, then perhaps we can simply say that system identifiers are IRIs and note this provision in the IRI spec for what I'll call "legacy" identifiers. If not, then I think we must proceed with the HRRI spec. Thoughts? Be seeing you, norm -- Norman Walsh <ndw@nwalsh.com> | A great deal may be done by severity, http://nwalsh.com/ | more by love, but most by clear | discernment and impartial | justice.--Goethe
Received on Wednesday, 18 July 2007 15:52:02 UTC