- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Wed, 30 May 2007 10:42:01 +0900
- To: public-iri@w3.org
- Cc: Richard Ishida <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>, Norman Walsh <Norman.Walsh@Sun.COM>, Richard Tobin <richard@inf.ed.ac.uk>, Michel Suignard <michelsu@microsoft.com>, Lisa Dusseault <lisa@osafoundation.org>, Chris Newman <Chris.Newman@Sun.COM>
[This email was a contribution to a private discussion, but some participants have requested that it be posted in a public location so that discussion can continue in public. The participants have agreed that the email can be made public with a few omissions. I'm forwarding the email below to public-iri@w3.org, because it is very related to IRIs.] >Date: Wed, 02 May 2007 18:14:32 +0900 >To: "Richard Ishida" <ishida@w3.org> >From: Martin Duerst <duerst@it.aoyama.ac.jp> >Subject: Re: HRRIs, IRIs, etc >Hello Richard, > >Many thanks for your mail. [...] >At 19:21 07/05/01, Richard Ishida wrote: >>Hi Martin, >> >>There is a discussion about to take place [...] >>about how specs should define "protocol or format >>elements ...which are to be interpreted as URI references". Part of that >>discussion will include HRRIs[1]. Do you have any particular thoughts on >>the usefulness or not or issues with HRRI's? >> >>Cheers, >>RI >> >>[1] http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-00.html [Note: currently at http://www.ietf.org/internet-drafts/draft-walsh-tobin-hrri-01.txt] >I think that there is definitely some issues that need to be addressed, >I think I have already pointed that out several months ago (I can >try to dig up the email if needed). > >I think the reasons for the current state are mainly two: >a) Difficulty to coordinate over a very large range of > organizations/groups/peoples (IETF, W3C with many groups and participants) >b) Long overall timeline which lead to ad-hoc solutions. > >First, to recap the history, things started out when approximate >descriptions of what later were to be called IRIs were put into >the XML spec. This originally came from HTML4, where it was worded >as an error treatment: >http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.1 >(B.2.1 Non-ASCII characters in URI attribute values) > > >Looking back at the first edition of the XML REC, it seems to me >actually rather clear that originally, spaces were not considered: >http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent > > An XML processor should handle a non-ASCII character in a URI by > representing the character in UTF-8 as one or more bytes, and then > escaping these bytes with the URI escaping mechanism (i.e., by > converting each byte to %HH, where HH is the hexadecimal notation > of the byte value). > >Note the use of the term "non-ASCII", which I take to exclude spaces. > >The above was changed in the second edition of XML, as follows: >http://www.w3.org/TR/2000/REC-xml-20001006#sec-external-ent > > URI references require encoding and escaping of certain characters. > The disallowed characters include all non-ASCII characters, plus the > excluded characters listed in Section 2.4 of [IETF RFC 2396], except > for the number sign (#) and percent sign (%) characters and the square > bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters > must be escaped as follows: ... > >For your reference, the excluded characters in RFC 2396 are as follows: > control = <US-ASCII coded characters 00-1F and 7F hexadecimal> > space = <US-ASCII coded character 20 hexadecimal> > delims = "<" | ">" | "#" | "%" | <"> > unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" > >The erratum responsible for this change, at >http://www.w3.org/XML/xml-19980210-errata#E78, points to >http://www.w3.org/International/Group/issues/xml/#uris.percent.hh >for motivation. There we find a reference to "equivalent text in >the Character Model". A pointer to the Charmod WD around the time >of this erratum is http://www.w3.org/TR/1999/WD-charmod-19991129/#URIs, >where we indeed can find mention of characters excluded in URIs, >as defined in RFC 2396. > >E78 overrides an earlier erratum: >http://www.w3.org/XML/xml-19980210-errata#E49 which has a pointer >to XML Syntax WG minutes at >http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999May/0026.html >for motivation; digging around, it looks like the correct minutes might >be at >http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999May/0020.html, >but there, E49 seems to be about something else. There is a supporting >note which is slightly more consistent, but doesn't provide much more info: >http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999May/0021.html >Anyway, the rationale here reads: > Original only discussed non-ASCII characters, > include case of reserved characters. > >We could probably dig deeper around that time to find out exactly >what motivated the roundup inclusion all ASCII characters, but my >recollection from that time (vague at best) is probably best summarized >by "we thought it would be nice to have". > > >What I also remember in this context is an extensive debate on >the w3c-xml-plenary list (see start at >http://lists.w3.org/Archives/Member/w3c-xml-plenary/2003Feb/0001.html), >and a BOF at the IETF in San Francisco (apparently shortly after), >see http://lists.w3.org/Archives/Public/uri/2003Mar/0043.html. >The former didn't lead to any conclusion, the later led to very >clear input that spaces (at least in principle) should be avoided. >After forming a dedicated mailing list (see announcement e.g. >at http://lists.w3.org/Archives/Member/w3c-xml-plenary/2003Apr/), >the issue of additional characters was the first issue listed >(see http://www.w3.org/International/iri-edit/#iadditional-00). > > >Please also note that the current IRI spec doesn't completely >disallow IRIs with spaces and such; RFC 3987 contains the following >paragraph: > Systems accepting IRIs MAY also deal with the printable characters in > US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, > "{", "}", "|", "\", "^", and "`", in step 2 above. If these > characters are found but are not converted, then the conversion > SHOULD fail. Please note that the number sign ("#"), the percent > sign ("%"), and the square bracket characters ("[", "]") are not part > of the above list and MUST NOT be converted. Protocols and formats > that have used earlier definitions of IRIs including these characters > MAY require percent-encoding of these characters as a preprocessing > step to extract the actual IRI from a given field. This > preprocessing MAY also be used by applications allowing the user to > enter an IRI. > >When comparing the list of characters above with the previous list, >please note that some of the differences (e.g. for '['/']') are >due to changes in RFC 3976 (URI spec). The other difference is the >absence of control characters; these were excluded because anything >besides CR/LF/TAB is almost totally silly and even CR/LF/TAB are >extremely prone to data loss (e.g. attribute value normalization) >and misinterpretation (try putting a TAB on the side of a bus). > > >So what we have got here is a mess, and we have to find a way out >of it. Giving things a name is definitely a good idea. The question >is where to do that, and what name to give it. > >Doing it as a separate draft is a good first step; that's the >way the IETF is working. Obviously, in that case, the draft should >actually be submitted to the IETF asap, and not just put up at >the W3C, so that it can be widely discussed. [Note: The draft has been published in the meantime, see above.] >I also suggest sending >a note to public-iri@w3.org and other related mailing lists. > >However, if possible, this should be integrated into the IRI spec, >because the IRI spec already contains the relevant provision, which >just needs to be labeled and maybe tweaked (re. control characters). >The IRI spec needs to be updated anyway in order to move to Draft >Standard, and these changes shouldn't affect that move, because >labeling something doesn't really change the spec, and clarifying >exactly which characters are affected also shouldn't be a problem >because on the XML side, we should have quite a few implementations >anyway, and that's what counts for draft standard. > > >As for the name, I'm rather sceptical about "Human Readable >Resource Identifiers". My reasons are twofold: >1) It contains a lot of judgement, somehow suggesting that > it's better than what we had up to now. "Why don't we use > Human Readable Resource Identifiers" will become difficult > to answer even if the technical reasons (such as not having > the luxury of the protective XML escaping/delimiting environment) > are totally obvious. >2) I'm doubtful that it's actually accurate, and somewhat worried > that people may request even more. Adding spaces and a few > symbols in some cases can aid readability quite a bit, the > case in point being structured XPointers. On the other hand, > things such as "http://" are inherently incomprehensible to > a lot of people, and the rest of the URI/IRI/whatever structure > is also not common knowledge for the everyday user. > People may ask "On this occasion, why don't we allow translations > for scheme (protocol) names?", or "Http URIs often get used without > the scheme part, why don't we also allow that?", or "Well, I have > this idea for a much better, more humanly readable scheme/mechanism, > why don't we use that one?" and so on. > >Choosing a more neutral name (my working name today is >"IRI with space", but I'm sure there are better names, another >that I just came up with, which may at least serve as a good >working accronym, is IRI++) may help discussions to move forward >quite a bit more smoothly. > > >Besides the two big issues of where to put this and how to name it, >there are other important issues, namely, in no particular order >and without any guarantee to be complete: >1) What specs in the XML area currently use IRI++, and where > do they differ (they may not differ, but then again, they may, > in subtle or not so subtle ways). >2) What other W3C specs currently use IRI++, in exactly what definition. > Definitely RDF does, but SPARQL on my last reading excluded spaces > and friends. What about HTML4? CSS?... What about other specs > (e.g. Atom,...) > > One goal of the above two points is to make sure we don't > create a new spec with a new term only to realize later that > we needed two or more such terms. > >3) What issues are raised by the fact that namespaces and RDF > use character-by-character comparison, rather than a comparison > higher up on the ladder. >4) What additional security issues are raised. The current draft > to me looks seriously incomplete e.g. on issues related to > control characters (immagine including a DEL or two at > crucial places, and a display infrastructure that didn't > show the DEL and the preceeding character; this may be > a way in on a spoofing attack). >5) What recommendations should the IRI spec (and the specs using > IRI++) make re. IRI++. It may be difficult to change > some of the XML specs to suddenly disallow arbitrary control > characters (in XML 1.0 at least we are mostly safe anyway, > but not in XML 1.1). But having some strong recommendations against > such use in the IRI spec (and if necessary also in the specs > that use IRI++, although ideally, the IRI spec should take > care of that) may be very appropriate for interoperability > and security reasons. > > In that sense, the labeling of the current draft as a BCP > (Best Current Practice) seems questionable. Allowing spaces > and some symbols may be extremely convenient in some contexts > (e.g. hand-authoring XPointers), but it shouldn't be sold > as the ideal thing to do. > > >Hope this is enough material for today. Regards, Martin. > >#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University >#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 30 May 2007 01:45:54 UTC