Fwd: Re: HRRIs, IRIs, etc

[This email was a contribution to a private discussion,
but some participants have requested that it be posted
in a public location so that discussion can continue in
public. The participants have agreed that the email can
be made public with a few omissions.
I'm forwarding the email below to public-iri@w3.org,
because it is very related to IRIs.]

>Date: Wed, 02 May 2007 18:14:32 +0900
>To: "Richard Ishida" <ishida@w3.org>
>From: Martin Duerst <duerst@it.aoyama.ac.jp>
>Subject: Re: HRRIs, IRIs, etc

>Hello Richard,
>
>Many thanks for your mail.

[...]

>At 19:21 07/05/01, Richard Ishida wrote:
>>Hi Martin,
>>
>>There is a discussion about to take place

[...]

>>about how specs should define "protocol or format 
>>elements ...which are to be interpreted as URI references".  Part of that 
>>discussion will include HRRIs[1].  Do you have any particular thoughts on 
>>the usefulness or not or issues with HRRI's?
>>
>>Cheers,
>>RI
>>
>>[1] http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-00.html

[Note: currently at http://www.ietf.org/internet-drafts/draft-walsh-tobin-hrri-01.txt]

>I think that there is definitely some issues that need to be addressed,
>I think I have already pointed that out several months ago (I can
>try to dig up the email if needed).
>
>I think the reasons for the current state are mainly two:
>a) Difficulty to coordinate over a very large range of
>   organizations/groups/peoples (IETF, W3C with many groups and participants)
>b) Long overall timeline which lead to ad-hoc solutions.
>
>First, to recap the history, things started out when approximate
>descriptions of what later were to be called IRIs were put into
>the XML spec. This originally came from HTML4, where it was worded
>as an error treatment:
>http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.1
>(B.2.1 Non-ASCII characters in URI attribute values)
>
>
>Looking back at the first edition of the XML REC, it seems to me
>actually rather clear that originally, spaces were not considered:
>http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent
>
>   An XML processor should handle a non-ASCII character in a URI by
>   representing the character in UTF-8 as one or more bytes, and then
>   escaping these bytes with the URI escaping mechanism (i.e., by
>   converting each byte to %HH, where HH is the hexadecimal notation
>   of the byte value).
>
>Note the use of the term "non-ASCII", which I take to exclude spaces.
>
>The above was changed in the second edition of XML, as follows:
>http://www.w3.org/TR/2000/REC-xml-20001006#sec-external-ent
>
>   URI references require encoding and escaping of certain characters.
>   The disallowed characters include all non-ASCII characters, plus the
>   excluded characters listed in Section 2.4 of [IETF RFC 2396], except
>   for the number sign (#) and percent sign (%) characters and the square
>   bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters
>   must be escaped as follows: ...
>
>For your reference, the excluded characters in RFC 2396 are as follows:
>   control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
>   space       = <US-ASCII coded character 20 hexadecimal>
>   delims      = "<" | ">" | "#" | "%" | <">
>   unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
>
>The erratum responsible for this change, at
>http://www.w3.org/XML/xml-19980210-errata#E78, points to
>http://www.w3.org/International/Group/issues/xml/#uris.percent.hh
>for motivation. There we find a reference to "equivalent text in
>the Character Model". A pointer to the Charmod WD around the time
>of this erratum is http://www.w3.org/TR/1999/WD-charmod-19991129/#URIs,
>where we indeed can find mention of characters excluded in URIs,
>as defined in RFC 2396.
>
>E78 overrides an earlier erratum:
>http://www.w3.org/XML/xml-19980210-errata#E49 which has a pointer
>to XML Syntax WG minutes at
>http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999May/0026.html
>for motivation; digging around, it looks like the correct minutes might
>be at
>http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999May/0020.html,
>but there, E49 seems to be about something else. There is a supporting
>note which is slightly more consistent, but doesn't provide much more info:
>http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999May/0021.html
>Anyway, the rationale here reads:
>   Original only discussed non-ASCII characters,
>      include case of reserved characters.
>
>We could probably dig deeper around that time to find out exactly
>what motivated the roundup inclusion all ASCII characters, but my
>recollection from that time (vague at best) is probably best summarized
>by "we thought it would be nice to have".
>
>
>What I also remember in this context is an extensive debate on
>the w3c-xml-plenary list (see start at
>http://lists.w3.org/Archives/Member/w3c-xml-plenary/2003Feb/0001.html),
>and a BOF at the IETF in San Francisco (apparently shortly after),
>see http://lists.w3.org/Archives/Public/uri/2003Mar/0043.html.
>The former didn't lead to any conclusion, the later led to very
>clear input that spaces (at least in principle) should be avoided.
>After forming a dedicated mailing list (see announcement e.g.
>at http://lists.w3.org/Archives/Member/w3c-xml-plenary/2003Apr/),
>the issue of additional characters was the first issue listed
>(see http://www.w3.org/International/iri-edit/#iadditional-00).
>
>
>Please also note that the current IRI spec doesn't completely
>disallow IRIs with spaces and such; RFC 3987 contains the following
>paragraph:
>   Systems accepting IRIs MAY also deal with the printable characters in
>   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
>   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
>   characters are found but are not converted, then the conversion
>   SHOULD fail.  Please note that the number sign ("#"), the percent
>   sign ("%"), and the square bracket characters ("[", "]") are not part
>   of the above list and MUST NOT be converted.  Protocols and formats
>   that have used earlier definitions of IRIs including these characters
>   MAY require percent-encoding of these characters as a preprocessing
>   step to extract the actual IRI from a given field.  This
>   preprocessing MAY also be used by applications allowing the user to
>   enter an IRI.
>
>When comparing the list of characters above with the previous list,
>please note that some of the differences (e.g. for '['/']') are
>due to changes in RFC 3976 (URI spec). The other difference is the
>absence of control characters; these were excluded because anything
>besides CR/LF/TAB is almost totally silly and even CR/LF/TAB are
>extremely prone to data loss (e.g. attribute value normalization)
>and misinterpretation (try putting a TAB on the side of a bus).
>
>
>So what we have got here is a mess, and we have to find a way out
>of it. Giving things a name is definitely a good idea. The question
>is where to do that, and what name to give it.
>
>Doing it as a separate draft is a good first step; that's the
>way the IETF is working. Obviously, in that case, the draft should
>actually be submitted to the IETF asap, and not just put up at
>the W3C, so that it can be widely discussed.

[Note: The draft has been published in the meantime, see above.]

>I also suggest sending
>a note to public-iri@w3.org and other related mailing lists.
>
>However, if possible, this should be integrated into the IRI spec,
>because the IRI spec already contains the relevant provision, which
>just needs to be labeled and maybe tweaked (re. control characters).
>The IRI spec needs to be updated anyway in order to move to Draft
>Standard, and these changes shouldn't affect that move, because
>labeling something doesn't really change the spec, and clarifying
>exactly which characters are affected also shouldn't be a problem
>because on the XML side, we should have quite a few implementations
>anyway, and that's what counts for draft standard.
>
>
>As for the name, I'm rather sceptical about "Human Readable
>Resource Identifiers". My reasons are twofold:
>1) It contains a lot of judgement, somehow suggesting that
>   it's better than what we had up to now. "Why don't we use
>   Human Readable Resource Identifiers" will become difficult
>   to answer even if the technical reasons (such as not having
>   the luxury of the protective XML escaping/delimiting environment)
>   are totally obvious.
>2) I'm doubtful that it's actually accurate, and somewhat worried
>   that people may request even more. Adding spaces and a few
>   symbols in some cases can aid readability quite a bit, the
>   case in point being structured XPointers. On the other hand,
>   things such as "http://" are inherently incomprehensible to
>   a lot of people, and the rest of the URI/IRI/whatever structure
>   is also not common knowledge for the everyday user.
>   People may ask "On this occasion, why don't we allow translations
>   for scheme (protocol) names?", or "Http URIs often get used without
>   the scheme part, why don't we also allow that?", or "Well, I have
>   this idea for a much better, more humanly readable scheme/mechanism,
>   why don't we use that one?" and so on.
>
>Choosing a more neutral name (my working name today is
>"IRI with space", but I'm sure there are better names, another
>that I just came up with, which may at least serve as a good
>working accronym, is IRI++) may help discussions to move forward
>quite a bit more smoothly.
>
>
>Besides the two big issues of where to put this and how to name it,
>there are other important issues, namely, in no particular order
>and without any guarantee to be complete:
>1) What specs in the XML area currently use IRI++, and where
>   do they differ (they may not differ, but then again, they may,
>   in subtle or not so subtle ways).
>2) What other W3C specs currently use IRI++, in exactly what definition.
>   Definitely RDF does, but SPARQL on my last reading excluded spaces
>   and friends. What about HTML4? CSS?... What about other specs
>   (e.g. Atom,...)
>
>   One goal of the above two points is to make sure we don't
>   create a new spec with a new term only to realize later that
>   we needed two or more such terms.
>
>3) What issues are raised by the fact that namespaces and RDF
>   use character-by-character comparison, rather than a comparison
>   higher up on the ladder.
>4) What additional security issues are raised. The current draft
>   to me looks seriously incomplete e.g. on issues related to
>   control characters (immagine including a DEL or two at
>   crucial places, and a display infrastructure that didn't
>   show the DEL and the preceeding character; this may be
>   a way in on a spoofing attack).
>5) What recommendations should the IRI spec (and the specs using
>   IRI++) make re. IRI++. It may be difficult to change
>   some of the XML specs to suddenly disallow arbitrary control
>   characters (in XML 1.0 at least we are mostly safe anyway,
>   but not in XML 1.1). But having some strong recommendations against
>   such use in the IRI spec (and if necessary also in the specs
>   that use IRI++, although ideally, the IRI spec should take
>   care of that) may be very appropriate for interoperability
>   and security reasons.
>
>   In that sense, the labeling of the current draft as a BCP
>   (Best Current Practice) seems questionable. Allowing spaces
>   and some symbols may be extremely convenient in some contexts
>   (e.g. hand-authoring XPointers), but it shouldn't be sold
>   as the ideal thing to do.
>
>
>Hope this is enough material for today.       Regards,    Martin.
>
>#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst@it.aoyama.ac.jp    


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     

Received on Wednesday, 30 May 2007 01:45:54 UTC