- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Wed, 30 May 2007 10:42:01 +0900
- To: public-iri@w3.org
- Cc: Richard Ishida <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>, Norman Walsh <Norman.Walsh@Sun.COM>, Richard Tobin <richard@inf.ed.ac.uk>, Michel Suignard <michelsu@microsoft.com>, Lisa Dusseault <lisa@osafoundation.org>, Chris Newman <Chris.Newman@Sun.COM>
[This email was a contribution to a private discussion,
but some participants have requested that it be posted
in a public location so that discussion can continue in
public. The participants have agreed that the email can
be made public with a few omissions.
I'm forwarding the email below to public-iri@w3.org,
because it is very related to IRIs.]
>Date: Wed, 02 May 2007 18:14:32 +0900
>To: "Richard Ishida" <ishida@w3.org>
>From: Martin Duerst <duerst@it.aoyama.ac.jp>
>Subject: Re: HRRIs, IRIs, etc
>Hello Richard,
>
>Many thanks for your mail.
[...]
>At 19:21 07/05/01, Richard Ishida wrote:
>>Hi Martin,
>>
>>There is a discussion about to take place
[...]
>>about how specs should define "protocol or format
>>elements ...which are to be interpreted as URI references". Part of that
>>discussion will include HRRIs[1]. Do you have any particular thoughts on
>>the usefulness or not or issues with HRRI's?
>>
>>Cheers,
>>RI
>>
>>[1] http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-00.html
[Note: currently at http://www.ietf.org/internet-drafts/draft-walsh-tobin-hrri-01.txt]
>I think that there is definitely some issues that need to be addressed,
>I think I have already pointed that out several months ago (I can
>try to dig up the email if needed).
>
>I think the reasons for the current state are mainly two:
>a) Difficulty to coordinate over a very large range of
> organizations/groups/peoples (IETF, W3C with many groups and participants)
>b) Long overall timeline which lead to ad-hoc solutions.
>
>First, to recap the history, things started out when approximate
>descriptions of what later were to be called IRIs were put into
>the XML spec. This originally came from HTML4, where it was worded
>as an error treatment:
>http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.1
>(B.2.1 Non-ASCII characters in URI attribute values)
>
>
>Looking back at the first edition of the XML REC, it seems to me
>actually rather clear that originally, spaces were not considered:
>http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent
>
> An XML processor should handle a non-ASCII character in a URI by
> representing the character in UTF-8 as one or more bytes, and then
> escaping these bytes with the URI escaping mechanism (i.e., by
> converting each byte to %HH, where HH is the hexadecimal notation
> of the byte value).
>
>Note the use of the term "non-ASCII", which I take to exclude spaces.
>
>The above was changed in the second edition of XML, as follows:
>http://www.w3.org/TR/2000/REC-xml-20001006#sec-external-ent
>
> URI references require encoding and escaping of certain characters.
> The disallowed characters include all non-ASCII characters, plus the
> excluded characters listed in Section 2.4 of [IETF RFC 2396], except
> for the number sign (#) and percent sign (%) characters and the square
> bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters
> must be escaped as follows: ...
>
>For your reference, the excluded characters in RFC 2396 are as follows:
> control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
> space = <US-ASCII coded character 20 hexadecimal>
> delims = "<" | ">" | "#" | "%" | <">
> unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
>
>The erratum responsible for this change, at
>http://www.w3.org/XML/xml-19980210-errata#E78, points to
>http://www.w3.org/International/Group/issues/xml/#uris.percent.hh
>for motivation. There we find a reference to "equivalent text in
>the Character Model". A pointer to the Charmod WD around the time
>of this erratum is http://www.w3.org/TR/1999/WD-charmod-19991129/#URIs,
>where we indeed can find mention of characters excluded in URIs,
>as defined in RFC 2396.
>
>E78 overrides an earlier erratum:
>http://www.w3.org/XML/xml-19980210-errata#E49 which has a pointer
>to XML Syntax WG minutes at
>http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999May/0026.html
>for motivation; digging around, it looks like the correct minutes might
>be at
>http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999May/0020.html,
>but there, E49 seems to be about something else. There is a supporting
>note which is slightly more consistent, but doesn't provide much more info:
>http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999May/0021.html
>Anyway, the rationale here reads:
> Original only discussed non-ASCII characters,
> include case of reserved characters.
>
>We could probably dig deeper around that time to find out exactly
>what motivated the roundup inclusion all ASCII characters, but my
>recollection from that time (vague at best) is probably best summarized
>by "we thought it would be nice to have".
>
>
>What I also remember in this context is an extensive debate on
>the w3c-xml-plenary list (see start at
>http://lists.w3.org/Archives/Member/w3c-xml-plenary/2003Feb/0001.html),
>and a BOF at the IETF in San Francisco (apparently shortly after),
>see http://lists.w3.org/Archives/Public/uri/2003Mar/0043.html.
>The former didn't lead to any conclusion, the later led to very
>clear input that spaces (at least in principle) should be avoided.
>After forming a dedicated mailing list (see announcement e.g.
>at http://lists.w3.org/Archives/Member/w3c-xml-plenary/2003Apr/),
>the issue of additional characters was the first issue listed
>(see http://www.w3.org/International/iri-edit/#iadditional-00).
>
>
>Please also note that the current IRI spec doesn't completely
>disallow IRIs with spaces and such; RFC 3987 contains the following
>paragraph:
> Systems accepting IRIs MAY also deal with the printable characters in
> US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
> "{", "}", "|", "\", "^", and "`", in step 2 above. If these
> characters are found but are not converted, then the conversion
> SHOULD fail. Please note that the number sign ("#"), the percent
> sign ("%"), and the square bracket characters ("[", "]") are not part
> of the above list and MUST NOT be converted. Protocols and formats
> that have used earlier definitions of IRIs including these characters
> MAY require percent-encoding of these characters as a preprocessing
> step to extract the actual IRI from a given field. This
> preprocessing MAY also be used by applications allowing the user to
> enter an IRI.
>
>When comparing the list of characters above with the previous list,
>please note that some of the differences (e.g. for '['/']') are
>due to changes in RFC 3976 (URI spec). The other difference is the
>absence of control characters; these were excluded because anything
>besides CR/LF/TAB is almost totally silly and even CR/LF/TAB are
>extremely prone to data loss (e.g. attribute value normalization)
>and misinterpretation (try putting a TAB on the side of a bus).
>
>
>So what we have got here is a mess, and we have to find a way out
>of it. Giving things a name is definitely a good idea. The question
>is where to do that, and what name to give it.
>
>Doing it as a separate draft is a good first step; that's the
>way the IETF is working. Obviously, in that case, the draft should
>actually be submitted to the IETF asap, and not just put up at
>the W3C, so that it can be widely discussed.
[Note: The draft has been published in the meantime, see above.]
>I also suggest sending
>a note to public-iri@w3.org and other related mailing lists.
>
>However, if possible, this should be integrated into the IRI spec,
>because the IRI spec already contains the relevant provision, which
>just needs to be labeled and maybe tweaked (re. control characters).
>The IRI spec needs to be updated anyway in order to move to Draft
>Standard, and these changes shouldn't affect that move, because
>labeling something doesn't really change the spec, and clarifying
>exactly which characters are affected also shouldn't be a problem
>because on the XML side, we should have quite a few implementations
>anyway, and that's what counts for draft standard.
>
>
>As for the name, I'm rather sceptical about "Human Readable
>Resource Identifiers". My reasons are twofold:
>1) It contains a lot of judgement, somehow suggesting that
> it's better than what we had up to now. "Why don't we use
> Human Readable Resource Identifiers" will become difficult
> to answer even if the technical reasons (such as not having
> the luxury of the protective XML escaping/delimiting environment)
> are totally obvious.
>2) I'm doubtful that it's actually accurate, and somewhat worried
> that people may request even more. Adding spaces and a few
> symbols in some cases can aid readability quite a bit, the
> case in point being structured XPointers. On the other hand,
> things such as "http://" are inherently incomprehensible to
> a lot of people, and the rest of the URI/IRI/whatever structure
> is also not common knowledge for the everyday user.
> People may ask "On this occasion, why don't we allow translations
> for scheme (protocol) names?", or "Http URIs often get used without
> the scheme part, why don't we also allow that?", or "Well, I have
> this idea for a much better, more humanly readable scheme/mechanism,
> why don't we use that one?" and so on.
>
>Choosing a more neutral name (my working name today is
>"IRI with space", but I'm sure there are better names, another
>that I just came up with, which may at least serve as a good
>working accronym, is IRI++) may help discussions to move forward
>quite a bit more smoothly.
>
>
>Besides the two big issues of where to put this and how to name it,
>there are other important issues, namely, in no particular order
>and without any guarantee to be complete:
>1) What specs in the XML area currently use IRI++, and where
> do they differ (they may not differ, but then again, they may,
> in subtle or not so subtle ways).
>2) What other W3C specs currently use IRI++, in exactly what definition.
> Definitely RDF does, but SPARQL on my last reading excluded spaces
> and friends. What about HTML4? CSS?... What about other specs
> (e.g. Atom,...)
>
> One goal of the above two points is to make sure we don't
> create a new spec with a new term only to realize later that
> we needed two or more such terms.
>
>3) What issues are raised by the fact that namespaces and RDF
> use character-by-character comparison, rather than a comparison
> higher up on the ladder.
>4) What additional security issues are raised. The current draft
> to me looks seriously incomplete e.g. on issues related to
> control characters (immagine including a DEL or two at
> crucial places, and a display infrastructure that didn't
> show the DEL and the preceeding character; this may be
> a way in on a spoofing attack).
>5) What recommendations should the IRI spec (and the specs using
> IRI++) make re. IRI++. It may be difficult to change
> some of the XML specs to suddenly disallow arbitrary control
> characters (in XML 1.0 at least we are mostly safe anyway,
> but not in XML 1.1). But having some strong recommendations against
> such use in the IRI spec (and if necessary also in the specs
> that use IRI++, although ideally, the IRI spec should take
> care of that) may be very appropriate for interoperability
> and security reasons.
>
> In that sense, the labeling of the current draft as a BCP
> (Best Current Practice) seems questionable. Allowing spaces
> and some symbols may be extremely convenient in some contexts
> (e.g. hand-authoring XPointers), but it shouldn't be sold
> as the ideal thing to do.
>
>
>Hope this is enough material for today. Regards, Martin.
>
>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 30 May 2007 01:45:54 UTC