URI Generic Syntax doc (draft-fielding-uri-rfc2396bis-03) from by way of Martin Duerst on 2003-07-09 (uri@w3.org from July 2003)

From: by way of Martin Duerst <klensin@jck.com>
Date: Wed, 09 Jul 2003 15:04:43 -0400
To: uri@w3.org
Message-Id: <4.2.0.58.J.20030709150430.00a8d210@localhost>
Hi.

As a result of still being troubled by the "which characters really need to 
be escaped in MAILTO" problem which some of you may have heard discussed, 
I've made a careful pass through this document.  As a reader who has not 
been active in its development, I may be bringing a bit of a fresh look to it.

The document is a considerable improvement over RFC 2396, but I've ended up 
with two major problems and a few nits.

(1) There are a number of places in which the document seems to go to such 
efforts to be general and to avoid over-constraining particular URI schemes 
that it has achieved a level of abstraction indistinguishable from 
incomprehensibility and, occasionally, internal contradictions.   Examples 
below, but I think either some rewriting or _very_ careful consistency 
review is needed, if not both.

(2) The problem I think we got into with MAILTO, and perhaps with other URI 
schemes, is that it is tempting to refer to a generic URI document and say, 
about syntax and escaping, "do what it says there".  Unfortunately, what 
this says is very general and non-specific, and some of the terms don't 
mean quite what one would assume on casual reading.   I believe the 
document would benefit significantly from a short section titled, e.g., 
"Specification Requirements for URI schemes" and that would then include, 
in very specific terms, a list of things that a URI scheme description/ 
standard must specify.  I would expect that list to include an exact list 
of characters that must be escaped within the context of that scheme.

(3) Details...

(i) Section 2.1.  I understand, I think, the reasoning behind the "maybe it 
is ASCII and maybe it is not" language here.  But, if URI appears in 
machine-readable form, and the scheme name is not (or might not be) in 
ASCII (or any other pre-specified character set), how is a URI parser or 
other processor to recognize it?   Put differently, there is a 
bootstrapping problem: one must know the character set of the scheme name 
before one can figure out how to parse or process anything else. I might be 
wrong about this but, if I am, this section needs a bit more explanation.

(ii) Section 2.2.   Normally, "reserved" means "always", and "can't be used 
for anything else".  It isn't the meaning here (or actually, is partially 
the meaning).  Things would be much more clear if the production/definition 
were broken up into

  reserved = Subcomponent-Delimiter-Role /
             Other-Often-Reserved
  Subcomponent-Delimiter-Role = "/" / "?" / "#"
      (and colon (":") ???)
  Other-Often-Reserved = <the rest of the list>

Some small rearrangement of the paragraphs below would then make things 
much more clear.

Also, "URI's origin" should be precisely defined somewhere (it isn't in the 
index).  A naive reader could interpret the term as either "the definition 
of the URI type/scheme" or "the author/ process that produces some 
particular URI instance".  A similar comment applies to "URI creator" which 
appears at the end of section 3.2.2.

(iii) Section 2.5. As with "reserved", "excluded" doesn't seem to have its 
normal English meaning of absolutely forbidden.  I read this section to say 
that "excluded" characters should be avoided if possible and escaped 
otherwise.  If that is the intended meaning, it should appear in so many 
words.    But I'm not sure it is.  For example, the exclusion of characters 
outside the ASCII range would appear to prohibit UTF-8, even in %-encoded 
form, and, given other text in the document, that clearly is not the intent.

(iv) Section 3 introduction.  The first sentence lists a "path" 
component.  There is no "path" component in the syntax productions, 
although I assume that "hier-part" is more or less the same thing.  And 
then the next sentence says that the "path components" (not plural) is 
required.  Either more explanation is needed here or "hier-path" should be 
renamed to "path-component" or equivalent.  It would also be useful to 
explicitly note that the productions for scheme, authority, etc. are 
defined in subsequent subsections.

The last sentence of the second paragraph says "a non-hierarchical path 
will be treated as opaque data...".  But, from the productions, there 
appears to be no such thing as a non-hierarchical path".

(v) Sectin 3.2.2 and IPv6.  I don't know if there is a future version of IP 
beyond v6, but please don't dig us into a corner by having only the IPv4 
and IPv6 forms and no way to move beyond that.  Consider the RFC 2821 
solution, in which address literals for other than the (historical) IPv4 
ones must be explicitly identified by a protocol-specific keyword.

In the fourth paragraph, please insert "and should be followed by one" 
before "if it is necessary to distinguish".  I.e., the trailing "." is 
always permitted but, if there is any question about whether the domain is 
an FQDN or a fragment of some sort, it should (or even must) be present.

(vi) Section 3.3.  The distinction between "path" and "authority" needs to 
be more clearly drawn.  This section clearly defines mailto as using a 
path, but the authority discussion and syntax in 3.2 might be construed as 
having mailto consisting of an empty opaque path and an authority.  Since 
the syntax for net-path in 3 seems to suggest that, if it were an 
authority, mailto would have to be
  mailto://fred@example.com rather than
  mailto:fred@example.com
I think the text is correct and consistent.  But it is exceptionally confusing.

(vii) Section 4.5 on Suffix References.  We all know this practice is 
common.  We also know that it leads to trouble, especially when "heuristics 
change over time".   I think the section should be a bit more clear about 
the problems, and then clearly recommend avoiding these if possible, rather 
than circling around the issues.
It should also note that the "suffix" often contains only part of a DNS 
hostname ("foo" in the expectation that the processor will turn it into 
"http://www.foo.com" or something equivalent and that form is very 
high-risk behavior.  See RFC1535 and/or for discussion of some of the 
downsides of these games.

(viii) Section 7.5.  But this is where a careful distinction between 
"authority" and "path" becomes important, along with clarity about types of 
"reserved" characters.  An only slightly confused reader could conclude 
that it was possible to define a MIXER URI type in which
  mixer:/I=J/S=Linnimouth/GQ=5/@Marketing.Widget.COM
Was valid.  I don't think it is, but it takes much too long to prove this 
from 2396bis.

I have skimmed, but often not studied, sections of the document not 
referenced above, so this should not be taken as a comprehensive review.

regards,
    john
Received on Wednesday, 9 July 2003 15:07:41 UTC