Re: URI Generic Syntax doc (draft-fielding-uri-rfc2396bis-03) from Roy T. Fielding on 2004-02-07 (uri@w3.org from February 2004)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Sat, 7 Feb 2004 01:08:48 -0800
To: John C Klensin <klensin@jck.com>
Cc: uri@w3.org
Message-Id: <3D4F25B4-594D-11D8-92BD-000393753936@gbiv.com>
On Wednesday, July 9, 2003, at 12:04  PM, John C Klensin wrote:
> The document is a considerable improvement over RFC 2396, but I've 
> ended up with two major problems and a few nits.
>
> (1) There are a number of places in which the document seems to go to 
> such efforts to be general and to avoid over-constraining particular 
> URI schemes that it has achieved a level of abstraction 
> indistinguishable from incomprehensibility and, occasionally, internal 
> contradictions.   Examples below, but I think either some rewriting or 
> _very_ careful consistency review is needed, if not both.

In almost all of the cases you note, this has been due to a tension
between wishing to remain consistent with prior UR* specifications
and yet reflect how these things have actually been implemented in
the real world.  This has been a particular problem with reserved
characters.  I agree that the balance isn't working, so I have rewritten
the sections such that all of the reserved characters are described
as delimiters (assigned or available to be assigned by schemes) and
all of the unreserved characters are always data.

> (2) The problem I think we got into with MAILTO, and perhaps with 
> other URI schemes, is that it is tempting to refer to a generic URI 
> document and say, about syntax and escaping, "do what it says there".  
> Unfortunately, what this says is very general and non-specific, and 
> some of the terms don't mean quite what one would assume on casual 
> reading.   I believe the document would benefit significantly from a 
> short section titled, e.g., "Specification Requirements for URI 
> schemes" and that would then include, in very specific terms, a list 
> of things that a URI scheme description/ standard must specify.  I 
> would expect that list to include an exact list of characters that 
> must be escaped within the context of that scheme.

I believe that is the role of RFC 2718 and BCP 35 (RFC 2717), which
will be revised as soon as I get this one off my plate.

> (3) Details...
>
> (i) Section 2.1.  I understand, I think, the reasoning behind the 
> "maybe it is ASCII and maybe it is not" language here.  But, if URI 
> appears in machine-readable form, and the scheme name is not (or might 
> not be) in ASCII (or any other pre-specified character set), how is a 
> URI parser or other processor to recognize it?   Put differently, 
> there is a bootstrapping problem: one must know the character set of 
> the scheme name before one can figure out how to parse or process 
> anything else. I might be wrong about this but, if I am, this section 
> needs a bit more explanation.

I think it is generally true that one must know the character encoding
of any document before one can process it (or at least a defined 
mechanism
for discovering the character encoding prior to reaching the content).
I have added: "When a URI appears in a protocol element, the character
encoding is defined by that protocol; absent such a definition, the
URI is assumed to be encoded in the same character encoding as the
surrounding text."

> (ii) Section 2.2.   Normally, "reserved" means "always", and "can't be 
> used for anything else".  It isn't the meaning here (or actually, is 
> partially the meaning).  Things would be much more clear if the 
> production/definition were broken up into
>
>  reserved = Subcomponent-Delimiter-Role /
>             Other-Often-Reserved
>  Subcomponent-Delimiter-Role = "/" / "?" / "#"
>      (and colon (":") ???)
>  Other-Often-Reserved = <the rest of the list>
>
> Some small rearrangement of the paragraphs below would then make 
> things much more clear.

Done, though I use the names gen-delims and sub-delims.  I have also
moved the unsafe mark characters to the reserved set, since that is
how they are used in practice and the source of most of the confusion.

> Also, "URI's origin" should be precisely defined somewhere (it isn't 
> in the index).  A naive reader could interpret the term as either "the 
> definition of the URI type/scheme" or "the author/ process that 
> produces some particular URI instance".  A similar comment applies to 
> "URI creator" which appears at the end of section 3.2.2.

Replaced with "the implementation-specific syntax of a URI's
dereferencing algorithm".

> (iii) Section 2.5. As with "reserved", "excluded" doesn't seem to have 
> its normal English meaning of absolutely forbidden.  I read this 
> section to say that "excluded" characters should be avoided if 
> possible and escaped otherwise.  If that is the intended meaning, it 
> should appear in so many words.    But I'm not sure it is.  For 
> example, the exclusion of characters outside the ASCII range would 
> appear to prohibit UTF-8, even in %-encoded form, and, given other 
> text in the document, that clearly is not the intent.

Yes, I'll work on a better way of describing excluded glyphs rather
than excluded characters.

> (iv) Section 3 introduction.  The first sentence lists a "path" 
> component.  There is no "path" component in the syntax productions, 
> although I assume that "hier-part" is more or less the same thing.  
> And then the next sentence says that the "path components" (not 
> plural) is required.  Either more explanation is needed here or 
> "hier-path" should be renamed to "path-component" or equivalent.  It 
> would also be useful to explicitly note that the productions for 
> scheme, authority, etc. are defined in subsequent subsections.
>
> The last sentence of the second paragraph says "a non-hierarchical 
> path will be treated as opaque data...".  But, from the productions, 
> there appears to be no such thing as a non-hierarchical path".

These will be fixed before the next revision due to other comments 
received.

> (v) Sectin 3.2.2 and IPv6.  I don't know if there is a future version 
> of IP beyond v6, but please don't dig us into a corner by having only 
> the IPv4 and IPv6 forms and no way to move beyond that.  Consider the 
> RFC 2821 solution, in which address literals for other than the 
> (historical) IPv4 ones must be explicitly identified by a 
> protocol-specific keyword.

Unfortunately, I have no control over the syntax supplied and 
implemented
by the IPv6 working groups, and it is difficult for me to invent one and
also provide the implementations necessary for a Draft Standard status.

> In the fourth paragraph, please insert "and should be followed by one" 
> before "if it is necessary to distinguish".  I.e., the trailing "." is 
> always permitted but, if there is any question about whether the 
> domain is an FQDN or a fragment of some sort, it should (or even must) 
> be present.

Done.

> (vi) Section 3.3.  The distinction between "path" and "authority" 
> needs to be more clearly drawn.  This section clearly defines mailto 
> as using a path, but the authority discussion and syntax in 3.2 might 
> be construed as having mailto consisting of an empty opaque path and 
> an authority.  Since the syntax for net-path in 3 seems to suggest 
> that, if it were an authority, mailto would have to be
>  mailto://fred@example.com rather than
>  mailto:fred@example.com
> I think the text is correct and consistent.  But it is exceptionally 
> confusing.

I will try to make it less confusing with the next revision due to the
other changes with path.

> (vii) Section 4.5 on Suffix References.  We all know this practice is 
> common.  We also know that it leads to trouble, especially when 
> "heuristics change over time".   I think the section should be a bit 
> more clear about the problems, and then clearly recommend avoiding 
> these if possible, rather than circling around the issues.
> It should also note that the "suffix" often contains only part of a 
> DNS hostname ("foo" in the expectation that the processor will turn it 
> into "http://www.foo.com" or something equivalent and that form is 
> very high-risk behavior.  See RFC1535 and/or for discussion of some of 
> the downsides of these games.

I have changed it to:

While this practice of using suffix references is common, it should
be avoided whenever possible and never used in situations where 
long-term
references are expected.  The heuristics noted above will change over 
time,
particularly when new URI schemes are introduced, and are often
incorrect when used out of context.  Furthermore, they can lead to
security issues along the lines of those described in
<xref target="RFC1535"/>.

> (viii) Section 7.5.  But this is where a careful distinction between 
> "authority" and "path" becomes important, along with clarity about 
> types of "reserved" characters.  An only slightly confused reader 
> could conclude that it was possible to define a MIXER URI type in > which
>  mixer:/I=J/S=Linnimouth/GQ=5/@Marketing.Widget.COM
> Was valid.  I don't think it is, but it takes much too long to prove 
> this from 2396bis.

That is a valid URI.  It is a weird one, but I've seen worse 
unfortunately.

Cheers,

Roy T. Fielding                            <http://roy.gbiv.com/>
Chief Scientist, Day Software              <http://www.day.com/>
Received on Saturday, 7 February 2004 04:09:04 UTC