Re: URI Generic Syntax doc (draft-fielding-uri-rfc2396bis-03) from by way of Martin Duerst on 2004-02-07 (uri@w3.org from February 2004)

From: by way of Martin Duerst <klensin@jck.com>
Date: Sat, 07 Feb 2004 17:31:23 -0500
To: uri@w3.org
Message-Id: <4.2.0.58.J.20040207173110.07787a60@localhost>
Roy,

Thanks.  I think this is very good progress, and I look forward to reading 
the new draft.   A few comments...

(1) If specifying what must be specified in specific URI definitions is 
"the role of RFC 2718 and BCP 35 (RFC 2717)...", which may be quite 
reasonable, I think those docs should be crossreferenced in 2396bis.  A 
sentence like "There are specific requirements that these per-scheme 
characteristics be defined for the schemes, see [RFC2718bis, RFC2717bis]" 
may be a forward pointer but is not normative, since it provides references 
for additional reading about a related topic, rather than information 
needed to understand/implement 2396bis.

(2) The IPv6 syntax issue.  RFC 2821 does not provide a variant/ different 
IPv6 syntax (although changes by the IPv6 folks may require some slight 
retuning in 2821bis).  What it does is to specify that, if the addressing 
scheme isn't IPv4, then the scheme must be tagged with what it is, rather 
than deduced by heuristics on the address syntax chosen.  Think of it as a 
wrapper around whatever the IPv6 folks specify, rather than a different 
syntax.  On that basis, I think you can do exactly the same thing and that 
a "no heuristics" or at least "no heuristics if there is any possible 
alternative" principle will generally serve URIs and the web well. If, due 
to other pressures and considerations, you need to permit IPv6 addresses 
without any qualification, it seems to me that it would be useful to 
identify an identified/tagged form, make it normative, and then define the 
IPv6 address form without the tag as a permitted abbreviated syntax 
variation.  That at least puts the right future stake in the ground.  That 
stake is necessary if we see URIs outliving the Internet as we know it, and 
some of the other W3C-derived architecture documents and public 
presentations clearly anticipate that.

regards,
   john


--On Saturday, 07 February, 2004 01:08 -0800 "Roy T. Fielding" 
<fielding@gbiv.com> wrote:

>On Wednesday, July 9, 2003, at 12:04  PM, John C Klensin wrote:
>>The document is a considerable improvement over RFC 2396, but
>>I've  ended up with two major problems and a few nits.
>>
>>(1) There are a number of places in which the document seems
>>to go to  such efforts to be general and to avoid
>>over-constraining particular  URI schemes that it has
>>achieved a level of abstraction  indistinguishable from
>>incomprehensibility and, occasionally, internal
>>contradictions.   Examples below, but I think either some
>>rewriting or  _very_ careful consistency review is needed, if
>>not both.
>
>In almost all of the cases you note, this has been due to a
>tension
>between wishing to remain consistent with prior UR*
>specifications
>and yet reflect how these things have actually been
>implemented in
>the real world.  This has been a particular problem with
>reserved
>characters.  I agree that the balance isn't working, so I have
>rewritten
>the sections such that all of the reserved characters are
>described
>as delimiters (assigned or available to be assigned by
>schemes) and
>all of the unreserved characters are always data.
>
>>(2) The problem I think we got into with MAILTO, and perhaps
>>with  other URI schemes, is that it is tempting to refer to a
>>generic URI  document and say, about syntax and escaping, "do
>>what it says there".   Unfortunately, what this says is very
>>general and non-specific, and  some of the terms don't mean
>>quite what one would assume on casual  reading.   I believe
>>the document would benefit significantly from a  short
>>section titled, e.g., "Specification Requirements for URI
>>schemes" and that would then include, in very specific terms,
>>a list  of things that a URI scheme description/ standard
>>must specify.  I  would expect that list to include an exact
>>list of characters that  must be escaped within the context
>>of that scheme.
>>I believe that is the role of RFC 2718 and BCP 35 (RFC 2717),
>which
>will be revised as soon as I get this one off my plate.
>
>>(3) Details...
>>
>>(i) Section 2.1.  I understand, I think, the reasoning behind
>>the  "maybe it is ASCII and maybe it is not" language here.
>>But, if URI  appears in machine-readable form, and the scheme
>>name is not (or might  not be) in ASCII (or any other
>>pre-specified character set), how is a  URI parser or other
>>processor to recognize it?   Put differently,  there is a
>>bootstrapping problem: one must know the character set of
>>the scheme name before one can figure out how to parse or
>>process  anything else. I might be wrong about this but, if I
>>am, this section  needs a bit more explanation.
>
>I think it is generally true that one must know the character
>encoding
>of any document before one can process it (or at least a
>defined mechanism
>for discovering the character encoding prior to reaching the
>content).
>I have added: "When a URI appears in a protocol element, the
>character
>encoding is defined by that protocol; absent such a
>definition, the
>URI is assumed to be encoded in the same character encoding as
>the
>surrounding text."
>
>>(ii) Section 2.2.   Normally, "reserved" means "always", and
>>"can't be  used for anything else".  It isn't the meaning
>>here (or actually, is  partially the meaning).  Things would
>>be much more clear if the  production/definition were broken
>>up into
>>
>>  reserved = Subcomponent-Delimiter-Role /
>>             Other-Often-Reserved
>>  Subcomponent-Delimiter-Role = "/" / "?" / "#"
>>      (and colon (":") ???)
>>  Other-Often-Reserved = <the rest of the list>
>>
>>Some small rearrangement of the paragraphs below would then
>>make  things much more clear.
>
>Done, though I use the names gen-delims and sub-delims.  I
>have also
>moved the unsafe mark characters to the reserved set, since
>that is
>how they are used in practice and the source of most of the
>confusion.
>
>>Also, "URI's origin" should be precisely defined somewhere
>>(it isn't  in the index).  A naive reader could interpret the
>>term as either "the  definition of the URI type/scheme" or
>>"the author/ process that  produces some particular URI
>>instance".  A similar comment applies to  "URI creator" which
>>appears at the end of section 3.2.2.
>
>Replaced with "the implementation-specific syntax of a URI's
>dereferencing algorithm".
>
>>(iii) Section 2.5. As with "reserved", "excluded" doesn't
>>seem to have  its normal English meaning of absolutely
>>forbidden.  I read this  section to say that "excluded"
>>characters should be avoided if  possible and escaped
>>otherwise.  If that is the intended meaning, it  should
>>appear in so many words.    But I'm not sure it is.  For
>>example, the exclusion of characters outside the ASCII range
>>would  appear to prohibit UTF-8, even in %-encoded form, and,
>>given other  text in the document, that clearly is not the
>>intent.
>
>Yes, I'll work on a better way of describing excluded glyphs
>rather
>than excluded characters.
>
>>(iv) Section 3 introduction.  The first sentence lists a
>>"path"  component.  There is no "path" component in the
>>syntax productions,  although I assume that "hier-part" is
>>more or less the same thing.   And then the next sentence
>>says that the "path components" (not  plural) is required.
>>Either more explanation is needed here or  "hier-path" should
>>be renamed to "path-component" or equivalent.  It  would also
>>be useful to explicitly note that the productions for
>>scheme, authority, etc. are defined in subsequent subsections.
>>
>>The last sentence of the second paragraph says "a
>>non-hierarchical  path will be treated as opaque data...".
>>But, from the productions,  there appears to be no such thing
>>as a non-hierarchical path".
>
>These will be fixed before the next revision due to other
>comments received.
>
>>(v) Sectin 3.2.2 and IPv6.  I don't know if there is a future
>>version  of IP beyond v6, but please don't dig us into a
>>corner by having only  the IPv4 and IPv6 forms and no way to
>>move beyond that.  Consider the  RFC 2821 solution, in which
>>address literals for other than the  (historical) IPv4 ones
>>must be explicitly identified by a  protocol-specific keyword.
>
>Unfortunately, I have no control over the syntax supplied and
>implemented
>by the IPv6 working groups, and it is difficult for me to
>invent one and
>also provide the implementations necessary for a Draft
>Standard status.
>
>>In the fourth paragraph, please insert "and should be
>>followed by one"  before "if it is necessary to distinguish".
>>I.e., the trailing "." is  always permitted but, if there is
>>any question about whether the  domain is an FQDN or a
>>fragment of some sort, it should (or even must)  be present.
>
>Done.
>
>>(vi) Section 3.3.  The distinction between "path" and
>>"authority"  needs to be more clearly drawn.  This section
>>clearly defines mailto  as using a path, but the authority
>>discussion and syntax in 3.2 might  be construed as having
>>mailto consisting of an empty opaque path and  an authority.
>>Since the syntax for net-path in 3 seems to suggest  that, if
>>it were an authority, mailto would have to be
>>  mailto://fred@example.com rather than
>>  mailto:fred@example.com
>>I think the text is correct and consistent.  But it is
>>exceptionally  confusing.
>
>I will try to make it less confusing with the next revision
>due to the
>other changes with path.
>
>>(vii) Section 4.5 on Suffix References.  We all know this
>>practice is  common.  We also know that it leads to trouble,
>>especially when  "heuristics change over time".   I think the
>>section should be a bit  more clear about the problems, and
>>then clearly recommend avoiding  these if possible, rather
>>than circling around the issues. It should also note that the
>>"suffix" often contains only part of a  DNS hostname ("foo"
>>in the expectation that the processor will turn it  into
>>"http://www.foo.com" or something equivalent and that form is
>>very high-risk behavior.  See RFC1535 and/or for discussion
>>of some of  the downsides of these games.
>
>I have changed it to:
>
>While this practice of using suffix references is common, it
>should
>be avoided whenever possible and never used in situations
>where long-term
>references are expected.  The heuristics noted above will
>change over time,
>particularly when new URI schemes are introduced, and are often
>incorrect when used out of context.  Furthermore, they can
>lead to
>security issues along the lines of those described in
><xref target="RFC1535"/>.
>
>>(viii) Section 7.5.  But this is where a careful distinction
>>between  "authority" and "path" becomes important, along with
>>clarity about  types of "reserved" characters.  An only
>>slightly confused reader  could conclude that it was possible
>>to define a MIXER URI type in > which
>>  mixer:/I=J/S=Linnimouth/GQ=5/@Marketing.Widget.COM
>>Was valid.  I don't think it is, but it takes much too long
>>to prove  this from 2396bis.
>
>That is a valid URI.  It is a weird one, but I've seen worse
>unfortunately.
>
>Cheers,
>
>Roy T. Fielding
><http://roy.gbiv.com/>
>Chief Scientist, Day Software
><http://www.day.com/>
Received on Sunday, 8 February 2004 10:09:12 UTC