- From: by way of Martin Duerst <klensin@jck.com>
- Date: Sat, 07 Feb 2004 17:31:23 -0500
- To: uri@w3.org
Roy, Thanks. I think this is very good progress, and I look forward to reading the new draft. A few comments... (1) If specifying what must be specified in specific URI definitions is "the role of RFC 2718 and BCP 35 (RFC 2717)...", which may be quite reasonable, I think those docs should be crossreferenced in 2396bis. A sentence like "There are specific requirements that these per-scheme characteristics be defined for the schemes, see [RFC2718bis, RFC2717bis]" may be a forward pointer but is not normative, since it provides references for additional reading about a related topic, rather than information needed to understand/implement 2396bis. (2) The IPv6 syntax issue. RFC 2821 does not provide a variant/ different IPv6 syntax (although changes by the IPv6 folks may require some slight retuning in 2821bis). What it does is to specify that, if the addressing scheme isn't IPv4, then the scheme must be tagged with what it is, rather than deduced by heuristics on the address syntax chosen. Think of it as a wrapper around whatever the IPv6 folks specify, rather than a different syntax. On that basis, I think you can do exactly the same thing and that a "no heuristics" or at least "no heuristics if there is any possible alternative" principle will generally serve URIs and the web well. If, due to other pressures and considerations, you need to permit IPv6 addresses without any qualification, it seems to me that it would be useful to identify an identified/tagged form, make it normative, and then define the IPv6 address form without the tag as a permitted abbreviated syntax variation. That at least puts the right future stake in the ground. That stake is necessary if we see URIs outliving the Internet as we know it, and some of the other W3C-derived architecture documents and public presentations clearly anticipate that. regards, john --On Saturday, 07 February, 2004 01:08 -0800 "Roy T. Fielding" <fielding@gbiv.com> wrote: >On Wednesday, July 9, 2003, at 12:04 PM, John C Klensin wrote: >>The document is a considerable improvement over RFC 2396, but >>I've ended up with two major problems and a few nits. >> >>(1) There are a number of places in which the document seems >>to go to such efforts to be general and to avoid >>over-constraining particular URI schemes that it has >>achieved a level of abstraction indistinguishable from >>incomprehensibility and, occasionally, internal >>contradictions. Examples below, but I think either some >>rewriting or _very_ careful consistency review is needed, if >>not both. > >In almost all of the cases you note, this has been due to a >tension >between wishing to remain consistent with prior UR* >specifications >and yet reflect how these things have actually been >implemented in >the real world. This has been a particular problem with >reserved >characters. I agree that the balance isn't working, so I have >rewritten >the sections such that all of the reserved characters are >described >as delimiters (assigned or available to be assigned by >schemes) and >all of the unreserved characters are always data. > >>(2) The problem I think we got into with MAILTO, and perhaps >>with other URI schemes, is that it is tempting to refer to a >>generic URI document and say, about syntax and escaping, "do >>what it says there". Unfortunately, what this says is very >>general and non-specific, and some of the terms don't mean >>quite what one would assume on casual reading. I believe >>the document would benefit significantly from a short >>section titled, e.g., "Specification Requirements for URI >>schemes" and that would then include, in very specific terms, >>a list of things that a URI scheme description/ standard >>must specify. I would expect that list to include an exact >>list of characters that must be escaped within the context >>of that scheme. >>I believe that is the role of RFC 2718 and BCP 35 (RFC 2717), >which >will be revised as soon as I get this one off my plate. > >>(3) Details... >> >>(i) Section 2.1. I understand, I think, the reasoning behind >>the "maybe it is ASCII and maybe it is not" language here. >>But, if URI appears in machine-readable form, and the scheme >>name is not (or might not be) in ASCII (or any other >>pre-specified character set), how is a URI parser or other >>processor to recognize it? Put differently, there is a >>bootstrapping problem: one must know the character set of >>the scheme name before one can figure out how to parse or >>process anything else. I might be wrong about this but, if I >>am, this section needs a bit more explanation. > >I think it is generally true that one must know the character >encoding >of any document before one can process it (or at least a >defined mechanism >for discovering the character encoding prior to reaching the >content). >I have added: "When a URI appears in a protocol element, the >character >encoding is defined by that protocol; absent such a >definition, the >URI is assumed to be encoded in the same character encoding as >the >surrounding text." > >>(ii) Section 2.2. Normally, "reserved" means "always", and >>"can't be used for anything else". It isn't the meaning >>here (or actually, is partially the meaning). Things would >>be much more clear if the production/definition were broken >>up into >> >> reserved = Subcomponent-Delimiter-Role / >> Other-Often-Reserved >> Subcomponent-Delimiter-Role = "/" / "?" / "#" >> (and colon (":") ???) >> Other-Often-Reserved = <the rest of the list> >> >>Some small rearrangement of the paragraphs below would then >>make things much more clear. > >Done, though I use the names gen-delims and sub-delims. I >have also >moved the unsafe mark characters to the reserved set, since >that is >how they are used in practice and the source of most of the >confusion. > >>Also, "URI's origin" should be precisely defined somewhere >>(it isn't in the index). A naive reader could interpret the >>term as either "the definition of the URI type/scheme" or >>"the author/ process that produces some particular URI >>instance". A similar comment applies to "URI creator" which >>appears at the end of section 3.2.2. > >Replaced with "the implementation-specific syntax of a URI's >dereferencing algorithm". > >>(iii) Section 2.5. As with "reserved", "excluded" doesn't >>seem to have its normal English meaning of absolutely >>forbidden. I read this section to say that "excluded" >>characters should be avoided if possible and escaped >>otherwise. If that is the intended meaning, it should >>appear in so many words. But I'm not sure it is. For >>example, the exclusion of characters outside the ASCII range >>would appear to prohibit UTF-8, even in %-encoded form, and, >>given other text in the document, that clearly is not the >>intent. > >Yes, I'll work on a better way of describing excluded glyphs >rather >than excluded characters. > >>(iv) Section 3 introduction. The first sentence lists a >>"path" component. There is no "path" component in the >>syntax productions, although I assume that "hier-part" is >>more or less the same thing. And then the next sentence >>says that the "path components" (not plural) is required. >>Either more explanation is needed here or "hier-path" should >>be renamed to "path-component" or equivalent. It would also >>be useful to explicitly note that the productions for >>scheme, authority, etc. are defined in subsequent subsections. >> >>The last sentence of the second paragraph says "a >>non-hierarchical path will be treated as opaque data...". >>But, from the productions, there appears to be no such thing >>as a non-hierarchical path". > >These will be fixed before the next revision due to other >comments received. > >>(v) Sectin 3.2.2 and IPv6. I don't know if there is a future >>version of IP beyond v6, but please don't dig us into a >>corner by having only the IPv4 and IPv6 forms and no way to >>move beyond that. Consider the RFC 2821 solution, in which >>address literals for other than the (historical) IPv4 ones >>must be explicitly identified by a protocol-specific keyword. > >Unfortunately, I have no control over the syntax supplied and >implemented >by the IPv6 working groups, and it is difficult for me to >invent one and >also provide the implementations necessary for a Draft >Standard status. > >>In the fourth paragraph, please insert "and should be >>followed by one" before "if it is necessary to distinguish". >>I.e., the trailing "." is always permitted but, if there is >>any question about whether the domain is an FQDN or a >>fragment of some sort, it should (or even must) be present. > >Done. > >>(vi) Section 3.3. The distinction between "path" and >>"authority" needs to be more clearly drawn. This section >>clearly defines mailto as using a path, but the authority >>discussion and syntax in 3.2 might be construed as having >>mailto consisting of an empty opaque path and an authority. >>Since the syntax for net-path in 3 seems to suggest that, if >>it were an authority, mailto would have to be >> mailto://fred@example.com rather than >> mailto:fred@example.com >>I think the text is correct and consistent. But it is >>exceptionally confusing. > >I will try to make it less confusing with the next revision >due to the >other changes with path. > >>(vii) Section 4.5 on Suffix References. We all know this >>practice is common. We also know that it leads to trouble, >>especially when "heuristics change over time". I think the >>section should be a bit more clear about the problems, and >>then clearly recommend avoiding these if possible, rather >>than circling around the issues. It should also note that the >>"suffix" often contains only part of a DNS hostname ("foo" >>in the expectation that the processor will turn it into >>"http://www.foo.com" or something equivalent and that form is >>very high-risk behavior. See RFC1535 and/or for discussion >>of some of the downsides of these games. > >I have changed it to: > >While this practice of using suffix references is common, it >should >be avoided whenever possible and never used in situations >where long-term >references are expected. The heuristics noted above will >change over time, >particularly when new URI schemes are introduced, and are often >incorrect when used out of context. Furthermore, they can >lead to >security issues along the lines of those described in ><xref target="RFC1535"/>. > >>(viii) Section 7.5. But this is where a careful distinction >>between "authority" and "path" becomes important, along with >>clarity about types of "reserved" characters. An only >>slightly confused reader could conclude that it was possible >>to define a MIXER URI type in > which >> mixer:/I=J/S=Linnimouth/GQ=5/@Marketing.Widget.COM >>Was valid. I don't think it is, but it takes much too long >>to prove this from 2396bis. > >That is a valid URI. It is a weird one, but I've seen worse >unfortunately. > >Cheers, > >Roy T. Fielding ><http://roy.gbiv.com/> >Chief Scientist, Day Software ><http://www.day.com/>
Received on Sunday, 8 February 2004 10:09:12 UTC