- From: by way of Martin Duerst <klensin@jck.com>
- Date: Wed, 09 Jul 2003 15:04:43 -0400
- To: uri@w3.org
Hi. As a result of still being troubled by the "which characters really need to be escaped in MAILTO" problem which some of you may have heard discussed, I've made a careful pass through this document. As a reader who has not been active in its development, I may be bringing a bit of a fresh look to it. The document is a considerable improvement over RFC 2396, but I've ended up with two major problems and a few nits. (1) There are a number of places in which the document seems to go to such efforts to be general and to avoid over-constraining particular URI schemes that it has achieved a level of abstraction indistinguishable from incomprehensibility and, occasionally, internal contradictions. Examples below, but I think either some rewriting or _very_ careful consistency review is needed, if not both. (2) The problem I think we got into with MAILTO, and perhaps with other URI schemes, is that it is tempting to refer to a generic URI document and say, about syntax and escaping, "do what it says there". Unfortunately, what this says is very general and non-specific, and some of the terms don't mean quite what one would assume on casual reading. I believe the document would benefit significantly from a short section titled, e.g., "Specification Requirements for URI schemes" and that would then include, in very specific terms, a list of things that a URI scheme description/ standard must specify. I would expect that list to include an exact list of characters that must be escaped within the context of that scheme. (3) Details... (i) Section 2.1. I understand, I think, the reasoning behind the "maybe it is ASCII and maybe it is not" language here. But, if URI appears in machine-readable form, and the scheme name is not (or might not be) in ASCII (or any other pre-specified character set), how is a URI parser or other processor to recognize it? Put differently, there is a bootstrapping problem: one must know the character set of the scheme name before one can figure out how to parse or process anything else. I might be wrong about this but, if I am, this section needs a bit more explanation. (ii) Section 2.2. Normally, "reserved" means "always", and "can't be used for anything else". It isn't the meaning here (or actually, is partially the meaning). Things would be much more clear if the production/definition were broken up into reserved = Subcomponent-Delimiter-Role / Other-Often-Reserved Subcomponent-Delimiter-Role = "/" / "?" / "#" (and colon (":") ???) Other-Often-Reserved = <the rest of the list> Some small rearrangement of the paragraphs below would then make things much more clear. Also, "URI's origin" should be precisely defined somewhere (it isn't in the index). A naive reader could interpret the term as either "the definition of the URI type/scheme" or "the author/ process that produces some particular URI instance". A similar comment applies to "URI creator" which appears at the end of section 3.2.2. (iii) Section 2.5. As with "reserved", "excluded" doesn't seem to have its normal English meaning of absolutely forbidden. I read this section to say that "excluded" characters should be avoided if possible and escaped otherwise. If that is the intended meaning, it should appear in so many words. But I'm not sure it is. For example, the exclusion of characters outside the ASCII range would appear to prohibit UTF-8, even in %-encoded form, and, given other text in the document, that clearly is not the intent. (iv) Section 3 introduction. The first sentence lists a "path" component. There is no "path" component in the syntax productions, although I assume that "hier-part" is more or less the same thing. And then the next sentence says that the "path components" (not plural) is required. Either more explanation is needed here or "hier-path" should be renamed to "path-component" or equivalent. It would also be useful to explicitly note that the productions for scheme, authority, etc. are defined in subsequent subsections. The last sentence of the second paragraph says "a non-hierarchical path will be treated as opaque data...". But, from the productions, there appears to be no such thing as a non-hierarchical path". (v) Sectin 3.2.2 and IPv6. I don't know if there is a future version of IP beyond v6, but please don't dig us into a corner by having only the IPv4 and IPv6 forms and no way to move beyond that. Consider the RFC 2821 solution, in which address literals for other than the (historical) IPv4 ones must be explicitly identified by a protocol-specific keyword. In the fourth paragraph, please insert "and should be followed by one" before "if it is necessary to distinguish". I.e., the trailing "." is always permitted but, if there is any question about whether the domain is an FQDN or a fragment of some sort, it should (or even must) be present. (vi) Section 3.3. The distinction between "path" and "authority" needs to be more clearly drawn. This section clearly defines mailto as using a path, but the authority discussion and syntax in 3.2 might be construed as having mailto consisting of an empty opaque path and an authority. Since the syntax for net-path in 3 seems to suggest that, if it were an authority, mailto would have to be mailto://fred@example.com rather than mailto:fred@example.com I think the text is correct and consistent. But it is exceptionally confusing. (vii) Section 4.5 on Suffix References. We all know this practice is common. We also know that it leads to trouble, especially when "heuristics change over time". I think the section should be a bit more clear about the problems, and then clearly recommend avoiding these if possible, rather than circling around the issues. It should also note that the "suffix" often contains only part of a DNS hostname ("foo" in the expectation that the processor will turn it into "http://www.foo.com" or something equivalent and that form is very high-risk behavior. See RFC1535 and/or for discussion of some of the downsides of these games. (viii) Section 7.5. But this is where a careful distinction between "authority" and "path" becomes important, along with clarity about types of "reserved" characters. An only slightly confused reader could conclude that it was possible to define a MIXER URI type in which mixer:/I=J/S=Linnimouth/GQ=5/@Marketing.Widget.COM Was valid. I don't think it is, but it takes much too long to prove this from 2396bis. I have skimmed, but often not studied, sections of the document not referenced above, so this should not be taken as a comprehensive review. regards, john
Received on Wednesday, 9 July 2003 15:07:41 UTC