- From: by way of Martin Duerst <klensin@jck.com>
- Date: Wed, 09 Jul 2003 15:04:43 -0400
- To: uri@w3.org
Hi.
As a result of still being troubled by the "which characters really need to
be escaped in MAILTO" problem which some of you may have heard discussed,
I've made a careful pass through this document. As a reader who has not
been active in its development, I may be bringing a bit of a fresh look to it.
The document is a considerable improvement over RFC 2396, but I've ended up
with two major problems and a few nits.
(1) There are a number of places in which the document seems to go to such
efforts to be general and to avoid over-constraining particular URI schemes
that it has achieved a level of abstraction indistinguishable from
incomprehensibility and, occasionally, internal contradictions. Examples
below, but I think either some rewriting or _very_ careful consistency
review is needed, if not both.
(2) The problem I think we got into with MAILTO, and perhaps with other URI
schemes, is that it is tempting to refer to a generic URI document and say,
about syntax and escaping, "do what it says there". Unfortunately, what
this says is very general and non-specific, and some of the terms don't
mean quite what one would assume on casual reading. I believe the
document would benefit significantly from a short section titled, e.g.,
"Specification Requirements for URI schemes" and that would then include,
in very specific terms, a list of things that a URI scheme description/
standard must specify. I would expect that list to include an exact list
of characters that must be escaped within the context of that scheme.
(3) Details...
(i) Section 2.1. I understand, I think, the reasoning behind the "maybe it
is ASCII and maybe it is not" language here. But, if URI appears in
machine-readable form, and the scheme name is not (or might not be) in
ASCII (or any other pre-specified character set), how is a URI parser or
other processor to recognize it? Put differently, there is a
bootstrapping problem: one must know the character set of the scheme name
before one can figure out how to parse or process anything else. I might be
wrong about this but, if I am, this section needs a bit more explanation.
(ii) Section 2.2. Normally, "reserved" means "always", and "can't be used
for anything else". It isn't the meaning here (or actually, is partially
the meaning). Things would be much more clear if the production/definition
were broken up into
reserved = Subcomponent-Delimiter-Role /
Other-Often-Reserved
Subcomponent-Delimiter-Role = "/" / "?" / "#"
(and colon (":") ???)
Other-Often-Reserved = <the rest of the list>
Some small rearrangement of the paragraphs below would then make things
much more clear.
Also, "URI's origin" should be precisely defined somewhere (it isn't in the
index). A naive reader could interpret the term as either "the definition
of the URI type/scheme" or "the author/ process that produces some
particular URI instance". A similar comment applies to "URI creator" which
appears at the end of section 3.2.2.
(iii) Section 2.5. As with "reserved", "excluded" doesn't seem to have its
normal English meaning of absolutely forbidden. I read this section to say
that "excluded" characters should be avoided if possible and escaped
otherwise. If that is the intended meaning, it should appear in so many
words. But I'm not sure it is. For example, the exclusion of characters
outside the ASCII range would appear to prohibit UTF-8, even in %-encoded
form, and, given other text in the document, that clearly is not the intent.
(iv) Section 3 introduction. The first sentence lists a "path"
component. There is no "path" component in the syntax productions,
although I assume that "hier-part" is more or less the same thing. And
then the next sentence says that the "path components" (not plural) is
required. Either more explanation is needed here or "hier-path" should be
renamed to "path-component" or equivalent. It would also be useful to
explicitly note that the productions for scheme, authority, etc. are
defined in subsequent subsections.
The last sentence of the second paragraph says "a non-hierarchical path
will be treated as opaque data...". But, from the productions, there
appears to be no such thing as a non-hierarchical path".
(v) Sectin 3.2.2 and IPv6. I don't know if there is a future version of IP
beyond v6, but please don't dig us into a corner by having only the IPv4
and IPv6 forms and no way to move beyond that. Consider the RFC 2821
solution, in which address literals for other than the (historical) IPv4
ones must be explicitly identified by a protocol-specific keyword.
In the fourth paragraph, please insert "and should be followed by one"
before "if it is necessary to distinguish". I.e., the trailing "." is
always permitted but, if there is any question about whether the domain is
an FQDN or a fragment of some sort, it should (or even must) be present.
(vi) Section 3.3. The distinction between "path" and "authority" needs to
be more clearly drawn. This section clearly defines mailto as using a
path, but the authority discussion and syntax in 3.2 might be construed as
having mailto consisting of an empty opaque path and an authority. Since
the syntax for net-path in 3 seems to suggest that, if it were an
authority, mailto would have to be
mailto://fred@example.com rather than
mailto:fred@example.com
I think the text is correct and consistent. But it is exceptionally confusing.
(vii) Section 4.5 on Suffix References. We all know this practice is
common. We also know that it leads to trouble, especially when "heuristics
change over time". I think the section should be a bit more clear about
the problems, and then clearly recommend avoiding these if possible, rather
than circling around the issues.
It should also note that the "suffix" often contains only part of a DNS
hostname ("foo" in the expectation that the processor will turn it into
"http://www.foo.com" or something equivalent and that form is very
high-risk behavior. See RFC1535 and/or for discussion of some of the
downsides of these games.
(viii) Section 7.5. But this is where a careful distinction between
"authority" and "path" becomes important, along with clarity about types of
"reserved" characters. An only slightly confused reader could conclude
that it was possible to define a MIXER URI type in which
mixer:/I=J/S=Linnimouth/GQ=5/@Marketing.Widget.COM
Was valid. I don't think it is, but it takes much too long to prove this
from 2396bis.
I have skimmed, but often not studied, sections of the document not
referenced above, so this should not be taken as a comprehensive review.
regards,
john
Received on Wednesday, 9 July 2003 15:07:41 UTC