Re: registration guidelines draft 04 comments from Bruce Lilly on 2005-06-21 (uri@w3.org from June 2005)

From: Bruce Lilly <blilly@erols.com>
Date: Tue, 21 Jun 2005 09:31:06 -0400
To: uri@w3.org
Cc: Mike Brown <mike@skew.org>
Message-Id: <200506210931.09262.blilly@erols.com>
On Mon June 20 2005 22:38, Mike Brown wrote:
> 
> I was surprised that there was no mention of reserved characters in 
> draft-hansen-2717bis-2718bis-uri-guidelines-04.txt [1].

I haven't looked at those myself, nor do I have any immediate plans to do
so.
 
> Since the determination of whether a character must be percent-encoded is 
> partially dependent upon whether it has a reserved purpose in the URI 
> component in question, shouldn't a URI scheme spec be clear about which 
> reserved characters that might appear in the URI have a reserved purpose,
> above & beyond that specified by the generic syntax?
> 
> For example, in April, Bruce Lilly noticed that it's insufficient for RFC 2368 
> and the new mailto draft to just say 'Within mailto URLs, the characters "?", 
> "=", "&" are reserved' because those characters don't always have to be 
> treated specially.

To clarify, it's certainly possible to say that, but it's unnecessarily
restrictive.  The comment, incidentally, dates back to RFC 2368, which
used similar language (however, 2368 referred to "all URL reserved
characters", and that set has changed between RFCs 1738, 2396, and 3986).

> [2] He is right, I think, but he doesn't go far enough; the  
> new draft should adhere to the new terminology: a scheme cannot say that 
> characters X, Y, and Z are reserved; RFC 3986 has established the set of 
> characters that are reserved and this cannot be changed. Rather, the scheme 
> should clarify exactly when & where the characters designated as reserved must 
> be percent-encoded. It should also make it clear, or at least implicit, how to 
> interpret percent-encoded octets corresponding to those reserved characters.

While to some extent splitting individual URI scheme definitions from the
general syntax may simplify document maintenance, there is a danger that
the specifications will lose synchronicity and become difficult to
interpret, probably leading to interoperability problems.  This has already
happened to some extent; RFC 2368 (mailto) refers to 1738 for the generic
syntax, even though 2396 was in preparation concurrent with 2368.  1738
defines an "unsafe" set of characters, which is no longer defined in 2396 or
3986.  As mentioned above, the "reserved" set referenced by 2368 differs
between 1738, 2396 (added "+", "$", and ",") and 3986 (added "#", "[", "]",
"!", "'", "(", ")", and "*").  Worse, the "reserved" set has a very rubbery
semantic definition; "reserved" characters "may (or may not) be defined as
delimiters".  2396 defined a clear set of "excluded" characters with a
rationale for each character; that has been dropped from 3986 -- one has
to read between the lines to figure out that "%" has a special meaning!
Largely because of splitting individual URI scheme specifications into
separate documents, several things have happened and/or can be expected to
happen:

1. for clarity and to promote interoperability, each scheme really needs
   to specify its syntax in a complete and generic-URI-syntax-independent
   manner -- referring back to the generic syntax (as 2368's "URL reserved
   characters) doesn't work because the generic syntax changes unexpectedly
   (as has happened with that "reserved" set).
2. because schemes need to specify syntax completely and in a self-contained
   manner, there is little of value in the generic syntax document (arguably
   that has been the case since the split from 1738, since both 2396 and
   3986 ABNF-specified grammars are unimplementable in an efficient manner
   due to reduce/reduce conflicts in the grammars).
3. There is a good chance that schemes will be specified such that there
   effectively is no generic syntax; each scheme will require its own
   dedicated parser.  This has been the case to some extent all along; one
   can use a pattern-matching regular expression to extract a "path" from a
   mailto URI, but given the mailto URI semantics, "path" has little
   meaning.  I.e. "uniform" is a misnomer.

The problem with adhering to the new terminology lies in the rubbery
definition  associated with 3986 terminology.  I agree that clarity is
desirable (I would say essential), and that the only way to achieve that
under the circumstances is to provide a syntax specification that does
not depend on rubbery definitions such as "reserved".

> For example, are these equivalent?
> 
> mailto:mailinglist-return-user=host.com@lists.otherhost.com
> mailto:mailinglist-return-user%3Dhost.com@lists.otherhost.com

Slightly interesting, but "=" has always been "reserved" (as has "@").  It
appears in what a pattern-matching regular expression would identify as a
"path", and RFC 2396 says:

      path          = [ abs_path | opaque_part ]

      path_segments = segment *( "/" segment )
      segment       = *pchar *( ";" param )
      param         = *pchar

      pchar         = unreserved | escaped |
                      ":" | "@" | "&" | "=" | "+" | "$" | ","

which seems to indicate that an unencoded "=" is fine, but then goes on
to say:

   Within a path segment, the characters
   "/", ";", "=", and "?" are reserved.

Probably 2396 should have said that "=" is reserved within "param", not
"path segment".

In accordance with the robustness principle, encoding for generation
and handling unencoded "=" in path when parsing seems to be a safe choice.

More interesting would be:

 mailto:foo!bar%40example.edu
vs.
 mailto:foo%21bar%40example.edu

That's interesting because "!" has suddenly become "reserved", and is not
uncommon in mail (RFC 976).  Prior to RFC 3986, it was not reserved; in
2396 it is a "mark" which is included in "unreserved", which is included
in pchar and therefore (under 2396 rules) legal unencoded in "path".

> I think they are, but if "=" is always considered to have a reserved purpose 
> no matter where it appears in a mailto URI, then they are not.
> 
> Sorry I haven't time to come up with a very carefully worded phrase to add to 
> the guidelines for new URI schemes, but I hope someone can make a good 
> suggestion for how schemes should refer to and make recommendations for
> reserved characters.

I think that individual scheme specifications need to clearly indicate
precisely (i.e. in a list, not via a blanket term from the generic URI
syntax document) which characters need to be encoded, and if different
scheme URI components have different requirements in that regard, that
too needs to be clearly specified.  Scheme specifications can go beyond
that and make recommendations for encoding of some characters in some
contexts for maximum compatibility with legacy implementations (e.g. as
with the "reserved" "@" in mailto).
Received on Tuesday, 21 June 2005 13:31:29 UTC