Ambiguity of Allowed/Recommended URI Syntax and Escaping

[ CC to Roy as I quote him ]


I[0] recently had cause to investigate the ins and outs of encoding of
characters in URIs in the context of checking the URI for "validity". The
conclusion appears to be that at least some specifics are ambigious and
echos (at least to some extent) the conclusion drawn in
<http://lists.w3.org/Archives/Public/uri/2002May/0032.html>.


For instance, RFC2396 in Appendix A defines the query component thus:

  query      = *uric
  uric       = reserved | unreserved | escaped
  unreserved = alphanum | mark
  reserved   = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
  mark       = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
  escaped    = "%" hex hex

This seems to say that the query component may contain any character from
the set "reserved", "unreserved", and "escaped"; even characters such as
the "?" (because it is part of the "reserved" production).

Previously, I would have said with the utmost certainty that any "?" inside
the query component must be escaped. And the text of Section 2.2. "Reserved
Characters"...

   Many URI include components consisting of or delimited by, certain
   special characters.  These characters are called "reserved", since
   their usage within the URI component is limited to their reserved
   purpose.  If the data for a URI component would conflict with the
   reserved purpose, then the conflicting data must be escaped before
   forming the URI.

...did on first reading suggest that was correct. However, closer
examination reveals the exception "If the data for a URI component would
conflict with the reserved purpose [...]".

Examining the productions for the components that may preceede the query
component reveals that the "?" is not allowed to appear in any of them...
and so the left-most literal "?" must be the delimiter for the query
component... which implies that any literal "?" appearing inside the query
component _cannot_ be said to "conflict" with it's reserved meaning, and
thus need not be encoded(!).

But Section 3.4 "Query Component" says:

  Within a query component, the characters ";", "/", "?", ":", "@",
     "&", "=", "+", ",", and "$" are reserved.

Why go to the trouble of making the "?" be "reserved" in the _content_ of
the query component, when it _cannot_ appear as anything but data here?

One theory presented, was that it may be because a "reserved" character is
permissible for an implementation, such as a HTTP server, to treat
differently depending on whether or not it is encoded. Based largely on the
definition of "reserved character":

   [...] In general, a character is reserved if the semantics of the
   URI changes if the character is replaced with its escaped US-ASCII
   encoding.

IOW, this appears to open up the possibility for implementations to treat
the escaped and unescaped forms of a character differently for the sole
reason that they appear in the set of "reserved" characters. Of course,
this also suggests to me that they were placed in the "reserved" set
_because_ implementations might treat them differently (which smells an
awfull lot of circular reasoning to me).


Now, from the context of a utility to check the syntax of an URI and warn a
user about errors and possible "Best Practices" issues, as well as creating
pre-escaped URIs for insertion into a HTML document, I'm left with a lot of
confusion and very little certainty.


What characters do in fact have to be encoded inside the query component?
Why? Apart from what RFC2396 actually spells out, what was the _intended_
behaviour in this regard? Why did this appear to change since RFC1738?

Why is the "?" reserved inside the query component when it cannot appear as
anything but data there? Why is the "/" reserved when it also cannot appear
as anything but data inside the query component?



More generally, it turns out that finding out exactly what must and must
not be encoded -- not to mention finding out what, in best practice,
/should/ be encoded -- is not an easy task with RFC2396. At least for now
I'm left guessing at what was the intent of the spec writers rather then
getting clear rules out of the RFC.

Speaking of which, I noticed
<http://www.apache.org/~fielding/uri/rev-2002/issues.html#013-query-slash>.
Quoth Roy T. Fielding on "/" in query component:

  This is not an error in the spec, though it could be useful as a
  note in future revisions.  The specification cannot disallow
  characters that commonly do appear in a URI query string, even if
  it is inadvisable for them to be used.  That is why they are
  listed as reserved in that context (i.e., should not be used
  unencoded except when the reserved meaning is intended).

This latter seems to indicate strongly that the intent is that all
characters in the "reserved" set for a component SHOULD be encoded when
appearing in, e.g., the query component, but the spec doesn't appear to
support this position. The spec specifically says that reserved characters
must be encoded _only_ when they would conflict with it's reserved purpose.

Worse, since the spec says that encoding a reserved character actually
changes it's meaning; you cannot encode a character from the reserved set
inside the query component without also changing the URI.


In case it's not obvious by now I'm hopelessly confused and rapidly
developing a headache. Help..?






[0] - That is to say, there are several unindicted co-conspirators,
      but I'd rather they stay, uhm, "unindicted" unless they choose
      to `fess up of their own accord. Don't want them to take the
      blame for my confusion. :-)



-- 
"When you have no nails your hammer grows restless, and you begin to throw
 sideways glances at screws and pieces of string."    -- Jarkko Hietaniemi

Received on Sunday, 17 November 2002 22:34:20 UTC