- From: Terje Bless <link@pobox.com>
- Date: Mon, 18 Nov 2002 04:33:55 +0100
- To: URI List <uri@w3.org>
- cc: "Roy T. Fielding" <fielding@apache.org>
[ CC to Roy as I quote him ] I[0] recently had cause to investigate the ins and outs of encoding of characters in URIs in the context of checking the URI for "validity". The conclusion appears to be that at least some specifics are ambigious and echos (at least to some extent) the conclusion drawn in <http://lists.w3.org/Archives/Public/uri/2002May/0032.html>. For instance, RFC2396 in Appendix A defines the query component thus: query = *uric uric = reserved | unreserved | escaped unreserved = alphanum | mark reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | "," mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" escaped = "%" hex hex This seems to say that the query component may contain any character from the set "reserved", "unreserved", and "escaped"; even characters such as the "?" (because it is part of the "reserved" production). Previously, I would have said with the utmost certainty that any "?" inside the query component must be escaped. And the text of Section 2.2. "Reserved Characters"... Many URI include components consisting of or delimited by, certain special characters. These characters are called "reserved", since their usage within the URI component is limited to their reserved purpose. If the data for a URI component would conflict with the reserved purpose, then the conflicting data must be escaped before forming the URI. ...did on first reading suggest that was correct. However, closer examination reveals the exception "If the data for a URI component would conflict with the reserved purpose [...]". Examining the productions for the components that may preceede the query component reveals that the "?" is not allowed to appear in any of them... and so the left-most literal "?" must be the delimiter for the query component... which implies that any literal "?" appearing inside the query component _cannot_ be said to "conflict" with it's reserved meaning, and thus need not be encoded(!). But Section 3.4 "Query Component" says: Within a query component, the characters ";", "/", "?", ":", "@", "&", "=", "+", ",", and "$" are reserved. Why go to the trouble of making the "?" be "reserved" in the _content_ of the query component, when it _cannot_ appear as anything but data here? One theory presented, was that it may be because a "reserved" character is permissible for an implementation, such as a HTTP server, to treat differently depending on whether or not it is encoded. Based largely on the definition of "reserved character": [...] In general, a character is reserved if the semantics of the URI changes if the character is replaced with its escaped US-ASCII encoding. IOW, this appears to open up the possibility for implementations to treat the escaped and unescaped forms of a character differently for the sole reason that they appear in the set of "reserved" characters. Of course, this also suggests to me that they were placed in the "reserved" set _because_ implementations might treat them differently (which smells an awfull lot of circular reasoning to me). Now, from the context of a utility to check the syntax of an URI and warn a user about errors and possible "Best Practices" issues, as well as creating pre-escaped URIs for insertion into a HTML document, I'm left with a lot of confusion and very little certainty. What characters do in fact have to be encoded inside the query component? Why? Apart from what RFC2396 actually spells out, what was the _intended_ behaviour in this regard? Why did this appear to change since RFC1738? Why is the "?" reserved inside the query component when it cannot appear as anything but data there? Why is the "/" reserved when it also cannot appear as anything but data inside the query component? More generally, it turns out that finding out exactly what must and must not be encoded -- not to mention finding out what, in best practice, /should/ be encoded -- is not an easy task with RFC2396. At least for now I'm left guessing at what was the intent of the spec writers rather then getting clear rules out of the RFC. Speaking of which, I noticed <http://www.apache.org/~fielding/uri/rev-2002/issues.html#013-query-slash>. Quoth Roy T. Fielding on "/" in query component: This is not an error in the spec, though it could be useful as a note in future revisions. The specification cannot disallow characters that commonly do appear in a URI query string, even if it is inadvisable for them to be used. That is why they are listed as reserved in that context (i.e., should not be used unencoded except when the reserved meaning is intended). This latter seems to indicate strongly that the intent is that all characters in the "reserved" set for a component SHOULD be encoded when appearing in, e.g., the query component, but the spec doesn't appear to support this position. The spec specifically says that reserved characters must be encoded _only_ when they would conflict with it's reserved purpose. Worse, since the spec says that encoding a reserved character actually changes it's meaning; you cannot encode a character from the reserved set inside the query component without also changing the URI. In case it's not obvious by now I'm hopelessly confused and rapidly developing a headache. Help..? [0] - That is to say, there are several unindicted co-conspirators, but I'd rather they stay, uhm, "unindicted" unless they choose to `fess up of their own accord. Don't want them to take the blame for my confusion. :-) -- "When you have no nails your hammer grows restless, and you begin to throw sideways glances at screws and pieces of string." -- Jarkko Hietaniemi
Received on Sunday, 17 November 2002 22:34:20 UTC