Re: Ambiguity of Allowed/Recommended URI Syntax and Escaping

Roy T. Fielding <fielding@apache.org> wrote:

>The specification requires things that are necessary for
>interoperability, and suggests things that are desirable for good
>practice/robustness.

Yes, I agree that this is the role of a Standards Track RFC. But I was not
able to understand what requirements this RFC imposes or what suggestions
it makes on the matter of when to escape reserved characters.


>None of the issues you mentioned are necessary for
>interoperability, and in fact you will find that implementations
>don't care one way or the other.  That is why "?" in query is
>reserved instead of disallowed, since it is more robust for parsers
>to expect it to be present (and treat it as data) rather than not
>expect it and think of it as an error.

The "Be liberal in what you accept." principle restated?


>URI generators, OTOH, are encouraged to use such characters only
>when they are used according to a reserved purpose, which in the
>case of "?" inside a query means never.

And instead they are encouraged, but not required, to escape the
characters; correct?


>That is not a contradiction, and certainly isn't ambiguous, since
>the specification both defines parsing and makes recommendations to
>URI generators.

It leaves the matter ambigious, subjectively, in the sense that I have been
unable, after reading the document, to determine the (presumably simple)
case of a "?" appearing inside the query component. Perhaps "Does not
explain it in a manner I was able to understand" is a better phrasing (but
it doesn't sound so catchy as a message subject ;D).

RFC1738 contains the text:

   If the character corresponding to an octet is
   reserved in a scheme, the octet must be encoded.

This is very clear and leaves no doubt about what is the expected behaviour
of an URI generator. A similar statement does not appear in RFC2396, and
neither do any statements making recommendation about "best practice" in
this area.

I take your statement on Issue #13[0] -- "That is why they are listed as
reserved in that context (i.e., should not be used unencoded except when
the reserved meaning is intended)." -- to mean that characters marked as
"reserved" in any particular context are intended to be escaped/encoded
regardless of whether they would "conflict" with it's reserved purpose.

Is that a correct assumption?

Can I generalize from this and assume, that if I were the agent responsible
for generating the URL

  http://example.com/redir?uri=http://example.com/elsewhere/

(as Jim gave as example), I would be expected to encode/escape it as

  http://example.com/redir?uri=http%3A%2F%2Fexample.com%2Felsewhere%2F

even if URI parsers are expected to accept the former as well as the
latter?

And that a similar generalization will apply to an example such as
"http://example.com/cgi.pl?uri=http://example.com/cgi-2.pl?para=meter" such
that the second "?" and "=", and all ":" and "/" characters following the
left-most "?" are expected/recommended to be encoded by URI generators?


Would it be correct to say that the characters found in the "unwise" and
"delims" productions MUST always be encoded, and any character in the
"reserved" set in any given context SHOULD be encoded within that context?

Would it be correct to say that any character in the "reserved" production
-- which differs from the set labelled as "reserved" for any particular
component -- SHOULD always be encoded unless used for it's reserved
purpose?




One thing is that I do not understand this, but I had the help of several
very smart people in attempting to determine what RFC2396 is actually
saying. If an update to RFC2396 is in progress, would it be possible to
take this into account and perhaps spell out requirements and
recommendations for various agents (generators vs. parsers) in greater
detail?

As an example, if my speculation above is correct, it might say something
along the lines of (paraphrasing RFC1738 and using RFC2119 conventions):

   If the character is reserved in a component, the character SHOULD
   be encoded by URI generators, but MUST be accepted unencoded by
   URI parsers.

Or similar language designed to make abundantly clear what the intended
behaviour of various agents are in the different contexts.


-- 
Interviewer: "In what language do you write your algorithms?"
    Abigail: English.
Interviewer: "What would you do if, say, Telnet didn't work?"
    Abigail: Look at the error message.

Received on Monday, 18 November 2002 05:14:52 UTC