ACTION-181 - escaping as defined by URI spec from Jack Jansen on 2010-11-01 (public-media-fragment@w3.org from November 2010)

From: Jack Jansen <Jack.Jansen@cwi.nl>
Date: Tue, 2 Nov 2010 00:44:42 +0100
To: Media Fragment <public-media-fragment@w3.org>
Message-Id: <10020F37-AE41-4BBD-B490-A560D59F458E@cwi.nl>

Finally, I got around to finding out what the URI spec says about escaping.
I've looked at rfc3986, that is still the most recent one, right?

The interesting bits of knowledge, from our point of view, are the following.

section 3.4, query:
query = *( pchar / "/" / "?" )
section 3.5, fragment:
fragment = *( pchar / "/" / "?" )
section 3.3:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
section 2.2, reserved characters, especially the second paragraph (right after the ABNF).
section 2.4, when to encode and decode.

The query and fragment section state that both of those consist of all the unreserved characters and "/?:@". Interestingly, you can have unescaped questionmarks in either, but the fragment cannot have an unescaped hash. The query section talks a bit about name=value, but nothing definitive.

The really interesting bit, IMO, is the second paragraph of 2.2 (especially when compared to the same paragraph of section 2.3, unreserved characters. Here, it is explicitly stated that the interpretation of a subdelim and the percent-encoded representation of that character are not necessarily identical. In 2.3 it is stated that the interpretation of an unreserved character and its percent-encoded representation is identical.

So, where the spec dictates that A and %41 are equivalent it states that = and %3d are different things. The spec says nothing about when to decode percent-escapes, except that you shouldn't do it before splitting the URL into its consituent parts (obviously). However, in my mind the referenced bits of sections 2.2 and 2.4 point towards doing percent-decoding as late as possible, in other words, if we put extra structure into the query and fragment parts we should use delims and subdelims there (as we already do) and do percent decoding after parsing our structure.

So, what we do seems to be in line with the intention of rfc3986, but rfc3986 does not mandate that processing order. Therefore, I think we should state explicitly that percent-decoding is to happen after separation on & and = .
--
Jack Jansen, <Jack.Jansen@cwi.nl>, http://www.cwi.nl/~jack
If I can't dance I don't want to be part of your revolution -- Emma Goldman

Received on Monday, 1 November 2010 23:45:54 UTC