Re: [whatwg/url] Need an "unreserved" character set (and better define how to percent-encode arbitrary strings) (#369) from Matt Giuca on 2018-01-22 (public-webapps-github@w3.org from January 2018)

From: Matt Giuca <notifications@github.com>
Date: Sun, 21 Jan 2018 22:31:50 -0800
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/issues/369/359336557@github.com>
Update: Someone else coincidentally filed the registerProtocolHandler issue at whatwg/html#3377.

> This decoding can cause a reparse problem, see #87 (comment).

I see. This issue is more complex than I thought (mostly because of the nested-escape issue).

> It's problematic because alone % are still allowed. I think the problem can be solved by percent encoding % to %25 when it isn't part of valid percent encode sequence (see issue: #170), but this can be problematic as well

Yeah, we can't do that. It would break registerProtocolHandler, which is based on the (IMHO rather flimsy) assumption that "%s" parses as "%s" (despite a validation error). I was encouraged to rely on the same mechanism in WICG/web-share-target#31.

Having said that, I think we can solve it basically the way that Chrome solved it. Having proper equivalence defined for URLs is kind of important. I don't think we should jettison that concept because there is an edge case that causes trouble.

Here's what I'm proposing:

A "decodable percent sequence" is a "%" followed by two hex digits representing a byte in the unreserved set.

- When the URL Parser encounters a %, it consumes it, then looks ahead at the following characters.
- If it finds two hexdigits, it consumes them. Then,
  - If those hexdigits are a byte value in the unreserved set, it emits that byte value's code point.
  - Else, it emits those hexdigits, converted to uppercase.
- Else, if it finds a hexdigit followed by a decodable percent sequence, a decodable percent sequence followed by a hexdigit, or two decodable percent sequences, validation error, and it emits "%25" without consuming those tokens.
- Else, validation error, and it emits "%".

Test cases:

* "%61" -> "a"
* "%3d" -> "%3D"
* "%%361" -> "%2561" (with validation error)
* "%6%31" -> "%2561" (with validation error)
* "%6%3D" -> "%6%3D" (with validation error)
* "%%36%31" -> "%2561" (with validation error)
* "%6%%331" -> "%6%2531" (with validation error)
* "%6%2531" -> "%6%2531" (with validation error)
* "%s" -> "%s" (with validation error)

I think that covers it, including nested cases. What we gain from this is that we can define a set of characters that **must** be considered equivalent.

Let me further justify why we need this.

> @annevk (on #87)
> the mapping of an HTTP request to a resource on disk is not exactly governed by the URL Standard. How servers deal with URLs and what amount of normalization they apply is very much up to them.
> ...
> What I'm favor of is that clients do not normalize and treat your given examples as distinct. This follows from how we define the URL parser and then pass the URL to the networking subsystem, etc.
>
> A server however can still see those paths and treat them as equivalent. That would be up to the server library that maps paths to resources. I don't think we have to take a stance on whether such mapping takes place in the URL Standard.

I agree that servers (let's call them "URL processors" -- any application that breaks down a URL and uses its pieces, whether mapping it onto a file system, or otherwise) should be free to treat certain characters, such as '$', as equivalent to their encoded counterparts, or not, as they wish. What we're missing is a mandate that URL processors **must** treat other characters, such as 'a', as equivalent to their encoded counterparts.

Let's call these two character sets "reserved" and "unreserved". Encoding or decoding a reserved character **may or may not** change the meaning of the URL (depending on the processor). Encoding or decoding an unreserved character **does not** change the meaning of the URL. These sets impact rendering and encoding as follows:

* URL rendering should not be allowed to decode any character in the reserved set, because that would change the meaning of the URL and present the reader with an ambiguous string (that could represent one of many URLs). Conversely, URL rendering can freely decode any character in the unreserved set.
* When encoding an arbitrary string to be inserted into a URL (**for example**: with registerProtocolHandler, but also any time this is done by an application), any character in the reserved set **must** be encoded, so that it surely represents the literal character, and not some syntactic component of either the URL syntax, or a quirk of some unknown URL processor.

The current status quo is essentially that the "unreserved" set is the null set. This means that:

* URL rendering cannot decode any character, because even decoding "%61" to "a" could change the meaning of the URL.
* When encoding an arbitrary string, every character must be encoded. e.g., constructing a query string "name=%s", if the name is "Matt", we have to produce "name=%4D%61%74%74"; otherwise some URL processor could treat the letter "a" specially. Encoding it as "%61" is the only way to be sure it's treated as a literal.

Putting those two together, a URL with "name=%4D%61%74%74" has to be rendered as "name=%4D%61%74%74", so all URLs are ugly and impossible for a human to read.

Now you may be saying: "Come on, don't be so pedantic. No URL processor is going to treat "a" differently to "%61", so surely we don't need to encode it!" OK, but how do I choose which characters need to be encoded and which don't? How do I know which characters will be treated equivalently to their encoded versions, and which won't? I have no more faith that a URL processor will consider "a" and "%61" equivalent than I do for "=" and "%3D". In order to know what characters need to be encoded, the URL specification needs to explicitly state which characters a URL processor is allowed to treat specially (the reserved set) and which it isn't (the unreserved set).

**Corollary:** If we don't define an unreserved set, we still need to define some set of characters that registerProtocolHandler (and Web Share Target) should encode. What is that set? It can't be any of the existing percent-encode sets, since they don't encode enough characters. Sure, we could throw in a few more characters like '&' and '=', but what is the right answer to "which characters need to be encoded to ensure the characters in the string aren't treated specially by the URL processor?". My search for an answer to this question led me to the conclusion that we need to bring back the RFC 3986 concept of an unreserved set.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/369#issuecomment-359336557
Received on Monday, 22 January 2018 06:32:13 UTC