[whatwg/url] Need an "unreserved" character set (and better define how to percent-encode arbitrary strings) (#369)

This is a bit of a jumble of related issues that all stem from one root problem: URL Standard (unlike [RFC 3986](https://www.ietf.org/rfc/rfc3986.txt)) does not have a concept of an "unreserved" character set. Apologies that this is a bit of an essay, but since these are all inter-related, I thought I would just group them into one discussion.

## Why an unreserved set?

To give some background, [RFC 3986](https://www.ietf.org/rfc/rfc3986.txt)'s **unreserved** set (ASCII alphanumeric plus `-._~`) is the set of characters that are interchangeable in their percent-encoded and non-encoded forms: "URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource." (The earlier [RFC 2396](https://www.ietf.org/rfc/rfc2396.txt) defined a slightly larger unreserved set: ASCII alphanumeric plus `!'()*-._~`, which will be relevant later.)

In other words, the RFC divides the set of valid URL characters into two subsets: reserved and unreserved. Percent-encoding or percent-decoding a reserved character *may* change the meaning of the URL (e.g., "?abc=def" and "?abc%3Ddef" have different meanings). Percent-encoding or percent-decoding an unreserved character *does not* change the meaning of the URL (e.g., "/abc/" and "/%61%62%63/" should be considered equivalent, with "/abc/" being the normalized representation).

URL Standard does not have an equivalent concept, and this manifests as several problems (each of which could have its own bug, but I think it helps to group these together):

1. **URL equivalence is broken.** RFC 3986 considers "/abc/" and "/%61%62%63/" to be equivalent; URL Standard does not. URL Standard treats "=" / "%3D" the same way as "a" / "%61" --- in both cases, it does not consider these equivalent. It needs to recognise the equivalence of "a" and "%61". Furthermore, it doesn't even recognise equivalence of uppercase and lowercase percent-encoded bytes: "%3D" and "%3d" are not considered equivalent, because the URL Parser does not normalize lowercase percent-encoded bytes to uppercase. (Note: I like that URL Parser serves as a "normalization" pass, with equivalence being trivially defined as compare serialization for equality. Therefore, I would like to address this in the Parser, rather than modifying the equivalence algorithm itself.)
2. **URL rendering is broken.** Currently, the standard just says to percent decode all sequences ("unless that renders those sequences invisible"). This displays URLs ambiguously, because for example "?abc=def" and "?abc%3Ddef" will both be displayed as "?abc=def". It's impossible for the reader to know whether that "=" represents a syntactic "=" or a literal U+003D character represented by "%3D". It's only safe to percent decode characters when rendering if we're sure it doesn't change the semantics of the URL. (A good rule would be that `Parse(Serialize(url)) == Parse(Render(url))` should be true for all URLs.) Right now, there is no code point (with a few exceptions, but not the cases I'm talking about) that can be decoded without changing the way the URL parses.
3. **There is no well-defined algorithm or character set for safely encoding an arbitrary string into a URL.** Of course, there is [encodeURIComponent](https://www.ecma-international.org/ecma-262/6.0/#sec-encodeuricomponent-uricomponent) (defined in ECMAScript), but I'm not sure how to reference an ECMAScript API from a web standard. I think URL Standard itself should define how to safely encode a string for a URL. Case in point: [registerProtocolHandler](https://html.spec.whatwg.org/multipage/system-state.html#dom-navigator-registerprotocolhandler) is incorrectly specified as using the "default encode set", which is the old name for the path percent-encode set, which doesn't encode enough characters (in particular, '&' and '='). I'll file a separate bug on HTML, but the fix is quite difficult to define (other than "use encodeURIComponent") because URL Standard doesn't define an equivalent set of characters.

So how does adding an unreserved set help with these?

1. The URL Parser would be updated to *decode* any percent-encode sequences in the unreserved set. This fixes equivalence because "%61" would decode to "a", and thus "a" would be equivalent to "%61". (We should also fix it so it normalizes "%3d" to "%3D", but that's a separate issue.)
2. URL Rendering would be updated to only decode percent-encoded bytes above 0x7f (i.e., non-ASCII characters). This is because the parser would have already decoded all unreserved characters (the only ASCII characters that are safe to decode). The only job for the renderer is to decode the non-ASCII characters (which are technically unreserved, but don't appear in the unreserved set, so that the parser produces ASCII-only strings).
3. We would also define a "default decode set" as the complement of the unreserved set. Other specs (like [registerProtocolHandler](https://html.spec.whatwg.org/multipage/system-state.html#dom-navigator-registerprotocolhandler), and my draft [Web Share Target API](https://wicg.github.io/web-share-target)) would be encouraged to percent-encode all characters in this set.

## What should be in the unreserved set?

So what characters should be in the unreserved set? I propose three alternatives (from largest to smallest):

1. ASCII alphanum plus `!$'()*,-.;_~`. This reserves the bare minimum set of characters. I compiled this list by carefully reading the URL standard and deciding whether each ASCII character has any special meaning. The above list of characters have no intrinsic meaning anywhere in the URL standard (note that '.' has special meaning in single- and double-dot path segments, but "." and "%2E" are already considered equivalent in that regard).
2. ASCII alphanum plus `!'()*-._~`. This matches [RFC 2396](https://www.ietf.org/rfc/rfc2396.txt), the older IETF standard.
3. ASCII alphanum plus `-._~`. This matches [RFC 3986](https://www.ietf.org/rfc/rfc3986.txt).

Of these, I prefer option 2 (match RFC 2396). Option 1 is the most "logical" because it can be directly derived from reading the rest of the spec, but it doesn't leave any room for either this spec, or any individual schemes, to add their own special meaning to any new characters (which was the purpose of reserved characters in the first place). Option 3 matches the most recent IETF URL specification, which deliberately moved `!'()*` into the reserved set, but I don't think this move had much impact on implementations. For example, [encodeURIComponent](https://www.ecma-international.org/ecma-262/6.0/#sec-encodeuricomponent-uricomponent) still uses the reserved set from RFC 2396. Option 2 exactly matches the encode set of [encodeURIComponent](https://www.ecma-international.org/ecma-262/6.0/#sec-encodeuricomponent-uricomponent). Furthermore, choosing Option 2 more or less matches Chrome's current behaviour (though it differs from one context to another, as discussed below).

An open question is whether non-ASCII characters should appear in the unreserved set. This mostly doesn't matter, because all non-ASCII characters are in all of the percent-encode sets, so they always get normalized into encoded form. Technically, they act like unreserved characters because the URL semantics doesn't change as you encode/decode them. But I am leaving them out because the unreserved characters should be those that normalize to being decoded.

## Study of current implementations

A WHATWG standard is supposed to describe how implementations actually behave. My experiments with Chrome 63 and Firefox 52 suggest that implementations do not follow the current URL standard at all, and are much closer to matching what I suggest above. *(Disclaimer: I work for Google on the Chrome team.)*

### URL equivalence

I can't find a good built-in way on the browser side to test URL equivalence (since the `URL` class has no equivalence method). But we can use this function to test equivalence of URL strings, based on the browser's implementation of URL parsing and serializing:

```js
function urlStringsEquivalent(a, b) {
    return new URL(a, document.baseURI).href == new URL(b, document.baseURI).href;
}
```

Here, Chrome mostly matches RFC 3986's notion of syntax-based equivalence:

* `urlStringsEquivalent('a', 'a')`: true
* `urlStringsEquivalent('a', '%61')`: true (normalized to 'a')
* `urlStringsEquivalent('~', '%7E')`: true (normalized to '~')
* `urlStringsEquivalent('=', '%3D')`: false (not normalized)
* `urlStringsEquivalent('*', '%2A')`: false (not normalized)
* `urlStringsEquivalent('<', '%3C')`: true (normalized to '%3C')

Specifically, Chrome's URL parser decodes all characters in the RFC 3986 unreserved set: ASCII alphanum plus `-._~`.

But Chrome also fails to normalize case when it doesn't decode a percent-encoded sequence:

* `urlStringsEquivalent('%6e', '%6E')`: true (normalized to 'n')
* `urlStringsEquivalent('%3d', '%3D')`: false (not normalized)

Firefox, on the other hand, follows the current URL standard:

* `urlStringsEquivalent('a', 'a')`: true
* `urlStringsEquivalent('a', '%61')`: **false** (not normalized)
* `urlStringsEquivalent('~', '%7E')`: **false** (not normalized)
* `urlStringsEquivalent('=', '%3D')`: false (not normalized)
* `urlStringsEquivalent('*', '%2A')`: false (not normalized)
* `urlStringsEquivalent('<', '%3C')`: true (normalized to '%3C')
* `urlStringsEquivalent('%6e', '%6E')`: **false** (not normalized)
* `urlStringsEquivalent('%3d', '%3D')`: false (not normalized)

In my opinion, the spec and Firefox should change so that these "unreserved" characters (particularly alphanumeric) are equivalent to their percent-encoded counterparts. Though I think instead of using Chrome's set (RFC 3986), we should use RFC 2396, for compatibility with encodeURIComponent (hence `urlStringsEquivalent('*', '%2A')` should return true).

### URL rendering

Paste this URL in the address bar:

```
https://example.com/%20%21%22%23%24%25%26%27%28%29%2a%2b%2c%2d%2e%2f%30%31%32%33%34%35%36%37%38%39%3a%3b%3c%3d%3e%3f%40%41%42%43%44%45%46%47%48%49%4a%4b%4c%4d%4e%4f%50%51%52%53%54%55%56%57%58%59%5a%5b%5c%5d%5e%5f%60%61%62%63%64%65%66%67%68%69%6a%6b%6c%6d%6e%6f%70%71%72%73%74%75%76%77%78%79%7a%7b%7c%7d%7e%7f%ce%a9
```

Chrome decodes the following characters: ASCII alphanum, non-ASCII, and `"-.<>_~`. All other characters remain encoded. This is the RFC 3986 unreserved set, plus `"<>`, which is the intersection of the URL Standard fragment and query encode sets (those three characters are always encoded by the parser, so like unreserved characters, they have the same semantics whether encoded or not). `Parse(Serialize(url)) == Parse(Render(url))` is true for Chrome for all URLs.

Firefox decodes the following characters: ASCII alphanum, non-ASCII, backtick, and `!"'()*-.<>[\]^_{|}~`. All other characters remain encoded. This is the RFC 2396 unreserved set, plus backtick, and `"<>[\]^_{|}`. I'm not sure what the rationale behind Firefox's decode set is.

`Parse(Serialize(url)) == Parse(Render(url))` is not true for Firefox. For example, the URL "https://example.com/%2A": `Parse(Serialize(url))` gives "https://example.com/%2A", while `Parse(Render(url))` gives "https://example.com/*".

Clearly, neither of these implementations follow the standard, which says to decode **all** characters. Therefore, the spec should change to more closely match implementations. Preferably RFC 2396's unreserved set, for consistency. We could also throw in `"`, `<` and `>`, since these will be re-encoded upon parsing. Whatever is decided, it should be the case that `Parse(Serialize(url)) == Parse(Render(url))`.

### Encoding arbitrary strings

Taking a look at [registerProtocolHandler](https://html.spec.whatwg.org/multipage/system-state.html#dom-navigator-registerprotocolhandler)'s escaping behaviour when a URL is escaped before being substituted into the "%s" template string. The spec says to escape it with the "default encode set", which no longer exists but links to path percent-encode set, which is: C0 control chars, space, backtick, non-ASCII, and `"#<>?{}`.

I'll test this by navigating to [httpbin](https://httpbin.org) and running this code in the Console:

```js
navigator.registerProtocolHandler("mailto", "/get?address=%s", "httpbin")
```

Now a malicious site can inject other query parameters by linking you to "[mailto:foo@example.com&launchmissiles=true](mailto:foo@example.com&launchmissiles=true)".

According to the spec, this is supposed to open [https://httpbin.org/get?address=mailto:foo@example.com&launchmissiles=true](https://httpbin.org/get?address=mailto:foo@example.com&launchmissiles=true). That's a query parameter injection attack. httpbin displays:

```json
  "args": {
    "address": "mailto:foo@example.com", 
    "launchmissiles": "true"
  },
```

Fortunately, Chrome and Firefox both encode many more characters. In both cases, they open [https://httpbin.org/get?address=mailto%3Afoo%40example.com%26launchmissiles%3Dtrue](https://httpbin.org/get?address=mailto%3Afoo%40example.com%26launchmissiles%3Dtrue), so the '&' and '=' are correctly interpreted as part of the email address, not separate arguments. httpbin displays:

```json
  "args": {
    "address": "mailto:foo@example.com&launchmissiles=true"
  },
```

Chrome uses the RFC 2396 reserved set, matching [encodeURIComponent](https://www.ecma-international.org/ecma-262/6.0/#sec-encodeuricomponent-uricomponent). Firefox leaves off a few more characters: `<>[\]{|}` (but nothing important).

I think the correct fix is to change `registerProtocolHandler`'s spec (in HTML) to match `encodeURIComponent`. However, there isn't an easy way to do that, short of calling into the ECMAScript-defined `encodeURIComponent` method, or explicitly listing all characters. If we had an appropriate "reserved set" or "default encode set" (the complement of the unreserved set) in the URL Standard, then `registerProtocolHandler` can just use that.

Note that I am developing the [Web Share Target API](https://wicg.github.io/web-share-target) and need basically the same thing as `registerProtocolHandler`. At the moment, I've written it as "userinfo percent-encode set", but that still doesn't capture all the characters I need (especially '&').

## Recommendations

Given all of the above, I would like to make the following changes to URL Standard:

1. Define an "unreserved set", probably as ASCII alphanumeric plus `!'()*-._~` (which matches RFC 2396, but the exact set is debatable).
2. Define a "reserved set" or "default encode set" as the complement of unreserved. (This set would include all C0 control chars, as well as all non-ASCII characters.) (If it's called "default encode set", then `registerProtocolHandler` is automatically fixed. Otherwise we have to update `registerProtocolHandler` to use the reserved set's name instead.)
3. Add a recommendation that other standards use the default encode set for sanitizing strings before inserting them into a URL. Note that this is equivalent to ECMAScript's `encodeURIComponent` function.
4. The URL Parser needs to be updated to normalize percent-encoded sequences: values in the unreserved set need to be decoded. Values not in the unreserved set need to have their hex digits uppercased. (This is actually fairly hard, due to the way the parser is written. Some refactoring required, but doable.) Note that this automatically fixes equivalence.
5. The URL Rendering algorithm needs to be updated. Instead of decoding all characters, only decode non-ASCII characters, and optionally, `"`, `<` and `>` (the intersection of the fragment and query encode sets). Note that this algorithm should satisfy `Parse(Serialize(url)) == Parse(Render(url))` for all URLs.

Doing so would solve a number of issues outlined above, and bring the spec much closer to existing implementations. It would then make sense to update implementations to match the new spec.

I am quite familiar with the URL Standard and am volunteering to make the required changes, if there is consensus. Also, I don't strictly need there to be a reserved / unreserved set. These three problems could be fixed individually. But it makes the most sense to conceptualize this as reserved vs unreserved, and then tie a bunch of other definitions off of those concepts.

Regards,

Matt

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/369

Received on Friday, 19 January 2018 06:12:45 UTC