[whatwg/url] Percent encoding does not round-trip (#523) from Karl on 2020-06-02 (public-webapps-github@w3.org from June 2020)

From: Karl <notifications@github.com>
Date: Tue, 02 Jun 2020 05:37:52 -0700
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/issues/523@github.com>

Hi!

I'm trying to implement the current version of the spec in Swift (for non-browser applications. The idea is that it's useful to have a URL type that behaves as your browser behaves and accepts/rejects the same things).

One issue that I've noticed is that the current definition of the ["UTF8 percent encode"](https://url.spec.whatwg.org/#percent-encoded-bytes) algorithm doesn't round-trip for strings which contain the percent character. For example, following the algorithm, the string "%100" (UTF8: [37, 49, 48, 48]) is not changed at all when encoding (regardless or the percent-encoding set; none of them contain the "%" character itself). However, decoding that same string using the decoding algorithm in the spec results in the UTF8 sequence [16, 48], or "0" (ASCII 0x10 is the unprintable "data link escape" character).

[RFC-3986](https://tools.ietf.org/html/rfc3986#section-2.4) warns about this:

> Because the percent ("%") character serves as the indicator for
   percent-encoded octets, it must be percent-encoded as "%25" for that
   octet to be used as data within a URI. 

Indeed, the `encodeURIComponent` JS function encodes the percent character:
```
> encodeURIComponent("%100")
"%25100"
```

I was ready to submit a PR to have the spec's algorithm also do this, but it appears there is an explicit test that percent characters are _not_ escaped (in this case, in the URL's username component): https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json#L2372

```
    "input": "http://%25DOMAIN:foobar@foodomain.com/",
    "base": "about:blank",
    "href": "http://%25DOMAIN:foobar@foodomain.com/",
    "origin": "http://foodomain.com",
    "protocol": "http:",
    "username": "%25DOMAIN",   // <--- I would expect this to be "%2525DOMAIN"
```
I'm not sure if this is correct. Usernames are required to be escaped (meaning they must be unescaped to recover their original value), as in the following example:

```
> new URL("http://;DOMAIN:foobar@foodomain.com/")
...
username: "%3BDOMAIN" // <--- Semicolon is escaped as %3B
```
However, unescaping the string "%25DOMAIN" as the test expects would not recover the original and result in the invalid result "%DOMAIN". Again, RFC-3986 warns about this:

> Implementations must not
   percent-encode or decode the same string more than once, as decoding
   an already decoded string might lead to misinterpreting a percent
   data octet as the beginning of a percent-encoding, or vice versa in
   the case of percent-encoding an already percent-encoded string.

Can anybody confirm the correct behaviour? I think we _should_ be escaping the % character, and that the test is incorrect. If that's not the case (and the current behaviour is correct), this could use a note or warning in the spec.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/523

Received on Tuesday, 2 June 2020 12:38:06 UTC