- From: Karl <notifications@github.com>
- Date: Tue, 02 Jun 2020 05:37:52 -0700
- To: whatwg/url <url@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/url/issues/523@github.com>
Hi! I'm trying to implement the current version of the spec in Swift (for non-browser applications. The idea is that it's useful to have a URL type that behaves as your browser behaves and accepts/rejects the same things). One issue that I've noticed is that the current definition of the ["UTF8 percent encode"](https://url.spec.whatwg.org/#percent-encoded-bytes) algorithm doesn't round-trip for strings which contain the percent character. For example, following the algorithm, the string "%100" (UTF8: [37, 49, 48, 48]) is not changed at all when encoding (regardless or the percent-encoding set; none of them contain the "%" character itself). However, decoding that same string using the decoding algorithm in the spec results in the UTF8 sequence [16, 48], or "0" (ASCII 0x10 is the unprintable "data link escape" character). [RFC-3986](https://tools.ietf.org/html/rfc3986#section-2.4) warns about this: > Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Indeed, the `encodeURIComponent` JS function encodes the percent character: ``` > encodeURIComponent("%100") "%25100" ``` I was ready to submit a PR to have the spec's algorithm also do this, but it appears there is an explicit test that percent characters are _not_ escaped (in this case, in the URL's username component): https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json#L2372 ``` "input": "http://%25DOMAIN:foobar@foodomain.com/", "base": "about:blank", "href": "http://%25DOMAIN:foobar@foodomain.com/", "origin": "http://foodomain.com", "protocol": "http:", "username": "%25DOMAIN", // <--- I would expect this to be "%2525DOMAIN" ``` I'm not sure if this is correct. Usernames are required to be escaped (meaning they must be unescaped to recover their original value), as in the following example: ``` > new URL("http://;DOMAIN:foobar@foodomain.com/") ... username: "%3BDOMAIN" // <--- Semicolon is escaped as %3B ``` However, unescaping the string "%25DOMAIN" as the test expects would not recover the original and result in the invalid result "%DOMAIN". Again, RFC-3986 warns about this: > Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string. Can anybody confirm the correct behaviour? I think we _should_ be escaping the % character, and that the test is incorrect. If that's not the case (and the current behaviour is correct), this could use a note or warning in the spec. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/whatwg/url/issues/523
Received on Tuesday, 2 June 2020 12:38:06 UTC