Re: [whatwg/url] Need an "unreserved" character set (and better define how to percent-encode arbitrary strings) (#369) from Matt Giuca on 2018-01-22 (public-webapps-github@w3.org from January 2018)

From: Matt Giuca <notifications@github.com>
Date: Mon, 22 Jan 2018 23:51:54 +0000 (UTC)
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/issues/369/359614523@github.com>
> @annevk:
> https://%6d%6D/%6d%6D?%6d%6D#%6d%6D yields https://mm/mm?%6d%6D#%6d%6D in Chrome, ... (or are you indeed proposing to normalize all and change Chrome as well?).

Right. I don't see any reason to not normalize in the query and fragment as well, and update Chrome to match.

Theoretically, it shouldn't matter whether any particular URL processor normalizes "%6D" as "m" or not, because "%6D" should be considered equivalent to "m". The only problem is a technicality that equivalence is **defined** by the URL parser, so we need to specify that "%6D" decodes to "m" in the parser, otherwise it isn't considered equivalent.

> but it does not really address your larger point, which I'm not quite sure I fully understand.

I'll go from my most pragmatic concern to most ideological:

1. If you have an arbitrary string and need to insert it into a URL (as we do in registerProtocolHandler and Web Share Target), it is difficult to find the right set of characters to percent-encode. None of the encoding sets defined in URL Standard are appropriate. This leads directly to whatwg/html#3377 (registerProtocolHandler's specification is wrong). Fixing the spec would basically be changing it to say "encode all code points that are not in the RFC 2986 unreserved set". We could just do that, and stop there, but that feels unsatisfactory.
2. The reason it feels unsatisfactory is that there is no logical reason (based on the text of URL Standard) why '$' should be encoded but 'a' should not. It "feels right" to encode '$' but not 'a', but that's because we're used to decades of both software and specs doing so. There's nothing in URL Standard to differentiate these characters. Also, if you carefully read the spec, there is nothing special about '&', outside of x-www-form-urlencoded and the URLSearchParams JavaScript API. Nothing in the core definition of a URL (the parser, serializer and equivalence logic) treats '&' any differently to '$' or 'a'. So I would have no reason for registerProtocolHandler to encode '&' other than "we all know that '&' delimits query parameters". I feel like the URL standard needs to have a much clearer policy on what characters should be encoded and what characters are safe to leave bare.
3. If there's nothing in the spec that explicitly says "a and %61 mean exactly the same thing", then theoretically I can't be sure that a URL processor won't use 'a' as some kind of delimiter, while "%61" is used to represent a literal 'a', just as URL processors commonly use '&' as a delimiter, while "%26" is used to represent a literal '&'. Even though it's unlikely that 'a' would be used as a syntax character, the spec allows it. Thus, theoretically, a general algorithm for encoding a string for insertion into a URL must encode *all* characters, just to be sure. *While this may be only a theoretical concern, the specification document for a core technology of the Internet should not allow for implementations that break everybody's expectations while still following the letter of the law.* By contrast, RFC 3986 gives us a defined list of characters that *do not need to be encoded* because the processor on the other end is not allowed to distinguish the encoded and non-encoded form. That's what we need.

> @bsittler:
> is there any good reason to even allow %-encoding of ASCII alphanumerics? Is there actually enough legitimate usage or an otherwise-impossible scenario reliant on this feature to justify it?

If you're going to go down this path, I'd want other unreserved characters (like '_' and '~') treated the same. Otherwise you create three classes of character: reserved, unreserved and non-encodable, with the same problem just for a smaller set of characters.

> It seems to me like it's primarily allowing naïve filters to be bypassed, similar to overlong UTF-8 encodings -- which are thankfully banned on the web for reasons of security. Is there any reason we cannot likewise ban these?

I can't speak to whether this would mitigate any realistic security problems. My feeling is that it's been legal to encode unreserved characters for 20 years and making it illegal now would break a countless amount of software --- especially since different encoders encode slightly different sets of characters (e.g., Python's [urllib.parse.quote](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote) encodes '~', even though it's in the unreserved set, so if we made "%7E" illegal, URLs generated by Python would become illegal).

(Also note that the UTF-8 standard itself banned overlong encodings [from 2003 onwards](https://en.wikipedia.org/wiki/UTF-8#History); this isn't a web-specific restriction.)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/369#issuecomment-359614523
Received on Monday, 22 January 2018 23:52:30 UTC