Re: [whatwg/url] Need an "unreserved" character set (and better define how to percent-encode arbitrary strings) (#369)

Hmm, I didn't know all of those @mnot -- I thought Java at least was based on the old standard. It looks like Java is based on x-www-form-urlencoded which is a different set again (it doesn't encode `*-._`). This means we should at least have `*` in the unreserved set, since it is left bare by the x-www-form-urlencoded encoder (but encoded by other encoders) and must therefore be considered equivalent between the two forms.

The thing is though, that it's safer to leave characters in the *unreserved* set, as long as they aren't used for any syntax. That way, encoders are free to encode them, or not, as they choose, without changing the semantics. If we chose `-._~` (RFC 3986) as the unreserved set, then encoders based on RFC 3986 will be fine, but anything that encodes `!'()*` will potentially change the semantics of those characters (because URL standard will consider "`!`" and "`%21`" to be non-equivalent, for example).

If we choose `!'()*-._~` (RFC 2396) as the unreserved set, encoders based on RFC 3986 will be fine; maybe they don't encode "`!`", maybe they do; either way it will be viewed exactly the same by the parser. But encoders based on RFC 2396, or x-www-form-urlencoded will also be fine. Basically, the larger the set, the better, as long as none of those symbols have an existing meaning in URL syntax (which I don't believe they do).

That's why I suggested a possibly even wider set: `!$'()*,-.;_~`, which is all characters that have no special meaning in URL syntax. (This adds `$`, `,` and `;` to the RFC 2396 set.) Adding any other character potentially runs into trouble.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/369#issuecomment-380689453

Received on Thursday, 12 April 2018 06:07:59 UTC