Re: [whatwg/url] Should we unescape characters in path? (#606) from Karl on 2022-01-23 (public-webapps-github@w3.org from January 2022)

From: Karl <notifications@github.com>
Date: Sun, 23 Jan 2022 10:17:12 -0800
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/issues/606/1019538994@github.com>

I'm currently exploring implementing this in Swift, as over-encoding/removing over-encoding is an important feature for interop with our existing RFC-2396 URL type, as well as a generally useful feature. Having looked a the previous issues, I'm reasonably convinced this is possible. I'm not seeing any insurmountable challenges.

> Maybe? That really depends on whether the user knows what the parser will do.

I don't really find this very satisfying; the same argument could be made the other way. If the user is expected to have a deep and detailed understanding of the parser, any behaviour is reasonable and nothing needs to be justified. It's a kind of cyclical reasoning where things happen because they happen.

> If a server wants to treat %61 and a differently, it can.

On the one hand, this is is demonstrably true because - well, form-encoding 😔. A `+` and a `%2B` may certainly be different depending on how the query is interpreted.

On the other hand, at least for some characters in some components, that behaviour would not appear to be web compatible. Routers, caches and CDNs will sometimes decode these bytes, and expect that they do not change the meaning of the URL. The discussions in previous issues seems to indicate that many browsers very much do expect these to be equivalent.

Such a server would serve different resources to different browsers for the same URL, which seems at-odds with the idea of interoperability or the web as a platform. The evidence in this issue indicates that GitHub Pages is apparently performing as you say it may, and it breaks Firefox's ability to navigate to certain websites hosted on that server. If GHP is indeed entitled to behave that way, it suggests that all browsers which successfully navigate to that URL are wrong - which again, does not seem to be a web-compatible position.

> There are corner cases beyond percent encoding. For example http://example.com/path/to//file (two slashes) and http://example.com/path/to/file (one slash) are essentially equivalent from the filesystem's point of view, but depending on the web server you're using, they might not be. While the URL parser could say that we should collapse the two paths, it's probably more important that we keep the processing to a minimal in order to not change the URL's initial form.

The difference, IMO, is that the URL parser does not add or remove empty path components (any more! It used to do that to file URLs). It does, however, add and remove percent-encoding, meaning there is already implicit acceptance that doing so does not change the meaning of the URL.

By definition, if the parser does something (e.g. turning `http://ex%61mple.com` -> `http://example.com`), it **must** preserve meaning, as any attempt to utter the former as a URL record results in the latter, and URLs are records:

> A URL is a struct that represents a universal identifier. To disambiguate from a valid URL string it can also be referred to as a URL record.

We are forced to accept that the web's model of URLs, as defined by the various browser implementations over the decades, includes this assumption that percent-encoding may be safely added or removed in certain circumstances, and that a standard which attempts to describe that model must define that process and the circumstances where it applies.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/606#issuecomment-1019538994

You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/606/1019538994@github.com>

Received on Sunday, 23 January 2022 18:17:24 UTC