[whatwg/url] How should "everything after the scheme" URLs work? (#385)

There are several URL types that are basically of the form `scheme:<some arbitrary data>`. For example, `data:`, `mailto:`, `javascript:`, and `urn:`.

The question is, how should software process these URLs? I see three main models:

1. Treat these as non-URLs: check if the string has a leading `scheme:`, then look at everything after that.
   - Nothing specced does this. (Although I suspect a decent amount of un-specced non-browser software might.)
   - This is probably not a good idea, if we want to call these things URLs at all. For example, it misses canonicalizations like percent-decoding and whitespace-stripping that are otherwise common to URLs.
2. [Parse](https://url.spec.whatwg.org/#concept-url-parser) the URL. Check if its scheme is the one you want. Then, [serialize](https://url.spec.whatwg.org/#concept-url-serializer) them, and strip the leading scheme. (Maybe also strip the fragment?) Now process that remaining set of code units.
   - This is how the relatively-new [`data:` URL processor](https://fetch.spec.whatwg.org/#data-url-processor) spec works
   - This is how the very old [`javascript:` URL processing](https://html.spec.whatwg.org/#javascript-protocol) is specced (although I don't think we have extensive tests in that area)
3. [Parse](https://url.spec.whatwg.org/#concept-url-parser) the URL. Now, validate it according to some strict criteria, such as: no username, no password, no host, no port, maybe no query, maybe no fragment. Now, process the path, and optionally process the query or fragment, if those are allowed for your scheme.
   - Nothing specced does this, yet.
   - This might be better than (2), as it is stricter validation, and more in line with the traditional RFCs, which consider these "everything after the scheme" URLs as having paths only.
   - This model seems a bit weird in that if your `<some arbitrary data>` contains `?`s or `#`s, you have to model that as allowing queries and fragments, and then processing `${path}?${query}#${fragment}`. Whereas (2) just lets you process the whole string at once.

An interesting example contrasting (2) and (3) is the following: `javascript://somehost/%0Aalert(1)`
- In (2), it would work, and cause an alert, because the source string `//somehost/\nalert(1)` is interpreted as a comment followed by an alert. 
- In (3), it would fail, since we'd validate that hosts aren't present in `javascript:` URLs.

Another example is that `mailto:///d@domenic.me` is interpreted as containing a `<some data here>` of `///d@domenic.me` in (2) and a path of `/d@domenic.me` in (3).

There are probably more interesting examples of this sort.

---

The purpose of this thread is to gather community thoughts on these scenarios, with an eye toward setting a precedent for future such schemes, and providing recommendations for software that processes such URLs (including both the web's specced `data:` and `javascript:`, and other schemes like `mailto:` or `urn:`).

If we decide (2) is better, we should provide better spec support for it, including helper operations and explicit recommendations to continue doing this pattern. If we decide (3) is better, we should do the same, and we should either explicitly note `data:` and `javascript:`'s processing models as legacy, or we should try to change them (which might be possible if interop is bad).

/ccing some people who might have thoughts: @mnot @jasnell @sleevi @masinter

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/385

Received on Tuesday, 8 May 2018 16:12:29 UTC