[whatwg/url] Proposal for new version of parsing spec (Issue #778)

I think we need to extend the PEG to specify the lower-layer protocols explicitly (I.E., chain multiple schemas together). Especially since HTTP can now also be via UDP and more and more stuff uses HTTP as a transport/tunneling protocol. The current parsing spec is just not flexible enough and thereby adds a bunch of redundancies and limitations that lead to non-compatible implementation differences in software already (implementation of UNIX domain sockets or the "http+socket://" notations). And these differences can cause security vulnerabilities when one software gets "glued to another" (different parsing between WAF, frontend, and backend servers).

Like so far, I think this one would encompass all of these (new) challenges and flexibility while still being backward compatible with the current one:

`LowestLayer+HigherLayer+EvenHigherLayer://[[username]:[password]@EvenHigherLayerEndpointIdentifier]:[HigherLayerEndpointIdentifier]:[LowerLayerEndPointIdentifier]/resource`
(with optional square brackets around each attribute and default values for the lower layers if not specified in the URL explicitly, as well as recommendation to offer a strict parsing mode for implementations that will not try to guess anything and only treat URLs with square brackets around every attribute and explicitly provided data (no implied application ports, no implied lower layer protocols, ...), mainly for security, futureproofing and reliability in usages by scripts and automation, as well as for debugability by experts and prosumers). And multiple (chained) endpoint identifiers only being allowed for the verbose version (to avoid parsing bugs and ambiguity), as well as requiring EndpointIdentifiers to match the number of specified lower layers 1:1 (but in reversed order).
(And the current `username:password@` would explicitly become part of the part that specifies the HTTP endpoint, for example so that each layer can have its own independent login information or additional protocol-specific information, we'd just hand it off to the protocol the schema specified as an opaque blob)


Examples:

| Stack | example URI | Comment |
| --------- | ---------------------- | -- |
|TCP => HTTP|`tcp+http://[example.com]:[80]`|Http but explicit|
|UDP => HTTP|`udp+http://[example.com]:[80]`|HTTP via UDP but explicit (no probing and no fallback to e.g. TCP)|
|TCP => TLS => HTTP|`tcp+tls+http://[example.com]:[example.com]:[443]`|So having two schemas specified for "TLS" wrapped version is now no longer necessary as a side effect, but  kepting them for backward compatibility for already added/specified ones isn't an issue anyway|
|TCP => TLS => HTTP|`tcp+https://[example.com]:[443]`|Same, but with HTTPS instead of "tls+http"|
|UDP => HTTP => TCP|`udp+http+tcp://[48569]:[example.com]:[80]`|Specifies a raw TCP stream that is tunneled through HTTP which itself is served via a UDP connection|
|IP => TCP => TLS => HTTP|`ip+tcp+tls+http://[example.com]:[example2.com]:[443]:[2001:db8::1]/foo`|This form would mean that an IP connection to 2001:db8::1 is established that contains a TCP connection to port 443, which contains a TLS connection¹. And the HTTP being within it and the SNI header of `example.com`|
|HTTP => HTTP|`http+http://[example.com]:[username:password@example2.com]`|authenticating against example2.com to use it as HTTP proxy to connect to example.com, also avoids current ambiguity of credentials being for the destination or the proxy|
|Unix Socket => HTTP|`socket+http://[example.com]:[/run/foo/bar.sock]/foobar`|opening a unix socket to /run/foo/bar.sock and sending example.com as the SNI name|
|FILE => HTTP|`file+http://[example.com]:[/run/foo/bar.sock]/foobar`|same, but using file as schema, for an even more generic approach, as technology also other things beside unix sockets that are accessed like files could be used like e.g., a (virtual) serial device|

¹: with an explicitly specified hostname `example2.com` to use for certificate validation. Web browsers should throw a disableable (in the options, not the error message itself) error if this differs from the HTTP SNI, but that's application behavior (shouldn't be part of the PEG), as for CLI tools, debugging and developing or for web proxies like those universities use for off-campus online access to journals etc, it is very much desirable.



This extension (or, admittedly, propose for a new version of the PEG) is my preferred improvement, as it does not break the independence of the different protocols and allows extensibility, debugability, and clarity (no ambiguity and no security vulnerabilities by parsing "trickery"). But if breaking backward compatibility is not an issue (e.g. because we can detect the "parser spec version" easily, then I'd prefer this alternative:

`LowestLayer[[LowerLayerEndPointIdentifier]]+HigherLayer[[HigherLayerEndpointIdentifier]]+EvenHigherLayer[[username]:[password]@EndpointIdentifier]://resourcePath`

Change the syntax completely to have the endpoint identifiers right after the schema part. Cleaner, simpler to implement a parser for, but a drastic and breaking change to the current one, currently not used (not even in a similar fashion) by any available implementation I'm aware of.


-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/778
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/778@github.com>

Received on Monday, 10 July 2023 15:06:11 UTC