Re: [whatwg/url] Addressing HTTP servers over Unix domain sockets (#577)

> @agowa338 So standardizing the "+" delimiter this way would also allow us to do things like "https+ipv6://www.google.com".

The "+" delimiter is *already* a standard, in RFC 3986 Section 3.1, "Scheme".   And, there is nothing preventing anyone from defining new schemes using the "+" delimiter.  It is just that RFC 8820, Section 2.1, "URI Schemes", seems to strongly discourage the introduction of new "schemes".  Again, as above:
>[RFC 8220]
>A Specification that defines substructure for URI schemes overall (e.g., a prefix or suffix for URI scheme names) MUST do so by modifying [BCP35] (an exceptional circumstance).

And, I believe the comment there, "an exceptional circumstance", is well founded.

So, first an RFC which updates BCP35/RFC 7595 to define and include the new scheme would have to be accepted, and then, an RFC which updates RFC 3986, Section 3.2, "Authority", to define how the "authority" component of the URI would be parsed to make sense of this new scheme, must be accepted.

But, there is no "compelling interest" justifying the definition of a new URI scheme in RFC 7595 that would pass the "Strict Scrutiny" test.  There are other ways to use only the existing URI schemes to convey the desired information, using the RFC 3986 "authority" component.

RFC 3986, Section 3.1, "Scheme", concludes with:
>Individual schemes are not specified by this document.  The process for registration of new URI schemes is defined separately by [BCP35]. The scheme registry maintains the mapping between scheme names and their specifications.  Advice for designers of new URI schemes can be found in [RFC2718].  URI scheme specifications must define their own syntax so that all strings matching their scheme-specific syntax will also match the <absolute-URI> grammar, as described in Section 4.3.
>
>When presented with a URI that violates one or more scheme-specific restrictions, the scheme-specific resolution process should flag the reference as an error rather than ignore the unused parts; doing so reduces the number of equivalent URIs and helps detect abuses of the generic syntax, which might indicate that the URI has been constructed to mislead the user (Section 7.6).

Of course, anyone is welcome to try introducing new "schemes" to the scheme registry, but really, you probably do not want to go there.

Essentially, RFC 3986 defines a set of Rules for URIs that enable different people to write compatible parsers for any standard URI "scheme". You may want to read through RFC 3986 Section 2.2, "Reserved Characters", to see what sorts of options are available for parsing various components of a URI.  In part, this says:
>[RFC 3986]
>
>      reserved    = gen-delims / sub-delims
>
>      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
>
>      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
>                  / "*" / "+" / "," / ";" / "="
>
 >The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI.

It might be argued that using one of the "sub-delims" to terminate a unix domain socket path might make it easier to distinguish the "authority" component from the resource "path" component, or to designate an entirely new kind of "subcomponent" within the "authority" component, to specify a unix domain "socketpath", as distinct from a "port".  But that seemed to me unnecessarily intrusive, as opposed to simply making use of the "gen-delims".  In contrast, historic use of the ":" as a delimiter in the URI and use again of the ":" in the expression of an IPv6 address has led to the awkward necessity for enclosing an IPv6 address in square brackets when used in a URI.  Alternatively, if any of the RFC 3986 "sub-delims" were to be defined in place the URI ":" delimiter, then the use of square brackets around an IPv6 address would not be necessary, and IPv6 addresses in URIs would be just a little quicker to type.

Strictly speaking, as used in RFC 3986, a "delimiter" exists in a kind of "delimiter hierarchy", in which the delimiter must be a specific "allowed character".  There are delimiters for the URI itself:
>[RFC 3986]
>3.  Syntax Components
>The generic URI syntax consists of a hierarchical sequence of components referred to as the scheme, authority, path, query, and fragment.
>
>      URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
>
>      hier-part   = "//" authority path-abempty
>                  / path-absolute
>                  / path-rootless
>                  / path-empty
>

and another set of delimiters for each of those five "components" of the URI.  This is rather awkwardly, and confusingly, described subsequently in that Section 2.2:
>A component's ABNF syntax rule will not use the reserved or gen-delims rule names directly; instead, each syntax rule lists the characters allowed within that component (i.e., not delimiting it), and any of those characters that are also in the reserved set are "reserved" for use as subcomponent delimiters within the component.

Here, by my reading, the  "reserved or gen-delims rule names" and the "reserved set" are simply references to the set of "gen-delims" defined earlier.  And then, the authors make a distinction - badly - between the "allowed delimiting characters" in the rule being expressed and the "allowed component characters" of each non-delimiter "component" or "subcomponent" of that rule.

Presumably, this "backdoor" way of describing an ABNF syntax rule is for the benefit of anyone writing a parser for the URI generally, and for each component and subcomponent specifically.

And, it's useful to have a general grasp of RFC 5234, *Augmented BNF for Syntax Specifications: ABNF*, https://www.rfc-editor.org/rfc/rfc5234.html, particularly Section 3, "Operators".

You may also want to look at RFC 6874, *Representing IPv6 Zone Identifiers in Address Literals and Uniform Resource Identifiers*, https://www.rfc-editor.org/rfc/rfc6874, for an example of an RFC which modifies and updates RFC 3986.

>@agowa338 But it still doesn't take into account features like SNI...

Server Name Indication is *already* standardized, in RFC 6066, *Transport Layer Security (TLS) Extensions: Extension Definitions*, https://datatracker.ietf.org/doc/html/rfc6066#section-3.

A better example of further extensions to RFC 3986 could be in web browser support for other, non IP, address families and their socket protocols.  For instance, by defining, say, a "bluetooth" domain, prefixed with some "friendly Bluetooth name", as a "host" extension, or by extending the "host" address parser to distinguish a 48 bit colon separated bluetooth address from 128 bit colon separated IPv6 addresses, and then extending the "port" option to specify any of the many bluetooth protocols, it would be possible for a web browser to also directly access a remote bluetooth server through the local bluetooth socket.  But then, bluetooth already has "Bluetooth Network Encapsulation Protocol", BNEP, allowing IP to be used instead.  Again, just as someone would only choose to access http from a unix socket specifically to avoid using IP, usually for security reasons, someone might choose to access http directly from a bluetooth socket with the same purpose, and possibly for the same reason.  But I don't expect that there is anyone really motivated to do that.  Still, there's always the possibility.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/577#issuecomment-1185749024
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/577/1185749024@github.com>

Received on Friday, 15 July 2022 17:23:39 UTC