Re: [whatwg/url] Addressing HTTP servers over Unix domain sockets (#577)

>Network Working Group
RFC 3986                   URI Generic Syntax               January 2005
3.2.3.  Port  
The port subcomponent of authority is designated by an optional port number in decimal following the host and delimited from it by a single colon (":") character.
>
>      port        = *DIGIT
>
@avakar Thanks for introducing a specific reference.  Of course, we will keep in mind that the point of this conversation is to approach a recommendation for a specific *revision*, exactly to RFC 3986, https://www.rfc-editor.org/rfc/rfc3986.

Interesting, but maybe understandable, given the "Network" orientation of the Working Group, that the group utterly failed to consider application of the "URI Generic Syntax" to "AF_UNIX".  Actually, for the Linux kernel, we see:

>ADDRESS_FAMILIES(7)
...
DESCRIPTION
The  domain  argument  of the socket(2) specifies a communication domain; this selects the protocol family which will be used for communication.  These families are defined in <sys/socket.h>.  The formats currently understood by the Linux kernel include:
   ...

There are  41 different address families listed there, which includes AF_INET, AF_INET6, and AF_UNIX/AF_LOCAL.  It also includes AF_BLUETOOTH, which also lacks a standardized URI syntax, as far as I know.

In particular, though, RFC 8820, *URI Design and Ownership*, 2020, https://datatracker.ietf.org/doc/html/rfc8820, updates 3986 and addresses the issue of updates to the URI scheme:

>[RFC 8820]
>1.1.  Intended Audience
>This document's guidelines and requirements target the authors of specifications that constrain the syntax or structure of URIs or parts of them.  Two classes of such specifications are called out specifically:
> *  Protocol Extensions ("Extensions") - specifications that offer new
      capabilities that could apply to any identifier or to a large
      subset of possible identifiers, e.g., a new signature mechanism
      for "http" URIs, metadata for any URI, or a new format.
>...
>
>2\. Best Current Practices for Standardizing Structured URIs
This section updates [RFC3986] by advising Specifications how they should define structure and semantics within URIs. Best practices differ, depending on the URI component in question, as described below.
>
>2.1. URI Schemes
Applications and Extensions can require the use of one or more specific URI schemes; for example, it is perfectly acceptable to require that an Application support "http" and "https" URIs. However, Applications ought not preclude the use of other URI schemes in the future, unless they are clearly only usable with the nominated schemes.
>
> A Specification that defines substructure for URI schemes overall (e.g., a prefix or suffix for URI scheme names) MUST do so by modifying [BCP35] (an exceptional circumstance).
>

"BCP35" is also known as RFC 7595, *Guidelines and Registration Procedures for URI Schemes*, https://datatracker.ietf.org/doc/html/rfc7595.

>[RFC 8820 continued]
>2.2. URI Authorities
Scheme definitions define the presence, format, and semantics of an authority component in URIs; all other Specifications MUST NOT constrain or define the structure or the semantics for URI authorities, unless they update the scheme registration itself or the structures it relies upon (e.g., DNS name syntax, as defined in Section 3.5 of [RFC1034]).
>
>For example, an Extension or Application cannot say that the "foo" prefix in "https://foo_app.example.com" is meaningful or triggers special handling in URIs, unless they update either the "http" URI scheme or the DNS hostname syntax.
>
>Applications can nominate or constrain the port they use, when applicable. For example, BarApp could run over port nnnn (provided that it is properly registered).
...

As a reminder, Section 3.2 of RFC 3986, https://www.rfc-editor.org/rfc/rfc3986#section-3.2, defines the "URI Authority", referred to there in Section 2.2 of RFC 8820:
>[RFC 3986]
>3.2.  Authority
>...
The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI.
>
>     authority   = [ userinfo "@" ] host [ ":" port ]
>
>URI producers and normalizers should omit the ":" delimiter that separates host from port if the port component is empty.  Some schemes do not allow the userinfo and/or port subcomponents.
>
>If a URI contains an authority component, then the path component must either be empty or begin with a slash ("/") character.  Non-validating parsers (those that merely separate a URI reference into its major components) will often ignore the subcomponent structure of authority, treating it as an opaque string from the double-slash to the first terminating delimiter, until such time as the URI is dereferenced.

We can see, then, that defining "a prefix or suffix for URI scheme names", which means modifying BCP35/RFC 7595, will be much less desirable than only modifying RFC 3986 itself.  Something to keep in mind.

Here:
>[RFC 3986]
>3.1.  Scheme
>Each URI begins with a scheme name that refers to a specification for assigning identifiers within that scheme.  As such, the URI syntax is a federated and extensible naming system wherein each scheme's specification may further restrict the syntax and semantics of identifiers using that scheme.
>
>Scheme names consist of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus ("+"), period ("."), or hyphen ("-").  ...

So, while "http+unix:" is a valid RFC 3986 "scheme", the desired protocol there is simply "http", and there is no reason to define an entire new group of RFC 7595 "schemes" which have nothing to do with the protocol itself.  That would entail not just "http+unix:", but also "https+unix:", "ftp+unix:", "smtp+unix:", "submissions+unix:", etc., etc., and on and on.

> @agowa338 ":/" would be a valid filename on jfs...

"/.../ht.socket:/path/to/resource.html" would be a valid filename in many file systems.  But note that RFC 3986 prohibits use of the ":", "as the first segment of a relative-path reference":

>4.1.  URI Reference
...
A URI-reference is either a URI or a relative reference.  If the URI-reference's prefix does not match the syntax of a scheme followed by its colon separator, then the URI-reference is a relative reference.
...
>4.2.  Relative Reference
...
A relative reference that begins with two slash characters is termed a network-path reference; such references are rarely used.  A relative reference that begins with a single slash character is termed an absolute-path reference.  A relative reference that does not begin with a slash character is termed a relative-path reference.
>
>A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name.  Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative-path reference.

Generally, though, we are all talking about modifying RFC 3986, Section 3.2, "Authority" - preferably in the least intrusive or disruptive manner.  We may note that RFC 3986, Section 3.2.2, "Host", already requires that "The host subcomponent of authority is identified by an IP literal encapsulated within square brackets, ...".  As a courtesy to existing URI parsers, we should avoid a *completely different* use of square brackets, specifically, to surround a unix domain socket path, where an "IP-literal" should normally be expected, as described in Section 3.2.2, "Host".

In particular, we are talking about modifying RFC 3986, Section 3.2.3, "Port".  Again, remember that the "port" element of the "authority" component of the URI *already* makes use of the ":" delimiter:
`authority   = [ userinfo "@" ] host [ ":" port ]`
And remember, those square brackets there are not literal.  They are just indicating optional elements of the "authority".

We note that the ":" is already also used as a delimiter in the userinfo element of the authority:
`userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )`

Section 3.2.1, "User Information" goes on to explain why nothing else follows the ":" , preceding the "@", in the userinfo element.

And we note the general description of the "authority" component from RFC 3986, Section 3.2:
>The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI.

So rather than introducing some new type of delimiter to the "authority" component of the URI, instead, Section 3.2 is modified to say "The authority component ... is terminated by the next slash ("/") not preceded by a colon (":"), ...", and Section 3.2.3 is modified to allow for a unix domain socket path, which is simply a path terminated by a ":", as:
```
port        = *DIGIT / socketpath
socketpath  =  path ":"
```

"path" itself is defined in the subsequent Section 3.3, "Path".  And, remember the context here.  This is just the optional "port" element of the "authority", which must be preceded by a ":".  So we may say that "The port subcomponent of authority is identified by a path encapsulated within colons." This is simply `:path:`.  URI parsers already are *required* to know how to parse a URI "path".  Here, the parser just has to learn how to distinguish a path encapsulated within colons.

It is true that this approach would prohibit a socket path having a directory or file name ending with ":", but, as we have seen in Section 4.2, the path segment *already* has a limitation with respect to use of the ":".

The only other issue to be addressed is with respect to the special "host" component of the "authority".  Guidance is provided in:
>[RFC 3986]
>1.1.  Overview of URIs
>...
>This specification does not place any limits on the nature of a resource, the reasons why an application might seek to refer to a resource, or the kinds of systems that might use URIs for the sake of identifying resources.  This specification does not require that a URI persists in identifying the same resource over time, though that is a common goal of all URI schemes.  Nevertheless, nothing in this specification prevents an application from limiting itself to particular types of resources, or to a subset of URIs that maintains characteristics desired by that application.

and also, in:
>[RFC 3986]
>3.2.2.  Host
The host subcomponent of authority is identified by an IP literal encapsulated within square brackets, an IPv4 address in dotted-decimal form, or a registered name.
...
>
>      host        = IP-literal / IPv4address / reg-name
>
Since we are specifically addressing to the http, and since web servers - and mail servers too, for that matter - already know how to serve from unix domain sockets, the practical issue here only has to do with web browsers properly handling a URI in the unix domain.  Web browsers are specific applications which, consistent with RFC 3986, are free to address a specific type of URI.  In this case, we would like that to include any URI in the "unix domain".  "Resolving" the "unix domain", then, since it is a local socket family, and not a network socket family, must be the responsibility of the browser application itself, and not something dependent upon the resolver libraries or the DNS protocol.  That's the whole point of this exercise - "local" IPC, not "network".  Thus, the web browser itself must recognize a "URI authority" having the "host registered name" *unix*, and then simply open a local unix socket, and not just annoyingly complain about failing to find a network domain named "unix" - ERR_NAME_NOT_RESOLVED.

Thus, the ABNF rule for "host" is also modified to allow for the "unix" domain, keeping in mind the "first-match-wins" algorithm:
` host        = IP-literal / IPv4address / "unix" / reg-name`

Otherwise, it is best not to "reinvent the wheel", and necessitate complicated - pointlessly complicated - parsing schemes.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/577#issuecomment-1185345177
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/577/1185345177@github.com>

Received on Friday, 15 July 2022 09:12:37 UTC