Re: [whatwg/url] Addressing HTTP servers over Unix domain sockets (#577)

Given the ambiguity in addressing unix domain sockets, I am still inclined to fault the basic RFC 3986.  So, here is a brief review, several rants, and another suggestion for unix domain socket addressing, simply using the square bracket "hack".

Assuming the general concept of "Uniform Resource Identifier" from Section 1.1.3., the basic structure is defined in Section 3 as having 5 components: scheme, authority, path, query, and fragment.  First off, then, what type of URI component is a unix domain socket (UDS) address?

The original context here is "HTTP servers", and "http" is, itself, a type of "scheme".  So, UDS as "scheme" is not my first choice.

Now, RFC 3986 uses the term "resource" without much constraint, saying 'This specification does not limit the scope of what might be a resource; rather, the term "resource" is used in a general sense for whatever might be identified by a URI.'  Effectively, a "resource" is whatever the user wants it to be.  Is a UDS a "resource" itself?  For the purpose here, "no".  The "resource" implied by an HTTP server is some other specific data delivered using HTTP.

Then, is a UDS a type of "path", "query", or "fragment"?

From Section 3.3, "The path component contains data, usually organized in hierarchical form, that, along with data in the non-hierarchical query component (Section 3.4), serves to identify a resource within the scope of the URI's scheme and naming authority (if any)."  Since the UDS is *not* the "resource", and, since the "path" *identifies* a "resource", then the UDS cannot be a "path".

Similarly, from Sections 3.4. Query and 3.5 Fragment, both of these components are also references to the "resource".  So the UDS is also not either a "query" or a "fragment".

And that leads to the inference that the UDS must be a kind of "authority".  RFC 3986 actually subdivides the "authority" component itself into three parts, in Section 3.2.:
```
 authority   = [ userinfo "@" ] host [ ":" port ]
```
And here, the same analysis can be applied.  Is the UDS a type of "userinfo"?  Section 3.2.1. says, "The userinfo subcomponent may consist of a user name and, optionally, scheme-specific information about how to gain authorization to access the resource."  Hmm - "scheme-specific information about how to gain authorization to access the resource" - "how to gain authorization".  Does the UDS tell "how to gain authorization"?  Sort of - maybe - not really - I'd say "no".

Is the UDS a type of "host"?  From Section 3.2.2., "The host subcomponent of authority is identified by an IP literal encapsulated within square brackets, an IPv4 address in dotted- decimal form, or a registered name."  Is, then, the UDS a type of "IP literal", "IPv4 address", or a "registered name"?  Hmm - what is an "IP literal"?  Again, from Section 3.2.2.:
```
 IP-literal = "[" ( IPv6address / IPvFuture  ) "]"
```
Since a UDS is not any of an "IPv6address / IPvFuture", an "Pv4 address", or a "registered name", then "no", a UDS is also *not* any type of "host".

And then, using RFC 3986, there is only one interpretation remaining.  Is the UDS a type of "port"?  From Section 3.2.3. Port:
```
 The port subcomponent of authority is designated by an optional port number in decimal following the
 host and delimited from it by a single colon (":") character.

  port        = *DIGIT
```
Well, clearly, and as has been mentioned previously in this discussion, the UDS is not a "DIGIT".  And here is where I find fault with RFC 3986, in its limited scope when defining "port".  Except that, Section 3.2.3. goes on to say, "The type of port designated by the port number (e.g., TCP, UDP, SCTP) is defined by the URI scheme."  And that statement suggests asking "What sort of Communication Protocol is UDS?"  Of course a UDS is not itself a kind of communication protocol, but the relationship should become apparent.  It may be more illuminating to ask the converse, "What sort of Sockets are TCP, UDP, and SCTP?"  And then, the Unix - in this case Linux - man pages offer some guidance.
```
 man 7 tcp:     tcp_socket = socket(AF_INET, SOCK_STREAM, 0);
 man 7 udp:     udp_socket = socket(AF_INET, SOCK_DGRAM, 0);
 man 7 sctp:    sctp_socket = socket(PF_INET, SOCK_STREAM, IPPROTO_SCTP);
                sctp_socket = socket(PF_INET, SOCK_SEQPACKET, IPPROTO_SCTP);
```
And generally, "What is a 'socket'"?  In part:
```
 man 2 socket:
        Name            Purpose                         Man page
        AF_UNIX         Local communication             unix(7)
        AF_LOCAL        Synonym for AF_UNIX
        AF_INET         IPv4 Internet protocols         ip(7)

 HISTORY
        The  manifest  constants  used under 4.x BSD for protocol families are PF_UNIX, PF_INET, and so
        on, while AF_UNIX, AF_INET, and so on are used for address families.  However, already the BSD
        man page promises: "The protocol family generally is the same as the address family", and
        subsequent standards use  AF_*  everywhere.
```
and then:
```
 man 7 unix:    unix_socket = socket(AF_UNIX, type, 0);
```
Here is my first rant about RFC 3986.  The "port" component of the defined URI has *presumed* an Address Family, here implying AF_INET *exclusively*, along with what is a merely incidental association with a port "number".  There is no explanation or justification given for this presumption.

Alternatively, it might be supposed that this presumption of an Address Family is an erroneous interpretation by the reader of RFC 3986.  It may instead be supposed that the "port" component of the URI is simply a *general* concept to be associated with *any* Address Family which might be included from the list given from man(2)socket.

And so, I believe that this is the interpretation, while not "official", yet, that must be taken with RFC 3986.

Then, "What is the *'port' subcomponent of authority* of an Address Family AF_UNIX socket?"

Here, man(7)unix tells us, "Traditionally, UNIX domain sockets can  be  either unnamed, or bound to a filesystem pathname (marked as being of type socket)."  In our case, we are looking for a URI, so "unnamed" is not useful.  Instead, the man page offers "a filesystem pathname".  That seems clear enough.

Therefore, an RFC 3986 URI "port" for an AF_UNIX socket might also be interpreted as simply "a filesystem pathname", instead of exclusively as a number.

Allowing that, then the remaining problem only involves appropriate delimiters, to allow correctly parsing the resulting URI for the AF_UNIX "port".

Referring again to Section 2.2.:
```
      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="
```
Incidentally, it may be noted that this RFC 3986 list of delimiters is missing the percent "%", from Section 2.1 Percent-Encoding, and the set of White Space characters generally.  The reader is now well into the realm of "inferring",  "guessing", and "interpreting", instead of specifically "defining".

Here is my second rant about RFC 3986, related to the use of delimiters.  The Section 3. URI syntax explicitly defines the ":" as separating the "scheme" from the "authority".  Subsequently, in Section 3.2., it says 'The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI.' Taken together, this double slash actually provides *no information* whatsoever in the URI and only serves to "poison" the parsing of the URI, by requiring the parser to distinguish potentially between "<scheme>:///...", "<scheme>://...", and "<scheme>:/...".  For instance, the "file" scheme, RFC 8089, supports optionally leaving out this useless "//" altogether.  RFC 3986 offers no explanation or justification for this use the double slash "//".  The delimiter might as well have been defined explicitly as "://".  This makes any use of the slash "/" as a delimiter in the URI potentially problematic, where it is also used as an essential component of any unix "filesystem pathname", when referring to the proposed UDS AF_UNIX "port", as well as, already, referring to an actual "resource" by pathname.

A third rant regards Section 3.2.2 Host, which says:
```
 A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is
 distinguished by enclosing the IP literal within square brackets ("[" and "]").  This is the only place
 where square bracket characters are allowed in the URI syntax.
```
The *only* reason that these square brackets are needed is because of the repeated and overloaded use of the colon ":" as a delimiter in the "authority", in Section 3.2 preceding the "port", and in Section 3.2.1, potentially subdividing the "userinfo".  Considering that RFC 3513 defines the use of colon ":" as the field delimiter in an IPv6 address, this should have glaringly suggested that the same ":" would be a *bad* choice for a delimiter in the RFC 3986 "authority" component and subcomponents of the URI.  And there are *plenty* of alternative characters to choose, from the small ASCII character set, for use as delimiters in the "authority".

The use of the square brackets, then, is a "hack", consequent of a bad choice for delimiter in the "authortiy" component of the URI.  Be that as it may, suppose that the prohibition "This is the only place where square bracket characters are allowed in the URI syntax", is ignored.  Then, this same "hack" can be applied equally to the unfortunate choice of the slash "/" as a delimiter within the URI syntax with respect to the "port" subcomponent of the "authority", as with the "host" subcomponent.

I propose now another alternative to addressing unix domain sockets.  By example, using the square bracket "hack", the result would allow, for instance, all of:
```
http://:[/path/to/socket]/path/to/resource.html?...#...
http://localhost:[/path/to/socket]/path/to/resource.html...
http://[::1]:[/path/to/socket]/path/to/resource.html...
http://user:password@[::1]:[/path/to/socket]/path/to/resource.html...
http://unix:[/path/to/socket]/path/to/resource.html...
```
All of these examples otherwise strictly follow the RFC 3986 URI syntax.

That is the least intrusive "hack" to UDS addressing and merely extends an existing URI "hack".  A "cleaner" revision to RFC 3986 would be to eliminate the use of either the colon ":" or the slash "/" as delimiters in the URI syntax delineating its components and subcomponents.  There are 11 other "sub-delims" defined in RFC 3986 that seem perfectly usable as delimiters in the URI "authority", which would obviate the need for using these square bracket "hacks" completely.

With reference to previous remarks about security issues, it may be noted that man(7)unix describes AF_UNIX as supporting communication "between processes on the same machine", so there would be no "remote access" possible, despite the http/https "scheme", if that constraint were followed.  And, since the UDS "port" is just a Unix "filesystem pathname", there are many existing security measures available.

On the other hand, this suggested UDS AF_UNIX "port" addressing clearly does lend itself to replacing "localhost" with "some-remote-host", to access some UDS on, literally, a remote host.  But then, any http/https "server" will be providing its own security measures, should it allow UDS addressing at all, so that's a different issue and not really a problem here. This does introduce another concept, access to a UDS by a local http/https *server*, as opposed to UDS access only by a local html display *client*.

There is still the question of whether the http/https schemes would need to be formally updated to acknowledge any kind of UDS AF_UNIX "port" addressing.  Reading at RFC 9110, Sections 4.2.1. http URI Scheme and 4.2.2. https URI Scheme:
```
        The origin server for an "http[/https]" URI is identified by the authority component, which
        includes a host identifier ([URI], Section 3.2.2) and optional port number ([URI], Section
        3.2.3).
```
By my reading, "no".  The http/https schemes simply refer to the RFC 3986 URI "optional port number" definition, and would therefore follow any update to RFC 3986 itself.

The much more difficult issue remains with any html display client, which must be taught to recognize *any* kind of UDS AF_UNIX "port" addressing.  Again, strictly, that is a separate issue.  But this does point-out that the proposal here implies that there are two distinct "solution" arenas to confront: first, RFC 3986 itself, and second, the various de facto standard html display clients extent.

The Node.js security issue mentioned by @randomstuff is - well - a Node.js security issue, as was mentioned.  It's not a server security issue and has nothing to do with UDS AF_UNIX "port" addressing per se.  Of course, that also doesn't mean that html display client security issues go away.  It's just a *separate* problem - though, it's still a problem.  It is interesting that this raises the question of security in the "reverse" direction, from a *remote* "server" potentially accessing a *local* "client resource", through a UDS.

That is not something inherent in the original concept of http client/server communication, but a consequence of allowing the "client" to potentially act, itself, as a kind of "server", using some client facility, as with javascript, to access a local resource.  The security model, then, requires simply that the client be smart enough not to do "anything stupid" at the behest of the server.  Ha!

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/577#issuecomment-1850865447
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/url/issues/577/1850865447@github.com>

Received on Monday, 11 December 2023 20:49:55 UTC