Opaque right hand sides (was: Re: revised "generic syntax" internet draft)

Edward Cherlin wrote:
>...
> If I am going to create an ftp: site, and I don't check what version of
> what ftp server I'm using, I'm a fool, and likewise for gopher: and telnet:
> and the others. If I put out an https: URL and I don't have a secure server
> to receive it, I'm a fool. If I intend to accept encoded UTF-8, I need to
> find out how my server can deal with it. If I don't intend to accept it, I
> can regard encoded UTF-8 in URLs as plain ASCII, without breaking any
> process that is not already broken.
>...

I think, with the help of the above, I've finally figured 
out what is going on here and why we have a seemingly- 
insurmountable communications disconnect.  In case others 
have been as confused as I have been and in the hope that 
this might help, let's step away from character sets for a 
moment and look at a broader question.

With email, we've made a very careful distinction between 
the "local-part" and the "domain-name".    The latter has 
to be resolvable by the DNS and must obey its rules, 
whatever they are.   The local part is defined as opaque to 
everything but the target system -- the one named in the 
"domain-name" (or indirected via an MX record, which has no 
analogy here).   There are some very low level syntax rules 
to which it must conform --e.g., seven bits and quoting if 
certain classes of characters appear-- but the 
oft-repeated, and very important, rule is that _nothing_ 
besides a delivery host gets to interpret or revise the 
local part.  So, for example, sometimes a percent sign 
denotes routing, and sometimes it is just part of an 
address, and, in principle, sometimes it might introduce 
encoding of something that, by prior agreement, sending and 
receiving MUAs (but not anything in the intermediate 
transport system) might construe as encoding for non-ASCII 
characters.  As long as the "don't mess with the local 
part" rules are strictly observed while the message is in 
the transport system, everything works fine.

That is, more or less, the position I think Edward and 
others are taking -- we can safely treat all of the URL 
that follows the domain name as opaque and as something 
that will be interpreted, like the local-part, only by 
systems that --by prearrangement or good sense-- will know 
how to interpret it correctly.

FTP, for example, is actually pretty similar: the form and 
syntax of file names is that of the server and it is the 
responsibity of the client to figure out, out of band if 
necessary, what form the server uses and to adapt to it.  
The protocol was carefully designed so that the arguments 
to, e.g., RETR and STOR, could be treated as completely 
opaque.

Fortunately or unfortunately, URLs haven't been defined as
      <protocol>://<domain>/<opaque-part>
but, instead, with considerable syntax and semantics 
attached to the RHS (after the domain-part -- I am 
deliberately not using standard URI terminology here). 
Presumably that was done for a reason and lots of things 
interpret, slice, and dice the RHS -- things that may not 
be aware of the shared client/server conventions.  If that 
is really the case, then the issue about interpretation of 
the "characters" in the URL is not limited to the ultimate 
client and server but involves all intermediate systems, 
caches,  publishers of URLs, etc. -- anything that might 
take advantage of knowledge of the syntax to do something 
with, or to, the URL's structure.   Now it may be that 
unusual codings with still survive that process: the point 
is that we need to look at those "third-party" systems, not 
just the endpoints.

However, we have systems out there that depend on 
digitally-signed URLs and similar things.   The fact that 
they work may imply that intermediate systems aren't, in 
fact, dependent on the ability to disassemble a URL.   Or 
it might prove nothing at all -- most, if not all, of those 
signed URLs are associated with objects that can't be 
cached.

In any event, I think the key question turns into whether, 
at this stage, it is plausible to agree to drastically 
revise the URL/URI definitions to define the RHS as opaque. 
It would make a number of other things easier -- there are 
proposals for URIs floating around that are made very 
clumsy because they have to conform to the current rules 
about the RHS.   And it would dramatically shorten the 
generic syntax document.   Is that an idea people are 
willng to discuss?  Could it --and the notion of a 

   <protocol>://<domain>/<not-very-constrained-string>

definition-- be deployed without doing violence to the 
installed base?

I believe that the current discussion about whether it is 
sufficient that client and servers understand each other 
becomes meaningful only if the answer to both of those 
questions is "yes".

     john

Received on Thursday, 24 April 1997 08:28:36 UTC