URL syntax: Protocol autonomy

Martin J. Duerst (mduerst@ifi.unizh.ch)
Fri, 20 Dec 1996 17:32:58 +0100 (MET)

Date: Fri, 20 Dec 1996 17:32:58 +0100 (MET)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: uri@bunyip.com
Subject: URL syntax: Protocol autonomy
Message-Id: <Pine.SUN.3.95.961220163940.245M-100000@enoshima>

A next issue I would like to address regarding the URL syntax
draft is protocol autonomy. It may also be called scheme
autonomy or mechanism autonomy. By this I mean the fact that,
or the question whether, protocols/schemes/mechanisms
can do whatever they want to define how their URLs look,
or whether they have to follow certain restrictions.

It is clear that all URLs have to meet the syntactic
restrictions, i.e. <scheme>:<scheme-specific-part> for
opaque URLs and a few more things for generic URLs.
That's what the syntax draft is here for :-).

However, currently the draft contains language that imposes
more restrictions. These restrictions are either already
broken by existing URLs, are not applicable to all URLs,
may restrict the creation of useful URLs in the future,
are in conflict (or at least seemingly in conflict) with
other language in the draft, and/or may seriously hamper
any attempts at getting more serious and consistent with
respect to i18n.

> 2.3.1. Escaped Encoding

>    The 8-bit coded character set of the octet must be a superset of the
>    US-ASCII coded character set, such that the US-ASCII characters have
>    the same escaped encoding regardless of the larger octet character
>    set.

Apart from mixing up characters and octets heavily (discussed
in another message), this requirement seems much too strong
and unnecessary. A first case in point is the data: URL,
where we don't have any "character set" (on the represented side;
on the representing side, it can be on paper, anyway) at all.

Another case is an ftp URL to a machine using an ISO 646
character set. There you might have %7B (displayed as "{" in the
US) which actually represents the character &auml; in HTML
notation. These cases are the reason why "{" and friends
are excluded from URL characters; it makes no sense to
assume more for character encoding, a non-syntactic issue,
than for the syntax itself.
With this, I don't want to say that trying to have represented
characters appear looking the same when in URLs is a bad
idea; just to the contrary I am delightedly interpreting
this paragraph as a concession that ASCII==ASCII is not
just a coincidence (as it seemed from RFC 1738), but a
useful and desired property (more on that later).
But because it is useful and desired doesn't mean
we can make it required.

> The coded character set chosen must correspond to the character
>    set of the mechanism that will interpret the URL component in which
>    the escaped character is used.  A sequence of escape triplets are
>    used if the character is coded as a sequence of octets.

This, again, is a too stringent requirement, in particular
if the "mechanism" is assumed to be the wire or the instance
on the other side of the wire (which indeed finally interprets
the URL component by converting it to an entity). If the
"interpretation" is in the scheme-specific part of the client-side
URL machinery, that's not a problem. But this should be clarified.
Also, it should be noted, here or in the URL requirements document,
that schemes/mechanisms requiring a conversion from the octets
(don't want to use the term "characters" here) in URLs to those
used on the wire have to specify this in their specification.
For some background on why I care about this point, please see

> 4. Generic URL Syntax

>    An absolute URL contains the name of the scheme being used (<scheme>)
>    followed by a colon (":") and then a string (the <scheme-specific-
>    part>) whose interpretation depends on the scheme.

The scheme autonomy stipulated here is in conflict to the
requirement cited earier. I would prefer to keep "whose interpretation
depends on the scheme" and to change the earlier stuff.

I think it would be a good idea to make an itemised list
of some ways in which "interpretation depends on the scheme",
which covers the major cases we already have and those
that we think could appear. We can say whether these ways
are usual or exceptional, recommended or not, but we shoudn't
force anything.

> 4.2. Opaque and Hierarchical URLs
>    The URL syntax does not require that the scheme-specific-part have
>    any general structure or set of semantics which is common among all
>    URLs.

Again, this contrasts with the earlier ASCII==ASCII requirements.
Character interpretation/encoding should be treated as part of
semantics, not as syntax.

Regards,	Martin.