- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Fri, 20 Dec 1996 17:32:58 +0100 (MET)
- To: uri@bunyip.com
A next issue I would like to address regarding the URL syntax draft is protocol autonomy. It may also be called scheme autonomy or mechanism autonomy. By this I mean the fact that, or the question whether, protocols/schemes/mechanisms can do whatever they want to define how their URLs look, or whether they have to follow certain restrictions. It is clear that all URLs have to meet the syntactic restrictions, i.e. <scheme>:<scheme-specific-part> for opaque URLs and a few more things for generic URLs. That's what the syntax draft is here for :-). However, currently the draft contains language that imposes more restrictions. These restrictions are either already broken by existing URLs, are not applicable to all URLs, may restrict the creation of useful URLs in the future, are in conflict (or at least seemingly in conflict) with other language in the draft, and/or may seriously hamper any attempts at getting more serious and consistent with respect to i18n. > 2.3.1. Escaped Encoding > The 8-bit coded character set of the octet must be a superset of the > US-ASCII coded character set, such that the US-ASCII characters have > the same escaped encoding regardless of the larger octet character > set. Apart from mixing up characters and octets heavily (discussed in another message), this requirement seems much too strong and unnecessary. A first case in point is the data: URL, where we don't have any "character set" (on the represented side; on the representing side, it can be on paper, anyway) at all. Another case is an ftp URL to a machine using an ISO 646 character set. There you might have %7B (displayed as "{" in the US) which actually represents the character ä in HTML notation. These cases are the reason why "{" and friends are excluded from URL characters; it makes no sense to assume more for character encoding, a non-syntactic issue, than for the syntax itself. With this, I don't want to say that trying to have represented characters appear looking the same when in URLs is a bad idea; just to the contrary I am delightedly interpreting this paragraph as a concession that ASCII==ASCII is not just a coincidence (as it seemed from RFC 1738), but a useful and desired property (more on that later). But because it is useful and desired doesn't mean we can make it required. > The coded character set chosen must correspond to the character > set of the mechanism that will interpret the URL component in which > the escaped character is used. A sequence of escape triplets are > used if the character is coded as a sequence of octets. This, again, is a too stringent requirement, in particular if the "mechanism" is assumed to be the wire or the instance on the other side of the wire (which indeed finally interprets the URL component by converting it to an entity). If the "interpretation" is in the scheme-specific part of the client-side URL machinery, that's not a problem. But this should be clarified. Also, it should be noted, here or in the URL requirements document, that schemes/mechanisms requiring a conversion from the octets (don't want to use the term "characters" here) in URLs to those used on the wire have to specify this in their specification. For some background on why I care about this point, please see draft-duerst-dns-i18n-00.txt. > 4. Generic URL Syntax > An absolute URL contains the name of the scheme being used (<scheme>) > followed by a colon (":") and then a string (the <scheme-specific- > part>) whose interpretation depends on the scheme. The scheme autonomy stipulated here is in conflict to the requirement cited earier. I would prefer to keep "whose interpretation depends on the scheme" and to change the earlier stuff. I think it would be a good idea to make an itemised list of some ways in which "interpretation depends on the scheme", which covers the major cases we already have and those that we think could appear. We can say whether these ways are usual or exceptional, recommended or not, but we shoudn't force anything. > 4.2. Opaque and Hierarchical URLs > > The URL syntax does not require that the scheme-specific-part have > any general structure or set of semantics which is common among all > URLs. Again, this contrasts with the earlier ASCII==ASCII requirements. Character interpretation/encoding should be treated as part of semantics, not as syntax. Regards, Martin.
Received on Friday, 20 December 1996 11:34:50 UTC