- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Thu, 19 Dec 1996 20:21:54 +0100 (MET)
- To: uri@bunyip.com
Hello everybody, In my series of mails regarding the URL syntax draft (draft-fielding-url-syntax-02.txt) I come to the most serious issue, as far as I see it the issue that will determine how early (or late) the draft can be advanced. The draft writes: > F.2. Modifications from both RFC 1738 and RFC 1808 > > Confusion regarding the terms "character encoding", the URL > "character set", and the escaping of characters with %<hex><hex> > equivalents has (hopefully) been reduced. Unfortunately, I have to say that these hopes are not met. The draft uses the term "character" indiscriminately for both the characters that represent an URL (on paper or wherever) and the characters that are (may be!) represented by an URL. This is very annoying for somebody who understands these things, and extremely confusing to somebody who does not understand them. Let's have a look at some examples: The introduction says: > Unlike many specifications which use a BNF-like grammar to define the > bytes (octets) allowed by a protocol, the URL grammar is defined in > terms of characters. Each literal in the grammar corresponds to the > character it represents, rather than to the octet encoding of that > character in any particular coded character set. One would assume that these are the characters A-Z, a-z, 0-9, and some more, in particular the "%", as we will find them on that famous napkin. But later, the draft says: > The set of characters allowed for use within URLs can be described in > three categories: reserved, unreserved, and escaped. > > urlchar = reserved | unreserved | escaped and defines "escaped" as: > 2.3.1. Escaped Encoding > > An escaped character is encoded as a character triplet, consisting of > the percent character "%" followed by the two hexadecimal digits > representing the character's octet code in an 8-bit coded character > set. For example, "%20" is the escaped encoding for the space > character. > > escaped = "%" hex hex One wonders: On my napkin, is "%20" one character, or are these 3 characters? Confusion is perfect. I will refrain from more unnecessarily confusing examples, I just can say that the draft is full of them. What are the possible solutions? (1) Make a clear distinction between URL characters (these would be the "%", the "2", and the "0", and not the "%20" as currently) and represented characters (could also be called encoded characters, scheme characters, or something else), which may include SPACE, control characters, dangerous characters, and so on. (2) Go back to the terminology of RFC 1738 and speak about *octets* encoded as characters. Has its advantages, but is rather too abstract and far from reality (where ASCII==ASCII). (3) Work with three levels: represented characters | v octets | v URL characters This looks more complicated at first glance, but is in many cases closer to reality, and less confusing. It allows different aspects of the problem to be clearly separated. (4) Add cautions: URLs not always represent characters, and/or not always represent octets that are encoded directly (with %HH). A classical example is the data: URL. It encodes raw octets, it does not ultimately represent characters. But the octets are not encoded as %HH, they are encoded with BASE64 into a set of characters/octets that don't need %HH. In my oppinion, the best solution would combine (1), (3), and (4). I am willing to rewrite the text once the general direction to solve these issues is found (and after I come back from vacation over the holydays :-). Regards, Martin. ---- Dr.sc. Martin J. Du"rst ' , . p y f g c R l / = Institut fu"r Informatik a o e U i D h T n S - der Universita"t Zu"rich ; q j k x b m w v z Winterthurerstrasse 190 (the Dvorak keyboard) CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16 S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch ----
Received on Thursday, 19 December 1996 14:22:05 UTC