URL syntax: characters and octets

Martin J. Duerst (mduerst@ifi.unizh.ch)
Thu, 19 Dec 1996 20:21:54 +0100 (MET)


Date: Thu, 19 Dec 1996 20:21:54 +0100 (MET)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: uri@bunyip.com
Subject: URL syntax: characters and octets
Message-Id: <Pine.SUN.3.95.961219194325.245V-100000@enoshima>

Hello everybody,

In my series of mails regarding the URL syntax draft
(draft-fielding-url-syntax-02.txt) I come to the most
serious issue, as far as I see it the issue that will
determine how early (or late) the draft can be advanced.


The draft writes:

> F.2. Modifications from both RFC 1738 and RFC 1808
> 
>    Confusion regarding the terms "character encoding", the URL
>    "character set", and the escaping of characters with %<hex><hex>
>    equivalents has (hopefully) been reduced.

Unfortunately, I have to say that these hopes are not met.
The draft uses the term "character" indiscriminately for both
the characters that represent an URL (on paper or wherever) and
the characters that are (may be!) represented by an URL.
This is very annoying for somebody who understands these things,
and extremely confusing to somebody who does not understand them.


Let's have a look at some examples:

The introduction says:

>    Unlike many specifications which use a BNF-like grammar to define the
>    bytes (octets) allowed by a protocol, the URL grammar is defined in
>    terms of characters.  Each literal in the grammar corresponds to the
>    character it represents, rather than to the octet encoding of that
>    character in any particular coded character set.  

One would assume that these are the characters A-Z, a-z, 0-9, and some
more, in particular the "%", as we will find them on that famous napkin.

But later, the draft says:

>    The set of characters allowed for use within URLs can be described in
>    three categories: reserved, unreserved, and escaped.
> 
>       urlchar     = reserved | unreserved | escaped

and defines "escaped" as:

> 2.3.1. Escaped Encoding
> 
>    An escaped character is encoded as a character triplet, consisting of
>    the percent character "%" followed by the two hexadecimal digits
>    representing the character's octet code in an 8-bit coded character
>    set.  For example, "%20" is the escaped encoding for the space
>    character.
>    
>       escaped     = "%" hex hex

One wonders: On my napkin, is "%20" one character, or are these 3
characters? Confusion is perfect.

I will refrain from more unnecessarily confusing examples, I just
can say that the draft is full of them. What are the possible
solutions?

(1) Make a clear distinction between URL characters (these would be
	the "%", the "2", and the "0", and not the "%20" as currently)
	and represented characters (could also be called encoded
	characters, scheme characters, or something else),
	which may include SPACE, control characters, dangerous
	characters, and so on.

(2) Go back to the terminology of RFC 1738 and speak about *octets*
	encoded as characters. Has its advantages, but is rather
	too abstract and far from reality (where ASCII==ASCII).

(3) Work with three levels:
                     represented characters
                                |
                                v
                              octets
                                |
                                v
                         URL characters

	This looks more complicated at first glance, but is in many cases
	closer to reality, and less confusing. It allows different aspects
	of the problem to be clearly separated.

(4) Add cautions: URLs not always represent characters, and/or not
	always represent octets that are encoded directly (with %HH).
	A classical example is the data: URL. It encodes raw octets,
	it does not ultimately represent characters. But the octets
	are not encoded as %HH, they are encoded with BASE64 into
	a set of characters/octets that don't need %HH.

In my oppinion, the best solution would combine (1), (3), and (4).
I am willing to rewrite the text once the general direction to solve
these issues is found (and after I come back from vacation over
the holydays :-).

Regards,	Martin.

----
Dr.sc.  Martin J. Du"rst			    ' , . p y f g c R l / =
Institut fu"r Informatik			     a o e U i D h T n S -
der Universita"t Zu"rich			      ; q j k x b m w v z
Winterthurerstrasse  190			     (the Dvorak keyboard)
CH-8057   Zu"rich-Irchel   Tel: +41 1 257 43 16
 S w i t z e r l a n d	   Fax: +41 1 363 00 35   Email: mduerst@ifi.unizh.ch
----