Re: UTF-8 and URLs

Martin J. Duerst (mduerst@ifi.unizh.ch)
Sun, 27 Apr 1997 18:16:02 +0200 (MET DST)


Date: Sun, 27 Apr 1997 18:16:02 +0200 (MET DST)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Francois Yergeau <yergeau@alis.com>
Cc: uri@bunyip.com
Subject: Re: UTF-8 and URLs
In-Reply-To: <3.0.1.32.19970425102234.00d53550@genstar.alis.ca>
Message-Id: <Pine.SUN.3.96.970427175721.245P-100000@enoshima>

On Fri, 25 Apr 1997, Francois Yergeau wrote:

> À 00:25 25-04-97 -0500, Dan Connolly a écrit :
> >> Let's see: we would have an i18n RFC that would allow URLs to contain most
> >> any characters, and a (possibly Draft) standard that would say "All URLs
> >> consist of a restricted set of characters..." (we know which): clear
> >> contradiction.
> >
> >Please don't cite out of context or paraphrase wildly. The _existing_
> >RFC limits the characters in URLs. In fact, the UTF-8-in-%XX encoding
> >propsal doesn't even change that: it just adds semantics to the syntax.
> 
> I'm sorry, but I see it differently: the UTF-8-in-%XX proposal doesn't add
> octet values on-the-wire, but it adds, and correctly maps, thousands of
> characters.

It can be seen in different ways. For some of the issues discussed
in the syntax draft, in particular all about relative URL processing,
it is indeed just semantics and doesn't interfere. On the other hand,
the current draft contains many explanations about the relation between
represented characters, octets, and URL characters. Somebody studying
it will greatly benefit from being told about the limitations of the
model that the current draft assumes, and from being told the direction
that is being taken to change the model and eliminate the deficiencies.

Also, the UTF-8-in-%XX proposal, strictly requiring %XX, is indeed
just an addition of semantics. However, once it is clear how these
semantics are added, the next step, namely removing the %XX requirement
and extending the URL character set to most of the Universal Character
Set (excluding compatibility characters and stuff), is obvious.
If URLs were closely similar to MIME headers, we could say that
this is a transparent user-interface issue, but because URLs include
the form on paper, where we agree that transcribing long %XX sequences
is a great pain for those that know the actual characters, the
situation is different.

I originally proposed the addition of UTF-8-in-%XX to the current
draft as an important first step towards fully international URLs,
based on experience with the URN compromize. But UTF-8-in-%XX is
only the first step, and because we already know the next steps,
we definitely should tell this to the reader of the syntax draft,
whether in the form of a fully reworked draft or (probably
preferable) in the form of a note discussing future developments.


Regards,	Martin.