Re: UTF-8 URL for testing

Francois Yergeau (yergeau@alis.com)
Sun, 13 Apr 1997 22:59:37 -0400


Message-Id: <3.0.1.32.19970413225937.007819fc@genstar.alis.com>
Date: Sun, 13 Apr 1997 22:59:37 -0400
To: John C Klensin <klensin@mci.net>
From: Francois Yergeau <yergeau@alis.com>
Subject: Re: UTF-8 URL for testing
Cc: uri@bunyip.com
In-Reply-To: <SIMEON.9704121139.H@tp7.Jck.com>

À 11:41 12-04-97 -0400, John C Klensin a écrit :
>While I'm very anxious to see a real solution that 
>addresses the underlying issues here, I'm forced to agree 
>with Larry.  We don't "make" things happen by standardize 
>untested ideas and arguments, however logical, that 
>things are easy to do don't move the discussion forward 
>much.

Yet this is exactly how HTTP/1.1 was made to happen.  Untested things were
discussed and put into drafts.  Some testing took place along the way, but
at some point the spec was declared a Proposed Standard, before there was a
single full implementation that embodied what you want here:

> ... a demonstration that it works 
>well, that it won't cause significant problems with 
>existing (unmodified) clients, servers, or users, etc.

By contrast, what we have now is a refusal to even do the first step, to
put things  into the draft so that the issue can be addressed.

>  I don't think that timing of standards are much of 
>the issue here.

Indeed, it doesn't matter much if URL syntax becomes Draft Standard now or
6 months later.  But it does matter that an unsound spec doesn't make it to
DS.

URLs are written on paper (characters) and transmitted over the wire
(bytes).  Thus an unambiguous mapping between characters and bytes is
*required*.  This mapping currently only exists for only a tiny fraction of
possible characters, namely ASCII.  Since Web forms are submitted using
URLs, and can contain almost any text, it is neither desirable nor possible
to restrict the repertoire of characters.  The current spec does not
recognize this and pretends that (section 2):

  "All URLs consist of a restricted set of characters, 
   primarily chosen to aid transcribability and usability 
   both in computer systems and in non-computer communications."

In other words, it places a purported transcribability requirement ahead of
the simple fact that current practice uses other characters all the time.
Oh, of course, these non-ASCII characters are escaped to ASCII using
%-encoding, but there is still no defined mapping from characters to bytes.
 And there is no defined mapping from bytes to characters for half the
possible byte values, precluding any sensible display of URLs representing
non-ASCII characters.

In short, the current spec is technically unsound and broken, and needs
fixing not to extend it to new capabilities, but to bring it in line with
widespread current practice.

This discussion has been going on for months in various circles, lists and
conferences, with no resolution.  The reason, it seems to me, is the
continued failure to fully recognize that mapping only ASCII characters in
not a solution.  While it may be acceptable to restrict bytes over the wire
to 7 bits (but why?), it is not to limit the character repertoire to a
subset of ASCII.  URLs are widely put to uses where there is no such limit.

>And, as I have said many times before, while I recognize 
>and accept the enthusiasm for UTF-8, especially among users 
>of languages with Latin-based alphabetic systems, I would 
>prefer that, when we make protocol decisions that are 
>expected to have very long lifetimes, we use systems that 
>don't penalize non-Roman language groups as severely as 
>UTF-8 tends to do.

This has also been discussed at length.  The trade-off is compatibility
with all of current practice (ASCII-based) vs this undeniable byte-count
penalty for non-Latin scripts.  For short string such as URLs, I'm afraid
the technical choice is clear.
-- 
François Yergeau <yergeau@alis.com>
Alis Technologies Inc., Montréal
Tél : +1 (514) 747-2547
Fax : +1 (514) 747-2561