Re: UTF-8 URL for testing

Martin J. Duerst (mduerst@ifi.unizh.ch)
Mon, 14 Apr 1997 21:09:16 +0200 (MET DST)


Date: Mon, 14 Apr 1997 21:09:16 +0200 (MET DST)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: John C Klensin <klensin@mci.net>
Cc: masinter@parc.xerox.com, Francois Yergeau <yergeau@alis.com>,
Subject: Re: UTF-8 URL for testing
In-Reply-To: <SIMEON.9704121139.H@tp7.Jck.com>
Message-Id: <Pine.SUN.3.96.970414204206.245I-100000@enoshima>

On Sat, 12 Apr 1997, John C Klensin wrote:

> 
> On Fri, 11 Apr 1997 16:29:57 -0700 (PDT) Larry Masinter 
> <masinter@parc.xerox.com> wrote:
> 
> > Just because a problem is important doesn't
> > mean that we should recommend something that has not yet
> > been demonstrated to actually solve the problem.
> >...
> 
> Dan and Francois,
> 
> While I'm very anxious to see a real solution that 
> addresses the underlying issues here, I'm forced to agree 
> with Larry.  We don't "make" things happen by standardize 
> untested ideas and arguments, however logical, that 
> things are easy to do don't move the discussion forward 
> much.

Thanks for admitting that there is some logic behind what
we have been proposing.


> I don't think that timing of standards are much of 
> the issue here.  It is just that we have a large installed 
> base and I'd prefer to see a demonstration that it works 
> well, that it won't cause significant problems with 
> existing (unmodified) clients, servers, or users, etc.

I very much appreciate your concern. However, I have
great difficulties to immaging what might actually go
wrong. For example, as long as we stay with %HH, there
can't possibly be anything going wrong, can it? And if
it did, it wouldn't be UTF-8 that had to be blamed, but
the implementation that didn't handle %HH correctly.

If we start to remove %HHs and replace it with 8-bit
octets, more things can go wrong. But they are exactly
the same things that can happen now when this is done
with a legacy encoding. They are mainly related to
the fact that transcoding conserves character identity,
whereas URLs assume octet identity. The recommendation
for UTF-8 will finally remove these problems, but in
a transition period, they will show up more strongly.

The above applies as long as we don't have a look at the
exact characters encoded. If we do this, we get problems
similar to the 0O0O0O problems with ASCII. Again nothing
really new.


When asked for implementations, I immediately made two
URLs with UTF-8 encoded characters. Francois made a few
more and included them in a web page. They are here for
anybody to test. We have tested the browsers we have
around. When asked to write some software to convert
URLs to UTF-8, Francois also wrote such software.
Everybody can use it and test it.

If you have any ideas of what else would have to be
tested, and how, please tell the list. Everybody
knows that it is hard to test one's own software or
ideas. It's much easier for other people to spot
problems.


Many thanks for your help,	Martin.