Re: UTF-8 URL for testing

Martin J. Duerst (mduerst@ifi.unizh.ch)
Mon, 14 Apr 1997 21:55:15 +0200 (MET DST)


Date: Mon, 14 Apr 1997 21:55:15 +0200 (MET DST)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Larry Masinter <masinter@parc.xerox.com>
Cc: Francois Yergeau <yergeau@alis.com>, uri@bunyip.com
Subject: Re: UTF-8 URL for testing
In-Reply-To: <334DADDC.5CBC@parc.xerox.com>
Message-Id: <Pine.SUN.3.96.970411161247.245p-100000@enoshima>

On Thu, 10 Apr 1997, Larry Masinter wrote:

> Is there any software anywhere in the world that actually generates
> URLs like these? All of the examples seem to be carefully
> hand-constructed.
> Since these URLs are compatible with existing browsers, as you say,
> there should not be any difficulty in people running their web servers
> this way. Do any web servers in Japan use hex-encoded UTF-8-encoded
> Unicode for URLs?

> The problem with recommending this method for "Draft Standard" is
> not the "six month delay" it takes in getting to draft standard,
> it's that we should not recommend something that people aren't
> actually going to do. This is not some kind of nit-picky technical
> objection, it's fundamental to the process of Internet standards.

Thanks for expressing your concerns so that they don't look like
procedural nit-picking.

Implementors can be split into browser side and server side.
On the server side, implementation is trivial for some systems
(Plan 9, Windows NT), and more difficult for others (Unix).
It is also difficult if you already have (strictly speaking
illegal) natively encoded URLs out on your pages or on
people's napkins. But for most cases, there are very
nice upgrade paths, and fast "workarounds" up to the
time there is full configurable support in a server.
For Jigsaw, one of the browsers mentionned, implementation
should be really easy, as characters are Unicode anyway
in Java, and all the conversion support is there with
Java 1.1.

For the browser side, display is not really much of an
issue anymore. Both Netscape and MS Explorer can handle
UTF-8 in their newest version. They may have to streamline
their interfaces so that UTF-8 also works in the status
bar, but that should be a minor issue. Input is a little
bit more of a problem, as for example in the case of
Netscape, I have found that form fields and such usually
lag behind in internationalization by a browser version
or two.


My judgement of the situation, after several recent conferences
in the multilingual/internet/unicode field, is that we are in
some kind of oversaturated-fluid situation (my English for physics
is not very good, sorry). There are quite some people, in particular
browser makers, who are very aware of the problems and would like
to know what the solution will be. They are waiting for that
small seed cristal to drop in. After that, change might
be slower or faster, but it will happen.

There is absolutely no question for me that once the Japanese
realize that they can use their characters in URLs, they
will do so. Currently, they can't, because they have to
deal with two or three encodings, and autodetect or
"let the user guess" doesn't work well on short strings.


> I am eager to actually support internationalization.

Many thanks for reassuring me of that.


> However,
> I think it is counter-productive to foist hex-encoded UTF-8-encoded
> URLs (12 bytes to represent one 16-bit Kanji)

It is *9* bytes for a 16-bit Kanji. But I agree that this is still
rather long. Up to the Unicode conference last September, I was
against using UTF-8 for exactly this reason, while many others
already pushed for UTF-8. I came to the conclusion that the
escaping problems with UTF-7 were more serious, and that the
length issue was to a large part a temporary one because when
using native encoding or 8-bit UTF-8, it would not be an issue.
Even for languages such as Georgian, which are suited for
native 8-bit encodings but need three bytes in UTF-8, representing
such a character with %HH as we do now and representing it with
three bytes in the future will use exactly the same amount of
space.

I would of course have liked to push directly for 8-bit URLs
with UTF-8, but the requirement of having a lowest-common-
denominator form, the discussion on URNs, and also the discussion
on the URL and URI lists seemed to indicate that it was advisable
to first clearly establish the mapping mechanisms for 7-bit
URLs, and then go on an start to introduce native URLs from
the user interface side. If there should suddenly be consensus
for going all the way with one step, I definitely wouldn't mind.


> on the rest of the
> world merely because a western European and a Canadian like
> the idea. Surely we can find a site in Japan, China, Israel, or
> Russia that would support exporting their URLs with hex-encoded
> UTF-8-encoded URLs, before believing that this isn't yet
> another form of Unicode imperialism.

The proposal is explicitly formulated as a recommendation to
avoid any impression of imperialism. We don't want to force
anybody.


> Otherwise, we would just
> have a pretend solution to a real problem.

I understand that you are worrying about Unicode acceptance
in Japan. I have been studying this phenomenon, participating
in discussions, reading articles, collecting material, and so
on, since at least 1993. I have done all this both in Japanese
and in English. And I have discussed this with many others,
many of them more involved (with actual products) in Japan
than myself.

The bottom line of all of it is that:
- The "arguments" against Unicode are rumors that quickly
	cease in actual discussion.
- The Unicode opponents in Japan are very vocal, and use
	whatever tactics and contortions they can to spread
	their stuff.
- There are many people clearly in favor of Unicode, but
	they are less vocal.
- In serious technical discussion, for example in Japanese
	magazines oriented towards computer professionals,
	the merrits of Unicode are clearly stressed.
- Unicode is already used by many Japanese users, without
	being aware of it. The main text processing softwares
	(Ichitarou and MSWord) in the market use Unicode.
	They don't advertise with stickers saying "Unicode
	inside", so you may very well have the situation
	that somebody spreads rumors against Unicode while
	taking advantage of it :-).

What people in Japan and elsewhere will do is not
"exporting their URLs with hex-encoded UTF-8". In a few
months or years, webmasters will just configure their
servers to tell them which character encoding the local
filesystem or other resources are using. Users will just
type in their characters. They won't care whether it's
Unicode or SJIS or EUC whatever, and they won't even
be surprised that it works, as much as currently
US users are not surprised they can use the characters
they know in URLs, and they don't care whether they
use ASCII or EBCDIC or CPXYZ or whatever.


Regards,	Martin.