Re: I18N Concensus - Generic Syntax Document

Martin J. Duerst (mduerst@ifi.unizh.ch)
Sat, 8 Mar 1997 16:02:58 +0100 (MET)


Date: Sat, 8 Mar 1997 16:02:58 +0100 (MET)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
Cc: Rich Salz <rsalz@osf.org>, uri@bunyip.com
Subject: Re: I18N Concensus - Generic Syntax Document
In-Reply-To: <9703070657.aa00482@paris.ics.uci.edu>
Message-Id: <Pine.SUN.3.95q.970308153503.245Z-100000@enoshima>

On Fri, 7 Mar 1997, Roy T. Fielding wrote:

> >I don't know if you can just rule out filesystems just like that.
> >I can imagine networked filesystems that span hosts that would have,
> >or need to have, the locale stored at the mountpoint.
> 
> I am sure it is possible on some file systems to determine the charset.
> It just isn't possible on all of the file systems for which you can
> use an Apache server, nor is it possible for us to distribute code
> that maps from any possible filesystem charset into UTF-8 and back
> again,

Apache is a great server, and it is improved constantly by a
large group of people. You are one of the main contributors.
Apache also has an API and can be extended in many different
ways.
In an earlier version, Apache had no support for language or
charset negociation; Dirk van Gulik has explained in a recent
workshop how that works now. It is well possible that with
the increasing use of UTF-8 in Accept-Charset by browsers,
future versions of Apache might include some functionality
or hooks for conversion of document character encodings;
that would be a very valuable addition. In a later stage,
similar functionality might be added for file names/URLs.

Also, it is well possible that somebody doing a port to
some specific file system will include some code. For example,
for an NT port, it would be trivial to add Unicode<->UTF-8
conversion code. For a system running in Western Europe,
such as the ones Dan has described, it would be even easier
to write Latin-1 <-> UTF-8 code (no table needed!).
For Unix systems that use Unicode as their wchar_t type,
with support for the appropriate locale, it would also
be possible to implement things easily by just using
mbtowc and such. On other systems, users just might
start to use a UTF-8 locale, avoiding any implementation
problems.

To summarize: While it is very clear that providing a single
solution that works everywhere is very far away, there
are many solutions that can work very quickly for a
particular system or locale. What this shows most clearly
is that the current approach of various locales creates
deployment problems that will be reduced when a single
character <-> octet conversion is used not only on the
wire, but also locally.


> nor is it desirable for us to build a server that does it in
> the first place because, as I said in a message a while back, I don't
> think it is a good idea for http URLs to contain (or be displayed)
> as anything other than ASCII characters, regardless of the locale.

If you put in a hook that allows somebody to plug in his/her
own code and see how it works, that would be great. All the
rest will come with time.

As for the desirability, I have well read your earlier message.
I have answered and explained in detail why for URLs used mainly
locally, the overhead induced on 99.9...% of the users if they
are forced to use ASCII only may not be worth the savings
produced for the 0.0...% of accidental users that can't use
the local script.

If, in light of this and related arguments, you still think
that URLs should contain (and be displayed!) in ASCII and only
in ASCII, I (and certainly others on this list) look forward
to read your arguments.


Regards,	Martin.