Re: Charset support (was: Accept-Charset support) from Martin J. Duerst on 1996-12-13 (www-international@w3.org from October to December 1996)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Fri, 13 Dec 1996 22:13:33 +0100 (MET)
To: Klaus Weide <kweide@tezcat.com>
cc: Francois Yergeau <yergeau@alis.com>, www-international@w3.org
Message-ID: <Pine.SUN.3.95.961213213205.245X-100000@enoshima>
On Fri, 13 Dec 1996, Klaus Weide wrote:

> It is my understanding that the folks who would be using 8859-2 haven't
> yet agreed on whether to use that or windows-1250 or cp852 or ...,
> and those who might use 8859-5 are also split (KOI8-R, "alt", ...).
> There are several encodings in use for Japanese.  That Russian DOS box
> either already has learnt to speak several charsets, or it will not even
> be able to understand its neighbors in the same region.  

Yes, exactly. They didn't catch the good side of agreeing on
ISO-8859-1 for western Europe for HTML. The Japanese can be
excused a little bit, as in their case, the software usually
converts automatically, and they just expected this to happen
also for a web browser (and implemented it that way).

Anyway, I guess the only solution for Eastern Europe, Russia,
and so on will be Unicode.


> Even if it seems right now that nearly everybody around the world does
> pretty much what they want (let the client guess what we are sending,
> after all it works most of the time or we just don't know any better)
> --- there is a history in the drafts and specs of Web protocols that
> said "iso-8859-1 is default".  One would think the world joined the
> World-Wide Web under those conditions...

The others joined the WWW with one big wish: "It's so cool, we want
it too." Everything else was second to this. They didn't mind the
iso-8859-1 default, because they worked around it. They knew it
wouldn't work in all cases, but that was not of much importance.

One important thing you can learn here is: The more you think
that what you do might have a chance to become really cool, the
more you should care about serious i18n.

 
> >From some responses it seemed the NC in everybody's hands is right
> around the corner, which would then make all-you-can-eat of fonts
> appear on the screen via some Java magic (presumably with negligible
> cost and delay)...  but I rather like your definition of "supporting 
> UTF-8".  There's nothing wrong with displaying _U1234_ if necessary,
> I suppose.

Not for HTML, anyway. We took care of that.

> I should clarify that above I am referring to charsets for entity bodies.
> The part of the HTTP draft about charsets in Warning headers seemed,
> uhmm, antiquated when I first read it (some months ago).  I can agree
> that iso-8859-1 in a special role doesn't seem to belong there.

"antiquated" is a very good word. It hasn't changed since, unfortunately.
I am glad you write this; in private mail somebody has told me that
I would have nobody to agree that iso-8859-1 in warnings does not
need any special place :-).

Anyway, with regards to warnings, I have a little proposal:

Let's collect a list of those six warings in the draft, in many
languages, e.g. in the form:

en.10	Response is stale
de.14	Umwandlung angewentet

and so on. An inital file with English and German (not yet really
perfect) is available as:

ftp://www.ifi.unizh.ch/pub/multilingual/http.warnings.utf8

Any improvements or additions are highly wellcome, just send
them to me and I'll integrate them. To start, you can get
the file with English only (still in utf8, but also in ASCII)
as:

ftp://www.ifi.unizh.ch/pub/multilingual/http.warnings.ascii

It's only six short messages at the moment, so translation
is done very quickly. For submission, you don't
need to use utf8, I can integrate quite a few things.
And of course I have an editor that can handle UTF-8 :-).

Then let's make this file (and a little bit of code to
extract the desired warning) available to implementors,
and let's ask them to just send the strings out as is,
and just silently ignore the antiquated ISO-8859-1 default
for warnings, and silently change that to UTF-8.
Interestingly enough, with this solution, the server side
does not have to worry AT ALL about what encoding the
warnings are in, how to convert that encoding to something
allowed, how to implement RFC1522, or whatever.
I guess implementors will love this. Those that don't care
wont do anything else than English, anyway.

Lets make the internet principles of implementation priority
and independet creativity work for decent and non-antiquated
internationalization.

Regards,	Martin.
Received on Friday, 13 December 1996 16:13:31 UTC