Re: Charset support (was: Accept-Charset support) from Martin J. Duerst on 1996-12-16 (www-international@w3.org from October to December 1996)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Mon, 16 Dec 1996 15:37:06 +0100 (MET)
To: Chris Lilley <Chris.Lilley@sophia.inria.fr>
cc: Klaus Weide <kweide@tezcat.com>, www-international@w3.org
Message-ID: <Pine.SUN.3.95.961216151909.242F-100000@enoshima>

On Mon, 16 Dec 1996, Chris Lilley wrote:

> On Dec 15,  1:49pm, Klaus Weide wrote:
> 
> > On Fri, 13 Dec 1996, Martin J. Duerst wrote:
> 
> > > Then let's make this file (and a little bit of code to
> > > extract the desired warning) available to implementors,
> 
> good so far
> 
> > > and let's ask them to just send the strings out as is,
> >       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > and just silently ignore the antiquated ISO-8859-1 default
> >       ^^^^^^^^^^^^^^^^^^^^
> > > for warnings, and silently change that to UTF-8.
> >                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Is it possible for any Unicode character, when converted to
> UTF-8, to contain a byte which is 0D or 0A? I suspect, looking at
> the UTF-8 algorithm, that this is the case. How then should a
> compliant HTTP/1.1 implementation tell when the HTTP reason code is
> finished, if it can contain bytes that look like CR and LF?

Chris (and everybody who might have had any doubts or concerns
about this): UTF-8 leaves all octet values between 00 and 7F
untouched. Any ASCII character or C0 control character, converted
to UTF-8, looks exactly the same as before. And whatever exotic
character you take from ISO 10646, there is never any chance
that some of the octets that represent it in UTF-8 may
be mistaken for C0 or ASCII, even if your UTF-8 parser gets
hopelessly out of sync. If we had such basic problems, I would
never have dared to suggest UTF-8 in the first place.

> > If you think that
> > part of it is really unacceptable, you should try to take that up with
> > the http-wg
> 
> Yes, there is always room for a compelling and well argued case. However,
> is this really the number one i18N issue? Response codes are for human
> debugging; it is probably more important to ensure that multilingual
> content can be delivered, multilingual response codes are really just
> icing.

I agree that it's not the number one i18n issue. That's probably why
it has not received serious attention.
On the other hand, I think that for i18n, we have to stop to
let bad or antiquated design go by without being concerned.
Also, the http warnings might be the first place where anything
except 7-bit is allowed *officially* in internet application protocol
headers. Having such a lopsided spec as "ISO-8859-1 or RFC1522",
at a place that is just made for UTF-8 (and for which UTF-8 was
made), creates a very bad precedent. Accepting the argument that
ISO-8859-1 was used for "consistency" also creates a very bad precedent.

Overall, I think that if it is a small issue, there should not
be much resistance getting it right. There seems to be virtually
no installed base, and the current discussion has not shown
any good arguments for ISO-8859-1. The main issues seem to be
procedural concerns, on which I am open to any reasonable
solution whatsoever (be it a last-minute change to the RFC
on request of the wg, a separate RFC, a mutual understanding,
or whatever).

Regards,	Martin.

Received on Monday, 16 December 1996 09:37:38 UTC