Re: Accept-Charset support from Klaus Weide on 1996-12-10 (www-international@w3.org from October to December 1996)

From: Klaus Weide <kweide@tezcat.com>
Date: Mon, 9 Dec 1996 23:45:40 -0600 (CST)
To: Keld J|rn Simonsen <keld@dkuug.dk>
cc: www-international@w3.org, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-ID: <Pine.SUN.3.95.961209224611.22922L-100000@huitzilo.tezcat.com>

On Sun, 8 Dec 1996, Keld J&o/rn Simonsen wrote:
> Koen Holtman writes:
> 
> > But skimming the UTF-8 specification, I gather that UTF-8 is an encoding
> > mechanism, not a character set.
> 
> Well, no. UTF8 is an encoding of characters. It implies the character
                                               ^^^^^^^^^^^^^^^^^^^^^^^^
> repertoire of ISO 10646. So it is a charset in MIME sense, including
  ^^^^^^^^^^^^^^^^^^^^^^^
> the specific character definitions of 10646.

If that is taken seriously, then "Accept-Charset: utf-8" cannot be used
to just send information about what character encoding a client can
decode.  It implies that (at least when sent in the encoding of utf-8)
all characters from the 10646 repertoire are acceptable.

It seems predictable that e.g. "Accept-Charset: koi8-r,iso-8859-1,utf-8"
will be used to indicate "documents containing characters which are 
also in koi8-r and latin-1 characters are acceptable in utf-8 encoding", 
because there is currently no better way to express that (other than
maybe with language tags, which has other problems already mentioned:
e.g. transliteration/transcription, languages that do not imply exactly
one character repertoire).

If such interpretation of "utf-8", i.e. effectively using it like another
Content-Transfer-Encoding or C-E, becomes widespread, the fact that "utf-8" 
implies the full 10646 repertoire will be totally lost.

This is of course not specific to HTTP or the Web, protocols without
negotiation like mail need charset labelling.  A simple MIME compliant
MUA should have sufficient information from message headers to dispatch 
to the appropriate viewer.  In the pre-UTF era this was reliably possible 
e.g. with metamail (given the correct charset parameter and availability of
appropriate codepage).  With messages labelled "utf-8", heuristics have to 
be involved.

  Klaus

Received on Tuesday, 10 December 1996 00:46:15 UTC