Re: Accept-Charset support from Martin J. Duerst on 1996-12-10 (www-international@w3.org from October to December 1996)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Tue, 10 Dec 1996 15:33:40 +0100 (MET)
To: Erik van der Poel <erik@netscape.com>
cc: Francois Yergeau <yergeau@alis.com>, www-international@w3.org, Klaus Weide <kweide@tezcat.com>
Message-ID: <Pine.SUN.3.95.961210141212.245E-100000@enoshima>

On Mon, 9 Dec 1996, Erik van der Poel wrote:

> > The choice between "UNICODE-1-1-UTF-8" and "UTF-8" has been debated at
> > length on the ISO10646 and Unicode lists, with the result that we have now:
> > "UTF-8".  The wise implementer, however, would be well advised to support
> > the longer tag as an ad hoc alias.
> 
> I'm not sure what you mean by "ad hoc alias", but the term "alias" is
> used in this context (Internet "charsets") to mean a synonym. Are
> "unicode-1-1-utf-8" and "utf-8" synonymous? If so, what is the name of
> UTF-8-encoded Unicode 2.0?
> 
> Unicode 1.1 and 2.0 are not the same. In particular, there was a big
> change in the Korean block. The Korean characters in the U+3400 to
> U+3D2D range were removed, and they were added again with some others in
> the U+AC00 to U+D7A3 range. A future version of the Unicode standard may
> re-use the U+3400 to U+3D2D range. If/when that happens, what does
> "utf-8" mean?

In my oppinion, the fact that RFC 2044 refers to Unicode 1.1 is
an inconvenient historical coincidence. The RFC was submitted shortly
before Unicode 2.0 came out (which was expected for a long time).
I guess the general consensus is that UTF-8 should denote
Unicode 2.0 rather than Unicode 1.1 in cases where it really matters.


> Without rehashing the whole debate that you say already took place on
> those other mailing lists (which I didn't follow), could you briefly
> explain the future plans for the charset name "utf-8"? I glanced at RFC
> 2044 but didn't immediately see anything about this.

Here is what I remember from that discussion:

- Shortness to show it is important.
- No versioning to reduce the number of "charset" parameter values.
- No versioning because for most things (except Korean), it does
	not matter.
- No versioning to show that there is (basically) only one
	character set (UCS) that is encoded.
- No versioning to create pressure to avoid further shuffling
	of codepoints (a real stupidity).

Regards,	Martin.

Received on Tuesday, 10 December 1996 09:33:40 UTC