- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Tue, 10 Dec 1996 15:33:40 +0100 (MET)
- To: Erik van der Poel <erik@netscape.com>
- cc: Francois Yergeau <yergeau@alis.com>, www-international@w3.org, Klaus Weide <kweide@tezcat.com>
On Mon, 9 Dec 1996, Erik van der Poel wrote: > > The choice between "UNICODE-1-1-UTF-8" and "UTF-8" has been debated at > > length on the ISO10646 and Unicode lists, with the result that we have now: > > "UTF-8". The wise implementer, however, would be well advised to support > > the longer tag as an ad hoc alias. > > I'm not sure what you mean by "ad hoc alias", but the term "alias" is > used in this context (Internet "charsets") to mean a synonym. Are > "unicode-1-1-utf-8" and "utf-8" synonymous? If so, what is the name of > UTF-8-encoded Unicode 2.0? > > Unicode 1.1 and 2.0 are not the same. In particular, there was a big > change in the Korean block. The Korean characters in the U+3400 to > U+3D2D range were removed, and they were added again with some others in > the U+AC00 to U+D7A3 range. A future version of the Unicode standard may > re-use the U+3400 to U+3D2D range. If/when that happens, what does > "utf-8" mean? In my oppinion, the fact that RFC 2044 refers to Unicode 1.1 is an inconvenient historical coincidence. The RFC was submitted shortly before Unicode 2.0 came out (which was expected for a long time). I guess the general consensus is that UTF-8 should denote Unicode 2.0 rather than Unicode 1.1 in cases where it really matters. > Without rehashing the whole debate that you say already took place on > those other mailing lists (which I didn't follow), could you briefly > explain the future plans for the charset name "utf-8"? I glanced at RFC > 2044 but didn't immediately see anything about this. Here is what I remember from that discussion: - Shortness to show it is important. - No versioning to reduce the number of "charset" parameter values. - No versioning because for most things (except Korean), it does not matter. - No versioning to show that there is (basically) only one character set (UCS) that is encoded. - No versioning to create pressure to avoid further shuffling of codepoints (a real stupidity). Regards, Martin.
Received on Tuesday, 10 December 1996 09:33:40 UTC