On Sun, 8 Dec 1996, Keld J&o/rn Simonsen wrote: > Koen Holtman writes: > > > But skimming the UTF-8 specification, I gather that UTF-8 is an encoding > > mechanism, not a character set. > > Well, no. UTF8 is an encoding of characters. It implies the character ^^^^^^^^^^^^^^^^^^^^^^^^ > repertoire of ISO 10646. So it is a charset in MIME sense, including ^^^^^^^^^^^^^^^^^^^^^^^ > the specific character definitions of 10646. If that is taken seriously, then "Accept-Charset: utf-8" cannot be used to just send information about what character encoding a client can decode. It implies that (at least when sent in the encoding of utf-8) all characters from the 10646 repertoire are acceptable. It seems predictable that e.g. "Accept-Charset: koi8-r,iso-8859-1,utf-8" will be used to indicate "documents containing characters which are also in koi8-r and latin-1 characters are acceptable in utf-8 encoding", because there is currently no better way to express that (other than maybe with language tags, which has other problems already mentioned: e.g. transliteration/transcription, languages that do not imply exactly one character repertoire). If such interpretation of "utf-8", i.e. effectively using it like another Content-Transfer-Encoding or C-E, becomes widespread, the fact that "utf-8" implies the full 10646 repertoire will be totally lost. This is of course not specific to HTTP or the Web, protocols without negotiation like mail need charset labelling. A simple MIME compliant MUA should have sufficient information from message headers to dispatch to the appropriate viewer. In the pre-UTF era this was reliably possible e.g. with metamail (given the correct charset parameter and availability of appropriate codepage). With messages labelled "utf-8", heuristics have to be involved. KlausReceived on Tuesday, 10 December 1996 00:46:15 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:46 GMT