- From: Klaus Weide <kweide@tezcat.com>
- Date: Sat, 7 Dec 1996 03:24:11 -0600 (CST)
- To: Larry Masinter <masinter@parc.xerox.com>
- cc: Chris.Lilley@sophia.inria.fr, www-international@w3.org
On Thu, 5 Dec 1996, Larry Masinter wrote: [snipped from a longer message:] > I think the simple thing to do is to send: > > accept-charset: utf-8,iso-8859-5 > > if you're a browser and can display utf-8 and 8859-5 as well as > 8859-1. It seems more appropriate to say "...if you can decode utf-8 and display 8859-5". The problem is that "utf-8" doesn't carry any useful information about available character repertoire (whereas iso-8859-5 does) unless we assume that it will be normal for a browser (or other web client) to have _all_ of the 10646 characters available (in which case all discussion about Accept-Charset would be rather pointless). Interpreting "accept-charset: utf-8,iso-8859-5" as meaning "I can accept utf-8 character encoding for all all characters in the 8859-5 repertoire" is an obvious hack, one that will no doubt be employed. Another hack is to guess from Accept-Language headers (or other language information) which characters are really available. Both hacks would be using the Accept-* headers in a way that does not conform to their defined meaning, and would likely lead to another round of interoperability problems if no better way is clearly defined. If there is a need for a client to express "I can understand UTF-8, but can only display some of the 10646 characters: ..." - and I definitely think there is such a need - I don not see a way to implement this cleanly. This is a limitation of the MIME charset model which mixes character encoding and repertoire aspects ("charset considered harmful" etc...). Or rather it is a limitation following from the fact that no more than a handful of "10646 sub-repertoire charsets" have been registered, for which the IANA registry file has reserved a range: "The second region (1000-1999) is for the Unicode and ISO/IEC 10646 coded character sets together with a specification of a (set of) sub-repetoires that may occur." And none of those registered charsets are currently being considered for being among "the canonical few". (Most of them appear to be vendor specific. As a result, the only straightforward way today to say "I accept Unicode for Arabic characters" appears to be "Accept-Charset: ISO-Unicode-IBM-1264". [I am just going by the IANA registry's description and don't know anything more about that specific Presentation Set].) I don't know whether all this has already been discussed to death in previous iterations, but so far I have not found an RFC or I-D or similar with a clear answer to these issues. I am left with the impression that this list already lives in the bright new future where all UCS2 (and UCS4) characters are available to everything that has a CPU or can print - but I don't see it just yet. And clients with restricted capabilities would be much more in need of usable {charset,language,*}-negotiation than those browsers-of-the-future that can do everything. Klaus
Received on Saturday, 7 December 1996 04:24:04 UTC