Re: Accept-Charset support from Klaus Weide on 1996-12-07 (www-international@w3.org from October to December 1996)

From: Klaus Weide <kweide@tezcat.com>
Date: Sat, 7 Dec 1996 03:24:11 -0600 (CST)
To: Larry Masinter <masinter@parc.xerox.com>
cc: Chris.Lilley@sophia.inria.fr, www-international@w3.org
Message-ID: <Pine.SUN.3.95.961207022947.19063W-100000@xochi.tezcat.com>

On Thu, 5 Dec 1996, Larry Masinter wrote:

[snipped from a longer message:]
> I think the simple thing to do is to send:
> 
> 	accept-charset: utf-8,iso-8859-5
> 
> if you're a browser and can display utf-8 and 8859-5 as well as
> 8859-1.  

It seems more appropriate to say "...if you can decode utf-8 and display
8859-5".  The problem is that "utf-8" doesn't carry any useful information
about available character repertoire (whereas iso-8859-5 does) unless
we assume that it will be normal for a browser (or other web client)
to have _all_ of the 10646 characters available (in which case all 
discussion about Accept-Charset would be rather pointless).

Interpreting "accept-charset: utf-8,iso-8859-5" as meaning "I can
accept utf-8 character encoding for all all characters in the 8859-5
repertoire" is an obvious hack, one that will no doubt be employed.
Another hack is to guess from Accept-Language headers (or other
language information) which characters are really available.  Both
hacks would be using the Accept-* headers in a way that does not
conform to their defined meaning, and would likely lead to another
round of interoperability problems if no better way is clearly defined.

If there is a need for a client to express "I can understand UTF-8,
but can only display some of the 10646 characters: ..." - and I 
definitely think there is such a need - I don not see a way to implement
this cleanly.  This is a limitation of the MIME charset model which
mixes character encoding and repertoire aspects ("charset considered
harmful" etc...).  Or rather it is a limitation following from the fact
that no more than a handful of "10646 sub-repertoire charsets" have
been registered, for which the IANA registry file has reserved a range:

 "The second region (1000-1999) is for the Unicode and
ISO/IEC 10646 coded character sets together with a specification of a
(set of) sub-repetoires that may occur."

And none of those registered charsets are currently being considered
for being among "the canonical few".  (Most of them appear to be
vendor specific.  As a result, the only straightforward way today to
say "I accept Unicode for Arabic characters" appears to be 
"Accept-Charset: ISO-Unicode-IBM-1264".  [I am just going by the IANA
registry's description and don't know anything more about that specific
Presentation Set].)

I don't know whether all this has already been discussed to death in
previous iterations, but so far I have not found an RFC or I-D or similar
with a clear answer to these issues.  I am left with the impression that
this list already lives in the bright new future where all UCS2 (and UCS4)
characters are available to everything that has a CPU or can print - but
I don't see it just yet.  And clients with restricted capabilities would
be much more in need of usable {charset,language,*}-negotiation than those
browsers-of-the-future that can do everything.

  Klaus

Received on Saturday, 7 December 1996 04:24:04 UTC