[Prev][Next][Index][Thread]

Re: Accept-Charset support




> # If there is a need for a client to express "I can understand UTF-8,
> # but can only display some of the 10646 characters: ..." - and I 
> # definitely think there is such a need - I don not see a way to implement
> # this cleanly.
 
On Sat, 7 Dec 1996, Larry Masinter wrote:
> I think this kind of communication is along the same lines as: "I can
> implement all of HTML 3.5 tables, except I don't know anything about
> the 'border' parameter".
> 
> That is, there may be a need to communicate special subset
> capabilities, but usually those limitations are transient and too
> fine-grained to actually matter in real communication.

That does not look like a fair comparison.  Whatever HTML 3.5 tables
are, understanding the border parameter looks like a minor thing, as
you say.  But not being able to say what characters I can understand
would matter a lot in real communication.

Saying "I can understand 10646" or "I can understand UTF-8" practically
just means that I can decode that character encoding.  That is on the
same level as saying "I can understand 8-bit character sets" without
specifying which.  If anything more detailed is too fine-grained to
really matter then I don't see why anybody should currently bother 
to use Accept-Charset: ISO-8859-2 etc.
 
> In general, in the web, we've avoided catering to fine-grained
> differentiation of client capabilities. Yes, you can say "I speak
> postscript" or not, but there's no good way to say "I can take
> postscript files but don't give me any that won't look good on little
> pieces of paper".

But whether that text is readable for me or appears as complete garbage
(because I couldn't tell the server about my character repertoire)
is a bit more significant than whether something looks good or bad.

If I move from sending (say) Accept-Charset: iso-8859-3 to 
Accept-Charset: utf-8 (because my browser now understands that character
encoding), then I *lose* the capability to express what is more important
for the human user: what characters I can actually see.  And the
overloading of Accept-Language with character repertoire meaning seems
to show that there is a perceived need to express character repertoire
capabilities.

With the given structure of the MIME "charset" parameter (and therefore
the Accept-Charset header), the logical thing to at least preserve
what currently can be expressed w.r.t. repertoire would be to register
lots of additional charsets: we'd then have ISO-10646-Unicode-Latin2,
ISO-10646-Unicode-Latin3, ISO-10646-Unicode-Latin4, and so on.  Well
I can see why that isn't very inviting, looks like a big can of worms...    
What I cannot understand is how the loss of existing expressive 
capability for negotiation (of something *essential*) can be seen as 
a step forward.  

> There _is_ a proposal for allowing profiles of capabilities to be
> expressed and negotiated, and the proposal is elaborated in internet
> drafts:
> 	draft-holtman-http-negotiation-04.txt
> 	draft-ietf-http-feature-reg-00.txt
> and related topics in:
> 	draft-mutz-http-attributes-02.txt
> 	draft-goland-http-headers-00.txt
> from your nearby internet drafts directory. Perhaps 'support for
> particular subsets of ISO-10646' might fit into this category.

I am rather thinking about 'need for..' than 'support for..'.

Maybe it is the most practical way.  But no mechanism is in place yet,
while overloading the language header (and associated inventiveness with
new HTML tags) can be done now... 

Come to think of it, putting 'particular subsets of ISO-10646' under
feature tag registration wouldn't work.  Other protocols like mail
presumably will also need a way to say "this is Latin42 characters
encoded with UTF-8'.  I don't think that a HTTP/HTML/Web specific
feature tag registration can take over the IANA charset registry's
function.

BTW It seems those drafts specifically exclude "MIME type, charset, 
and language" from the new feature tags.  Probably because they are
too essential.  For all practical purposes Hebrew characters encoded
as UTF-8 (or raw 16-bit) *is* a different charset fro Greek characters
encoded the same way.

  Klaus





Follow-Ups: References: