Re: Accept-Charset support from Martin J. Duerst on 1996-12-05 (www-international@w3.org from October to December 1996)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Thu, 5 Dec 1996 12:34:51 +0100 (MET)
To: Erik van der Poel <erik@netscape.com>
cc: Alan Barrett/DUB/Lotus <Alan_Barrett/DUB/Lotus.LOTUSINT@crd.lotus.com>, www-international <www-international@w3.org>, bobj <bobj@netscape.com>, wjs <wjs@netscape.com>, Chris Lilley <Chris.Lilley@sophia.inria.fr>, Ed Batutis/CAM /Lotus <Ed_Batutis/CAM/Lotus@crd.lotus.com>
Message-ID: <Pine.SUN.3.95.961205120900.279C-100000@enoshima>

On Wed, 4 Dec 1996, Erik van der Poel wrote:

> > Browser vendors are not keen to send a very long list of character sets
> > accepted due to the overhead.
> 
> Right. This is one concern that keeps coming up over here at Netscape.

> > What do people think about this suggestion? Will it work for servers? I am
> > really keen to give servers a chance to return UTF-8. How do servers today
> > return UTF-8 when Accept-Charset is not generally being sent to them?
> 
> Servers cannot send UTF-8 to clients unless they know that the client is
> capable of decoding it or there is a large critical mass of browsers in
> the installed base that is known to be capable of decoding UTF-8.

I think there are two ways to look at the Accept-Charset stuff.
One way to see it is to say "all 'charset's are equal", which means
that we have to send a long list in Accept-Charset, or long bit vectors,
or such.
The other is to realize that there is quite some structure in the
problem, which can help us realize what we really need.

The structure, as I see it, has three levels:

(1) UTF-8 as an encoding that covers pretty much everything, and that
	we want to help getting acceptance. This group migth include
	some other encodings of Unicode/ISO 10646, but not too many.

(2) A list of well used and widely accepted encodings, ideally one for
	each "region" of the world. For Western Europe, this is
	iso-8859-1. We want servers to send this, and not something
	from the next category.

(3) All the special variants, alternative designations, and garbage
	"charset" parameters.

To help keep the net clean of an uncontrolled proliferation of
encodings, things in the last category definitely should not
be sent in Accept-Charset.

Now let's have a look at the server side. I can immagine three
kinds of servers:

- The old servers that ignore Accept-Charset. A long list does
	not help them.

- Servers that choose one of several already existing versions.
	I don't think that these will become really popular.

- Servers that translate on the fly. I can't immagine that they
	have a document in a class (3) "charset", and don't
	know how to convert it to class (2), or UTF-8.
	So there is no need to send anything beyond class (2).

Now let's analyse what indeed has to be send of class (2).
In theory, if a server can convert to UTF-8, that's all you
need. The main problem with UTF-8 is that it may not be
as efficient as other encodings. However, for a general
Latin 1 text (where accented characters are rather sparse),
the difference between UTF-8 and iso-2022-1 is small.
Differences are larger for e.g. pure Japanese, it's about a
50% overhead. For Indic scripts, the overhead is 200%.
But then again, compression will reduce that overhead very
nicely.

So in practice, I could see the following solutions for
Accept-Charset:

- Send UTF-8 if you can accept it, and nothing else.

- Send UTF-8 and/or a careful selection of class (2)
	"charset"s.

Regards,	Martin.

Received on Thursday, 5 December 1996 06:35:43 UTC