- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Thu, 5 Dec 1996 12:34:51 +0100 (MET)
- To: Erik van der Poel <erik@netscape.com>
- cc: Alan Barrett/DUB/Lotus <Alan_Barrett/DUB/Lotus.LOTUSINT@crd.lotus.com>, www-international <www-international@w3.org>, bobj <bobj@netscape.com>, wjs <wjs@netscape.com>, Chris Lilley <Chris.Lilley@sophia.inria.fr>, Ed Batutis/CAM /Lotus <Ed_Batutis/CAM/Lotus@crd.lotus.com>
On Wed, 4 Dec 1996, Erik van der Poel wrote: > > Browser vendors are not keen to send a very long list of character sets > > accepted due to the overhead. > > Right. This is one concern that keeps coming up over here at Netscape. > > What do people think about this suggestion? Will it work for servers? I am > > really keen to give servers a chance to return UTF-8. How do servers today > > return UTF-8 when Accept-Charset is not generally being sent to them? > > Servers cannot send UTF-8 to clients unless they know that the client is > capable of decoding it or there is a large critical mass of browsers in > the installed base that is known to be capable of decoding UTF-8. I think there are two ways to look at the Accept-Charset stuff. One way to see it is to say "all 'charset's are equal", which means that we have to send a long list in Accept-Charset, or long bit vectors, or such. The other is to realize that there is quite some structure in the problem, which can help us realize what we really need. The structure, as I see it, has three levels: (1) UTF-8 as an encoding that covers pretty much everything, and that we want to help getting acceptance. This group migth include some other encodings of Unicode/ISO 10646, but not too many. (2) A list of well used and widely accepted encodings, ideally one for each "region" of the world. For Western Europe, this is iso-8859-1. We want servers to send this, and not something from the next category. (3) All the special variants, alternative designations, and garbage "charset" parameters. To help keep the net clean of an uncontrolled proliferation of encodings, things in the last category definitely should not be sent in Accept-Charset. Now let's have a look at the server side. I can immagine three kinds of servers: - The old servers that ignore Accept-Charset. A long list does not help them. - Servers that choose one of several already existing versions. I don't think that these will become really popular. - Servers that translate on the fly. I can't immagine that they have a document in a class (3) "charset", and don't know how to convert it to class (2), or UTF-8. So there is no need to send anything beyond class (2). Now let's analyse what indeed has to be send of class (2). In theory, if a server can convert to UTF-8, that's all you need. The main problem with UTF-8 is that it may not be as efficient as other encodings. However, for a general Latin 1 text (where accented characters are rather sparse), the difference between UTF-8 and iso-2022-1 is small. Differences are larger for e.g. pure Japanese, it's about a 50% overhead. For Indic scripts, the overhead is 200%. But then again, compression will reduce that overhead very nicely. So in practice, I could see the following solutions for Accept-Charset: - Send UTF-8 if you can accept it, and nothing else. - Send UTF-8 and/or a careful selection of class (2) "charset"s. Regards, Martin.
Received on Thursday, 5 December 1996 06:35:43 UTC