charset issues

     Prof J Larmouth,  University Director of Telematic Applications, 
     IT Institute,  University of Salford,  Salford M5 4WT,  England.

J.Larmouth @ ITI.SALFORD.AC.UK                Telephone: +44 161 745 5657
                                                    Fax: +44 161 745 8169
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
To:     www_international@w3.org                 
Subject:      charset issues
>(1) UTF-8 as an encoding that covers pretty much everything, and that
>        we want to help getting acceptance. This group migth include
>        some other encodings of Unicode/ISO 10646, but not too many.

None others I think.  We really only want ONE de facto standard,  not even
two,  and UTF-8 scores on most criteria.

>(2) A list of well used and widely accepted encodings, ideally one for
>        each "region" of the world. For Western Europe, this is
>        iso-8859-1. We want servers to send this, and not something
>        from the next category.

I agree.  The approach of saying "anything goes provided you label it" is
better than we have now,  but is NOT good enough for the long term. 
Pragmatically,  I would like to see the responsibility on server managers to
ensure that the stuff they store is strictly UTF-8 or 8859-x,  recognising
that such a requirement is NOT going to be easily satisfied (people often
FTP into server directories from a wide variety of work stations).  Of
course,  if some servers want to be more complex,  and store in Windows code
pages and translate to 8859 for delivery,  they can do that - it just costs
them CPU cycles.  But delivery should be in a "sensible" charset.

>(3) All the special variants, alternative designations, and garbage
>        "charset" parameters.
>To help keep the net clean of an uncontrolled proliferation of
>encodings, things in the last category definitely should not
>be sent in Accept-Charset.

Just a useless remark saying "I agree".  There is a case for allowing (eg)
8859-7 (and the other parts of 8859 - apart from 8859-1) in the indefinite
future because it allows a one octet per char representation of Greek where
UTF-8 needs two octets per char,  but there is NO case for the long-term use
of other encodings that offer no better verbosity features.

>Now let's have a look at the server side. I can immagine three
>kinds of servers:

> .....

>- Servers that translate on the fly. I can't immagine that they
>        have a document in a class (3) "charset", and don't
>        know how to convert it to class (2), or UTF-8.
>        So there is no need to send anything beyond class (2).

I think this comment ignores the problem of "document management" on the
server.  If multiple users are FTPing documents into the server from a
variety of desk-tops (which is the case for many sites),  the problem of the
server identifying the original charset is not trivial.  Heuristics and
spell-checkers???  Ugh!!!

>Now let's analyse what indeed has to be send of class (2).
>In theory, if a server can convert to UTF-8, that's all you
>need. The main problem with UTF-8 is that it may not be
>as efficient as other encodings. However, for a general
>Latin 1 text (where accented characters are rather sparse),
>the difference between UTF-8 and iso-2022-1 is small.

I agree.  This is an important point.  8859-1 should die a death.  But
8859-7,  for text which is simple Greek text,  is half the size of UTF-8.
Does this matter?  The Web bandwidth (storage and transfer) is dominated by
non-text documents,  so does a doubling of size of Greek text (for storage
and transfer) matter??   Only the Greeks can tell us. (And the Arabs,  and
the Israelis,  etc).  It would be nice if they said "It doesn't matter. 
Let's all agree on UTF-8 for storage and transfer."  Comments?

>Differences are larger for e.g. pure Japanese, it's about a
>50% overhead. For Indic scripts, the overhead is 200%.
>But then again, compression will reduce that overhead very

Again,  a very good (compression) point.  BUT ....  have I missed something? 
Is compression for HTTP transfers becoming a de facto standard?  (Or even a
technically agreed approach?)  I think not.

>So in practice, I could see the following solutions for

>- Send UTF-8 if you can accept it, and nothing else.

Yes.  And then the browser worries about the font problems.  

And WHY,  in all these discussions,  do we continually concentrate on
delivery of pages and ignore forms input?

>- Send UTF-8 and/or a careful selection of class (2)
>        "charset"s.

Yes.  You send the class (2) charsets for which you have fonts.

John L