Re: character sets in HTTP: translation tables

Larry Masinter writes:

>Unicode is one of the answers. It is perfectly possible to do
>multi-lingual HTML without Unicode, for those who would prefer to do
>so. 

  I have never said that Unicode is "the only answer", and in fact, I
  specifically mention ways of *not* using it in my paper. Indeed,
  there is absolutely no reason why what you outline, and what I
  describe in my paper, cannot coexist. 

>Unicode may be the answer for you and for me, but it's become
>abundantly clear that it isn't the answer for some folks, and we don't
>actually need to force the issue this way.

  I have heard many objections to Unicode, and upon discussion, I have
  found that most are related to display, or unsupported language
  feature issues. If we have "some folks" out there, please raise your
  objections now: I would be very interested in hearing them, and even
  more interested to hear if some kind of "hint" is not sufficient to
  handle the objection.

>I've found the capabilities alluded to in <URL:http://www.nada.kth.se/

  Yes, this is interesting stuff, and can be used for part of what I
  talked about.

>Receivers might be expected to either (1) understand the character
>sets the sender is able to present or (2) call some proxy that is able
>to translate the document into a character set the client knows.
.....
[Some detail deleted]

>Current practice is that servers offer documents in the character sets
>they have, and clients either display them correctly or attempt to
>translate them.  If you browse the net looking for web sites by
>country, you'll come up with lots of examples.
....
[Some examples of charset specific browsers and servers deleted]

>In general, users with large collections of documents in national
>character sets will just make them available in those forms. 

  Because there is nothing else to make them available in?

>They're currently expecting the reader, if they're really interested,
>to obtain support for the character encoding used in the documents on
>those servers. A regime where the documents are properly labelled, and
>the client fetches the translation table the first time in order to
>properly display the documents:
>
> a) will work well

  This seems a rather premature claim.

> b) is efficient

  Hmm. When I think of 200,000 accesses from the US to a Japanese site
  from a browser that does not support SJIS, EUC, Arabic, and Thai,
  all of which are used within a single document (lord only knows
  how!), resulting in 200,000 requests for each of those translation
  tables, and compare this to a single server "wasting" cycles doing a
  conversion to UTF-8 (which would probably be cached after the first
  access), I have trouble imagining this to be more efficient.
  Requiring the browsers to call a proxy is at least as inefficient as
  server conversion (actually more so due to redirect), and relies
  upon the same (or probably, more complicated) caching techniques for
  performance. 
   
  Worse, doing in-client translation will almost certainly require
  16 bits *somewhere* within the client... and what is everything
  being translated *into*? Unless you are translating into some
  virtual character set (which would need to be at least as large as
  the largest national character set: read 16 bits), or into Unicode,
  or transliterating, you will also need to reprogram the parsers for
  the different data formats you purport to support, so that they will
  be able to recognise the characters as belonging to a particular
  class of characters to be used in parsing (think of SGML parsing, or
  even HTML parsing.) In addition, your proposal generally assumes a
  single language, and a single character set per document, mine
  assumes (in the "core" case), a single character set and multiple
  languages (though your proposal can indeed emulate mine via proxy).

  My proposal requires people to update clients once (to achieve basic
  functionality). Your proposal requires alls client to also be changed
  at least once to add support for translation tables (unless they be
  forever stuck with transliteration, and the overhead thereof). My
  proposal requires changes to servers, so does your proposal. 

 c) has a reasonable transition from current practice.

  True. So does my proposal. In the worste case, current practise (or
  something close to it) is supported. 

>The servers don't have the computational resources to translate the
>documents on-the-fly to Unicode, nor is such translation particularly
>efficient in terms of network bandwidth.

  I don't buy this, especially if the server uses good caching
  techniques. 

  The biggest problem I have with this idea is that in the worste case
  (and indeed, I would argue in many cases), the client and server
  will not be able to communicate without help. Additionally, it
  encourages the creation of browsers with widely disparate
  capabilities, and version numbers (read: bug lists), and breaking the
  web into "villages" with their own culture, own language, own GUI,
  etc. 

  We need a common base, a lingua franca, so that in the worste case
  (no external help available), we can communicate. Unicode is the
  only thing I know of that offers a reasonable many->one->many
  path. Additionally, it promises to offer a *single* processing model
  for character data of *mixed* languages (at least until it needs to
  be displayed), and most (all?) parsers could use Unicode a base for
  character classification, and hence, the parsing tables would not
  need to be reprogrammed. Eventually, free tools, and libraries
  *will* appear to aide in processing, and display. 

> accept-parameter: charset=unicode-1-1-utf7

  OK. This looks fine to me. (Perhaps utf8 or ucs-2??)

>The MIME draft standard makes no such claims. There is a document
>being circulated by the mail extensions working group which makes

  Thank you for clarifying this.

BTW. I tend to beleive that unless a program was specifically designed
to manipulate charsets and/or encodings, that the actual processing
within it can be abstracted away from concrete codes. If a
program/system relies upon concrete codes, it is BAD (Broken As
Designed) in my books, though certainly, using Unicode codes gives a
greater range of flexibility than other code sets.

Received on Thursday, 29 December 1994 05:19:59 UTC