character sets in HTTP: translation tables

(The conversation about character sets seems to be intertwined on a
couple of mailing lists.)

> 3. Is Unicode the answer?

Unicode is one of the answers. It is perfectly possible to do
multi-lingual HTML without Unicode, for those who would prefer to do
so. Unicode may be the answer for you and for me, but it's become
abundantly clear that it isn't the answer for some folks, and we don't
actually need to force the issue this way.

I've found the capabilities alluded to in <URL:http://www.nada.kth.se/
i18n/c3/> intruiging as part of "the answer", in particular, the
ability to translate richer character sets into legible US ASCII when
necessary (e.g., Lynx and Vt100 terminals with limited font
repertoires).

Receivers might be expected to either (1) understand the character
sets the sender is able to present or (2) call some proxy that is able
to translate the document into a character set the client knows.

Of course, those proxies and translators may well use Unicode
internally (as does C3), but you don't have to declare any particular
character set as the 'canonical' one.

I'm thinking of a regime where:

* conforming servers use only registered character encodings ("charset")
* character encoding registration requires supplying translation
  tables (making sure there are widely available replicated tables &
  services) that allow translation (transliteration, etc.) of the
  character encoding given to the other registered character
  encodings, either directly or via some intermediate form.
* Senders that wish to send documents in a non-standard character
  encoding may do so, but only if they also (are willing to) send (a
  pointer to a receiver-accessible copy of) the translation table for
  that character encoding.

In this way:

* conforming document senders are required to supply documents in
  registered character encoding, but may choose whichever encodings
  they want to use.
* conforming document receivers may choose the character encodings
  they're able to accept.
* if a conforming sender and a conforming receiver don't have a common
  character encoding, they may find a translation table to do the
  mapping.

Within the context of HTTP, I think it is most reasonable to say that
senders *may* do translation, but receivers *must* do translation (or
else not display/accept the sender's document.)

Fortunately, code translation tables are small enough that we might
actually expect receivers to be able to dynamically fetch them.
Character set translation tables seem like a good application for URN
replication services.

Clients used unconnected from the network will of course have to have
built in the translation tables for the encodings used in the
documents they're likely to encounter.

I don't think that it is necessary or actually possible to require
servers to translate the character sets of their documents. It might
be useful to ask that HTTP servers that supply documents in
non-standard (or infrequently used) character sets also be able to
supply the translation tables for those character sets.  But in
general isn't possible really to 'require' servers to do more work
than they're prepared to do.

Current practice is that servers offer documents in the character sets
they have, and clients either display them correctly or attempt to
translate them.  If you browse the net looking for web sites by
country, you'll come up with lots of examples.

For example, <URL:http://www.ariadne-t.gr/apodimoi/index.html> says
"The information is in Greek using the ELOT 928 chraracter set. It
will display nicely in Mosaic for MS Windows IF you have installed the
Greek version of Windows 3.1 or WfW."

If you look at <URL:http://www.free.net/Docs/cyrillic/notes.en.html>,
it notes that the servers support ISO-8859-5 cyrillic character set,
KOI8 charset, as well as two DOS and MS-Windows charsets.

If you look at <URL:http://www.huji.ac.il/WWW_DIR/default.html>,
you'll see HTML that 'assumes a VT terminal with hebrew characters'.

In general, users with large collections of documents in national
character sets will just make them available in those forms. The
servers don't have the computational resources to translate the
documents on-the-fly to Unicode, nor is such translation particularly
efficient in terms of network bandwidth.

They're currently expecting the reader, if they're really interested,
to obtain support for the character encoding used in the documents on
those servers. A regime where the documents are properly labelled, and
the client fetches the translation table the first time in order to
properly display the documents:

 a) will work well
 b) is efficient
 c) has a reasonable transition from current practice.

================================================================
>   3.2.3 Accept-charset

As for the HTTP protocol element, I think we might be better off with

   accept-parameter: charset=unicode-1-1-utf7

than 

   accept-charset: unicode-1-1-utf7

For example, imagine that we may want to extend image/* types to have
a 'colors' and a 'width' and 'height' parameter, and to allow

   accept: image/gif
   accept: image/jpg
   accept: image/tiff
   accept-parameter: width<=640
   accept-parameter: height<=480
   accept-parameter: colors<=256

In general, accept-parameter could be defined to be "indicate
acceptable paramater values for those media types that take those
parameters". Unlike "accept", I think it should be within the protocol
spec for the server to ignore accept-parameter and supply what it
has if it cannot translate.

================================================================
As a final note, re:

>  In addition, the MIME specification states that for the text/* data
>  types, all line breaks must be indicated by a CRLF pair. This implies
>  that certain encodings cannot be used within the text/* data types if
>  the WWW is to be strictly MIME conformant.

The MIME draft standard makes no such claims. There is a document
being circulated by the mail extensions working group which makes
stronger claims about text/* data types, but that document is not yet
even a proposed standard.

Received on Wednesday, 28 December 1994 12:28:29 UTC