- From: Larry Masinter <masinter@parc.xerox.com>
- Date: Wed, 28 Dec 1994 12:26:00 PST
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, www@unicode.org
(The conversation about character sets seems to be intertwined on a couple of mailing lists.) > 3. Is Unicode the answer? Unicode is one of the answers. It is perfectly possible to do multi-lingual HTML without Unicode, for those who would prefer to do so. Unicode may be the answer for you and for me, but it's become abundantly clear that it isn't the answer for some folks, and we don't actually need to force the issue this way. I've found the capabilities alluded to in <URL:http://www.nada.kth.se/ i18n/c3/> intruiging as part of "the answer", in particular, the ability to translate richer character sets into legible US ASCII when necessary (e.g., Lynx and Vt100 terminals with limited font repertoires). Receivers might be expected to either (1) understand the character sets the sender is able to present or (2) call some proxy that is able to translate the document into a character set the client knows. Of course, those proxies and translators may well use Unicode internally (as does C3), but you don't have to declare any particular character set as the 'canonical' one. I'm thinking of a regime where: * conforming servers use only registered character encodings ("charset") * character encoding registration requires supplying translation tables (making sure there are widely available replicated tables & services) that allow translation (transliteration, etc.) of the character encoding given to the other registered character encodings, either directly or via some intermediate form. * Senders that wish to send documents in a non-standard character encoding may do so, but only if they also (are willing to) send (a pointer to a receiver-accessible copy of) the translation table for that character encoding. In this way: * conforming document senders are required to supply documents in registered character encoding, but may choose whichever encodings they want to use. * conforming document receivers may choose the character encodings they're able to accept. * if a conforming sender and a conforming receiver don't have a common character encoding, they may find a translation table to do the mapping. Within the context of HTTP, I think it is most reasonable to say that senders *may* do translation, but receivers *must* do translation (or else not display/accept the sender's document.) Fortunately, code translation tables are small enough that we might actually expect receivers to be able to dynamically fetch them. Character set translation tables seem like a good application for URN replication services. Clients used unconnected from the network will of course have to have built in the translation tables for the encodings used in the documents they're likely to encounter. I don't think that it is necessary or actually possible to require servers to translate the character sets of their documents. It might be useful to ask that HTTP servers that supply documents in non-standard (or infrequently used) character sets also be able to supply the translation tables for those character sets. But in general isn't possible really to 'require' servers to do more work than they're prepared to do. Current practice is that servers offer documents in the character sets they have, and clients either display them correctly or attempt to translate them. If you browse the net looking for web sites by country, you'll come up with lots of examples. For example, <URL:http://www.ariadne-t.gr/apodimoi/index.html> says "The information is in Greek using the ELOT 928 chraracter set. It will display nicely in Mosaic for MS Windows IF you have installed the Greek version of Windows 3.1 or WfW." If you look at <URL:http://www.free.net/Docs/cyrillic/notes.en.html>, it notes that the servers support ISO-8859-5 cyrillic character set, KOI8 charset, as well as two DOS and MS-Windows charsets. If you look at <URL:http://www.huji.ac.il/WWW_DIR/default.html>, you'll see HTML that 'assumes a VT terminal with hebrew characters'. In general, users with large collections of documents in national character sets will just make them available in those forms. The servers don't have the computational resources to translate the documents on-the-fly to Unicode, nor is such translation particularly efficient in terms of network bandwidth. They're currently expecting the reader, if they're really interested, to obtain support for the character encoding used in the documents on those servers. A regime where the documents are properly labelled, and the client fetches the translation table the first time in order to properly display the documents: a) will work well b) is efficient c) has a reasonable transition from current practice. ================================================================ > 3.2.3 Accept-charset As for the HTTP protocol element, I think we might be better off with accept-parameter: charset=unicode-1-1-utf7 than accept-charset: unicode-1-1-utf7 For example, imagine that we may want to extend image/* types to have a 'colors' and a 'width' and 'height' parameter, and to allow accept: image/gif accept: image/jpg accept: image/tiff accept-parameter: width<=640 accept-parameter: height<=480 accept-parameter: colors<=256 In general, accept-parameter could be defined to be "indicate acceptable paramater values for those media types that take those parameters". Unlike "accept", I think it should be within the protocol spec for the server to ignore accept-parameter and supply what it has if it cannot translate. ================================================================ As a final note, re: > In addition, the MIME specification states that for the text/* data > types, all line breaks must be indicated by a CRLF pair. This implies > that certain encodings cannot be used within the text/* data types if > the WWW is to be strictly MIME conformant. The MIME draft standard makes no such claims. There is a document being circulated by the mail extensions working group which makes stronger claims about text/* data types, but that document is not yet even a proposed standard.
Received on Wednesday, 28 December 1994 12:28:29 UTC