- From: Albert Lunde <Albert-Lunde@nwu.edu>
- Date: Sun, 15 Jan 1995 15:30:36 -0600
- To: html-wg@oclc.org, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
(note cross-posting) At 1:00 PM 1/15/95, Gavin Nicol wrote on html-wg: >>>Also, do we really want to get into the business of multi-charsets w/in 1 >>>document?? >> >>Emphatically yes! > >Well, even if we wanted to, we cannot. SGML does not have any way of >defining that a given bit combination belongs to more than one >character class. In other words, documents containing multiple >character sets must be "normalised" *before* then parser sees the >data. > >In my earlier paper I pointed this out, and it is one reason for using >Unicode. As Larry noted, multilingual documents can be written using a >coded character set that includes codes for the desired language, and >in no other way. > >We emphatically *do* want multilingual capabilities, so we must not >restrict ourselves to US-ASCII or ISO-8859-1, but we most certainly do >not want multiple character sets per document: that path is a long >road leading to madness. > >>>I hope not otherwise all the discussion on a header line with the desired >>>charset for negotiating on a perfered format is for >>>nothing. (I ask for a document in EUC but it has JIS or SJIS >>>intermixed; how could I grok those parts?) >>First thing, the different charsets have to be identifiable, and that means >>tagging. > >No. As I said before. SGML has no (working) way of handling this. The >data *must* be normalised. Dan has spent a long time making HTML a >conforming application of SGML, and this would invalidate all that >effort (as well as making it *very* difficult to write generic SGML >viewers that could also handle HTML). > >Say "yes" to Accept-Charset: >Say "NO" to multiple character sets. I think allowing documents to be in a single character set from: ISO-8859-X for the same values of "X" allowed in MIME is a fairly non-controversial extension to HTML/HTTP. (Not for HTML 2.0, but HTML 2.x) Can we cite some outside source for additional character sets names that will include Unicode and a reasonable assortment of other national character encodings not covered by ISO-8859-X, like ISO-2022-JP so we don't have to act as the body to pick allowed charater sets and wind up with yet another WWW -specific variation? It's more important to pick a well-defined name space than to have all browsers support everything. I'm not totally convinced that transferring a whole document in a single encoding, a la Unicode, is the _only_ way to handle multi-lingual documents, though I'm not an SGML expert and could use some discussion on this. At least the characters used in tagging need to be mapping in a single character set before parsing. (This would seem easier in codes that have US-ASCII as a proper subset.) Another possiblity would be to define a meta-encoding for multiple character sets, where the escape codes to shift character sets would not be represented in _any_ of the character sets. It would then be up to a multi-lingual HTML implementer to provide a pre-processor to get this information into a form an SGML parser could deal with (maybe by normalizing to a combined character set, maybe by adding extra markup) This does sound less elegant than Unicode, but I'd like to hear more about why it won't work before ruling it out. --- Albert Lunde Albert-Lunde@nwu.edu
Received on Sunday, 15 January 1995 13:34:23 UTC