- From: Daniel W. Connolly <connolly@hal.com>
- Date: Mon, 20 Jun 1994 09:35:11 -0500
- To: "Vitaly Motyakov, IHEP, Protvino, Russia" <motyakov@mx.ihep.su>
- Cc: www-html@www0.cern.ch
In message <009800B2.5849B14D.25266@mx.ihep.su>, "Vitaly Motyakov, IHEP, Protvi no, Russia" writes: > Dear Daniel, > > Several weeks ago I posted a message to cern.www.talk and >comp.infosystems.www where I asked about the possibility to >have different character sets in one document. [Since you already posted to www-talk, I hope you don't mind my copying www-html on this.] > I think that >it is essential for multilingual documents where character >sets might be changed even on the same line. I agree... > I browsed through HTML+ specification and your new HTML 2.0 >specification but unfortunately I did not find an answer to >my question. The HTML 2.0 spec is simply an effort to specify current practice; that is, to publish a document that says how HTML works today. Today, there is no widely deployed working code or consensus on how to combine multiple character sets into one document. Hence, there we cannot specify how it works. The current document has this to say about character sets: Character set option (proposed) The SGML declaration specifies ISO 8859/1 Latin alphabet No. 1 as the base character set. The charset parameter is reserved for future use. Its intended significance is to override the base character set of the SGML declaration. Support of character sets other than ISO 8859/1 Latin alphabet No. 1 is not a requirement for conformance with this specification. > Also, the MIME charset option could be used, but I am not >sure that character sets could be changed on the same line of >a document. The character set specified using the charset="xxx" parameter could include several graphic character sets, with escape codes to switch between them. I think there are some mechanisms in place to do this sort of thing: ISO 2022 comes to mind, but I'm not certain. There has been some work on this subject in various parts of the IETF. >From an internet draft index (ftp://ds.internic.net/internet-drafts/1id-index.txt), I see... "Characters and character sets for various languages",02/02/1994, <draft-alvestrand-lang-char-01.txt> I'm not sure this particular draft is relevant, but I think you would find it useful to browse the internet drafts and RFCs to see what work in this area has been done there. This discussion (how to combine character sets in a document) comes up on the USENET newsgroup comp.mail.mime periodically as well. > May be, it would be useful to introduce new CHARSET tag or >attribute to HTML. What is your opinion? I've seen proposals for CHARSET and LANG attributes in HTML. I don't like the idea of a CHARSET attribute, as it may lead folks to believe that they can use multibyte character sets or switch graphic character sets in a document whose SGML declaration has no provision for doing that. This could open up bad interactions with the parser. Consider, for example: [ESC]<abc If an SGML parser knows that ESC is an escape character (i.e. if the SGML declaration for the document includes a character set with such escape sequences), then it knows that the '<' that follows is part of the escape sequence. Otherwise, it will see "<abc" and treat it as markup. With a CHARSET attribute, folks might get the impression that they can introduce new charcter encodings with an attribute, when in fact, this will not change the parser's idea of the character encoding. I like the idea of a LANG attribute, which specifies a NOTATION for an element. It doesn't change the character set, but it may change the interpretation and/or display of those characters. In other words, it has no interactions with the parser -- only the rendering application. Dan
Received on Monday, 20 June 1994 16:35:21 UTC