Re: Different character sets in one HTML document

Daniel W. Connolly (connolly@hal.com)
Mon, 20 Jun 1994 09:35:11 -0500


Message-Id: <9406201435.AA05858@ulua.hal.com>
To: "Vitaly Motyakov, IHEP, Protvino, Russia" <motyakov@mx.ihep.su>
Cc: www-html@www0.cern.ch
Subject: Re: Different character sets in one HTML document 
In-Reply-To: Your message of "Thu, 16 Jun 1994 15:10:44 GMT."
             <009800B2.5849B14D.25266@mx.ihep.su> 
Date: Mon, 20 Jun 1994 09:35:11 -0500
From: "Daniel W. Connolly" <connolly@hal.com>

In message <009800B2.5849B14D.25266@mx.ihep.su>, "Vitaly Motyakov, IHEP, Protvi
no, Russia" writes:
>        Dear Daniel,
>
>   Several weeks ago I posted a message to cern.www.talk and
>comp.infosystems.www where I asked about the possibility to
>have different character sets in one document.

[Since you already posted to www-talk, I hope you don't mind
my copying www-html on this.]

> I think that
>it is essential for multilingual documents where character 
>sets might be changed even on the same line.

I agree...

>   I browsed through HTML+ specification and your new HTML 2.0
>specification but unfortunately I did not find an answer to
>my question.

The HTML 2.0 spec is simply an effort to specify current practice;
that is, to publish a document that says how HTML works today. Today,
there is no widely deployed working code or consensus on how to
combine multiple character sets into one document. Hence, there we
cannot specify how it works.

The current document has this to say about character sets:

	Character set option (proposed)

	The SGML declaration specifies ISO 8859/1 Latin alphabet No. 1
	as the base character set. The charset parameter is reserved
	for future use. Its intended significance is to override the
	base character set of the SGML declaration. Support of
	character sets other than ISO 8859/1 Latin alphabet No. 1 is
	not a requirement for conformance with this specification.

>   Also, the MIME charset option could be used, but I am not
>sure that character sets could be changed on the same line of
>a document.

The character set specified using the charset="xxx" parameter could
include several graphic character sets, with escape codes to switch
between them. I think there are some mechanisms in place to do
this sort of thing: ISO 2022 comes to mind, but I'm not certain.

There has been some work on this subject in various parts of the IETF.
>From an internet draft index
(ftp://ds.internic.net/internet-drafts/1id-index.txt), I see...

  "Characters and character sets for various languages",02/02/1994, 
  <draft-alvestrand-lang-char-01.txt>                                      

I'm not sure this particular draft is relevant, but I think you would
find it useful to browse the internet drafts and RFCs to see what work
in this area has been done there. This discussion (how to combine
character sets in a document) comes up on the USENET newsgroup
comp.mail.mime periodically as well.

>   May be, it would be useful to introduce new CHARSET tag or
>attribute to HTML. What is your opinion?

I've seen proposals for CHARSET and LANG attributes in HTML. I don't
like the idea of a CHARSET attribute, as it may lead folks to believe
that they can use multibyte character sets or switch graphic character
sets in a document whose SGML declaration has no provision for doing
that. This could open up bad interactions with the parser. Consider,
for example:

	[ESC]<abc

If an SGML parser knows that ESC is an escape character (i.e. if the
SGML declaration for the document includes a character set with such
escape sequences), then it knows that the '<' that follows is part of
the escape sequence. Otherwise, it will see "<abc" and treat it as
markup.

With a CHARSET attribute, folks might get the impression that they can
introduce new charcter encodings with an attribute, when in fact, this
will not change the parser's idea of the character encoding.

I like the idea of a LANG attribute, which specifies a NOTATION for an
element. It doesn't change the character set, but it may change the
interpretation and/or display of those characters. In other words, it
has no interactions with the parser -- only the rendering application.

Dan