Re: Different character sets in one HTML document

Daniel W. Connolly (connolly@hal.com)
Fri, 24 Jun 1994 13:26:43 -0500


Message-Id: <9406241826.AA18054@ulua.hal.com>
To: Multiple recipients of list <www-html@www0.cern.ch>
Cc: piet@hpcvusm.cv.hp.com
Cc: Erik Naggum <SGML@ifi.uio.no>
Subject: Re: Different character sets in one HTML document 
In-Reply-To: Your message of "Thu, 23 Jun 1994 22:11:58 +0200."
             <9406231309.ZM1164@hpcvusm.cv.hp.com> 
Date: Fri, 24 Jun 1994 13:26:43 -0500
From: "Daniel W. Connolly" <connolly@hal.com>


  Multi-Language HTML Documents
===============================

  First, I'd like to thank Mr. van Zee for teaching me a whole bunch
  of stuff about character sets that I've been curious about for some
  time.

  HTML and SGML
---------------

  Second, I'll preface my response to his proposals with the
  underlying assumption that it is a requirement of HTML that

	A HTML document shall be a conforming SGML document.
		(as per ISO 8879, definition 4.51 and section 15.1)

  I just spent some time wading through the (excellent) comp.text.sgml
  archive. For mucky details about SGML and character sets, see, for
  example, Erik Naggum's Q&A on Character Sets_:

	Newsgroups: comp.text.sgml
	From: Erik Naggum <erik@naggum.no>
	Message-ID: <23160A@erik.naggum.no>
	Date: 22 May 1992 07:06:51 UT
	Subject: Character Sets Q&A, Part 2

  I've also been reading the TEI Guidelines_, including their stuff on
  writing systems and interchange problems with character sets.

  I've come across so much stuff that I don't know that I finally
  picked up _Understanding_Japanese_Information_Processing_ by Ken
  Lunde (O'Reilly & Associates, ISBN 1-56592-043-0).


  About the Proposals
---------------------

  Now, about Mr. van Zee's proposals:


In message <9406231309.ZM1164@hpcvusm.cv.hp.com>, "Pieter van Zee" writes:
>My objective is to support multi-lingual content, i.e. to move
>away from the assumption that the entire content of an HTML file
>is in a single charset.

  Well, as far as SGML is concerned, the entire content of an "HTML
  file" _must_ be a in a single charset. That charset might support
  multiple languages through ISO2022 style escape mechanisms to
  different graphic code sets, but it's still one character set, in
  SGML terminology.

>Assuming we agree that HTML documents need to support
>multi-lingual content, let's discuss how this might occur.  I ran
>the following by our i18n guru to verify my comments.
>
>The phrase "specifying the ISO 2022 mechanism at the MIME level"
>isn't exactly clear to me.  I'll take it to mean that whenever a
>HTML document is encapsulated as a MIME object for transport, the
>document must use ISO 2022 encoding for its content.
>
>Let's generalize and call this:
>
>  Strategy (a): a HTML document has only ISO 2022-encoded content.

  Well, let me attmept to clarify, and suggest this wrinkle on this
  strategy (a):

  HTML and MIME for the Western European Writing System
-------------------------------------------------------

  Currently, there is an implicit SGML declaration_ shared by all HTML
  documents that specifies ISO8859-1 as the document character set. So
  currently, the HTML spec "specifies ISO 8859-1 at the MIME level";
  that is, the conventional HTTP header:

	Content-Type: text/html

  might be considered short for:

	Content-Type: text/html; charset="iso8859-1"
	Content-Transfer-Encoding: binary

  In addition, there are entities for each of the ISO8859 characters
  that are not part of ISO646, so most HTML documents _can_ be written
  in 7bit characters. So when most folks send HTML via mail, if they
  write:

	Content-Type: text/html

  they are using the US-ASCII character set (the ISO646 subset of
  ISO8859-1), ala:

	Content-Type: text/html; charset="US-ASCII"
	Content-Transfer-Encoding: 7bit

	<!DOCTYPE HTML "-//W30//DTD WWW HTML 2.0//EN">
	<title>german names in 7bit html</title>
	<h1>Kurt G&ouml;del</h1>

  (note that while individual 8bit characters in HTML can be converted
  to 7bit representations, there are some HTML idioms that can't be
  represented within the 72 character limit of the 7bit encoding, such
  as very long words, very wide PRE lines, and long URLs)


  HTML and MIME for Other Writing Systems
-----------------------------------------

  It would make sense to me, then, to interpret this HTTP headers:

	Content-Type: text/html; charset="ISO-2022-JP"
	Content-Transfer-Encoding: binary

  as meaning "the SGML declaration for this document specifies
  ISO-2022-JP, rather than ISO-8859-1 as its document character set."

  I don't yet know whether this strategy is sufficient to represent,
  for example, a single HTML document containing English, Japanese,
  Cyrillic, and Hebrew text. I'm guessing something like:

	Content-Type: text/html; charset="ISO-10646"
	Content-Transfer-Encoding: binary

  might be sufficient, but a parser for such a document would be
  vastly different from an ISO8859-1 based HTML parser, since even the
  markup characters would be 2-byte characters. And I don't even know
  if ISO-10646 (aka UNICODE) can be expressed in an SGML declaration.


  SGML can't represent per-element Character Sets
-------------------------------------------------

>And my proposal is:
>
>  Strategy (b): every HTML element has optional LANG and CHARSET
>  attibutes which specify the locale of the element's data.
>
>  In other words...A HTML document uses 7-bit ASCII for
>  markup but may use any charset for content, and charset is
>  specified in two ways: (i) an optional default charset for the
>  document, and (ii) an optional charset attribute on every
>  element that overrides the document default.
>
>What are the relative merits and pitfalls?

  The major pitfall that I see is that changing character sets on a
  per-element basis is not expressible in a conforming SGML document.

  It's possible to hack things where the parser reports ISO8859-1
  characters all the time, and we use NOTATIONs to represent other
  writing systems. I'm not sure how that would work just yet, but I
  believe the TEI Guidelines include a technique for doing this.


  Development tools for Multilanguage Text
------------------------------------------

>Basically, with strategy (a), every program must know how to
>parse a ISO-2022 byte stream and map that to something meaningful
>on their platform.
...
>Also, although the ISO-2022 mechanism supports baseline charset
>specifications, it does not support higher-level specifications
>that combine two or more baseline charsets.  These aggregate
>charsets, such as Japanese SJIS and EUC, are the charsets that
>users are exposed to and which have OS infrastructure support.

  This is a compelling argument that I'm looking into. I'm interested
  to know if there's an "over-the-wire" representation of
  multi-language text that's widely supported by development tools.

  For example, I checked the ANSI C standard, and while they specify
  interfaces to translate between multibyte character and wide
  character encodings, they don't specify either of the actual
  encodings! So as far as ANSI C goes, there is no portable wide
  character or multibyte character encoding.

  I've been rooting around the Modula-3 documentation and trying to
  find out how other distributed computing platforms do multi-language
  text. I used to develop DCE software, and I can't remember their
  approach.

  It certainly would be a shame if the SGML standard conflicted with
  the predominant technique for multilanguage text representation
  supported by distributed applications tools.

  ...

  Ok... after reading more of the O'Reilly book, it appears that there
  are three predominant encodings of multilanguage text supported by
  development tools:

* EUC (Extended Unix Code): specified in ISO2022. Supported by OSF,
  Unix International, and USL.

* Shift-JIS: supported by Microsoft and Apple

* Unicode: specified by the Unicode Consortium, concides with parts of ISO
  10646. supported by AT&T Plan 9 and PenPoint

  My vote for the over-the-wire representation of HTML EUC, with
  support for ISO-2022-JP for 7bit transmission. Now: how do we spell
  EUC in an SGML declaration?

.. _Sets ftp://ftp.ifi.uio.no/pub/SGML/comp.text.sgml/19920522/070651.Naggum
.. _Guidelines http://etext.virginia.edu/TEI.html
.. _declaraction http://www.hal.com/%7Econnolly/html-spec/html.decl