Re: Different character sets in one HTML document

Erik Naggum (erik@naggum.no)
25 Jun 1994 09:36:54 UT


Date: 25 Jun 1994 09:36:54 UT
From: Erik Naggum <erik@naggum.no>
Message-Id: <19940625.3247@naggum.no>
To: Daniel W. Connolly <connolly@hal.com>
Cc: www-html@www0.cern.ch, piet@hpcvusm.cv.hp.com
In-Reply-To: <9406241826.AA18054@ulua.hal.com>
Subject: Re: Different character sets in one HTML document

thanks for copying me in on this.

|   In message <9406231309.ZM1164@hpcvusm.cv.hp.com>, "Pieter van Zee" writes:
|   >My objective is to support multi-lingual content, i.e. to move
|   >away from the assumption that the entire content of an HTML file
|   >is in a single charset.
|
|     Well, as far as SGML is concerned, the entire content of an "HTML
|     file" _must_ be a in a single charset. That charset might support
|     multiple languages through ISO2022 style escape mechanisms to
|     different graphic code sets, but it's still one character set, in
|     SGML terminology.

well, yes and no.  the only actual requirement is that the parser be able
to distinguish data from markup, even in the presence of various character
sets (there is no such thing as a "graphic code set" apart from a character
set).  ISO 8879 predates ISO 10646, and while it does not predate ISO 2022,
it does no more than pay lip service to the existence of this standard.
like so many other things in SGML, I'm sorry to say, character set support
breaks under the scrutiny required of actual implementation.

|     I don't yet know whether this strategy is sufficient to represent,
|     for example, a single HTML document containing English, Japanese,
|     Cyrillic, and Hebrew text. I'm guessing something like:
|
|   	Content-Type: text/html; charset="ISO-10646"
|   	Content-Transfer-Encoding: binary
|
|     might be sufficient, but a parser for such a document would be
|     vastly different from an ISO8859-1 based HTML parser, since even the
|     markup characters would be 2-byte characters. And I don't even know
|     if ISO-10646 (aka UNICODE) can be expressed in an SGML declaration.

as long as you stick to data characters outside of the 0..255 range, i.e.,
no significant SGML characters outside that range, all is well.  you just
need an SGML parser that is capable of dealing with more-than-eight-bit
characters.  most aren't.  mine is.  if you need characters outside the
basic range, you will need a parser that is a bit more forgiving than those
that only accept the present idiotic syntax for additional significant SGML
characters.

|     SGML can't represent per-element Character Sets

not sure what you mean here.  there may be a danger of falsely recognized
markup, but there is nothing in SGML that requires the data in an element
to be of a particular kind.  NOTATION declarations can be used to
explicitly denote that the content is not regular character data.  in some
applications, Greek is found in special elements using a transcription
code, to take just one example.

|     The major pitfall that I see is that changing character sets on a
|     per-element basis is not expressible in a conforming SGML document.

not true.

|     It's possible to hack things where the parser reports ISO8859-1
|     characters all the time, and we use NOTATIONs to represent other
|     writing systems. I'm not sure how that would work just yet, but I
|     believe the TEI Guidelines include a technique for doing this.

didn't understand this.  it would indeed be useful if the parser could
shield the application from the actual coding of the data characters, and
instead provide it with ISO 10646 characters.  Charles F. Goldfarb has
vehemently opposed such an interpretation, so it will not likely be part of
the revised standard (if and when _that_ is published).  however, I think
this is vital to SGML's success in a multi-lingual environment, and have
worked out a proposal and an implementation to support it which treats the
SGML declaration's idea of character sets, primitive as it is, as an
agreement between the entity manager and the parser, not as descriptive of
the actual document and the character codes in it.  this is required if we
are to live in a world of multiple character sets, such as we will across a
heterogenous, networked environment.  (WWW is just one example of such.)

|     It certainly would be a shame if the SGML standard conflicted with
|     the predominant technique for multilanguage text representation
|     supported by distributed applications tools.

it doesn't.  SGML per se is oblivious to the meaning of the characters that
are not used to recognize markup, i.e., the data characters in a document.

|     Ok... after reading more of the O'Reilly book, it appears that there
|     are three predominant encodings of multilanguage text supported by
|     development tools:
|
|   * EUC (Extended Unix Code): specified in ISO2022. Supported by OSF,
|     Unix International, and USL.
|
|   * Shift-JIS: supported by Microsoft and Apple
|
|   * Unicode: specified by the Unicode Consortium, concides with parts of ISO
|     10646. supported by AT&T Plan 9 and PenPoint

note that ISO 10646 is not a unique encoding.  it is only a code space, and
because it is broken at the core (thanks to the Unicode crowd), it needed
various encoding to become part of what we call real life.  one of them is
UCS-2, which is a 16-bit encoding, another is UCS-4, a full 32-bit
encoding, another UCS-2+ or whatever it will be called, which is a 16-bit
encoding with 1024 high and low 16-bit codes to represent up to 1112064
characters (65536 - (2 * 1024) + (1024 * 1024)), then there's UTF-1 which
is the silly encoding invented by ISO, and UTF-2, FSS-UTF or UTF-8 as it is
alternatingly called, which is what Plan 9 uses, and then there's a new
silly idea called UTF-7, which is supposedly an extension to MIME's already
butt-ugly quoted-printable ("quoted-unreadable").

|     My vote for the over-the-wire representation of HTML EUC, with
|     support for ISO-2022-JP for 7bit transmission. Now: how do we spell
|     EUC in an SGML declaration?

this is one case where the SGML declaration breaks down.  it cannot talk
about variable-length encodings, or code-value-dependent action.  this must
be handled in the entity manager, and the parser must see a reasonably
consistent character encoding to be able to deal with this stuff at all.
SGML as it is currently specified cannot do anything useful with EUC-
encoded data, or UTF-x, or much anything outside ASCII.  if you think this
is more than just stupid, find out who is your country's representative to
ISO/IEC JTC 1, and/or its SC 18, and/or its WG 8, present the situation,
and require that

    unless ISO 8879 is revised to be able to be independent of character
    encoding, it should not be reconfirmed as an ISO standard in 1996.

do this as soon as possible.  remember that in ISO, the U.S. has only one
vote, whereas all the other small congregations of people in the world also
have one vote.  your country's voice (and vote) will make a difference.
while this may seem to go overboard for a small detail, the extreme lack of
progress in the review of the SGML standard makes it very important to
voice your concern with specific issues that SGML does not cover today, and
which will require a revised standard.  if you can make the case that SGML
would be utterly irrelevant to the future of computing unless it is made
able to handle other character encodings, so much the better.

the WWW community is becoming a force in the SGML world, and it should use
this force to obtain necessary goals for its continued existence.  without
proper character set support, SGML is unsuitable, and WWW will have to make
some incredibly ugly hack like MIME did on top of ASCII.  you do not want
this to happen!

private responses to this are encouraged.

as an aside, my entity manager, which is prepared for character encoding
interpreters, will be released in early July.  a conversion utility between
various character sets will be released shortly thereafter.  if you would
like to beta-test it, please drop me a line.

best regards,
</Erik>

--
Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no>       |  memento, terrigena
ISO 8652 Ada/ISO 8879 SGML/ISO 9899 C/ISO 10646 UCS  |  memento, vita brevis

ftp://ftp.ifi.uio.no/pub/SGML           wais://ftp.ifi.uio.no/comp.text.sgml