Re: National characters...again

>One would have expected that HTML, from its very early days, would
>have provided the construct
>       <CHARSET="XXX"> ...any_8_bit_characters... </CHARSET>
>where XXX could be Latin-2, Latin-3, Latin-4,...

SGML has no mechanism for doing this, so the word I keep hearing is
that we should strangle HTML with the same restrictions.

The fallacy I often hear uttered is that if we can stuff Unicode into
the MIME header as the charset, then we can avoid the problem of having
to define a CHARSET tag (since Unicode encompasses most national char-
acters).  But this way of thinking is WRONG.  Unicode doesn't provide
a mechanism for varying sort order and other things that vary accord-
ing to locale and language.  To do this, THE UNICODE STANDARD ITSELF
SAYS THAT ADDITIONAL TAGS ARE NECESSARY for this sort of thing.

So although offering Unicode or UTF-8 as a default charset is a good
idea, it does not do away with the need for LANG and CHARSET tags.

Just to do away with one other fallacy:  You can't have just LANG or
CHARSET tags.  You need both.  You can have two different charsets for
a single document (e.g., Shift-JIS and ISO 8859-1), and you can have
two different languages within the same charset (e.g. English and Ger-
man for ISO 8859-1; Urdu, Persian, and Arabic for Unicode - they all
use the same Unicode pages).

It may not make sense for all clients to allow all possible combina-
tions, but this is something they can negotiate with servers.  It is
not a reason to cripple HTML.

If I'm misunderstanding the Unicode standards, HTML, or SGML, someone
please let me know.  I'm doing my best to keep up :-).

Richard Goerwitz
goer@midway.uchicago.edu

Received on Thursday, 2 February 1995 11:25:46 UTC