Re: LANG= for character-mapping from Albert Lunde on 1996-07-23 (www-international@w3.org from July to September 1996)

From: Albert Lunde <Albert-Lunde@nwu.edu>
Date: Tue, 23 Jul 1996 17:30:41 -0500
To: Hans van Mourik <MOURIK@rullet.LeidenUniv.nl>, www-international@w3.org
Message-Id: <v02140b08ae1af98027f7@[129.105.110.129]>

At 8:09 PM 7/23/96, Hans van Mourik wrote:
>Hello to you internationalisationisers,
>
>I would like to know how the HTML LANG-attribute should be linked
>up to a particular character-set. In fact what I'm looking for is an
>HTML-equivalent for the TEI ``writing system declarations''.
>Are there any thoughts about such a thing?

It is my impression that the intention of the various HTML and HTTP drafts
that have addressed this is that "language" and character encoding (a.k.a.
MIME charset) are, so to speak, "independent variables". In the general
case, neither determines the other. There are different HTTP headers for
charset and language.

The thrust of the HTML internationalization draft is to define the SGML
stuff in terms of an SGML "document character set" of ISO-10646. However,
this _does not_ determine the character encoding used to send documents
"over the wire", and within broad limits, any reasonable encoding can be
used.

The significance of the use of ISO-10646 is to define a consistent
framework for interpreting numeric character references and other aspects
of SGML document parsing that isn't tied too closely to a particular
encoding.

For example, Japanese text might be encoded with a JIS or EUC encoding (I
don't remember the precise charset names). It might also be encoded
something stranger like US-ASCII or EBCDIC using ISO-10646-based numeric
character references (though you'd have trouble finding support for this
today, I think.)

I think the idea was than LANG would be used indicate aspects of
presentation or user agent behavior which might _not_ be incated by the
character encoding (and clearly would _not_ be indicated of some encoding
of unicode like utf-8 were used.) Examples cited in discussion were
spelling and hyphenation dictionaries, and the exact rendering of kanji
characters or quoted text.

I think the HTML internationalization draft tries to specify this a bit
more rigoriously. See the section "The LANG attribute".

It's been a while since I read the TEI documents.

Taking a look at them it, appears that the "writing system declaration"
specifies:
(1) the language
(2) the writing system (script, alphabet, syllabary) used to write the langage
(3) the coded character set, entity names, or transliteration scheme used
to represent the graphic characters of the writing system.

There is stuff defined in HTML and HTTP specs that addresses (1) and (3)
independently, but not much is said about (2) or the combination of the
three together.

Perhaps someone wiser than me about the TEI can say more.

---
    Albert Lunde                      Albert-Lunde@nwu.edu

Received on Tuesday, 23 July 1996 18:31:15 UTC