- From: Albert Lunde <Albert-Lunde@nwu.edu>
- Date: Tue, 23 Jul 1996 17:30:41 -0500
- To: Hans van Mourik <MOURIK@rullet.LeidenUniv.nl>, www-international@w3.org
At 8:09 PM 7/23/96, Hans van Mourik wrote: >Hello to you internationalisationisers, > >I would like to know how the HTML LANG-attribute should be linked >up to a particular character-set. In fact what I'm looking for is an >HTML-equivalent for the TEI ``writing system declarations''. >Are there any thoughts about such a thing? It is my impression that the intention of the various HTML and HTTP drafts that have addressed this is that "language" and character encoding (a.k.a. MIME charset) are, so to speak, "independent variables". In the general case, neither determines the other. There are different HTTP headers for charset and language. The thrust of the HTML internationalization draft is to define the SGML stuff in terms of an SGML "document character set" of ISO-10646. However, this _does not_ determine the character encoding used to send documents "over the wire", and within broad limits, any reasonable encoding can be used. The significance of the use of ISO-10646 is to define a consistent framework for interpreting numeric character references and other aspects of SGML document parsing that isn't tied too closely to a particular encoding. For example, Japanese text might be encoded with a JIS or EUC encoding (I don't remember the precise charset names). It might also be encoded something stranger like US-ASCII or EBCDIC using ISO-10646-based numeric character references (though you'd have trouble finding support for this today, I think.) I think the idea was than LANG would be used indicate aspects of presentation or user agent behavior which might _not_ be incated by the character encoding (and clearly would _not_ be indicated of some encoding of unicode like utf-8 were used.) Examples cited in discussion were spelling and hyphenation dictionaries, and the exact rendering of kanji characters or quoted text. I think the HTML internationalization draft tries to specify this a bit more rigoriously. See the section "The LANG attribute". It's been a while since I read the TEI documents. Taking a look at them it, appears that the "writing system declaration" specifies: (1) the language (2) the writing system (script, alphabet, syllabary) used to write the langage (3) the coded character set, entity names, or transliteration scheme used to represent the graphic characters of the writing system. There is stuff defined in HTML and HTTP specs that addresses (1) and (3) independently, but not much is said about (2) or the combination of the three together. Perhaps someone wiser than me about the TEI can say more. --- Albert Lunde Albert-Lunde@nwu.edu
Received on Tuesday, 23 July 1996 18:31:15 UTC