Re: LANG= for character-mapping

>Hello to you internationalisationisers,
>I would like to know how the HTML LANG-attribute should be linked
>up to a particular character-set. In fact what I'm looking for is an
>HTML-equivalent for the TEI ``writing system declarations''.
>Are there any thoughts about such a thing?

I don't know about TEI "writing system declarations", but definitely
the LANG attribute should have no connection whatsoever to
character encoding issues.

LANG is real content markup, and is important for indexing,
text-to-speach, hyphenation, high-quality display, and so on.
Please always mark up your documents with this information
to allow such applications to develop.

On the other hand, character encoding is a technicality that is
handled outside actual HTML or SGML. SGML has the concept
of a document character set, represented as a set of positive
integers. For HTML, the document character set was ISO 8859-1
and is becomming ISO 10646 (which means that it is just
extended, nothing has to be changed).

>What we (NHDA) would like to do is to serve documents containing
>*multiple languages*. We're not so much interested in serving a
>directory with multiple translations of the same instance.

Nice. This is what many people would like to do, and what some
already do.

>Consider a document containing both French, German and Russian.
>HTML 3.2 offers us the possibility of marking divisions, paragraphs,
><span>'s and so on with lang="ru" | lang="fr" | lang="de".
>But then what?

This information is important for the reasons put forward above. There
is no "so what?" here!

>Now suppose the HTTP Charset-header is set to some Russian character-
>encoding (Ms. codepage 1251, KOI-8R or ISO 8859-5 -- you may pick your

Don't "pick your choice". Help streamlining the character encoding mess
to improve interoperability. The web did a lot in this respect for Western
Europe by sticking to ISO 8859-1. It would be nice if the same happened
for Cyrillic and ISO 8859-5 (which is the one mentionned in all relevant
RFCs and such).

>What happens to entities like &eacute; an &ouml;? Browsers like
>Navigator, Explorer and Mosaic will map them blindly to #233 and #246.
>And so they'll appear as arbitrary Russian characters.

This is incorrect behaviour due to the laziness of the software makers
involved. They know this, but as many times, i18n is not the first item on
their list of priorities. But please complain to them about thi buggy behaviour.

Not only should it be clear that &eacute; has to be always what it says,
but also &#233; always has to be displayed as e-acute. If you want to
include a Cyrillic character with a numeric character reference, you have
to use e.g. &#1041; which is the decimal representation of U+0411,
Cyrillic upper case B.

>  (How about Arena/Amaya -- I haven't checked that one).

At least the browsers from Alis and from Accent should do it correctly.

>Do we have to publish it in Unicode instead then? -- ie. let most
>browsers just break and wait for the *perfect browser* to come along.

No, you don't. You can use ISO 8859-5 as indicated above. The document
length in octets may be shorter. And many browsers will correctly display
at least the Cyrillic part, even if they will mess up French. On the other
hand, those browsers that do everything correctly will usually also
accept Unicode.

>I would say the LANG attribute is very appropriate (amongst others)
>to indicate a specific character mapping. (ie. "8-bits to Unicode")
>I may be wrong, but I haven't seen very much about this attribute
>lately. I thought it actually appeared in earlier versions of the
>CSS-draft. It doesn't any more.

LANG is DEFINITELY not appropriate for this. For each language, there
are quite a lot of possible character encodings, and each encoding
can represent many languages. For example, Russian can be written
in the standard Japanese, Korean, or Chinese encodings!
Also, LANG is content markup, whereas character encodings are
transmission details, turning up mostly in HTTP and not in HTML.

As for CSS, I don't know the latest draft. But a combination of LANG
and styles definitely makes sense, e.g. to indicate that all French should
be displayed in a given font, or on the other hand to say that all
document parts of a given CLASS are in a given language.

>So, How about some IDREF-linking to make things work?

I don't know IDREF-linking. But it is very easy to guess that implementing
this in the browsers that currently don't support full internationalization
would be at least as difficult than the solution I described above that
has now been dicussed for about a year and is very well accepted.

Hope this helps,	Martin.

Dr.sc.  Martin J. Du"rst			    ' , . p y f g c R l / =
Institut fu"r Informatik			     a o e U i D h T n S -
der Universita"t Zu"rich			      ; q j k x b m w v z
Winterthurerstrasse  190			     (the Dvorak keyboard)
CH-8057   Zu"rich-Irchel   Tel: +41 1 257 43 16
 S w i t z e r l a n d	   Fax: +41 1 363 00 35   Email: mduerst@ifi.unizh.ch