- From: Martin J Duerst <mduerst@ifi.unizh.ch>
- Date: Wed, 24 Jul 1996 10:49:25 +0200 (MET DST)
- To: MOURIK@rullet.LeidenUniv.nl (Hans van Mourik)
- Cc: www-international@w3.org
> >Hello to you internationalisationisers, > >I would like to know how the HTML LANG-attribute should be linked >up to a particular character-set. In fact what I'm looking for is an >HTML-equivalent for the TEI ``writing system declarations''. >Are there any thoughts about such a thing? I don't know about TEI "writing system declarations", but definitely the LANG attribute should have no connection whatsoever to character encoding issues. LANG is real content markup, and is important for indexing, text-to-speach, hyphenation, high-quality display, and so on. Please always mark up your documents with this information to allow such applications to develop. On the other hand, character encoding is a technicality that is handled outside actual HTML or SGML. SGML has the concept of a document character set, represented as a set of positive integers. For HTML, the document character set was ISO 8859-1 and is becomming ISO 10646 (which means that it is just extended, nothing has to be changed). >What we (NHDA) would like to do is to serve documents containing >*multiple languages*. We're not so much interested in serving a >directory with multiple translations of the same instance. Nice. This is what many people would like to do, and what some already do. >Consider a document containing both French, German and Russian. >HTML 3.2 offers us the possibility of marking divisions, paragraphs, ><span>'s and so on with lang="ru" | lang="fr" | lang="de". >But then what? This information is important for the reasons put forward above. There is no "so what?" here! >Now suppose the HTTP Charset-header is set to some Russian character- >encoding (Ms. codepage 1251, KOI-8R or ISO 8859-5 -- you may pick your >choice). Don't "pick your choice". Help streamlining the character encoding mess to improve interoperability. The web did a lot in this respect for Western Europe by sticking to ISO 8859-1. It would be nice if the same happened for Cyrillic and ISO 8859-5 (which is the one mentionned in all relevant RFCs and such). >What happens to entities like é an ö? Browsers like >Navigator, Explorer and Mosaic will map them blindly to #233 and #246. >And so they'll appear as arbitrary Russian characters. This is incorrect behaviour due to the laziness of the software makers involved. They know this, but as many times, i18n is not the first item on their list of priorities. But please complain to them about thi buggy behaviour. Not only should it be clear that é has to be always what it says, but also é always has to be displayed as e-acute. If you want to include a Cyrillic character with a numeric character reference, you have to use e.g. Б which is the decimal representation of U+0411, Cyrillic upper case B. > (How about Arena/Amaya -- I haven't checked that one). At least the browsers from Alis and from Accent should do it correctly. >Do we have to publish it in Unicode instead then? -- ie. let most >browsers just break and wait for the *perfect browser* to come along. No, you don't. You can use ISO 8859-5 as indicated above. The document length in octets may be shorter. And many browsers will correctly display at least the Cyrillic part, even if they will mess up French. On the other hand, those browsers that do everything correctly will usually also accept Unicode. >I would say the LANG attribute is very appropriate (amongst others) >to indicate a specific character mapping. (ie. "8-bits to Unicode") >I may be wrong, but I haven't seen very much about this attribute >lately. I thought it actually appeared in earlier versions of the >CSS-draft. It doesn't any more. LANG is DEFINITELY not appropriate for this. For each language, there are quite a lot of possible character encodings, and each encoding can represent many languages. For example, Russian can be written in the standard Japanese, Korean, or Chinese encodings! Also, LANG is content markup, whereas character encodings are transmission details, turning up mostly in HTTP and not in HTML. As for CSS, I don't know the latest draft. But a combination of LANG and styles definitely makes sense, e.g. to indicate that all French should be displayed in a given font, or on the other hand to say that all document parts of a given CLASS are in a given language. >So, How about some IDREF-linking to make things work? I don't know IDREF-linking. But it is very easy to guess that implementing this in the browsers that currently don't support full internationalization would be at least as difficult than the solution I described above that has now been dicussed for about a year and is very well accepted. Hope this helps, Martin. ---- Dr.sc. Martin J. Du"rst ' , . p y f g c R l / = Institut fu"r Informatik a o e U i D h T n S - der Universita"t Zu"rich ; q j k x b m w v z Winterthurerstrasse 190 (the Dvorak keyboard) CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16 S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch ----
Received on Wednesday, 24 July 1996 04:49:24 UTC