- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Sat, 8 Mar 1997 11:27:08 +0100 (MET)
- To: lee@sq.com
- cc: unicode@unicode.org, www-international@w3.org
On Sat, 8 Mar 1997 lee@sq.com wrote: > M.T. Carrasco Benitez <carrasco@innet.lu> wrote: > > > There is a need to indicate monolingual docs. <HTML LANG=...> look like > > the right place as the meaning is "if I do not indicate otherwise, the > > text in this document is in language xx". So, it should expect that the > > bulk of the language be the one indicate in <HTML LANG...>. > > This seems reasonable to me... As of the definition in RFC 2070, the exact meaning of <HTML LANG=xxx> is that everything not marked to be in any other language is xxx. This can range from the whole document being in xxx to documents that contain not a single word in xxx. The later case does not make much sense in practical terms, but is perfectly legal according to RFC 2070. > > For the document you mentioned, it would probably be better not to > > indicate the language in the <HTML LANG...> and to mark the English like > > the other languages as the doc is clearly multilingual. > > Is this a document with parallel translations? If so, footnotes may > be in one language (say), or one lanaguage may be Old Church Slavonic > and the other Old English, in which case you are probably right. The document in question is multilingual. Much less than 50% of the text are in English. Nevertheless, it makes sense to have <HTML LANG=en>, because the main language of the document is English. If you understand English, you will understand what the document is all about, what it's structure is, and so on. It's an English multilingual document. All the nice texts in the many languages are just displays, as another document would contain many images (and would still be considered an English document). > But I would expect that HTML editing software would always by default > put the author's editing locale's language in the HTML LANG attribute > unless it was specifically overridden. It's hard for software to > detect an author's intent. To make software to help the authors put in correct tagging is an important and difficult problem indeed. It's difficult to decide to what extent heuristics like the above will work, and what percentage of wrong tagging (as compared to the always correct but useless non-tagging) can be tolerated. > I think explicit rules are needed on what counts the majority > language. Example rules might include > [1] you can't understand enough of this to make sense unless > you're fluent in Japanese and Old Frsian > [2] 51% or more of the text characters in this document correspond > to Hindi, so that's the majority language > [3] 51% or more of the glyphs .... > [4] 51% or more of the pixels set at 100 dpi... :-) > > If such rules are already in place, we can stop this discussion. > If not, it seems they're needed. Number [2] seems easiest to compute > automatically, and number [1] seems the most useful but can't be > set automatically. Such bean counting is quite useless, in my opinion. What counts is structure. A scientific paper about some Sanskrit poem written in English may be about as easy or difficult to understand to the average English reader as a paper on quantum physics, independent of whether it contains 17% or 59% Sanskrit. Structurally, all these papers are English papers. If you had to translate them to French, you would translate the English, not the Sanskrit and not the physics. A general comment: As we have seen in this discussion up to now, there are many different needs for language information about documents. Proposals for one specific interpretation of one already well-defined way to indicate language in a HTML document, to satisfy one specific information need that appeared at one place are not a long-lasting approach to solving the information needs we have. I would suggest to attack the problem in a wider frame, e.g. to look at Metadata (DC or other) and see how this can be used to satisfy the various needs already expressed and the many more that will appear in the future. Regards, Martin.
Received on Saturday, 8 March 1997 05:26:28 UTC