Re: Natural language marking in HTML

On Sat, 8 Mar 1997 lee@sq.com wrote:

> M.T. Carrasco Benitez <carrasco@innet.lu> wrote:
> 
> > There is a need to indicate monolingual docs. <HTML LANG=...> look like
> > the right place as the meaning is "if I do not indicate otherwise, the
> > text in this document is in language xx".  So, it should expect that the
> > bulk of the language be the one indicate in <HTML LANG...>.
> 
> This seems reasonable to me...

As of the definition in RFC 2070, the exact meaning of <HTML LANG=xxx>
is that everything not marked to be in any other language is xxx.
This can range from the whole document being in xxx to documents
that contain not a single word in xxx. The later case does not
make much sense in practical terms, but is perfectly legal
according to RFC 2070.


> > For the document you mentioned, it would probably be better not to
> > indicate the language in the <HTML LANG...> and to mark the English like
> > the other languages as the doc is clearly multilingual.
> 
> Is this a document with parallel translations?  If so, footnotes may
> be in one language (say), or one lanaguage may be Old Church Slavonic
> and the other Old English, in which case you are probably right.

The document in question is multilingual. Much less than 50%
of the text are in English. Nevertheless, it makes sense to
have <HTML LANG=en>, because the main language of the document
is English. If you understand English, you will understand
what the document is all about, what it's structure is, and
so on. It's an English multilingual document. All the nice
texts in the many languages are just displays, as another
document would contain many images (and would still be considered
an English document).



> But I would expect that HTML editing software would always by default
> put the author's editing locale's language in the HTML LANG attribute
> unless it was specifically overridden.  It's hard for software to
> detect an author's intent.

To make software to help the authors put in correct tagging is
an important and difficult problem indeed. It's difficult to
decide to what extent heuristics like the above will work,
and what percentage of wrong tagging (as compared to the
always correct but useless non-tagging) can be tolerated.


> I think explicit rules are needed on what counts the majority
> language.  Example rules might include
>     [1] you can't understand enough of this to make sense unless
>         you're fluent in Japanese and Old Frsian
>     [2] 51% or more of the text characters in this document correspond
>         to Hindi, so that's the majority language
>     [3] 51% or more of the glyphs ....
>     [4] 51% or more of the pixels set at 100 dpi... :-)
> 
> If such rules are already in place, we can stop this discussion.
> If not, it seems they're needed.  Number [2] seems easiest to compute
> automatically, and number [1] seems the most useful but can't be
> set automatically.

Such bean counting is quite useless, in my opinion.
What counts is structure. A scientific paper about some Sanskrit
poem written in English may be about as easy or difficult to
understand to the average English reader as a paper on quantum
physics, independent of whether it contains 17% or 59%
Sanskrit. Structurally, all these papers are English papers.
If you had to translate them to French, you would translate
the English, not the Sanskrit and not the physics.


A general comment:

As we have seen in this discussion up to now, there are many
different needs for language information about documents.

Proposals for one specific interpretation of one already
well-defined way to indicate language in a HTML document,
to satisfy one specific information need that appeared at
one place are not a long-lasting approach to solving the
information needs we have.

I would suggest to attack the problem in a wider frame,
e.g. to look at Metadata (DC or other) and see how this
can be used to satisfy the various needs already expressed
and the many more that will appear in the future.

Regards,	Martin.

Received on Saturday, 8 March 1997 05:26:28 UTC