W3C home > Mailing lists > Public > www-international@w3.org > January to March 1997

Re: Natural language marking in HTML

From: <lee@sq.com>
Date: Sat, 8 Mar 97 04:47:08 EST
Message-Id: <9703080947.AA11920@sqrex.sq.com>
To: unicode@unicode.org, www-international@w3.org
M.T. Carrasco Benitez <carrasco@innet.lu> wrote:

> There is a need to indicate monolingual docs. <HTML LANG=...> look like
> the right place as the meaning is "if I do not indicate otherwise, the
> text in this document is in language xx".  So, it should expect that the
> bulk of the language be the one indicate in <HTML LANG...>.

This seems reasonable to me...

> For the document you mentioned, it would probably be better not to
> indicate the language in the <HTML LANG...> and to mark the English like
> the other languages as the doc is clearly multilingual.

Is this a document with parallel translations?  If so, footnotes may
be in one language (say), or one lanaguage may be Old Church Slavonic
and the other Old English, in which case you are probably right.

But I would expect that HTML editing software would always by default
put the author's editing locale's language in the HTML LANG attribute
unless it was specifically overridden.  It's hard for software to
detect an author's intent.

I think explicit rules are needed on what counts the majority
language.  Example rules might include
    [1] you can't understand enough of this to make sense unless
        you're fluent in Japanese and Old Frsian
    [2] 51% or more of the text characters in this document correspond
        to Hindi, so that's the majority language
    [3] 51% or more of the glyphs ....
    [4] 51% or more of the pixels set at 100 dpi... :-)

If such rules are already in place, we can stop this discussion.
If not, it seems they're needed.  Number [2] seems easiest to compute
automatically, and number [1] seems the most useful but can't be
set automatically.

Received on Saturday, 8 March 1997 04:47:08 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:40:40 UTC