Re: Natural language marking in HTML from M.T. Carrasco Benitez on 1997-03-08 (www-international@w3.org from January to March 1997)

From: M.T. Carrasco Benitez <carrasco@innet.lu>
Date: Sat, 8 Mar 1997 11:17:09 +0100 (MET)
To: lee@sq.com
cc: unicode@unicode.org, www-international@w3.org
Message-ID: <Pine.LNX.3.95.970308110429.27794A-100000@localhost>

> > For the document you mentioned, it would probably be better not to
> > indicate the language in the <HTML LANG...> and to mark the English like
> > the other languages as the doc is clearly multilingual.
> 
> Is this a document with parallel translations?

Yes.

> If so, footnotes may
> be in one language (say), or one lanaguage may be Old Church Slavonic
> and the other Old English, in which case you are probably right.

It is a document divided into sections each one in one language.

> But I would expect that HTML editing software would always by default
> put the author's editing locale's language in the HTML LANG attribute
> unless it was specifically overridden.  It's hard for software to
> detect an author's intent.

This could be a behaviour, though it is not defined in the draft.

> I think explicit rules are needed on what counts the majority
> language.  Example rules might include
>     [1] you can't understand enough of this to make sense unless
>         you're fluent in Japanese and Old Frsian
>     [2] 51% or more of the text characters in this document correspond
>         to Hindi, so that's the majority language
>     [3] 51% or more of the glyphs ....
>     [4] 51% or more of the pixels set at 100 dpi... :-)

The draft recommend marking the language explicitly; not in computing the
language.  If the author wants to use certain tools, it is up to him;
this is not covered by the draft.

> 
> If such rules are already in place, we can stop this discussion.
> If not, it seems they're needed.

The rules are not in place and the present draft is aiming much
lower; i.e., to minimal recommendiations regarding the marking of natural
languages in HTML docs.

 http://www.crpht.lu/~carrasco/winter/lama.html 

> Number [2] seems easiest to compute automatically, and number [1] seems
> the most useful but can't be set automatically.

Not really.  One cannot trust the characters to indicate the language as
one character repertoir could be used by more than one language.

Tomas

Received on Saturday, 8 March 1997 05:12:23 UTC