Re: Problem with LANG keyword

Reuven Nisser <rnisser@ofek-liyladenu.org.il>:
>
> <body lang="en,he,ar" dir="ltr">
> <p>The following are two letters in Hebrew,
> &05D0; &05D1;
> while these are three Arabic letters,
> &0644; &0647; &062C;.
>
> You can still "know" automatically which part is Arabic, which is Hebrew
and
> which is English.

Actually I only recognize the English text and two sets of characters from
different alphabets from which I don't know if they form actual words. I can
(up to a certain level) distinguish between several alphabets. A computer is
even better at that, but neither I nor a computer do know without further
information words from which language it forms (except for a few cases).
With the genuine information from 'body lang="en,he,ar"' I could further
relate characters to languages, but that only works in a quite limited way,
i.e. when each of the languages usually uses its own script. Imagine how
many languages use the Latin, Greek or Kyrillic scripts, which share some
letters (e.g. uppercase H, Eta and En look the same) and are thus harder to
distinguish than those in your example. The solution to explicitely mark up
smaller parts from different languages than the main one of the document, is
surely better and computer friendly:

 <body lang="en"><p>
 The following are two letters in Hebrew,
 <samp lang="he">&05D0; &05D1;</samp>
 while these are three Arabic letters,
 <samp lang="ar">&0644; &0647; &062C;</samp>.
 </p></body>

> So, marking the whole text as English, Hebrew and Arabic is enough.

In this special case maybe, but in general you can't distinguish languages
by scripts used. Your idea even fails with English + Hebrew + Yiddish.

Received on Tuesday, 23 September 2003 19:38:16 UTC