- From: (wrong string) äper <christoph.paeper@tu-clausthal.de>
- Date: Wed, 24 Sep 2003 01:38:13 +0200
- To: <www-html@w3.org>
Reuven Nisser <rnisser@ofek-liyladenu.org.il>: > > <body lang="en,he,ar" dir="ltr"> > <p>The following are two letters in Hebrew, > &05D0; &05D1; > while these are three Arabic letters, > &0644; &0647; &062C;. > > You can still "know" automatically which part is Arabic, which is Hebrew and > which is English. Actually I only recognize the English text and two sets of characters from different alphabets from which I don't know if they form actual words. I can (up to a certain level) distinguish between several alphabets. A computer is even better at that, but neither I nor a computer do know without further information words from which language it forms (except for a few cases). With the genuine information from 'body lang="en,he,ar"' I could further relate characters to languages, but that only works in a quite limited way, i.e. when each of the languages usually uses its own script. Imagine how many languages use the Latin, Greek or Kyrillic scripts, which share some letters (e.g. uppercase H, Eta and En look the same) and are thus harder to distinguish than those in your example. The solution to explicitely mark up smaller parts from different languages than the main one of the document, is surely better and computer friendly: <body lang="en"><p> The following are two letters in Hebrew, <samp lang="he">&05D0; &05D1;</samp> while these are three Arabic letters, <samp lang="ar">&0644; &0647; &062C;</samp>. </p></body> > So, marking the whole text as English, Hebrew and Arabic is enough. In this special case maybe, but in general you can't distinguish languages by scripts used. Your idea even fails with English + Hebrew + Yiddish.
Received on Tuesday, 23 September 2003 19:38:16 UTC