W3C home > Mailing lists > Public > www-html@w3.org > September 2003

RE: Problem with LANG keyword

From: Reuven Nisser <rnisser@ofek-liyladenu.org.il>
Date: Tue, 23 Sep 2003 23:53:06 +0300
To: "BIGELOW,JIM (HP-Boise,ex1)" <jim.bigelow@hp.com>
Cc: <www-html@w3.org>, "'shaula haitner'" <shaula@shaula.co.il>, "'Yuval Rabinovich'" <yuval@faz.co.il>, "'Gertel Hasson'" <gilagh@netvision.net.il>
Message-ID: <EOEHIKCGOKGNIEEKJHEKIEJDDGAA.rnisser@ofek-liyladenu.org.il>

Hello,
It does not matter if I use Unicode or use encoding your way. See the
following script:

<body lang="en,he,ar" dir="ltr">
<p>The following are two letters in Hebrew,
&05D0; &05D1;
while these are three Arabic letters,
&0644; &0647; &062C;.
The letters forms both evolved from the ancient
Aramaic alphabet.
</p>
</body>

You can still "know" automatically which part is Arabic, which is Hebrew and
which is English. So, marking the whole text as English, Hebrew and Arabic
is enough.

Now, using 8 bit mode:

<body lang="en,he" dir="ltr">
<p>The following are two letters in Hebrew,
&#224; &#225;
</p>
</body>

Or using a text created with Notepad on Hebrew Windows:

<body lang="en,he" dir="ltr">
<p>The following are two letters in Hebrew,
 
</p>
</body>

Same follows. You know automatically which part is Hebrew and which is
English.

Regards,
Reuven Nisser
Ofek Liyladenu

-----Original Message-----
From: BIGELOW,JIM (HP-Boise,ex1) [mailto:jim.bigelow@hp.com]
Sent: Tuesday, September 23, 2003 8:21 PM
To: Reuven Nisser
Subject: RE: Problem with LANG keyword



 Reuven Nisser wrote
> ...
> This is especially true when using Unicode. There one can mix
> Hebrew, Arabic and English in the same text without any conflict.
> ...

The report <cite>Unicode in XML and other Markup Languages</cite> [1]
discusses the many situations where markup is preferred over Unicode
characters for encoding information about structure and presentation.  See
Section 3.9 Language Tag Characters [2].

Therefore, I think that use of the language attribute in elements that
enclose spans of text from a given language is preferred over discovering
the language based on the Unicode character.  For example:

<body lang="en" dir="ltr">
<p>The following are two letters in Hebrew,
<q lang="he" dir="rtl">&05D0; &05D1;<\q>
while these are three Arabic letters,
<q lang="ar" dir="rtl">&0644; &0647; &062C;</q>.
The letters forms both evolved from the ancient
Aramaic alphabet.
</p>
</body>

Jim Bigelow

[1] http://www.w3.org/TR/unicode-xml/
[2] http://www.w3.org/TR/unicode-xml/#Language
Received on Tuesday, 23 September 2003 16:53:22 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:15:58 GMT