RE: Problem with LANG keyword from Reuven Nisser on 2003-09-24 (www-html@w3.org from September 2003)

From: Reuven Nisser <rnisser@ofek-liyladenu.org.il>
Date: Wed, 24 Sep 2003 10:13:43 +0200
To: "David Woolley" <david@djwhome.demon.co.uk>, <www-html@w3.org>
Message-ID: <EOEHIKCGOKGNIEEKJHEKEEHADCAA.rnisser@ofek-liyladenu.org.il>

Hello David,
I am not saying not to use language information. What I am saying is to
allow more than one to be active at the same time.

If it was possible to write:
<HTML LANG="HE,AR,EN">
then when you get to Hebrew characters you know exactly that we are speaking
about Hebrew and not Yiddish, Ladino or Aramic. If you get to Arabic
characters you know exactly that we are speaking about Arabic and not
Turkish and if you get to Latin you know you are using English and not
Dutch.

The problem is that the LANG keyword (according to W3C standard) is not
allowed to receive more than one value. This is why I need to use:
<META http-equiv="Content-Language" CONTENT="HE,EN">

Thank you,
Reuven Nisser
Ofek Liyladenu

-----Original Message-----
From: www-html-request@w3.org [mailto:www-html-request@w3.org]On Behalf Of
David Woolley
Sent: Tuesday, September 23, 2003 10:52 PM
To: www-html@w3.org
Subject: Re: Problem with LANG keyword



[ Can't find the original...]
> Reuven Nisser <rnisser@ofek-liyladenu.org.il>:
> >
> > However, there are times where the change of language is "known" by the
> > character set used in the HTML. For example, English is using Ansi 7 bit

Leaving aside the obvious confusion between the HTML character set and
the ones that might be used to transfer pages to the browser (the former
is ISO 10646, slightly subsetted) and the bogus "Ansi" set,
except to note that a page may legitimately be converted between transfer
character sets, using numeric entities to fill any gaps....

> > characters but Hebrew & Arabic occupy the upper 128-255. [...]

They are actually well above 255.  However, more importantly, Hebrew
characters could be Yiddish or Ladino, and, as it's derived from
the Aramaic script, might be used for that as well.  Arabic script is
used for many languages, including Farsi (Persian), Urdu, Bengali,
Pushtu, Malay, and others.  (On the other hand, en-gb is likely to
contain ISO 10646 code point 163.)

Where people are using fixed length, 8 bit character sets which are
supersets of ISO 646 to transfer documents (true of most current 8 bit
sets except EBCDIC, and basically the same rules as those under which
meta...charset works), using language codes in the document also
avoids the need to know the details of lots of possible character sets,
which will help search engines to index by language without any deep
understanding.

Received on Wednesday, 24 September 2003 03:11:59 UTC