Re: Problem with LANG keyword from David Woolley on 2003-09-23 (www-html@w3.org from September 2003)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Tue, 23 Sep 2003 21:52:06 +0100 (BST)
To: www-html@w3.org
Message-Id: <200309232052.h8NKq6P12844@djwhome.demon.co.uk>

[ Can't find the original...]
> Reuven Nisser <rnisser@ofek-liyladenu.org.il>:
> >
> > However, there are times where the change of language is "known" by the
> > character set used in the HTML. For example, English is using Ansi 7 bit

Leaving aside the obvious confusion between the HTML character set and
the ones that might be used to transfer pages to the browser (the former
is ISO 10646, slightly subsetted) and the bogus "Ansi" set,
except to note that a page may legitimately be converted between transfer
character sets, using numeric entities to fill any gaps....

> > characters but Hebrew & Arabic occupy the upper 128-255. [...]

They are actually well above 255.  However, more importantly, Hebrew
characters could be Yiddish or Ladino, and, as it's derived from
the Aramaic script, might be used for that as well.  Arabic script is
used for many languages, including Farsi (Persian), Urdu, Bengali,
Pushtu, Malay, and others.  (On the other hand, en-gb is likely to
contain ISO 10646 code point 163.)

Where people are using fixed length, 8 bit character sets which are
supersets of ISO 646 to transfer documents (true of most current 8 bit
sets except EBCDIC, and basically the same rules as those under which
meta...charset works), using language codes in the document also
avoids the need to know the details of lots of possible character sets,
which will help search engines to index by language without any deep
understanding.

Received on Wednesday, 24 September 2003 02:16:27 UTC