W3C home > Mailing lists > Public > www-html@w3.org > September 2003

Re: Problem with LANG keyword

From: David Woolley <david@djwhome.demon.co.uk>
Date: Tue, 23 Sep 2003 21:52:06 +0100 (BST)
Message-Id: <200309232052.h8NKq6P12844@djwhome.demon.co.uk>
To: www-html@w3.org

[ Can't find the original...]
> Reuven Nisser <rnisser@ofek-liyladenu.org.il>:
> >
> > However, there are times where the change of language is "known" by the
> > character set used in the HTML. For example, English is using Ansi 7 bit

Leaving aside the obvious confusion between the HTML character set and
the ones that might be used to transfer pages to the browser (the former
is ISO 10646, slightly subsetted) and the bogus "Ansi" set,
except to note that a page may legitimately be converted between transfer
character sets, using numeric entities to fill any gaps....

> > characters but Hebrew & Arabic occupy the upper 128-255. [...]

They are actually well above 255.  However, more importantly, Hebrew
characters could be Yiddish or Ladino, and, as it's derived from
the Aramaic script, might be used for that as well.  Arabic script is
used for many languages, including Farsi (Persian), Urdu, Bengali,
Pushtu, Malay, and others.  (On the other hand, en-gb is likely to
contain ISO 10646 code point 163.)

Where people are using fixed length, 8 bit character sets which are
supersets of ISO 646 to transfer documents (true of most current 8 bit
sets except EBCDIC, and basically the same rules as those under which
meta...charset works), using language codes in the document also
avoids the need to know the details of lots of possible character sets,
which will help search engines to index by language without any deep
Received on Wednesday, 24 September 2003 02:16:27 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:06:05 UTC