- From: David Woolley <david@djwhome.demon.co.uk>
- Date: Tue, 23 Sep 2003 21:52:06 +0100 (BST)
- To: www-html@w3.org
[ Can't find the original...] > Reuven Nisser <rnisser@ofek-liyladenu.org.il>: > > > > However, there are times where the change of language is "known" by the > > character set used in the HTML. For example, English is using Ansi 7 bit Leaving aside the obvious confusion between the HTML character set and the ones that might be used to transfer pages to the browser (the former is ISO 10646, slightly subsetted) and the bogus "Ansi" set, except to note that a page may legitimately be converted between transfer character sets, using numeric entities to fill any gaps.... > > characters but Hebrew & Arabic occupy the upper 128-255. [...] They are actually well above 255. However, more importantly, Hebrew characters could be Yiddish or Ladino, and, as it's derived from the Aramaic script, might be used for that as well. Arabic script is used for many languages, including Farsi (Persian), Urdu, Bengali, Pushtu, Malay, and others. (On the other hand, en-gb is likely to contain ISO 10646 code point 163.) Where people are using fixed length, 8 bit character sets which are supersets of ISO 646 to transfer documents (true of most current 8 bit sets except EBCDIC, and basically the same rules as those under which meta...charset works), using language codes in the document also avoids the need to know the details of lots of possible character sets, which will help search engines to index by language without any deep understanding.
Received on Wednesday, 24 September 2003 02:16:27 UTC