W3C home > Mailing lists > Public > www-html@w3.org > September 2000

II8N robots & searching

From: Andrew Daviel <andrew@andrew.triumf.ca>
Date: Fri, 8 Sep 2000 10:09:33 -0700 (PDT)
To: Robots list <robots@MCCMEDIA.COM>
cc: WWW-HTML List <www-html@w3.org>
Message-ID: <Pine.LNX.4.21.0009080926090.18436-100000@andrew.triumf.ca>
Richard Chuang recently mentioned that he is working on
II8N, at least with Big5.

I often get asked questions about META tags in HTML
(e.g. "keywords") and sometimes about II8N aspects. I was wondering
what the current and future realities are of "how do I get listed
in search engines" for non-English users.

AFAIK, the following applies. Please correct me if I'm wrong.

- The default HTML charset is ISO-8859-1 (Western European)
- The Internet and most computers are 8-bit safe
- One can expect to put "château" in a document, and
  <meta name="keywords" content="château"> (or maybe
  <meta name="keywords" lang="fr" content="château">)
  and a search engine will find it under "château".
(if anyone doesn't see the accent, I wrote "ch<a-circumflex>teau")
- Many search engines will lose the accent so that a search for
  "chateau" will also find it.

- Escaped characters such as &#233; and &eacute; should be
  translated to the 8-bit value é and should also match, so that
  "renée" entered in a search engine should find "ren&#233;e" in
  a document.

Things I am not so sure about:

- what happens when the document charset is not ISO-8859-1
  but ISO-8859-5 or KOI8-R or Windows-1251 ? Does the
  search engine just try to match the 8-bit value from the users
  keyboard, or does it try to be clever with HTTP_ACCEPT_CHARSET
  and map across to the alternate charset(s) used in the documents ?

As I understand, at least with Netscape, the browser will automatically
switch charsets when a page is loaded if a charset modifier is used
with content-type, providing that the font is available, so that
in Russian, Chinese etc. pages composed with Unix, Windows etc.
may all be viewed correctly. But presumeably the users keyboard is mapped
to a single charset.

- what happens when a user enters data in a form on a page
  written in Windows-1251, but their keyboard is set to ISO-8859-5 ?
- are there any special rules for 16-bit charsets ?

Back to the meta tag question, if a user wants to be found as
"renée", should they put
  <meta name="keywords" content="renée">
  <meta name="keywords" content="ren&#233;e">
  <meta name="keywords" content="ren&eacute;">
  <meta name="keywords" content="renee">
or all 4 ?

Andrew Daviel, TRIUMF, Canada
Received on Friday, 8 September 2000 13:10:06 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:15:44 GMT