RE: Multilingual search resources?

Hi, Justin:

There is very little in the document I am really familiar with, 
"Internationalization Best Practices:  Specifying Languages in XHTML and 
HTML Content:"

http://www.w3.org/TR/i18n-html-tech-lang/#ri20050208.091505539

But there are a few things here, although the authors acknowledge that the 
main user agent that uses data about language right now is the browser; 
however it's anticipated that other user agents will do so.

What is here regards how to declare language in a document so that it can be 
accessed by user agents; this includes how to declare language in 
multilingual documents and how to declare language that can be accessed by 
search engines.

The HTTP Content-Language Header and the meta tags in the html or xhtml 
document headers are the two places to specify the language of the targeted 
audience.  Audiences speaking multiple languages (such as English students 
studying French) or multiple audiences speaking varying languages may be 
targeted here.
The language of the targeted audience is the language that search engines 
should be concerned with, rather that with the text processing language 
(though in some cases I bet search engines take an interest in the overall 
default text-processing language too).

Anyway, below are exerpts from the "Best Practices" document together with 
section numbers from the document where these exerpts are taken from!

Not sure if this is what you are looking for!

Hope it helps anyway!

Sincerely,

C. E. Whitehead
cewcathar@hotmail.com

2.
"Applications for language information are found in such things as authoring 
tools, translation tools, accessibility, font selection, page rendering, 
search, and scripting."

3.1.
"Metadata about the language of the intended audience is about the document 
as a whole. Such metadata may be used for searching, serving the right 
language version, classification, etc. It is not specific enough to indicate 
the language of a particular run of text in the document for text-processing 
  - for example, in a way that would be needed for the application of 
text-to-speech, styling, automatic font assignment, etc."

"The language of the intended audience does not include every language used 
in a document. Many documents on the Web contain embedded fragments of 
content in different languages, whereas the page is clearly aimed at 
speakers of one particular language. For example, a German city-guide for 
Beijing may contain useful phrases in Chinese, but it is aimed at a 
German-speaking audience, not a Chinese one.

"It is also possible to imagine a situation where a document contains the 
same or parallel content in more than one language. For example, a Web page 
may welcome Canadian readers with French content in the left column, and the 
same content in English in the right-hand column. Here the document is 
equally targeted at speakers of both languages, so there are two audience 
languages. This situation is not as common on the Web as in printed material 
since it is easy to link to separate pages on the Web for different 
audiences, but it does occur where there are multilingual communities. 
Another use case is a blog or a news page aimed at a multilingual community, 
where some articles on a page are in one language and some in another. "

"Metadata about the language of the intended audience is usually best 
declared outside the document in the HTTP Content-Language header, although 
there may be situations where an internal declaration using the meta element 
is appropriate."

4.2

"There is generally a lot of confusion about the difference between 
declaring language information using the Content-Language field in the HTTP 
header or meta elements, and using a language attribute on the html element. 
In particular, much of the informal advice on the Web about how to declare 
the language of a document tells you to use the meta tag to declare the 
language of the document. At least one popular authoring tool automatically 
inserts language information that you declare in the page properties dialog 
box into a meta element.

"Best practices in this document recommend that HTTP and the meta element be 
used for describing metadata about the language of the intended audience 
only, and that attributes be used for describing the default text-processing 
language of the document.

"Reasons for making this distinction include:

   1.

" HTTP and meta declarations allow you to specify more than one language 
value. This is inappropriate for labelling the text-processing language, 
which must be done one language at a time. On the other hand, multiple 
language values are appropriate when declaring language for documents that 
are aimed at speakers of more than one language. Attribute-based language 
declarations can only specify one language at a time, so they are less 
appropriate for specifying the language of the intended audience, but they 
are perfect for labelling the text-processing language for text.)"

"There are still some unknowns surrounding the use of HTTP headers or meta 
elements to declare the language of the intended audience, due to the 
currently low level of exploitation of this information. This may change in 
the future, particularly if libraries and similar users take an increasing 
interest in language metadata.

When it comes to choosing between the HTTP header or the meta element for 
expressing information about the intended audience, there is also a lack of 
information on which to base any advice. In some ways the meta element may 
appeal, because it is an in-document declaration. This avoids potential 
issues if authors cannot access server settings, particularly if dealing 
with an ISP, or if the document is to be read from a CD or other non-HTTP 
source. Until more practical use cases arise, however, this is just theory.

"If, in the future, we see systematic use of in-document declarations of 
audience language using the meta element. It may also become acceptable to 
infer the language of the intended audience from the language attribute on 
the html element for documents with a monolingual audience. Discussion 
amongst various stakeholders needs to take place, however, before this can 
be decided.

"In the meantime, we recommend that you use HTTP headers and meta elements 
to provide document metadata about the language of the intended audience(s), 
and language attributes on the html tag to indicate the default 
text-processing language. Furthermore, we recommend that you always declare 
the default text-processing language.

>From: "Justin Thorp" <juth@loc.gov>
>To: <www-international@w3.org>
>Subject: Multilingual search resources?
>Date: Thu, 02 Nov 2006 11:39:31 -0500
>MIME-Version: 1.0
>Received: from frink.w3.org ([128.30.52.16]) by 
>bay0-mc5-f19.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2444); Thu, 2 
>Nov 2006 08:55:02 -0800
>Received: from lists by frink.w3.org with local (Exim 4.50)id 
>1GffnY-0004Gg-Ngfor www-international-dist@listhub.w3.org; Thu, 02 Nov 2006 
>16:52:12 +0000
>Received: from lisa.w3.org ([128.30.52.41])by frink.w3.org with esmtp (Exim 
>4.50)id 1GffnN-0002NH-62for www-international@listhub.w3.org; Thu, 02 Nov 
>2006 16:52:01 +0000
>Received: from ntgwgate.loc.gov ([140.147.137.18] helo=loc.gov)by 
>lisa.w3.org with esmtp (Exim 4.50)id 1Gffag-0008Jn-RUfor 
>www-international@w3.org; Thu, 02 Nov 2006 16:39:02 +0000
>Received: from LCHub-MTA by loc.govwith Novell_GroupWise; Thu, 02 Nov 2006 
>11:39:15 -0500
>Received: none (lisa.w3.org: domain of juth@loc.gov does not designate 
>permitted sender hosts)
>X-Message-Info: txF49lGdW40iFCYqxCapx3dVQkhA/h0g3WtkA4YzLVs=
>X"-Mailer: Novell GroupWise Internet Agent 6.5.4 X-W3C-Hub-Spam-Status: No, 
>score=-2.6
>X-W3C-Scan-Sig: lisa.w3.org 1Gffag-0008Jn-RU 
>a388477279f24f54f636cfb91997355b
>X-Original-To: www-international@w3.org
>X-Archived-At: http://www.w3.org/mid/s549d8e3.007@loc.gov
>Resent-From: www-international@w3.org
>X-Mailing-List: <www-international@w3.org> archive/latest/4844
>X-Loop: www-international@w3.org
>Resent-Sender: www-international-request@w3.org
>Precedence: list
>List-Id: <www-international.w3.org>
>List-Help: <http://www.w3.org/Mail/>
>List-Unsubscribe: 
><mailto:www-international-request@w3.org?subject=unsubscribe>
>Resent-Message-Id: <E1GffnY-0004Gg-Ng@frink.w3.org>
>Resent-Date: Thu, 02 Nov 2006 16:52:12 +0000
>Return-Path: www-international-request@listhub.w3.org
>X-OriginalArrivalTime: 02 Nov 2006 16:55:03.0406 (UTC) 
>FILETIME=[A42940E0:01C6FE9F]
>
>
>I am doing research on issues regarding multilingual web search.  Are there 
>any resources that someone can point me to?
>
>thanks,
>Justin Thorp
>
>******************
>Justin Thorp
>Web Services - Office of Strategic Initiatives
>Library of Congress
>e - juth@loc.gov
>p - 202/707-9541
>
>

_________________________________________________________________
Stay in touch with old friends and meet new ones with Windows Live Spaces 
http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.aspx&mkt=en-us

Received on Tuesday, 7 November 2006 21:47:20 UTC