W3C home > Mailing lists > Public > www-international@w3.org > October to December 2006

RE: Multilingual search resources?

From: CE Whitehead <cewcathar@hotmail.com>
Date: Wed, 08 Nov 2006 19:10:37 -0500
Message-ID: <BAY114-F23C633F002501A8E0A1679B3F00@phx.gbl>
To: juth@loc.gov
Cc: www-international@w3.org

I sent you another reply; the only things I can think that you need to be 
aware of is how search engines identify the language of a page
(a search engine may identify the primary text processing language of a 
page, but the encodings are set up so that they can use the
meta tag about language or the http tag--which are meant to identify the 
language[s] of the targeted audience[s]; so what a search engine identifies 
depends on how someone encoded language information in his/her pages; the 
w3c just makes recommendations as to how to do it; it also depends on the 
search engine and how it checks for language--does it
use the meta content tag or what?  It could vary with the search engine 

Because those are recommendations, some of them are being used more than 
most programs that create pages specify the default language different ways, 
though, according to that document
the meta content tag is getting popular.

And as you may know, every search engine is a little different;
it would be great to know exactly what language information each search 
engine looks for;
they look for the primary language of the page of course.
You can specify what languages you will accept for the primary language of 
pages; that is all I know.
I am really not an expert.

I hope I've answered your question in some way.

If you get back information in languages you are not familiar with, you 
might try google's online translation service, if the language is found 
there; I've found it to be a fairly good translation service though 
sometimes it misses a beat or two.

(Google uses something like I think parallel distributed processing; which 
means that it keeps data bases of pages that are translations of each other; 
and compares the translations with the originals to learn how to translate 
various phrases; thus it learns automatically rather than having to be 
'taught; by humans to parse sentences as far as I know;
I think though that such automatic learning of grammars for translating 
could be simplified and improved with the implementation of a simple 
topic-comment grammar like Halliday used to analyze Darwin maybe; that's a 
personal opinion of course; I do not work in that field though I'd like to.
But I've been fairly impressed with its translations; I checked some French 
text in it and it worked well, avoided many mistakes that other translators 
have made translating text; maybe they are all getting better!)

--C. E. Whitehead

>From: "Justin Thorp" <juth@loc.gov>
>To: <cewcathar@hotmail.com>
>Subject: RE: Multilingual search resources?
>Date: Wed, 08 Nov 2006 15:06:03 -0500
>MIME-Version: 1.0
>Received: from loc.gov ([]) by bay0-mc5-f14.bay0.hotmail.com 
>with Microsoft SMTPSVC(6.0.3790.2444); Wed, 8 Nov 2006 12:08:51 -0800
>Received: from LCHub-MTA by loc.govwith Novell_GroupWise; Wed, 08 Nov 2006 
>15:05:39 -0500
>X-Message-Info: txF49lGdW43EtVys8/Pj0EoKiJCkrJhawBckgH3dbwM=
>X-Mailer: Novell GroupWise Internet Agent 6.5.4 Return-Path: juth@loc.gov
>X-OriginalArrivalTime: 08 Nov 2006 20:08:52.0197 (UTC) 
>Thank you so much for the references to the Internationalization Best 
>Practices Documents.  They really are great resources and have been good 
>introductory material.
>What I am inquiring about is searching a repository of metadata that could 
>potentially be in multiple languages.  When I am trying to return a 
>relevant search results to someone, are there certain things I need to be 
>aware of when it comes to different languages?
>Justin Thorp
>Web Services - Office of Strategic Initiatives
>Library of Congress
>e - juth@loc.gov
>p - 202/707-9541
> >>> "CE Whitehead" <cewcathar@hotmail.com> 11/07/06 4:47 PM >>>
>Hi, Justin:
>There is very little in the document I am really familiar with,
>"Internationalization Best Practices:  Specifying Languages in XHTML and
>HTML Content:"
>But there are a few things here, although the authors acknowledge that the
>main user agent that uses data about language right now is the browser;
>however it's anticipated that other user agents will do so.
>What is here regards how to declare language in a document so that it can 
>accessed by user agents; this includes how to declare language in
>multilingual documents and how to declare language that can be accessed by
>search engines.
>The HTTP Content-Language Header and the meta tags in the html or xhtml
>document headers are the two places to specify the language of the targeted
>audience.  Audiences speaking multiple languages (such as English students
>studying French) or multiple audiences speaking varying languages may be
>targeted here.
>The language of the targeted audience is the language that search engines
>should be concerned with, rather that with the text processing language
>(though in some cases I bet search engines take an interest in the overall
>default text-processing language too).
>Anyway, below are exerpts from the "Best Practices" document together with
>section numbers from the document where these exerpts are taken from!
>Not sure if this is what you are looking for!
>Hope it helps anyway!
>C. E. Whitehead
>"Applications for language information are found in such things as 
>tools, translation tools, accessibility, font selection, page rendering,
>search, and scripting."
>"Metadata about the language of the intended audience is about the document
>as a whole. Such metadata may be used for searching, serving the right
>language version, classification, etc. It is not specific enough to 
>the language of a particular run of text in the document for 
>   - for example, in a way that would be needed for the application of
>text-to-speech, styling, automatic font assignment, etc."
>"The language of the intended audience does not include every language used
>in a document. Many documents on the Web contain embedded fragments of
>content in different languages, whereas the page is clearly aimed at
>speakers of one particular language. For example, a German city-guide for
>Beijing may contain useful phrases in Chinese, but it is aimed at a
>German-speaking audience, not a Chinese one.
>"It is also possible to imagine a situation where a document contains the
>same or parallel content in more than one language. For example, a Web page
>may welcome Canadian readers with French content in the left column, and 
>same content in English in the right-hand column. Here the document is
>equally targeted at speakers of both languages, so there are two audience
>languages. This situation is not as common on the Web as in printed 
>since it is easy to link to separate pages on the Web for different
>audiences, but it does occur where there are multilingual communities.
>Another use case is a blog or a news page aimed at a multilingual 
>where some articles on a page are in one language and some in another. "
>"Metadata about the language of the intended audience is usually best
>declared outside the document in the HTTP Content-Language header, although
>there may be situations where an internal declaration using the meta 
>is appropriate."
>"There is generally a lot of confusion about the difference between
>declaring language information using the Content-Language field in the HTTP
>header or meta elements, and using a language attribute on the html 
>In particular, much of the informal advice on the Web about how to declare
>the language of a document tells you to use the meta tag to declare the
>language of the document. At least one popular authoring tool automatically
>inserts language information that you declare in the page properties dialog
>box into a meta element.
>"Best practices in this document recommend that HTTP and the meta element 
>used for describing metadata about the language of the intended audience
>only, and that attributes be used for describing the default 
>language of the document.
>"Reasons for making this distinction include:
>    1.
>" HTTP and meta declarations allow you to specify more than one language
>value. This is inappropriate for labelling the text-processing language,
>which must be done one language at a time. On the other hand, multiple
>language values are appropriate when declaring language for documents that
>are aimed at speakers of more than one language. Attribute-based language
>declarations can only specify one language at a time, so they are less
>appropriate for specifying the language of the intended audience, but they
>are perfect for labelling the text-processing language for text.)"
>"There are still some unknowns surrounding the use of HTTP headers or meta
>elements to declare the language of the intended audience, due to the
>currently low level of exploitation of this information. This may change in
>the future, particularly if libraries and similar users take an increasing
>interest in language metadata.
>When it comes to choosing between the HTTP header or the meta element for
>expressing information about the intended audience, there is also a lack of
>information on which to base any advice. In some ways the meta element may
>appeal, because it is an in-document declaration. This avoids potential
>issues if authors cannot access server settings, particularly if dealing
>with an ISP, or if the document is to be read from a CD or other non-HTTP
>source. Until more practical use cases arise, however, this is just theory.
>"If, in the future, we see systematic use of in-document declarations of
>audience language using the meta element. It may also become acceptable to
>infer the language of the intended audience from the language attribute on
>the html element for documents with a monolingual audience. Discussion
>amongst various stakeholders needs to take place, however, before this can
>be decided.
>"In the meantime, we recommend that you use HTTP headers and meta elements
>to provide document metadata about the language of the intended 
>and language attributes on the html tag to indicate the default
>text-processing language. Furthermore, we recommend that you always declare
>the default text-processing language.
> >From: "Justin Thorp" <juth@loc.gov>
> >To: <www-international@w3.org>
> >Subject: Multilingual search resources?
> >Date: Thu, 02 Nov 2006 11:39:31 -0500
> >MIME-Version: 1.0
> >Received: from frink.w3.org ([]) by
> >bay0-mc5-f19.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2444); Thu, 
> >Nov 2006 08:55:02 -0800
> >Received: from lists by frink.w3.org with local (Exim 4.50)id
> >1GffnY-0004Gg-Ngfor www-international-dist@listhub.w3.org; Thu, 02 Nov 
> >16:52:12 +0000
> >Received: from lisa.w3.org ([])by frink.w3.org with esmtp 
> >4.50)id 1GffnN-0002NH-62for www-international@listhub.w3.org; Thu, 02 Nov
> >2006 16:52:01 +0000
> >Received: from ntgwgate.loc.gov ([] helo=loc.gov)by
> >lisa.w3.org with esmtp (Exim 4.50)id 1Gffag-0008Jn-RUfor
> >www-international@w3.org; Thu, 02 Nov 2006 16:39:02 +0000
> >Received: from LCHub-MTA by loc.govwith Novell_GroupWise; Thu, 02 Nov 
> >11:39:15 -0500
> >Received: none (lisa.w3.org: domain of juth@loc.gov does not designate
> >permitted sender hosts)
> >X-Message-Info: txF49lGdW40iFCYqxCapx3dVQkhA/h0g3WtkA4YzLVs=
> >X"-Mailer: Novell GroupWise Internet Agent 6.5.4 X-W3C-Hub-Spam-Status: 
> >score=-2.6
> >X-W3C-Scan-Sig: lisa.w3.org 1Gffag-0008Jn-RU
> >a388477279f24f54f636cfb91997355b
> >X-Original-To: www-international@w3.org
> >X-Archived-At: http://www.w3.org/mid/s549d8e3.007@loc.gov
> >Resent-From: www-international@w3.org
> >X-Mailing-List: <www-international@w3.org> archive/latest/4844
> >X-Loop: www-international@w3.org
> >Resent-Sender: www-international-request@w3.org
> >Precedence: list
> >List-Id: <www-international.w3.org>
> >List-Help: <http://www.w3.org/Mail/>
> >List-Unsubscribe:
> ><mailto:www-international-request@w3.org?subject=unsubscribe>
> >Resent-Message-Id: <E1GffnY-0004Gg-Ng@frink.w3.org>
> >Resent-Date: Thu, 02 Nov 2006 16:52:12 +0000
> >Return-Path: www-international-request@listhub.w3.org
> >X-OriginalArrivalTime: 02 Nov 2006 16:55:03.0406 (UTC)
> >FILETIME=[A42940E0:01C6FE9F]
> >
> >
> >I am doing research on issues regarding multilingual web search.  Are 
> >any resources that someone can point me to?
> >
> >thanks,
> >Justin Thorp
> >
> >******************
> >Justin Thorp
> >Web Services - Office of Strategic Initiatives
> >Library of Congress
> >e - juth@loc.gov
> >p - 202/707-9541
> >
> >
>Stay in touch with old friends and meet new ones with Windows Live Spaces

Try the next generation of search with Windows Live Search today!  
Received on Thursday, 9 November 2006 00:10:58 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:40:53 UTC