W3C home > Mailing lists > Public > public-device-apis@w3.org > June 2010

RE: Question on text search

From: Phillips, Addison <addison@lab126.com>
Date: Wed, 2 Jun 2010 10:52:58 -0400
To: Robin Berjon <robin@robineko.com>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
CC: "public-device-apis@w3.org" <public-device-apis@w3.org>
Message-ID: <C7A5719F1E562149BA9171F58BEE2CA4129E3A9FCC@EX-IAD6-B.ant.amazon.com>
(chair hat on)

Hi, I've added this to our agenda to discuss.

(personal response)

Full text search is a somewhat complex topic and varies by language. The preprocessing of the indexed data and search terms involves processes such as "stemming" (finding the base word, such as converting "dogs" to "dog" or "running" to "run"), normalization, and so forth. There are open source libraries (cf. Lucene [1]) that provide specific processing for various languages. There are also choices to be made with regard to how this is implemented or tuned for each individual language. What is a valid strategy for one language may not be applicable to another language or culture. So it may not be easy to fully specify the right behavior here.

(restores chair hat to head)

Looking at the Contacts API draft quickly I notice many interesting internationalization issues that may not be fully addressed (handling of personal names; handling of postal addresses; enumerated types which need to consider the needs of other cultures; etc.)

Would it be appropriate for us to review this work now?

Addison

[1] http://lucene.apache.org/

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: public-i18n-core-request@w3.org [mailto:public-i18n-core-
> request@w3.org] On Behalf Of Robin Berjon
> Sent: Wednesday, June 02, 2010 6:17 AM
> To: public-i18n-core@w3.org
> Cc: public-device-apis@w3.org
> Subject: Question on text search
> 
> Dear I18N WG,
> 
> as part of our Contacts API[0] we have text search (e.g. matching
> names against input). We would like it to be loose, so that for
> instance "hazae" would possibly match "HazaŽl" (so case-insensitive,
> partial, and I forget what matching "Ž" for "e" is called but that
> one too).
> 
> We were wondering if this ought to be left up to implementations
> entirely or in part, or if it can be clearly defined. If the latter,
> has someone else done it so that we can steal it, and if the former
> is there any advice that we can at least give to implementations?
> We'd very much appreciate any information that you may have.
> 
> [0]http://dev.w3.org/2009/dap/contacts/
> 
> This is for ACTION-122.
> 
> --
> Robin Berjon
>   robineko - hired gun, higher standards
>   http://robineko.com/
> 
> 
> 
> 
Received on Wednesday, 2 June 2010 14:53:33 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:14:10 GMT