Re: Question on text search from Jerry Carter on 2010-06-02 (public-device-apis@w3.org from June 2010)

From: Jerry Carter <jerry@jerrycarter.org>
Date: Wed, 2 Jun 2010 12:19:21 -0400
To: Robin Berjon <robin@robineko.com>
Cc: public-i18n-core@w3.org, public-device-apis@w3.org
Message-Id: <40C36D4B-BCA9-45C3-987E-6013FFBB267E@jerrycarter.org>
To quote a former mentor, "Don't let the best be the enemy of the good."  It is possible to define rules that work for the common cases whereas trying to build the perfect solution is doomed to frustration and (most likely) eventual failure.  In this case, there are two features working in our favor.  First, the user will presumably be presented with several potential matches so the penalty for displaying a few 'wrong' choices is minor.  Second, the user can always edit the records in the address book to enable better matching with whatever algorithms are being used by the device.

On Jun 2, 2010, at 12:01 PM, Robin Berjon wrote:

> On Jun 2, 2010, at 17:41 , Jerry Carter wrote:
>> For this domain, the key language is that of the owner of the device.  The behavior of text search should match the expectations and customs of the owner.  We should not be surprised to find that the same string of characters matches differently depending on the language/country and preferences of the user.
> 
> I thought of this as a heuristic as well, but it's not at all clear to me how it would work. Its first issue is that it assumes that a user has a single language, which I find to be a very common mistake in I18N architecture. I don't have numbers, but I would be surprised if a large plurality of the world's users weren't multilingual. The fact that my phone's OS is in English won't help you much match the vast amount of French data that I have, not to mention all those contact entries I've received from people all around the world. It provides some minimal amount of help in understanding my entry (and even then, not really) but doesn't help with the data. I don't see how you can use this to get a match.

Certainly, but whatever data appears in your address book must be interpretable by you.  For residents of Japan, Korea, China, Taiwan, Quatar, etc., you presumably have a Latin-1 string for each as is appropriate for your cultural expectations.  Conversely, a native of Japan will likely have 'Robin Berjon' in Latin-1 characters but use non-Latin-1 for other Japanese entries in the address book -- as is appropriate to his/her cultural expectations.

>> Take nicknames for instance.  An US English speaker may find 'Manny' to be an appropriate match for 'Manuel' whereas a Mexican Spanish speaker may find this inappropriate.
> 
> That's certainly true but I don't think that we're looking at nickname stemming :) And it would still have the issue of how to interpret a query from a user who speaks English, Spanish, and Spanglish — something that's not uncommon in the US. Or even simply, even if I were monolingual, and if the system supported nickname stemming, I would want "Manny" to match an American friend called Manuel, but "Manu" to match the French "Emmanuel". The information for that would have to be attached to their names, not come from me (and using their country of residence doesn't help).

It is inadvisable to consider text search without some form of synonym mapping.  Whether to handle nicknames (e.g. 'NYC' ~ 'New York City'), or spelling variants ('Southborough' ~ 'Southboro' in the geographic context or 'Jasmine' ~ 'Jazmin' in names), humans expect some intelligence from their text engines.

I believe that you can use only the language/country + preferences of the user to control the behavior.  There is little value to using other information such as the country of residence of individual contacts.

-=- Jerry
Received on Wednesday, 2 June 2010 16:19:51 UTC