Re: Question on text search from Jerry Carter on 2010-06-02 (public-i18n-core@w3.org from April to June 2010)

From: Jerry Carter <jerry@jerrycarter.org>
Date: Wed, 2 Jun 2010 11:41:41 -0400
To: Robin Berjon <robin@robineko.com>
Cc: public-i18n-core@w3.org, public-device-apis@w3.org
Message-Id: <982F14AE-95A7-46D0-9ED7-E09FCC78E089@jerrycarter.org>

For this domain, the key language is that of the owner of the device.  The behavior of text search should match the expectations and customs of the owner.  We should not be surprised to find that the same string of characters matches differently depending on the language/country and preferences of the user.

Take nicknames for instance.  An US English speaker may find 'Manny' to be an appropriate match for 'Manuel' whereas a Mexican Spanish speaker may find this inappropriate.

-=- Jerry


On Jun 2, 2010, at 11:26 AM, Robin Berjon wrote:

> Hi Jerry,
> 
> On Jun 2, 2010, at 16:57 , Jerry Carter wrote:
>> Although not a member of the I18N group, I can share my experiences from working with the W3C grammar, speech synthesis, and lexicon specifications (SRGS, SSML, PLS).  These suggest that trying to mandate specific behavior across all languages is inappropriate.  That stated, I would not want to leave matching entirely up to implementations as it makes the behavior untestable.
> 
> Agreed, in principle at least.
> 
>> I recommend instead requiring specific behavior in specific languages (i.e. MUST level) and then leaving other choices to the implementations as in "other implementations MAY apply other matching logic as appropriate for meeting the expectations of specific languages and countries."
> 
> The problem here is that in 99% of cases (and I'm being conservative) we simply won't know what language is being processed. We're dealing with address book data, not a properly language-contextualised corpus. We have to handle existing contacts databases that won't have that information, and I don't think that we can hope to mandate that UIs expose a language identifier on fields for which it makes sense (and if they did, users would still enter it wrong most of the time).
> 
> I'm trying to think of heuristics but can't seem to find any. You could try using the country to guess the language of the address fields but some countries have several languages and even though it's in the UK I still might have entered "Londres". Guessing what language to apply to names based on that will just be random.
> 
> So while in theory I agree that search needs to be language specific, we don't have that information. We don't even have enough data to guess. This leaves us with a language-agnostic approach unless there's a smart trick I haven't thought of.
> 
> --
> Robin Berjon
>  robineko — hired gun, higher standards
>  http://robineko.com/
> 
> 
> 
> 
>

Received on Wednesday, 2 June 2010 15:42:14 UTC