Re: Question on text search from Jerry Carter on 2010-06-02 (public-device-apis@w3.org from June 2010)

From: Jerry Carter <jerry@jerrycarter.org>
Date: Wed, 2 Jun 2010 10:57:24 -0400
To: Robin Berjon <robin@robineko.com>
Cc: public-i18n-core@w3.org, public-device-apis@w3.org
Message-Id: <763BCC03-796C-411D-9283-5F35369D71A0@jerrycarter.org>

Although not a member of the I18N group, I can share my experiences from working with the W3C grammar, speech synthesis, and lexicon specifications (SRGS, SSML, PLS).  These suggest that trying to mandate specific behavior across all languages is inappropriate.  That stated, I would not want to leave matching entirely up to implementations as it makes the behavior untestable.

I recommend instead requiring specific behavior in specific languages (i.e. MUST level) and then leaving other choices to the implementations as in "other implementations MAY apply other matching logic as appropriate for meeting the expectations of specific languages and countries."

There are some considerations that would seem appropriate for MUST level:
* case equivalence for European languages (i.e. 'z' matches 'Z')
* accented character equivalence (i.e. 'e' matches 'é') for European languages
* compound character equivalence (i.e. 'æ' matches 'ae') for European languages

There are others that are better left to MAY:
* Pinyin to ideograph equivalence for Chinese
* Hiragana to Kanji equivalence for Japanese
* Omitted vowel annotations for Arabic & Hebrew

To be a MUST, there must be broad consensus within the working group that the requirements are appropriate for a specific language+country and solid commitment from multiple parties to implement said behavior.  I further recommend that the specification specifically declare that further requirements for specific language+countries may be added in future versions based on external feedback and implementation experience.

-=- Jerry

On Jun 2, 2010, at 9:17 AM, Robin Berjon wrote:

> Dear I18N WG,
> 
> as part of our Contacts API[0] we have text search (e.g. matching names against input). We would like it to be loose, so that for instance "hazae" would possibly match "Hazaël" (so case-insensitive, partial, and I forget what matching "ë" for "e" is called but that one too).
> 
> We were wondering if this ought to be left up to implementations entirely or in part, or if it can be clearly defined. If the latter, has someone else done it so that we can steal it, and if the former is there any advice that we can at least give to implementations? We'd very much appreciate any information that you may have.
> 
> [0]http://dev.w3.org/2009/dap/contacts/
> 
> This is for ACTION-122.
> 
> --
> Robin Berjon
>  robineko — hired gun, higher standards
>  http://robineko.com/
> 
> 
> 
> 
>

Received on Wednesday, 2 June 2010 14:57:56 UTC