W3C home > Mailing lists > Public > public-device-apis@w3.org > June 2010

Re: Question on text search

From: Jerry Carter <jerry@jerrycarter.org>
Date: Wed, 2 Jun 2010 10:57:24 -0400
Cc: public-i18n-core@w3.org, public-device-apis@w3.org
Message-Id: <763BCC03-796C-411D-9283-5F35369D71A0@jerrycarter.org>
To: Robin Berjon <robin@robineko.com>
Although not a member of the I18N group, I can share my experiences from working with the W3C grammar, speech synthesis, and lexicon specifications (SRGS, SSML, PLS).  These suggest that trying to mandate specific behavior across all languages is inappropriate.  That stated, I would not want to leave matching entirely up to implementations as it makes the behavior untestable.

I recommend instead requiring specific behavior in specific languages (i.e. MUST level) and then leaving other choices to the implementations as in "other implementations MAY apply other matching logic as appropriate for meeting the expectations of specific languages and countries."

There are some considerations that would seem appropriate for MUST level:
* case equivalence for European languages (i.e. 'z' matches 'Z')
* accented character equivalence (i.e. 'e' matches '') for European languages
* compound character equivalence (i.e. '' matches 'ae') for European languages

There are others that are better left to MAY:
* Pinyin to ideograph equivalence for Chinese
* Hiragana to Kanji equivalence for Japanese
* Omitted vowel annotations for Arabic & Hebrew

To be a MUST, there must be broad consensus within the working group that the requirements are appropriate for a specific language+country and solid commitment from multiple parties to implement said behavior.  I further recommend that the specification specifically declare that further requirements for specific language+countries may be added in future versions based on external feedback and implementation experience.

-=- Jerry



On Jun 2, 2010, at 9:17 AM, Robin Berjon wrote:

> Dear I18N WG,
> 
> as part of our Contacts API[0] we have text search (e.g. matching names against input). We would like it to be loose, so that for instance "hazae" would possibly match "Hazal" (so case-insensitive, partial, and I forget what matching "" for "e" is called but that one too).
> 
> We were wondering if this ought to be left up to implementations entirely or in part, or if it can be clearly defined. If the latter, has someone else done it so that we can steal it, and if the former is there any advice that we can at least give to implementations? We'd very much appreciate any information that you may have.
> 
> [0]http://dev.w3.org/2009/dap/contacts/
> 
> This is for ACTION-122.
> 
> --
> Robin Berjon
>  robineko  hired gun, higher standards
>  http://robineko.com/
> 
> 
> 
> 
> 
Received on Wednesday, 2 June 2010 14:57:56 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:14:10 GMT