Re: I18N-ISSUE-66: find() method sensitivity to Unicode normalization [Contacts API]

Dear I18N WG,

this is just a personal comment.

On Jul 4, 2011, at 21:38 , Internationalization Core Working Group Issue Tracker wrote:
> As with I18N-ISSUE-65, the find() method and search processing do not clearly define the details of "match". When processing a search, we feel that it should be clear if Unicode Normalization has been applied to the arguments and/or contacts being searched.
> 
> In our WG's opinion, Unicode normalization is desirable when searching, since many keyboards or user-agents generate non-normalized search strings (for example, Vietnamese keyboards vary by vendor). As a result, search strings entered by the user might not match content that uses a different normalization form. For example, a user might enter the pre-composed character U+1EA6 (LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE)--or they might enter U+0132 U+0300 instead. [They could also technically use U+00C0 U+030C, although this is less likely.] Ensuring that searches are done in a normalized manner will improve interoperability, since a collection of contacts may have been entered on a variety of devices and into a variety of systems.
> 
> The I18N WG recommends requiring comparisons be done in a Unicode normalized manner. We note that this is currently an issue raised before the TAG and guidance here is subject to change. If TAG were to decide that normalization is undesirable, a health warning would be warranted. 

Normalisation is indeed an issue. But as for case-sensitivity, we can only do so much at our end. I believe that we could specify that strings be normalised before the search is performed, but we can't have much influence over what the contacts backend decides to do. If it has its entries stored without normalisation then results may be random. In other words the problem is about normalising late (before the string is fed to the black box that does the search) or very late (in the black box itself). I would sort of hope that said black box is doing its own normalisation, but it's outside our control.

Either way, it's not something that we can reliably test for. A given browser might use a different contacts DB depending on the platform (e.g. Address Book on the Mac, Outlook on Windows, LDAP in a corporate environment) and may in fact be wrapping multiple contacts backends (the local address book, several sources from online social networks). A good implementation might therefore know that it needs a specific normalisation form for a given backend, but doesn't need any for another that does its own.

I am aware that this is being discussed in the TAG, but I would rather not wait until a decision is made there. Based on the above, what do you think of noting that normalisation can be an issue and that implementers should be aware of it when they interface to specific backends?

-- 
Robin Berjon - http://berjon.com/ - @robinberjon

Received on Tuesday, 5 July 2011 12:47:34 UTC