I18N-ISSUE-66: find() method sensitivity to Unicode normalization [Contacts API]

I18N-ISSUE-66: find() method sensitivity to Unicode normalization [Contacts API]


Raised by: Koji Ishii
On product: Contacts API

Section 4.2.1 find method

Section 5

WG Approved: Yes

As with I18N-ISSUE-65, the find() method and search processing do not clearly define the details of "match". When processing a search, we feel that it should be clear if Unicode Normalization has been applied to the arguments and/or contacts being searched.

In our WG's opinion, Unicode normalization is desirable when searching, since many keyboards or user-agents generate non-normalized search strings (for example, Vietnamese keyboards vary by vendor). As a result, search strings entered by the user might not match content that uses a different normalization form. For example, a user might enter the pre-composed character U+1EA6 (LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE)--or they might enter U+0132 U+0300 instead. [They could also technically use U+00C0 U+030C, although this is less likely.] Ensuring that searches are done in a normalized manner will improve interoperability, since a collection of contacts may have been entered on a variety of devices and into a variety of systems.

The I18N WG recommends requiring comparisons be done in a Unicode normalized manner. We note that this is currently an issue raised before the TAG and guidance here is subject to change. If TAG were to decide that normalization is undesirable, a health warning would be warranted. 

For more information on normalization see:

Unicode Standard Annex 15 http://www.unicode.org/reports/tr15/
Character Model-Normalization http://www.w3.org/TR/charmod-norm

Received on Monday, 4 July 2011 19:38:06 UTC