- From: asmusf via GitHub <sysbot+gh@w3.org>
- Date: Sat, 20 Feb 2016 21:57:09 +0000
- To: public-i18n-archive@w3.org
I think the general statement is that there is loose matching and (more or less) literal matching. Loose matching can be of many kinds. For example, for the DNS root zone we are working on a project that defines which simplified Chinese strings "match" which traditional Chinese strings. (The actual lookup in the DNS is literal, but the registration would be for bundles of matching labels, achieving an effect similar to loose matching). In the context of charmod, I think the statement should be in its most general terms. Loose matching can be required for some applications but it can be difficult to formulate a single, general solution that is satisfactory for all users (let alone all types of applications). In the Arabic case, for the root zone, the project decided to not support short vowels in top-level domain names. For general text, and searches on general text, that solution isn't adequate. One important consideration is that there are equivalences that fall outside the Unicode normalization (even NFKx). The Danish/Norwegian O with slash (U+00D8) is functionally equivalent to Swedish O with diaeresis (U+00D6), but O with slash has no decomposition. There are similar loose matching rules that might apply within the alphabet for a language; sometimes certain letter are pronounced the same way, and a loose matching that is "phonetic" might be needed. Sometimes it's possible to fold all diacritics on the same base letter, sometimes a language uses a few diacritics to generate new "letters" (rather than "new forms" of letters). In those cases, in that language's context, you'd not want to fold away all diacritics (only the "optional" ones, usually of foreign origin). And so on. The Arabic/Hebrew case is just one example - a useful one, but only if presented in context, not if this is the only example of non-normlization derived folding. For identifiers, the concept of using a non-folded lookup with detailed rules on how to bundle "variants" (which then all resolve to the same target so as to emulate loose matching) should be mentioned as an alternative to folding. (A specification of how to set up the rules for that is found here: http://www.ietf.org/id/draft-ietf-lager-specification-08.txt). A./ PS: from today's posts on the Unicode list, something very much on topic: > Hello Unicode, > > I have been involved in a rather long discussion on the Emacs-devel > mailing list[1] concerning the right way to do character folding and > we've reached a point where input from Unicode experts would be welcome. > > The problem is the implementation of equivalence when searching for > characters. For example, if I have a buffer containing the following > characters (both using the precomposed and canonical forms): > > o ö ø ó n ñ > > The character folding feature in Emacs allows a search for "o" to mach > some or even all of these characters. The discussion on the mailing > list has circulated around both the fact that the correct behaviour > here is locale-dependent, and also on the correct way to implement > this matching absent any locale-specific exceptions. > > An English speaker would probably expect a search for "o" to match the > first 4 characters and a search for "n" to match the latter two. > > A Spanish speaker would expect that n and ñ be different but otherwise > have the same behaviour as the English user. > > A Swedish user would definitely expect o and ö to compare differently, > but ö and ø to compare the same. > > I have been reading the materials on unicode.org <http://unicode.org> > trying to see if this has been specifically addressed anywhere by the > Unicode Consortium, but my results are inconclusive at best. > > What is the "correct" way to do this from Unicode's perspective? There > is clearly an aspect of locale-dependence here, but how far can the > Unicode data help? > > In particular, as far as I can see there is no way that the Unicode > charts can allow me to write an algorithm where o and ø are seen as > similar (as would be expected by an English user). > > [1] https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html -- GitHub Notification of comment by asmusf Please view or discuss this issue at https://github.com/w3c/charmod-norm/issues/78#issuecomment-186689215 using your GitHub account
Received on Saturday, 20 February 2016 21:57:12 UTC