Re: [charmod-norm] Arabic short vowels and Hebrew pointing in string search from asmusf via GitHub on 2016-02-20 (public-i18n-archive@w3.org from January to March 2016)

From: asmusf via GitHub <sysbot+gh@w3.org>
Date: Sat, 20 Feb 2016 21:57:09 +0000
To: public-i18n-archive@w3.org
Message-ID: <issue_comment.created-186689215-1456005429-sysbot+gh@w3.org>
I think the general statement is that there is loose matching and 
(more 
or less) literal matching. Loose matching can be of many kinds. For 
example, for the DNS root zone we are working on a project that 
defines 
which simplified Chinese strings "match" which traditional Chinese 
strings. (The actual lookup in the DNS is literal, but the 
registration 
would be for bundles of matching labels, achieving an effect similar 
to 
loose matching).

In the context of charmod, I think the statement should be in its most
 
general terms.

Loose matching can be required for some applications but it can be 
difficult to formulate a single, general solution that is satisfactory
 
for all users (let alone all types of applications).

In the Arabic case, for the root zone, the project decided to not 
support short vowels in top-level domain names. For general text, and 
searches on general text, that solution isn't adequate.

One important consideration is that there are equivalences that fall 
outside the Unicode normalization (even NFKx). The Danish/Norwegian O 
with slash (U+00D8) is functionally equivalent to Swedish O with 
diaeresis (U+00D6), but O with slash has no decomposition.

There are similar loose matching rules that might apply within the 
alphabet for a language; sometimes certain letter are pronounced the 
same way, and a loose matching that is "phonetic" might be needed.

Sometimes it's possible to fold all diacritics on the same base 
letter, 
sometimes a language uses a few diacritics to generate new "letters" 
(rather than "new forms" of letters). In those cases, in that 
language's 
context, you'd not want to fold away all diacritics (only the 
"optional" 
ones, usually of foreign origin). And so on.

The Arabic/Hebrew case is just one example - a useful one, but only if
 
presented in context, not if this is the only example of 
non-normlization derived folding.

For identifiers, the concept of using a non-folded lookup with 
detailed 
rules on how to bundle "variants" (which then all resolve to the same 
target so as to emulate loose matching) should be mentioned as an 
alternative to folding.  (A specification of how to set up the rules 
for 
that is found here: 
http://www.ietf.org/id/draft-ietf-lager-specification-08.txt).

A./

PS: from today's posts on the Unicode list, something very much on 
topic:

> Hello Unicode,
>
> I have been involved in a rather long discussion on the Emacs-devel 
> mailing list[1] concerning the right way to do character folding and
 
> we've reached a point where input from Unicode experts would be 
welcome.
>
> The problem is the implementation of equivalence when searching for 
> characters. For example, if I have a buffer containing the following
 
> characters (both using the precomposed and canonical forms):
>
>     o ö ø ó n ñ
>
> The character folding feature in Emacs allows a search for "o" to 
mach 
> some or even all of these characters. The discussion on the mailing 
> list has circulated around both the fact that the correct behaviour 
> here is locale-dependent, and also on the correct way to implement 
> this matching absent any locale-specific exceptions.
>
> An English speaker would probably expect a search for "o" to match 
the 
> first 4 characters and a search for "n" to match the latter two.
>
> A Spanish speaker would expect that n and ñ be different but 
otherwise 
> have the same behaviour as the English user.
>
> A Swedish user would definitely expect o and ö to compare 
differently, 
> but ö and ø to compare the same.
>
> I have been reading the materials on unicode.org 
<http://unicode.org> 
> trying to see if this has been specifically addressed anywhere by 
the 
> Unicode Consortium, but my results are inconclusive at best.
>
> What is the "correct" way to do this from Unicode's perspective? 
There 
> is clearly an aspect of locale-dependence here, but how far can the 
> Unicode data help?
>
> In particular, as far as I can see there is no way that the Unicode 
> charts can allow me to write an algorithm where o and ø are seen as 
> similar (as would be expected by an English user).
>
> [1] 
https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html






-- 
GitHub Notification of comment by asmusf
Please view or discuss this issue at 
https://github.com/w3c/charmod-norm/issues/78#issuecomment-186689215 
using your GitHub account
Received on Saturday, 20 February 2016 21:57:12 UTC