[charmod-norm] Normalisation for Case-Insensitive Comparison

Richard57 has just created a new issue for https://github.com/w3c/charmod-norm:

== Normalisation for Case-Insensitive Comparison ==
In Section 3.1 Step 3 'Normalisation', do we really want the case-insensitive Unicode full case folding  comparison of "ᾳ͙" <U+03B1 GREEK SMALL LETTER ALPHA, U+0359 COMBINING ASTERISK BELOW, U+0345 COMBINING GREEK YPOGEGRAMMENI> and "α͙ι" <U+03B1 GREEK SMALL LETTER ALPHA,         U+0359 COMBINING ASTERISK BELOW,  U+03B9 GREEK SMALL LETTER IOTA>  to depend on the choice of normalisation?  NF(K)C yields 'different', while NF(K)D yields 'identical'.  (The combining mark U+0359 was added to support the retranscription of deteriorated Greek manuscripts.)

The sequence of Step 4, "case-folding" and Step 6 "compare code points" does not work properly.  For example, in the comparison of the NFC strings "sś" <U+0073, U+015B> and "ß́" <U+00DF latin small letter sharp s, U+0301>, default case-folding yields the strings <U+0073, U+015B> and <U+0073, U+0073, U+0301>.  However, converting to NFD and then case-folding would yield <U+0073, U+0073, U+0301> for both strings.

Normalisation is required after case-folding.  By contrast, apart from strings containing U+0345 when fully decomposed, normalisation (i.e. NFC/NFD) is not required before case-folding.  However, compatibility decomposition, if applied, would be required before case-folding.


Please view or discuss this issue at https://github.com/w3c/charmod-norm/issues/172 using your GitHub account

Received on Friday, 11 May 2018 19:30:41 UTC