Re: [charmod-norm] Not all precomposed characters are reachable by NFC (#190)

> What NFC gives the user is a string that can be compared to other NFC strings for equality with the minimum number of combining marks for that purpose.

Hmm. That's not actually true either, since precomposed characters exist for nukta combinations in scripts like Devanagari and Bengali but NFC doesn't use them. So it's not the minimum number of combining marks in that case, because you could have 0 by using the precomposed character.

Actually, i think that the rationale behind NFC has more to do with a nominal compatibility with legacy standards. So i guess i should try to suggest something. How about this (changes signalled using bold):
--
These two types of Unicode-defined equivalence are then grouped by another pair of variations: "decomposition" and "composition". In "decomposition", separable logical parts of a visual character are broken out into a sequence of base characters and combining marks and the resulting code points are put into a fixed, canonical order. In "composition", the decomposition is performed and then combining marks are recombined **according to certain rules** with their base characters.

Roughly speaking, <abbr title="Normalization Form C">NFC</abbr> is defined such that each combining character sequence (a base character followed by one or more combining characters) is replaced, as far as possible, by a canonically equivalent precomposed character.

It is rather important to notice what this does <strong>not</strong> mean. The resulting character sequence can still contain combining marks, since not all character sequences have a precomposed equivalent. **Indeed, as we've seen, many scripts offer no alternative to the use of combining marks, such as the Devanagari vowels in <a href="#graphemeExample">this example</a>. In other cases, a given base character and combining mark is not replaced with a precomposed character because the combination is blocked by normalization rules. For example, some Indic scripts do not compose certain sequences of base plus diacritic, even though a matching precomposed character exists, due to composition exclusion rules. Composition may also be blocked by another combining mark between the two characters that would otherwise combine.** 

I'd just omit the paragraph you quoted in the previous comment. "What NFC gives the user..."




-- 
GitHub Notification of comment by r12a
Please view or discuss this issue at https://github.com/w3c/charmod-norm/issues/190#issuecomment-456891284 using your GitHub account

Received on Wednesday, 23 January 2019 17:24:29 UTC