Re: [charmod-norm] Limitations of Normalization - Confusion

(1) You can use Homograph, which covers cases where it's not a single 
"glyph". That term is a generic term simply meaning "written alike". 
It may not need the bold face, but it's a useful term to introduce.
(2) The more I look at it, I find the non-decomposable problem may be 
somewhat of a red herring. Latest example I have come across: in 
Khmer, the two sequences U+17D2 U+178F and U+17D2 U+178A display 
absolutely identically (while standalone the characters differ 
significantly in appearance). This is not covered by the 
non-decomposable issue, because these are not composed vs. decomposed 
sequences. And it occurs in the same language using the same keyboard.
 (Aside: my best guess is that some constraint in the language doesn't
 allow an actual minimal pair of two words being identical except for 
that sequence, so the writing system can get away with re-using a 
form, but when typing, people want to type the letter (DA or TA) that 
corresponds to the actual sound. For identifiers, this opens the door 
to spoofing, unless some steps are taken to prevent the use of a 
minimal pair.) 
(3) I agree. An example of a non decomposable, for example 0781, the 
example I just gave, an example of a digraph and the example of Latin 
turned e (I think that's the name) would be good to map the nature of 
the problem.
(4) Besides additional transformation there are other steps that can 
be taken, depending on the protocols involved. Where identifiers are 
registered, the registration of one can be made to cause the other 
(homograph one) to be blocked from registration. 

-- 
GitHub Notification of comment by asmusf
Please view or discuss this issue at 
https://github.com/w3c/charmod-norm/issues/88#issuecomment-206437324 
using your GitHub account

Received on Wednesday, 6 April 2016 15:48:17 UTC