[charmod-norm] Limitations of Normalization - Confusion

klensin has just created a new issue for 
https://github.com/w3c/charmod-norm:

== Limitations of Normalization - Confusion ==
In addition to other issues covered by Rchard's notes,...

(1) "Homoglyph" is not a generally-recognized term.  Perhaps "called a
 homoglyph in Unicode documents" would be better.

(2) The so-called non-decomposable problem, i.e., characters that can 
be formed by combining sequences in which all of the code points 
involved that are associated with particular scripts are members of 
the same script as each other and of the composite character but that 
do not have decompositions to at least one such combining sequence, 
probably deserves mention.  AFAICT, UTC39 does not cover that set of 
cases, they can set a trap when users try to input characters ,of a 
script without quite the right keyboard, and they interact with 
language preferences in ways that I, at least, still don't fully 
understand.

(3) The paragraph starting "Similar examples of identical 
appearance..." at least needs an example or two, whether the above is 
incorporated or not.  As it is, it reads like hand-waving, especially 
for the cases UTS39 does not address.

(4) In the last paragraph, starting "Finally, note that Unicode 
Normalization, even..." you might want to note that some systems do 
equate these characters by add-on steps to Normalization.  IDNA2003 
definitely did so.  I haven't checked whether UTR46 still does but 
doing so would be consistent with its apparent principle of preserving
 everything that "worked in" IDNA2003.


See https://github.com/w3c/charmod-norm/issues/88
Further comments on this issue will NOT be notified to this list. If 
you'd like to follow the discussion, please do so by subscribing to 
the issue via the above link. Do not reply to this email.

Received on Wednesday, 6 April 2016 14:47:59 UTC