Re: Unicode Normalization

Hi Ambrose,

On Feb 6, 2009, at 4:40 PM, Ambrose Li wrote:

> 2009/2/6 Robert J Burns <rob@robburns.com>:
>> Another singleton example is:
>>
>> 1) 慈 (U+2F8A6) [non-normalized]
>> 2) 慈  (U+6148) [NFC and NFD]
>>
>> I note the font HiraKakuProN-W3 on my system presents these with  
>> slightly
>> different glyphs which as i said before should be considered a bug  
>> (but like
>
> I disagree here. The whole point of the U+2Fxxx block of
> "compatibility ideographs" is to allow one to specify a particular
> form when the form actually matters (e.g., when dealing with ancient
> texts). I ran into U+2F999 just a week ago. (I have to look through
> the charts to pick out the correct character. This had to be
> contrasted with U+831D which is the normalized form, and the content
> that I had to mark up actually says something to the effect of "U+831D
> is probably an erraneous form of U+2F999…". This would make no sense
> if the two glyphs show up the same). Therefore the fonts MUST display
> the two differently; I would consider it a bug if U+2F999 looks the
> same as U+831D.

This is exactly the point I address in the rest of my email message:  
regarding singleton versus non-singleton canonical equivalence (see  
"As I go on to say in the very next paragraph:" below). I too say it  
is a mess for canonical singletons. Unicode has been very inconsistent  
on this and the lack of guidance form the very beginning to font  
makers and input system creators is a big part of the problem (and  
through them to authors). The problem we face is that non-singleton  
canonical equivalence is indispensable for proper text processing  
while the singletons are a spill-over of debates within the  
standardization process. It would have been better to include the  
compatibility equivalent positional forms and ligatures in canonical  
equivalence than to include these often disputed singletons which  
really muck up the situation (e.g., it is probably much less  
controversial to say that "ffl" and "ffl" (U+FB04) are equivalent  
strings than it is to say that "慈" (U+2F8A6) and "慈"  (U+6148) are  
equivalent strings). Similarly for Arabic positional forms. In terms  
of plain text the distinction between a initial form Beh and a final  
form Beh are inconsequential. It only matters in the rendering and  
then it only is consequential for legacy rendering systems (in other  
words, in terms of plain text, some compatibility decompositions are  
more like canonical equivalents except for legacy rendering support).

I will say though that the dispute you raise is part of the ongoing  
controversies in Unicode now spilled over into this debate. I don't  
have a horse in that race. But I do want most of all for Unicode and  
W3C to find a solution that we can all live with. If singleton  
canonical equivalence is controversial, lets jettison that and focus  
on combining marks, which I do not think should be controversial at  
all while at the same time being essential for string comparison in a  
way that the canonical singleton string matching is not.

> My personal opinion regarding CJK unification is that it's an
> inconsistent mess. But that'd be off-topic here.

I think this is only tangentially related to Han Unification (which I  
agree also has separate problems). Often these singleton canonical  
equivalents are typically included due to duplicates (perceived  
correctly or not) within a particular language and even sometimes  
within the same language encoding combination. I don't know much of  
anything about Han so i can't say nor do I want to get involved in  
those debates. But I think we can agree that the underlying order of  
combining marks that places a dot below and a circumflex above is not  
a meaningful difference we want to preserve for string comparison.

>> input systems, font makers really have not gotten clear norms about  
>> this) At
>> least in the case of the name of this character ("CJK COMPATIBILITY
>> IDEOGRAPH-2F8A6"), the name provides some indication of discouraged  
>> use
>> (which may be all an author encounters when using a character input  
>> system).

As I go on to say in the very next paragraph:

>>
>> My feeling is that singletons are an ill-conceived part of NFC and  
>> NFD
>> normalization (closer to compatibility decompositions than canonical
>> decompositions),

In other words these singleton canonical decompositions are different  
enough from the other canonical equivalents that they should have had  
a compatibility keyword attached to them so they aren't treated as  
canonical equivalents (like "singleton"). Perhaps this could have been  
avoided if from the beginning of Unicode clear agreement was achieved  
on the use of these characters; over 15 years later, we have some  
serious problems.

>> but that the non-singleton parts of normalization are
>> essential to proper text handling (and I don't see how Unicode  
>> could have
>> avoided or could avoid in the future such non-singleton canonical
>> normalization).

So if the singletons are a problem, we may need an entirely new  
normalization form that can be achieved with existing Unicode  
properties (since it is quite easy to distinguish a canonical  
decomposable singleton character from a non-singleton).

It is also possible that all of the disputes over singletons would  
simply not apply to markup normalization so markup normalization could  
occur according to NFC and content normalization would require a more  
nuanced normalization approach that avoided singleton normalization  
(singletons like all compatibility decomposable characters could be  
prohibited from markup including attribute values ).

Take care,
Rob

Received on Friday, 6 February 2009 23:34:33 UTC