- From: Robert J Burns <rob@robburns.com>
- Date: Fri, 6 Feb 2009 17:33:44 -0600
- To: Ambrose Li <ambrose.li@gmail.com>
- Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Hi Ambrose, On Feb 6, 2009, at 4:40 PM, Ambrose Li wrote: > 2009/2/6 Robert J Burns <rob@robburns.com>: >> Another singleton example is: >> >> 1) 慈 (U+2F8A6) [non-normalized] >> 2) 慈 (U+6148) [NFC and NFD] >> >> I note the font HiraKakuProN-W3 on my system presents these with >> slightly >> different glyphs which as i said before should be considered a bug >> (but like > > I disagree here. The whole point of the U+2Fxxx block of > "compatibility ideographs" is to allow one to specify a particular > form when the form actually matters (e.g., when dealing with ancient > texts). I ran into U+2F999 just a week ago. (I have to look through > the charts to pick out the correct character. This had to be > contrasted with U+831D which is the normalized form, and the content > that I had to mark up actually says something to the effect of "U+831D > is probably an erraneous form of U+2F999…". This would make no sense > if the two glyphs show up the same). Therefore the fonts MUST display > the two differently; I would consider it a bug if U+2F999 looks the > same as U+831D. This is exactly the point I address in the rest of my email message: regarding singleton versus non-singleton canonical equivalence (see "As I go on to say in the very next paragraph:" below). I too say it is a mess for canonical singletons. Unicode has been very inconsistent on this and the lack of guidance form the very beginning to font makers and input system creators is a big part of the problem (and through them to authors). The problem we face is that non-singleton canonical equivalence is indispensable for proper text processing while the singletons are a spill-over of debates within the standardization process. It would have been better to include the compatibility equivalent positional forms and ligatures in canonical equivalence than to include these often disputed singletons which really muck up the situation (e.g., it is probably much less controversial to say that "ffl" and "ffl" (U+FB04) are equivalent strings than it is to say that "慈" (U+2F8A6) and "慈" (U+6148) are equivalent strings). Similarly for Arabic positional forms. In terms of plain text the distinction between a initial form Beh and a final form Beh are inconsequential. It only matters in the rendering and then it only is consequential for legacy rendering systems (in other words, in terms of plain text, some compatibility decompositions are more like canonical equivalents except for legacy rendering support). I will say though that the dispute you raise is part of the ongoing controversies in Unicode now spilled over into this debate. I don't have a horse in that race. But I do want most of all for Unicode and W3C to find a solution that we can all live with. If singleton canonical equivalence is controversial, lets jettison that and focus on combining marks, which I do not think should be controversial at all while at the same time being essential for string comparison in a way that the canonical singleton string matching is not. > My personal opinion regarding CJK unification is that it's an > inconsistent mess. But that'd be off-topic here. I think this is only tangentially related to Han Unification (which I agree also has separate problems). Often these singleton canonical equivalents are typically included due to duplicates (perceived correctly or not) within a particular language and even sometimes within the same language encoding combination. I don't know much of anything about Han so i can't say nor do I want to get involved in those debates. But I think we can agree that the underlying order of combining marks that places a dot below and a circumflex above is not a meaningful difference we want to preserve for string comparison. >> input systems, font makers really have not gotten clear norms about >> this) At >> least in the case of the name of this character ("CJK COMPATIBILITY >> IDEOGRAPH-2F8A6"), the name provides some indication of discouraged >> use >> (which may be all an author encounters when using a character input >> system). As I go on to say in the very next paragraph: >> >> My feeling is that singletons are an ill-conceived part of NFC and >> NFD >> normalization (closer to compatibility decompositions than canonical >> decompositions), In other words these singleton canonical decompositions are different enough from the other canonical equivalents that they should have had a compatibility keyword attached to them so they aren't treated as canonical equivalents (like "singleton"). Perhaps this could have been avoided if from the beginning of Unicode clear agreement was achieved on the use of these characters; over 15 years later, we have some serious problems. >> but that the non-singleton parts of normalization are >> essential to proper text handling (and I don't see how Unicode >> could have >> avoided or could avoid in the future such non-singleton canonical >> normalization). So if the singletons are a problem, we may need an entirely new normalization form that can be achieved with existing Unicode properties (since it is quite easy to distinguish a canonical decomposable singleton character from a non-singleton). It is also possible that all of the disputes over singletons would simply not apply to markup normalization so markup normalization could occur according to NFC and content normalization would require a more nuanced normalization approach that avoided singleton normalization (singletons like all compatibility decomposable characters could be prohibited from markup including attribute values ). Take care, Rob
Received on Friday, 6 February 2009 23:34:33 UTC