- From: Robert J Burns <rob@robburns.com>
- Date: Sat, 7 Feb 2009 17:03:58 -0600
- To: public-i18n-core@w3.org
- Cc: W3C Style List <www-style@w3.org>
> Per L. David Baron's suggestion, I started a wiki page on the esw > wiki. If there's a better place for this, feel free to move it there. > > <http://esw.w3.org/topic/I18N/CanonicalNormalization> > I've also added a companion page: <http://esw.w3.org/topic/I18N/CanonicalNormalizationIssues> In retrospect I think I should have dropped the word "Canonical" from this page's name, since I ended up delving into non-canonical normalization compatibility equivalence and even general case folding. While I think the topic here should be squarely on canonical string comparison and how to achieve that, I think the discussion of problems with canonical singletons in this thread and related threads point to a need to think even more narrowly about normalization than even the canonical forms. So as you generally works through this wiki page, you'll find that it moves from topics of canonical (and even narrower) normalization to increasingly broader foldings of text. The remarkable thing about this is that I think due to all sorts of reasons, when looked at in this manner, the canonical singletons we've discussed (such as 慈 (U+2F8A6) [non-normalized] and 慈 (U+6148) [NFC and NFD]) fall much more squarely in the area of compatibility decompositions than in canonical decompositions. That implies to me that unless normalization is handled late (later than parsing) a new normalized form might be needed. A new normalized form would however do two things: 1) make normalization completely lossless; and 2) make normalization performance even better (since the character that would cause a normalization branch are that much fewer: only a couple hundred or less). On this page, I"m calling this new normalized form "NFW3C", though I think it is a normalized form that would be suitable for any and all Unicode applications. Take care, Rob
Received on Saturday, 7 February 2009 23:04:40 UTC