Re: Unicode Normalization thread should slow down; summary needed from Robert J Burns on 2009-02-07 (www-style@w3.org from February 2009)

From: Robert J Burns <rob@robburns.com>
Date: Sat, 7 Feb 2009 17:03:58 -0600
To: public-i18n-core@w3.org
Cc: W3C Style List <www-style@w3.org>
Message-Id: <7B52936E-CB4C-4BD4-B20D-DF657315CABD@robburns.com>

> Per L. David Baron's suggestion, I started a wiki page on the esw  
> wiki. If there's a better place for this, feel free to move it there.
>
> <http://esw.w3.org/topic/I18N/CanonicalNormalization>
>

I've also added a companion page:

<http://esw.w3.org/topic/I18N/CanonicalNormalizationIssues>

In retrospect I think I should have dropped the word "Canonical" from  
this page's name, since I ended up delving into non-canonical  
normalization compatibility equivalence and even general case folding.  
While I think the topic here should be squarely on canonical string  
comparison and how to achieve that, I think the discussion of problems  
with canonical singletons in this thread and related threads point to  
a need to think even more narrowly about normalization than even the  
canonical forms.

So as you generally works through this wiki page, you'll find that it  
moves from topics of canonical (and even narrower) normalization  to  
increasingly broader foldings of text. The remarkable thing about this  
is that I think due to all sorts of reasons, when looked at in this  
manner, the canonical singletons we've discussed (such as 慈 (U+2F8A6)  
[non-normalized] and 慈 (U+6148) [NFC and NFD]) fall much more  
squarely in the area of compatibility decompositions than in canonical  
decompositions.

That implies to me that unless normalization is handled late (later  
than parsing) a new normalized form might be needed. A new normalized  
form would however do two things: 1) make normalization completely  
lossless; and 2) make normalization performance even better (since the  
character that would cause a normalization branch are that much fewer:  
only a couple hundred or less). On this page, I"m calling this new  
normalized form "NFW3C", though I think it is a normalized form that  
would be suitable for any and all Unicode applications.

Take care,
Rob

Received on Saturday, 7 February 2009 23:04:40 UTC