Unicode issue on normalization from Felix Sasaki on 2006-01-25 (public-i18n-core@w3.org from January to March 2006)

From: Felix Sasaki <fsasaki@w3.org>
Date: Wed, 25 Jan 2006 14:36:24 +0900
To: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <op.s3w5uyiix1753t@ibm-60d333fc0ec.mag.keio.ac.jp>

There is an issue with the Unicode normalization forms, see  
http://www.unicode.org/review/pr-29.html .

The definitions of NFC and NFKC as they stand contain a contradiction.  
There are some cases where a transformation toNFC(toNFC(x)) has a  
different result than toNFC(x).

To fix these cases, there is the proposal to change

D2. In any character sequence beginning with a starter S, a character C is  
blocked from S if and only if there is some character B between S and C,  
and either B is a starter or it has the same combining class as C.

to

D2'. In any character sequence beginning with a starter S, a character C  
is blocked from S if and only if there is some character B between S and  
C, and either B is a starter or it has the same or higher combining class  
as C.

This definition is only to be applied to strings that are already  
canonically decomposed.

When B blocks C, changing the order of B and C would result in a character  
sequence that is  not canonically equivalent to the original. See Section  
3.11, Canonical Ordering Behavior in the Unicode Standard, 4.0.

The report says that this will not have an impact on real data found in  
practice (with the possible exception of test cases for the algorithm  
itself), because the affected sequences do not constitute well-formed text  
in any language.

If you have any comments on this, please send them in until 30 January at  
http://www.unicode.org/reporting.html .

Regards, Felix

Received on Wednesday, 25 January 2006 05:36:31 UTC