Re: Unicode issue on normalization from Felix Sasaki on 2006-01-26 (public-i18n-core@w3.org from January to March 2006)

From: Felix Sasaki <fsasaki@w3.org>
Date: Thu, 26 Jan 2006 22:14:14 +0900
To: "Mark Davis" <mark.davis@icu-project.org>
Cc: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <op.s3zlp0h4x1753t@ibm-60d333fc0ec>

On Thu, 26 Jan 2006 02:08:19 +0900, Mark Davis  
<mark.davis@icu-project.org> wrote:

> There is a misunderstanding. The change to D2 was already made in  
> Unicode 4.1, released about a year ago -- and  (see  
> http://www.unicode.org/reports/tr15/#D2). Only the material marked in  
> yellow is new for this version.

sorry, I mixed the versions up.

Felix

>
> Mark
>
> Felix Sasaki wrote:
>
>>
>> There is an issue with the Unicode normalization forms, see   
>> http://www.unicode.org/review/pr-29.html .
>>
>> The definitions of NFC and NFKC as they stand contain a contradiction.   
>> There are some cases where a transformation toNFC(toNFC(x)) has a   
>> different result than toNFC(x).
>>
>> To fix these cases, there is the proposal to change
>>
>> D2. In any character sequence beginning with a starter S, a character C  
>> is  blocked from S if and only if there is some character B between S  
>> and C,  and either B is a starter or it has the same combining class as  
>> C.
>>
>> to
>>
>> D2'. In any character sequence beginning with a starter S, a character  
>> C  is blocked from S if and only if there is some character B between S  
>> and  C, and either B is a starter or it has the same or higher  
>> combining class  as C.
>>
>> This definition is only to be applied to strings that are already   
>> canonically decomposed.
>>
>> When B blocks C, changing the order of B and C would result in a  
>> character  sequence that is  not canonically equivalent to the  
>> original. See Section  3.11, Canonical Ordering Behavior in the Unicode  
>> Standard, 4.0.
>>
>> The report says that this will not have an impact on real data found  
>> in  practice (with the possible exception of test cases for the  
>> algorithm  itself), because the affected sequences do not constitute  
>> well-formed text  in any language.
>>
>> If you have any comments on this, please send them in until 30 January  
>> at  http://www.unicode.org/reporting.html .
>>
>> Regards, Felix
>>
>>
>>
>>
>

Received on Thursday, 26 January 2006 13:14:24 UTC