Re: Questions/feedback on character normalization

Hi Addison,

At the XACML TC meeting yesterday we decided to adopt the text which you 
are proposing. Thank you very much to the I18N WG for helping us get 
this issue straight in the XACML 3.0 spec.

Best regards,
Erik

Phillips, Addison wrote:
> Hi Erik,
>
> Sorry about the delay in responding. Note that this response is on behalf of the I18N Core WG.
>
> <quot>
> That clears up my confusion. But it also means that the current draft 
> does not satisfy the requirements of XACML. Instead I will write like 
> this in a section on unicode issues:
> </quot>
>
> That's correct. Your text isn't quite correct yet. Here's your proposal:
>
> --8<--
> In Unicode it is possible to represent some letters by different 
> character sequences. The process of converting Unicode strings into 
> canonical character sequences is called normalization. An operation is 
> normalization-sensitive if its output(s) are different depending on the 
> state of normalization of the input(s); if the output(s) are textual, 
> they are deemed different only if they would remain different were they 
> to be normalized. (Quoted from 
> [http://www.w3.org/TR/2005/WD-charmod-norm-20051027])
>
> For more information on normalization see 
> [http://www.w3.org/TR/2005/WD-charmod-norm-20051027].
>
> An XACML implementation MUST NOT perform any normalization-sensitive 
> operations unless it has ensured that the inputs are normalized. An 
> XACML implementation MUST behave as if each normalization-sensitive 
> operation normalizes the string into Unicode normalization form NFC. An 
> implementation MAY use some other form of internal processing as long as 
> the externally visible results are identical to this specification.
> --8<--
>
> I would propose something more like:
>
> --
> In Unicode, some equivalent characters can be represented by more than one different Unicode character sequence. See [http://www.w3.org/TR/CharMod]. The process of converting Unicode strings into equivalent character sequences is called "normalization" [http://www.unicode.org/reports/tr15]. Some operations, such as string comparison, are sensitive to normalization. An operation is normalization-sensitive if its output(s) are different depending on the state of normalization of the input(s); if the output(s) are textual, they are deemed different only if they would remain different were they to be normalized.
>
> For more information on normalization see 
> [http://www.w3.org/TR/2005/WD-charmod-norm-20051027].
>
> An XACML implementation MUST behave as if each normalization-sensitive 
> operation normalizes input strings into Unicode Normalization Form C ("NFC"). An 
> implementation MAY use some other form of internal processing (such as using a non-Unicode, "legacy" character encoding) as long as 
> the externally visible results are identical to this specification.
> --
>
> Later you note:
>
> <quot>
> For our string equility function I will write "The two strings are 
> equal, if they result in identical binary sequences when encoded into a 
> common Unicode encoding form".
> </quot>
>
> This isn't complete. It should probably say instead something like:
>
> --
> The two strings are equal, if they are composed of identical code point sequences when normalized to Unicode Normalization Form C.
> --
>
> It is true that two binary-identical strings in the same encoding are equal. However, some strings that are not binary identical are still 'equal', even when the same character encoding is used (that's what normalizing the strings exposes).
>
> <quot>
> One question though: Is it possible that there are some strings which 
> cannot be normalized into NFC? If so, we need to define error behavior 
> where this can occur.
> </quot>
>
> This can never occur. All strings can be normalized to NFC.
>
> I hope this helps. Please don't hesitate to contact our WG or me for more feedback or information.
>
> Best Regards,
>
> Addison
>
> Addison Phillips
> Globalization Architect -- Lab126
> Chair -- W3C Internationalization Core WG
>
> Internationalization is not a feature.
> It is an architecture.
>   

Received on Friday, 21 November 2008 08:26:22 UTC