RE: Questions/feedback on character normalization

Hi Erik,

Sorry about the delay in responding. Note that this response is on behalf of the I18N Core WG.

<quot>
That clears up my confusion. But it also means that the current draft 
does not satisfy the requirements of XACML. Instead I will write like 
this in a section on unicode issues:
</quot>

That's correct. Your text isn't quite correct yet. Here's your proposal:

--8<--
In Unicode it is possible to represent some letters by different 
character sequences. The process of converting Unicode strings into 
canonical character sequences is called normalization. An operation is 
normalization-sensitive if its output(s) are different depending on the 
state of normalization of the input(s); if the output(s) are textual, 
they are deemed different only if they would remain different were they 
to be normalized. (Quoted from 
[http://www.w3.org/TR/2005/WD-charmod-norm-20051027])

For more information on normalization see 
[http://www.w3.org/TR/2005/WD-charmod-norm-20051027].

An XACML implementation MUST NOT perform any normalization-sensitive 
operations unless it has ensured that the inputs are normalized. An 
XACML implementation MUST behave as if each normalization-sensitive 
operation normalizes the string into Unicode normalization form NFC. An 
implementation MAY use some other form of internal processing as long as 
the externally visible results are identical to this specification.
--8<--

I would propose something more like:

--
In Unicode, some equivalent characters can be represented by more than one different Unicode character sequence. See [http://www.w3.org/TR/CharMod]. The process of converting Unicode strings into equivalent character sequences is called "normalization" [http://www.unicode.org/reports/tr15]. Some operations, such as string comparison, are sensitive to normalization. An operation is normalization-sensitive if its output(s) are different depending on the state of normalization of the input(s); if the output(s) are textual, they are deemed different only if they would remain different were they to be normalized.

For more information on normalization see 
[http://www.w3.org/TR/2005/WD-charmod-norm-20051027].

An XACML implementation MUST behave as if each normalization-sensitive 
operation normalizes input strings into Unicode Normalization Form C ("NFC"). An 
implementation MAY use some other form of internal processing (such as using a non-Unicode, "legacy" character encoding) as long as 
the externally visible results are identical to this specification.
--

Later you note:

<quot>
For our string equility function I will write "The two strings are 
equal, if they result in identical binary sequences when encoded into a 
common Unicode encoding form".
</quot>

This isn't complete. It should probably say instead something like:

--
The two strings are equal, if they are composed of identical code point sequences when normalized to Unicode Normalization Form C.
--

It is true that two binary-identical strings in the same encoding are equal. However, some strings that are not binary identical are still 'equal', even when the same character encoding is used (that's what normalizing the strings exposes).

<quot>
One question though: Is it possible that there are some strings which 
cannot be normalized into NFC? If so, we need to define error behavior 
where this can occur.
</quot>

This can never occur. All strings can be normalized to NFC.

I hope this helps. Please don't hesitate to contact our WG or me for more feedback or information.

Best Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization Core WG

Internationalization is not a feature.
It is an architecture.

Received on Wednesday, 19 November 2008 23:22:28 UTC