W3C home > Mailing lists > Public > www-international@w3.org > October to December 2008

RE: Questions/feedback on character normalization

From: Texin, Tex <Tex.Texin@netapp.com>
Date: Wed, 24 Dec 2008 11:36:35 -0800
Message-ID: <9D520119F7D4FA4AA449862666410D9002663D3B@SACMVEXC3-PRD.hq.netapp.com>
To: "Phillips, Addison" <addison@amazon.com>, "Erik Rissanen" <erik@axiomatics.com>, <www-international@w3.org>
Cc: <public-i18n-core@w3.org>
Hi,
I was just going thru older emails and noted this.
One comment on this paragraph:

'An XACML implementation MUST behave as if each normalization-sensitive operation normalizes input strings into Unicode Normalization Form C ("NFC"). An implementation MAY use some other form of internal processing (such as using a non-Unicode, "legacy" character encoding) as long as the externally visible results are identical to this specification.'

"Externally visible results" could be misconstrued to mean visually similar.
It probably should say something about using a normalizing transcoder and getting the identical NFC string.


Tex (who is hoping Santa will bring him a normalizing transcoder ring!)


-----Original Message-----
From: Phillips, Addison [mailto:addison@amazon.com] 
Sent: Wednesday, November 19, 2008 3:22 PM
To: Erik Rissanen; www-international@w3.org
Cc: public-i18n-core@w3.org
Subject: RE: Questions/feedback on character normalization

Hi Erik,

Sorry about the delay in responding. Note that this response is on behalf of the I18N Core WG.

<quot>
That clears up my confusion. But it also means that the current draft does not satisfy the requirements of XACML. Instead I will write like this in a section on unicode issues:
</quot>

That's correct. Your text isn't quite correct yet. Here's your proposal:

--8<--
In Unicode it is possible to represent some letters by different character sequences. The process of converting Unicode strings into canonical character sequences is called normalization. An operation is normalization-sensitive if its output(s) are different depending on the state of normalization of the input(s); if the output(s) are textual, they are deemed different only if they would remain different were they to be normalized. (Quoted from
[http://www.w3.org/TR/2005/WD-charmod-norm-20051027])

For more information on normalization see [http://www.w3.org/TR/2005/WD-charmod-norm-20051027].

An XACML implementation MUST NOT perform any normalization-sensitive operations unless it has ensured that the inputs are normalized. An XACML implementation MUST behave as if each normalization-sensitive operation normalizes the string into Unicode normalization form NFC. An implementation MAY use some other form of internal processing as long as the externally visible results are identical to this specification.
--8<--

I would propose something more like:

--
In Unicode, some equivalent characters can be represented by more than one different Unicode character sequence. See [http://www.w3.org/TR/CharMod]. The process of converting Unicode strings into equivalent character sequences is called "normalization" [http://www.unicode.org/reports/tr15]. Some operations, such as string comparison, are sensitive to normalization. An operation is normalization-sensitive if its output(s) are different depending on the state of normalization of the input(s); if the output(s) are textual, they are deemed different only if they would remain different were they to be normalized.

For more information on normalization see [http://www.w3.org/TR/2005/WD-charmod-norm-20051027].

An XACML implementation MUST behave as if each normalization-sensitive operation normalizes input strings into Unicode Normalization Form C ("NFC"). An implementation MAY use some other form of internal processing (such as using a non-Unicode, "legacy" character encoding) as long as the externally visible results are identical to this specification.
--

Later you note:

<quot>
For our string equility function I will write "The two strings are equal, if they result in identical binary sequences when encoded into a common Unicode encoding form".
</quot>

This isn't complete. It should probably say instead something like:

--
The two strings are equal, if they are composed of identical code point sequences when normalized to Unicode Normalization Form C.
--

It is true that two binary-identical strings in the same encoding are equal. However, some strings that are not binary identical are still 'equal', even when the same character encoding is used (that's what normalizing the strings exposes).

<quot>
One question though: Is it possible that there are some strings which cannot be normalized into NFC? If so, we need to define error behavior where this can occur.
</quot>

This can never occur. All strings can be normalized to NFC.

I hope this helps. Please don't hesitate to contact our WG or me for more feedback or information.

Best Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization Core WG

Internationalization is not a feature.
It is an architecture.
Received on Wednesday, 24 December 2008 19:38:07 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:18 GMT