[action item] FW: Questions/feedback on character normalization from Phillips, Addison on 2008-11-19 (public-i18n-core@w3.org from October to December 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 18 Nov 2008 16:36:59 -0800
To: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA01527D6A69@EX-SEA5-D.ant.amazon.com>

All,

Following is my proposed response on CharMod-Norm. Comments please.

Addison

<quot>
That clears up my confusion. But it also means that the current draft
does not satisfy the requirements of XACML. Instead I will write like
this in a section on unicode issues:
</quot>

That's correct. Your text isn't quite correct yet. Here's your proposal:

--8<--
In Unicode it is possible to represent some letters by different
character sequences. The process of converting Unicode strings into
canonical character sequences is called normalization. An operation is
normalization-sensitive if its output(s) are different depending on the
state of normalization of the input(s); if the output(s) are textual,
they are deemed different only if they would remain different were they
to be normalized. (Quoted from
[http://www.w3.org/TR/2005/WD-charmod-norm-20051027])

For more information on normalization see
[http://www.w3.org/TR/2005/WD-charmod-norm-20051027].

An XACML implementation MUST NOT perform any normalization-sensitive
operations unless it has ensured that the inputs are normalized. An
XACML implementation MUST behave as if each normalization-sensitive
operation normalizes the string into Unicode normalization form NFC. An
implementation MAY use some other form of internal processing as long as
the externally visible results are identical to this specification.
--8<--

I would propose something more like:

--
In Unicode, some equivalent characters can be represented by more than one different Unicode character sequence. See [http://www.w3.org/TR/CharMod]. The process of converting Unicode strings into equivalent character sequences is called "normalization" [http://www.unicode.org/reports/tr15]. Some operations, such as string comparison, are sensitive to normalization. An operation is normalization-sensitive if its output(s) are different depending on the state of normalization of the input(s); if the output(s) are textual, they are deemed different only if they would remain different were they to be normalized.

For more information on normalization see
[http://www.w3.org/TR/2005/WD-charmod-norm-20051027].

An XACML implementation MUST behave as if each normalization-sensitive
operation normalizes input strings into Unicode Normalization Form C ("NFC"). An
implementation MAY use some other form of internal processing (such as using a non-Unicode, "legacy" character encoding) as long as
the externally visible results are identical to this specification.
--

Later you note:

<quot>
For our string equility function I will write "The two strings are
equal, if they result in identical binary sequences when encoded into a
common Unicode encoding form".
</quot>

This isn't complete. It should probably say instead:

--
The two strings are equal, if they are composed of identical code point sequences when normalized to Unicode Normalization Form C.
--

It is true that two binary-identical strings in the same encoding are equal. However, some strings that are not binary identical are still 'equal', even when the same character encoding is used.

<quot>
One question though: Is it possible that there are some strings which
cannot be normalized into NFC? If so, we need to define error behavior
where this can occur.
</quot>

This can never occur. All strings can be normalized to NFC.

I hope this helps. Please don't hesitate to contact our WG or me for more feedback or information.

Best Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization Core WG

Internationalization is not a feature.
It is an architecture.

Received on Wednesday, 19 November 2008 00:37:43 UTC