RE: Questions/feedback on character normalization from Phillips, Addison on 2008-11-21 (public-i18n-core@w3.org from October to December 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Fri, 21 Nov 2008 09:31:48 -0800
To: Erik Rissanen <erik@axiomatics.com>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA0152957C0D@EX-SEA5-D.ant.amazon.com>
Hi Erik,

No problem. Both I and the working group are glad to be of help.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

> -----Original Message-----
> From: Erik Rissanen [mailto:erik@axiomatics.com]
> Sent: Friday, November 21, 2008 12:26 AM
> To: Phillips, Addison
> Cc: www-international@w3.org; public-i18n-core@w3.org
> Subject: Re: Questions/feedback on character normalization
> 
> Hi Addison,
> 
> At the XACML TC meeting yesterday we decided to adopt the text
> which you
> are proposing. Thank you very much to the I18N WG for helping us
> get
> this issue straight in the XACML 3.0 spec.
> 
> Best regards,
> Erik
> 
> Phillips, Addison wrote:
> > Hi Erik,
> >
> > Sorry about the delay in responding. Note that this response is
> on behalf of the I18N Core WG.
> >
> > <quot>
> > That clears up my confusion. But it also means that the current
> draft
> > does not satisfy the requirements of XACML. Instead I will write
> like
> > this in a section on unicode issues:
> > </quot>
> >
> > That's correct. Your text isn't quite correct yet. Here's your
> proposal:
> >
> > --8<--
> > In Unicode it is possible to represent some letters by different
> > character sequences. The process of converting Unicode strings
> into
> > canonical character sequences is called normalization. An
> operation is
> > normalization-sensitive if its output(s) are different depending
> on the
> > state of normalization of the input(s); if the output(s) are
> textual,
> > they are deemed different only if they would remain different
> were they
> > to be normalized. (Quoted from
> > [http://www.w3.org/TR/2005/WD-charmod-norm-20051027])
> >
> > For more information on normalization see
> > [http://www.w3.org/TR/2005/WD-charmod-norm-20051027].
> >
> > An XACML implementation MUST NOT perform any normalization-
> sensitive
> > operations unless it has ensured that the inputs are normalized.
> An
> > XACML implementation MUST behave as if each normalization-
> sensitive
> > operation normalizes the string into Unicode normalization form
> NFC. An
> > implementation MAY use some other form of internal processing as
> long as
> > the externally visible results are identical to this
> specification.
> > --8<--
> >
> > I would propose something more like:
> >
> > --
> > In Unicode, some equivalent characters can be represented by more
> than one different Unicode character sequence. See
> [http://www.w3.org/TR/CharMod]. The process of converting Unicode
> strings into equivalent character sequences is called
> "normalization" [http://www.unicode.org/reports/tr15]. Some
> operations, such as string comparison, are sensitive to
> normalization. An operation is normalization-sensitive if its
> output(s) are different depending on the state of normalization of
> the input(s); if the output(s) are textual, they are deemed
> different only if they would remain different were they to be
> normalized.
> >
> > For more information on normalization see
> > [http://www.w3.org/TR/2005/WD-charmod-norm-20051027].
> >
> > An XACML implementation MUST behave as if each normalization-
> sensitive
> > operation normalizes input strings into Unicode Normalization
> Form C ("NFC"). An
> > implementation MAY use some other form of internal processing
> (such as using a non-Unicode, "legacy" character encoding) as long
> as
> > the externally visible results are identical to this
> specification.
> > --
> >
> > Later you note:
> >
> > <quot>
> > For our string equility function I will write "The two strings
> are
> > equal, if they result in identical binary sequences when encoded
> into a
> > common Unicode encoding form".
> > </quot>
> >
> > This isn't complete. It should probably say instead something
> like:
> >
> > --
> > The two strings are equal, if they are composed of identical code
> point sequences when normalized to Unicode Normalization Form C.
> > --
> >
> > It is true that two binary-identical strings in the same encoding
> are equal. However, some strings that are not binary identical are
> still 'equal', even when the same character encoding is used
> (that's what normalizing the strings exposes).
> >
> > <quot>
> > One question though: Is it possible that there are some strings
> which
> > cannot be normalized into NFC? If so, we need to define error
> behavior
> > where this can occur.
> > </quot>
> >
> > This can never occur. All strings can be normalized to NFC.
> >
> > I hope this helps. Please don't hesitate to contact our WG or me
> for more feedback or information.
> >
> > Best Regards,
> >
> > Addison
> >
> > Addison Phillips
> > Globalization Architect -- Lab126
> > Chair -- W3C Internationalization Core WG
> >
> > Internationalization is not a feature.
> > It is an architecture.
> >
Received on Friday, 21 November 2008 17:32:26 UTC