RE: Questions/feedback on character normalization from Phillips, Addison on 2008-10-30 (www-international@w3.org from October to December 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Thu, 30 Oct 2008 11:03:55 -0700
To: Erik Rissanen <erik@axiomatics.com>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA0151CE38F8@EX-SEA5-D.ant.amazon.com>

Hello Erik,

Thanks for this note.

Charmod-Norm (the document you reference) is being worked on again by the Internationalization Core WG (after a lengthy hiatus). The draft you're referencing represents the first steps in a compromise taking place in the document. Previous versions *required* early uniform normalization ("EUN") of text by string producers. This version is the first to begin trying to address the fact that early uniform normalization cannot be relied upon. The algorithm you refer to is still very much reliant on EUN (by the producers of your string input).
> 
> I have found the document at
> http://www.w3.org/TR/2005/WD-charmod-norm-20051027/ very useful,
> but there are some things I don't understand in it.
> 
> 1. For string identity matching, in section C312: Why must the
> normalization be done by the producers of the strings to be
> compared?

The requirement that producers of the strings generate only normalized text allows the algorithm to assume a normalization form is extant already. Non-normalized text that "sneaks in" to such a system will not produce matches. The algorithm as written is appropriate for cases where EUN can be assumed but is not appropriate for a "real world" in which you cannot assume it.


> For XACML, this is difficult, since the strings are produced by
> components outside the XACML specification scope, such as LDAP
> servers for instance. Maybe I don't understand what is meant.

No, I think you understand the problem.

> 
> 2. Also, isn't it possible that step 3 of the algorithm for string
> identity mapping results in a non-normalized string? 

No. The key is in understanding what "3.2.4 Fully-normalized text" in step 1 means. It means that the escape in your example (&#x327; or &#807;, representing a Unicode combining mark) would already have been normalized away.

If you cannot assume normalization has been performed by the producer, then you must assume that normalization must be performed prior to comparing strings. This algorithm is not currently described in the document, but can be derived by looking at section 3.2.4.

I should add that the Internationalization Core WG reviewed the current document at the recent TPAC meeting and discussed the results at our last teleconference (yesterday) [1]. The basic conclusion of the group was that we need to do extensive work to address precisely the problem that you now face.

Best Regards,

Addison

[1] http://www.w3.org/2008/10/29-core-minutes.html#item06


Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization Core WG

Internationalization is not a feature.
It is an architecture.

Received on Thursday, 30 October 2008 18:04:35 UTC