Re: Questions/feedback on character normalization from Erik Rissanen on 2008-11-01 (www-international@w3.org from October to December 2008)

From: Erik Rissanen <erik@axiomatics.com>
Date: Sat, 01 Nov 2008 15:24:49 +0100
To: "www-international@w3.org" <www-international@w3.org>
Message-ID: <490C66B1.5020909@axiomatics.com>
Thank you Addison,

That clears up my confusion. But it also means that the current draft 
does not satisfy the requirements of XACML. Instead I will write like 
this in a section on unicode issues:

--8<--
In Unicode it is possible to represent some letters by different 
character sequences. The process of converting Unicode strings into 
canonical character sequences is called normalization. An operation is 
normalization-sensitive if its output(s) are different depending on the 
state of normalization of the input(s); if the output(s) are textual, 
they are deemed different only if they would remain different were they 
to be normalized. (Quoted from 
[http://www.w3.org/TR/2005/WD-charmod-norm-20051027])

For more information on normalization see 
[http://www.w3.org/TR/2005/WD-charmod-norm-20051027].

An XACML implementation MUST NOT perform any normalization-sensitive 
operations unless it has ensured that the inputs are normalized. An 
XACML implementation MUST behave as if each normalization-sensitive 
operation normalizes the string into Unicode normalization form NFC. An 
implementation MAY use some other form of internal processing as long as 
the externally visible results are identical to this specification.
--8<--

If I understand the issues correctly, this should be sufficient for our 
needs.

For our string equility function I will write "The two strings are 
equal, if they result in identical binary sequences when encoded into a 
common Unicode encoding form".

One question though: Is it possible that there are some strings which 
cannot be normalized into NFC? If so, we need to define error behavior 
where this can occur.

Best regards,
Erik

Phillips, Addison wrote:
> Hello Erik,
>
> Thanks for this note.
>
> Charmod-Norm (the document you reference) is being worked on again by the Internationalization Core WG (after a lengthy hiatus). The draft you're referencing represents the first steps in a compromise taking place in the document. Previous versions *required* early uniform normalization ("EUN") of text by string producers. This version is the first to begin trying to address the fact that early uniform normalization cannot be relied upon. The algorithm you refer to is still very much reliant on EUN (by the producers of your string input).
>   
>> I have found the document at
>> http://www.w3.org/TR/2005/WD-charmod-norm-20051027/ very useful,
>> but there are some things I don't understand in it.
>>
>> 1. For string identity matching, in section C312: Why must the
>> normalization be done by the producers of the strings to be
>> compared?
>>     
>
> The requirement that producers of the strings generate only normalized text allows the algorithm to assume a normalization form is extant already. Non-normalized text that "sneaks in" to such a system will not produce matches. The algorithm as written is appropriate for cases where EUN can be assumed but is not appropriate for a "real world" in which you cannot assume it.
>
>
>   
>> For XACML, this is difficult, since the strings are produced by
>> components outside the XACML specification scope, such as LDAP
>> servers for instance. Maybe I don't understand what is meant.
>>     
>
> No, I think you understand the problem.
>
>   
>> 2. Also, isn't it possible that step 3 of the algorithm for string
>> identity mapping results in a non-normalized string? 
>>     
>
> No. The key is in understanding what "3.2.4 Fully-normalized text" in step 1 means. It means that the escape in your example (&#x327; or &#807;, representing a Unicode combining mark) would already have been normalized away.
>
> If you cannot assume normalization has been performed by the producer, then you must assume that normalization must be performed prior to comparing strings. This algorithm is not currently described in the document, but can be derived by looking at section 3.2.4.
>
> I should add that the Internationalization Core WG reviewed the current document at the recent TPAC meeting and discussed the results at our last teleconference (yesterday) [1]. The basic conclusion of the group was that we need to do extensive work to address precisely the problem that you now face.
>
> Best Regards,
>
> Addison
>
> [1] http://www.w3.org/2008/10/29-core-minutes.html#item06
>
> Addison Phillips
> Globalization Architect -- Lab126
> Chair -- W3C Internationalization Core WG
>
> Internationalization is not a feature.
> It is an architecture.
>
>
>
>
> �&Fri=
Received on Saturday, 1 November 2008 14:25:27 UTC