Re: Update on Bug 3245 from Felix Sasaki on 2007-10-17 (public-i18n-core@w3.org from October to December 2007)

From: Felix Sasaki <fsasaki@w3.org>
Date: Wed, 17 Oct 2007 15:02:00 +0900
To: Mark Davis <mark.davis@icu-project.org>
CC: public-i18n-core@w3.org
Message-ID: <4715A558.9090202@w3.org>
Many thanks for pointing this out, Mark. I added your comment as 
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245#c10 .

Felix

Mark Davis wrote:
> There is a factual problem in the example.
>
> The normalized form of <space, combining umlaut> is <space, combining 
> umlaut> in all cases; it does not change under normalization. The 
> normalized form of <00a8> remains the same (00a8) under NFC and NFD: 
> it only changes in the compatibility forms to <space, combining 
> umlaut>. So if normalization form C is being discussed, then the 
> example needs to be changed.
>
> If you have any questions about particular normalizations, the icu 
> browser is helpful.
>
> http://demo.icu-project.org/icu-bin/nbrowser
>
> Mark
>
> On 10/16/07, *Felix Sasaki* <fsasaki@w3.org <mailto:fsasaki@w3.org>> 
> wrote:
>
>
>     FYI, in case you did not see this.
>
>     Felix
>
>     http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245
>
>     ------- Comment #9 from cmsmcq@w3.org
>     <mailto:cmsmcq@w3.org>  2007-10-14 19:56 -------
>     The WG discussed this issue both with Query and XSL, and then among
>     ourselves, at the October 2007 ftf meetings in Redmond.  See also bug
>     3222, which is closely related in practice.
>
>     We discussed several proposals for defining equality conditions for
>     string which might depend on normalization and/or collation
>     information.  Eventually, we converged on a proposal to add a
>     Unicode-normalization facet applicable to xs:string. Its value
>     will be
>     an identifier denoting a specific Unicode collation form (e.g. 'c').
>     To begin with, the only legal values will be the identifier for
>     normalization form C and ABSENT.  The default value will be ABSENT,
>     which means the unnormalized form is used.  Once specified, the facet
>     cannot be changed (it's effectively fixed from the time of first use).
>     The meaning of the facet is that the lexical form is prepared by
>     calculating the named normalization form for the 'normalized
>     value' in
>     the input infoset, and then performing whitespace normalization to
>     calculate the candidate lexical form.
>
>     We noted that it does matter that Unicode normalization be done first:
>     For the string s = x, y, z, space, space, combining umlaut, x, y, z,
>     it's clear that norm(ws(s)) = x, y, z, non-combining umlaut, x, y, z,
>     while ws(norm(s)) = x, y, z, space, non-combining umlaut, x, y, z.  We
>     thought that in this case the double space in the original seems a
>     clear signal that two tokens are intended, not one.
>
>     After the meeting, it occurred to some WG members that it might be
>     good to have an explicit identifier for no-normalization, so that the
>     value of the facet can be fixed that way if desired.  (This would
>     entail reformulating the rule about changing the facet:  the value
>     might change from no-normalization to some normalization form, but
>     not from any specified normalization form to any other value.)
>
>
>
>
>
> -- 
> Mark
Received on Wednesday, 17 October 2007 06:02:20 UTC