Re: Update on Bug 3245 from Mark Davis on 2007-10-17 (public-i18n-core@w3.org from October to December 2007)

From: Mark Davis <mark.davis@icu-project.org>
Date: Tue, 16 Oct 2007 22:40:55 -0700
To: "Felix Sasaki" <fsasaki@w3.org>
Cc: public-i18n-core@w3.org
Message-ID: <30b660a20710162240y68e8bb3te121429b81dd6a17@mail.gmail.com>

There is a factual problem in the example.

The normalized form of <space, combining umlaut> is <space, combining
umlaut> in all cases; it does not change under normalization. The normalized
form of <00a8> remains the same (00a8) under NFC and NFD: it only changes in
the compatibility forms to <space, combining umlaut>. So if normalization
form C is being discussed, then the example needs to be changed.

If you have any questions about particular normalizations, the icu browser
is helpful.

http://demo.icu-project.org/icu-bin/nbrowser

Mark

On 10/16/07, Felix Sasaki <fsasaki@w3.org> wrote:
>
>
> FYI, in case you did not see this.
>
> Felix
>
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245
>
> ------- Comment #9 from cmsmcq@w3.org  2007-10-14 19:56 -------
> The WG discussed this issue both with Query and XSL, and then among
> ourselves, at the October 2007 ftf meetings in Redmond.  See also bug
> 3222, which is closely related in practice.
>
> We discussed several proposals for defining equality conditions for
> string which might depend on normalization and/or collation
> information.  Eventually, we converged on a proposal to add a
> Unicode-normalization facet applicable to xs:string. Its value will be
> an identifier denoting a specific Unicode collation form (e.g. 'c').
> To begin with, the only legal values will be the identifier for
> normalization form C and ABSENT.  The default value will be ABSENT,
> which means the unnormalized form is used.  Once specified, the facet
> cannot be changed (it's effectively fixed from the time of first use).
> The meaning of the facet is that the lexical form is prepared by
> calculating the named normalization form for the 'normalized value' in
> the input infoset, and then performing whitespace normalization to
> calculate the candidate lexical form.
>
> We noted that it does matter that Unicode normalization be done first:
> For the string s = x, y, z, space, space, combining umlaut, x, y, z,
> it's clear that norm(ws(s)) = x, y, z, non-combining umlaut, x, y, z,
> while ws(norm(s)) = x, y, z, space, non-combining umlaut, x, y, z.  We
> thought that in this case the double space in the original seems a
> clear signal that two tokens are intended, not one.
>
> After the meeting, it occurred to some WG members that it might be
> good to have an explicit identifier for no-normalization, so that the
> value of the facet can be fixed that way if desired.  (This would
> entail reformulating the rule about changing the facet:  the value
> might change from no-normalization to some normalization form, but
> not from any specified normalization form to any other value.)
>
>
>


-- 
Mark

Received on Wednesday, 17 October 2007 05:41:31 UTC