[Bug 3245] Equality of strings

http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245





------- Comment #10 from fsasaki@w3.org  2007-10-17 06:00 -------
(In reply to comment #9)
> The WG discussed this issue both with Query and XSL, and then among
> ourselves, at the October 2007 ftf meetings in Redmond.  See also bug
> 3222, which is closely related in practice.
> 
> We discussed several proposals for defining equality conditions for
> string which might depend on normalization and/or collation
> information.  Eventually, we converged on a proposal to add a
> Unicode-normalization facet applicable to xs:string. Its value will be
> an identifier denoting a specific Unicode collation form (e.g. 'c').
> To begin with, the only legal values will be the identifier for
> normalization form C and ABSENT.  The default value will be ABSENT,
> which means the unnormalized form is used.  Once specified, the facet
> cannot be changed (it's effectively fixed from the time of first use).
> The meaning of the facet is that the lexical form is prepared by
> calculating the named normaliztion form for the 'normalized value' in
> the input infoset, and then performing whitespace normalization to
> calculate the candidate lexical form.
> 
> We noted that it does matter that Unicode normalization be done first:
> For the string s = x, y, z, space, space, combining umlaut, x, y, z,
> it's clear that norm(ws(s)) = x, y, z, non-combining umlaut, x, y, z,
> while ws(norm(s)) = x, y, z, space, non-combining umlaut, x, y, z.  We
> thought that in this case the double space in the original seems a
> clear signal that two tokens are intended, not one.
> 
> After the meeting, it occurred to some WG members that it might be
> good to have an explicit identifier for no-normalization, so that the
> value of the facet can be fixed that way if desired.  (This would
> entail reformulatiing the rule about changing the facet:  the value
> might change from no-normalization to some normalization form, but
> not from any specified normalization form to any other value.)
> 

The following is not my comment, but a copy from
http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0005.html .

Felix

There is a factual problem in the example.

The normalized form of <space, combining umlaut> is <space, combining umlaut>
in all cases; it does not change under normalization. The normalized form of
<00a8> remains the same (00a8) under NFC and NFD: it only changes in the
compatibility forms to <space, combining umlaut>. So if normalization form C is
being discussed, then the example needs to be changed.

If you have any questions about particular normalizations, the icu browser is
helpful.

http://demo.icu-project.org/icu-bin/nbrowser

Mark

Received on Wednesday, 17 October 2007 06:01:06 UTC