Update on Bug 3245

FYI, in case you did not see this.

Felix

http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245

------- Comment #9 from cmsmcq@w3.org  2007-10-14 19:56 -------
The WG discussed this issue both with Query and XSL, and then among
ourselves, at the October 2007 ftf meetings in Redmond.  See also bug
3222, which is closely related in practice.

We discussed several proposals for defining equality conditions for
string which might depend on normalization and/or collation
information.  Eventually, we converged on a proposal to add a
Unicode-normalization facet applicable to xs:string. Its value will be
an identifier denoting a specific Unicode collation form (e.g. 'c').
To begin with, the only legal values will be the identifier for
normalization form C and ABSENT.  The default value will be ABSENT,
which means the unnormalized form is used.  Once specified, the facet
cannot be changed (it's effectively fixed from the time of first use).
The meaning of the facet is that the lexical form is prepared by
calculating the named normalization form for the 'normalized value' in
the input infoset, and then performing whitespace normalization to
calculate the candidate lexical form.

We noted that it does matter that Unicode normalization be done first:
For the string s = x, y, z, space, space, combining umlaut, x, y, z,
it's clear that norm(ws(s)) = x, y, z, non-combining umlaut, x, y, z,
while ws(norm(s)) = x, y, z, space, non-combining umlaut, x, y, z.  We
thought that in this case the double space in the original seems a
clear signal that two tokens are intended, not one.

After the meeting, it occurred to some WG members that it might be
good to have an explicit identifier for no-normalization, so that the
value of the facet can be fixed that way if desired.  (This would
entail reformulating the rule about changing the facet:  the value
might change from no-normalization to some normalization form, but
not from any specified normalization form to any other value.)

Received on Wednesday, 17 October 2007 05:05:45 UTC