- From: <bugzilla@wiggum.w3.org>
- Date: Wed, 17 Oct 2007 06:01:00 +0000
- To: www-xml-schema-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245 ------- Comment #10 from fsasaki@w3.org 2007-10-17 06:00 ------- (In reply to comment #9) > The WG discussed this issue both with Query and XSL, and then among > ourselves, at the October 2007 ftf meetings in Redmond. See also bug > 3222, which is closely related in practice. > > We discussed several proposals for defining equality conditions for > string which might depend on normalization and/or collation > information. Eventually, we converged on a proposal to add a > Unicode-normalization facet applicable to xs:string. Its value will be > an identifier denoting a specific Unicode collation form (e.g. 'c'). > To begin with, the only legal values will be the identifier for > normalization form C and ABSENT. The default value will be ABSENT, > which means the unnormalized form is used. Once specified, the facet > cannot be changed (it's effectively fixed from the time of first use). > The meaning of the facet is that the lexical form is prepared by > calculating the named normaliztion form for the 'normalized value' in > the input infoset, and then performing whitespace normalization to > calculate the candidate lexical form. > > We noted that it does matter that Unicode normalization be done first: > For the string s = x, y, z, space, space, combining umlaut, x, y, z, > it's clear that norm(ws(s)) = x, y, z, non-combining umlaut, x, y, z, > while ws(norm(s)) = x, y, z, space, non-combining umlaut, x, y, z. We > thought that in this case the double space in the original seems a > clear signal that two tokens are intended, not one. > > After the meeting, it occurred to some WG members that it might be > good to have an explicit identifier for no-normalization, so that the > value of the facet can be fixed that way if desired. (This would > entail reformulatiing the rule about changing the facet: the value > might change from no-normalization to some normalization form, but > not from any specified normalization form to any other value.) > The following is not my comment, but a copy from http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0005.html . Felix There is a factual problem in the example. The normalized form of <space, combining umlaut> is <space, combining umlaut> in all cases; it does not change under normalization. The normalized form of <00a8> remains the same (00a8) under NFC and NFD: it only changes in the compatibility forms to <space, combining umlaut>. So if normalization form C is being discussed, then the example needs to be changed. If you have any questions about particular normalizations, the icu browser is helpful. http://demo.icu-project.org/icu-bin/nbrowser Mark
Received on Wednesday, 17 October 2007 06:01:06 UTC