- From: Mark Davis <mark.davis@icu-project.org>
- Date: Tue, 16 Oct 2007 22:40:55 -0700
- To: "Felix Sasaki" <fsasaki@w3.org>
- Cc: public-i18n-core@w3.org
- Message-ID: <30b660a20710162240y68e8bb3te121429b81dd6a17@mail.gmail.com>
There is a factual problem in the example. The normalized form of <space, combining umlaut> is <space, combining umlaut> in all cases; it does not change under normalization. The normalized form of <00a8> remains the same (00a8) under NFC and NFD: it only changes in the compatibility forms to <space, combining umlaut>. So if normalization form C is being discussed, then the example needs to be changed. If you have any questions about particular normalizations, the icu browser is helpful. http://demo.icu-project.org/icu-bin/nbrowser Mark On 10/16/07, Felix Sasaki <fsasaki@w3.org> wrote: > > > FYI, in case you did not see this. > > Felix > > http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245 > > ------- Comment #9 from cmsmcq@w3.org 2007-10-14 19:56 ------- > The WG discussed this issue both with Query and XSL, and then among > ourselves, at the October 2007 ftf meetings in Redmond. See also bug > 3222, which is closely related in practice. > > We discussed several proposals for defining equality conditions for > string which might depend on normalization and/or collation > information. Eventually, we converged on a proposal to add a > Unicode-normalization facet applicable to xs:string. Its value will be > an identifier denoting a specific Unicode collation form (e.g. 'c'). > To begin with, the only legal values will be the identifier for > normalization form C and ABSENT. The default value will be ABSENT, > which means the unnormalized form is used. Once specified, the facet > cannot be changed (it's effectively fixed from the time of first use). > The meaning of the facet is that the lexical form is prepared by > calculating the named normalization form for the 'normalized value' in > the input infoset, and then performing whitespace normalization to > calculate the candidate lexical form. > > We noted that it does matter that Unicode normalization be done first: > For the string s = x, y, z, space, space, combining umlaut, x, y, z, > it's clear that norm(ws(s)) = x, y, z, non-combining umlaut, x, y, z, > while ws(norm(s)) = x, y, z, space, non-combining umlaut, x, y, z. We > thought that in this case the double space in the original seems a > clear signal that two tokens are intended, not one. > > After the meeting, it occurred to some WG members that it might be > good to have an explicit identifier for no-normalization, so that the > value of the facet can be fixed that way if desired. (This would > entail reformulating the rule about changing the facet: the value > might change from no-normalization to some normalization form, but > not from any specified normalization form to any other value.) > > > -- Mark
Received on Wednesday, 17 October 2007 05:41:31 UTC