- From: Felix Sasaki <fsasaki@w3.org>
- Date: Wed, 17 Oct 2007 15:02:00 +0900
- To: Mark Davis <mark.davis@icu-project.org>
- CC: public-i18n-core@w3.org
Many thanks for pointing this out, Mark. I added your comment as http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245#c10 . Felix Mark Davis wrote: > There is a factual problem in the example. > > The normalized form of <space, combining umlaut> is <space, combining > umlaut> in all cases; it does not change under normalization. The > normalized form of <00a8> remains the same (00a8) under NFC and NFD: > it only changes in the compatibility forms to <space, combining > umlaut>. So if normalization form C is being discussed, then the > example needs to be changed. > > If you have any questions about particular normalizations, the icu > browser is helpful. > > http://demo.icu-project.org/icu-bin/nbrowser > > Mark > > On 10/16/07, *Felix Sasaki* <fsasaki@w3.org <mailto:fsasaki@w3.org>> > wrote: > > > FYI, in case you did not see this. > > Felix > > http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245 > > ------- Comment #9 from cmsmcq@w3.org > <mailto:cmsmcq@w3.org> 2007-10-14 19:56 ------- > The WG discussed this issue both with Query and XSL, and then among > ourselves, at the October 2007 ftf meetings in Redmond. See also bug > 3222, which is closely related in practice. > > We discussed several proposals for defining equality conditions for > string which might depend on normalization and/or collation > information. Eventually, we converged on a proposal to add a > Unicode-normalization facet applicable to xs:string. Its value > will be > an identifier denoting a specific Unicode collation form (e.g. 'c'). > To begin with, the only legal values will be the identifier for > normalization form C and ABSENT. The default value will be ABSENT, > which means the unnormalized form is used. Once specified, the facet > cannot be changed (it's effectively fixed from the time of first use). > The meaning of the facet is that the lexical form is prepared by > calculating the named normalization form for the 'normalized > value' in > the input infoset, and then performing whitespace normalization to > calculate the candidate lexical form. > > We noted that it does matter that Unicode normalization be done first: > For the string s = x, y, z, space, space, combining umlaut, x, y, z, > it's clear that norm(ws(s)) = x, y, z, non-combining umlaut, x, y, z, > while ws(norm(s)) = x, y, z, space, non-combining umlaut, x, y, z. We > thought that in this case the double space in the original seems a > clear signal that two tokens are intended, not one. > > After the meeting, it occurred to some WG members that it might be > good to have an explicit identifier for no-normalization, so that the > value of the facet can be fixed that way if desired. (This would > entail reformulating the rule about changing the facet: the value > might change from no-normalization to some normalization form, but > not from any specified normalization form to any other value.) > > > > > > -- > Mark
Received on Wednesday, 17 October 2007 06:02:20 UTC