- From: Felix Sasaki <fsasaki@w3.org>
- Date: Mon, 29 Oct 2007 20:58:54 +0900
- To: Mark Davis <mark.davis@icu-project.org>, "C. M. Sperberg-McQueen" <cmsmcq@acm.org>
- CC: public-i18n-core@w3.org
Hello Mark, thank you very much for your explanation! Mark Davis wrote: > Nice to hear from you. Yes, NFKC and NFKD don't commute with the XML > ws function. I believe that NFC and NFD do, but will have to write a > little program to verify that. would it be possible to see the script or (even better) the algorithm which is necessary for verification? Or could you point me / others to the relevant part in the normalization specification? > > If it is the case, that would only be guaranteed for the current > version of Unicode. While I wouldn't forsee that changing in the > future, if you really want a guarantee that it will continue to work > that way in future versions of Unicode, we'd have to propose that that > be one of the Unicode stability policies.... See > http://www.unicode.org/standard/stability_policy.html for examples. From my understanding, this might be helpful for the resolution Michael proposed for the XML Schema 1.1. issue http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245#c9 : having a factet for creating NFC and doing whitespace normalization afterwards. Michael, am I right? Felix > > NFKC and NFKD are mostly useful for loose matching, since they lose > formatting information. > > Mark > > On 10/24/07, *C. M. Sperberg-McQueen* <cmsmcq@acm.org > <mailto:cmsmcq@acm.org>> wrote: > > On 16 October, Mark Davis wrote: > > > There is a factual problem in the example. > > > > The normalized form of <space, combining umlaut> is <space, > combining > > umlaut> in all cases; it does not change under normalization. The > > normalized > > form of <00a8> remains the same (00a8) under NFC and NFD: it only > > changes in > > the compatibility forms to <space, combining umlaut>. So if > > normalization > > form C is being discussed, then the example needs to be changed. > > > > If you have any questions about particular normalizations, the icu > > browser > > is helpful. > > > > http://demo.icu-project.org/icu-bin/nbrowser > <http://demo.icu-project.org/icu-bin/nbrowser> > > Thanks for the correction. I've noted the error in the minutes of the > meeting where the example was concocted, and will try not to propagate > it further. > > The crucial question, which the ICU browser might help me answer if I > had more background knowledge (but not in my current state of > innocence), is: if the process of Unicode normalization is represented > by functions nfc, nfd, nfkc, and nfkd, and the process of whitespace > normalization as specified in XML is represented by function ws, then > which of the following are true and which are false? > > (1) for all Unicode strings s, nfc(ws(s)) = ws(nfc(s)) > (2) for all Unicode strings s, nfd(ws(s)) = ws(nfd(s)) > (3) for all Unicode strings s, nfkc(ws(s)) = ws(nfkc(s)) > (4) for all Unicode strings s, nfkd(ws(s)) = ws(nfkd(s)) > > I believe your comment allows me to infer that (3) and (4) are false, > because for s = x, blank, spacing umlaut, x (U+0078, U+0020, U+00A8, > U+0078), we have (and the ICU browser does help, thanks a million): > > ws(nfkc(s)) = ws(x, blank, blank, combining umlaut, x) > = ws(0078 0020 0020 0308 0078) > = 0078 0020 0308 0078 > = x, blank, combining umlaut, x > nfkc(ws(s)) = nfkc(s) > = 0078 0020 0020 0308 0078 > = x, blank, blank, combining umlaut, x > > So ws(nfkc(s)) != nfkc(ws(s)) > > and ditto for nfkd. > > For this trivial example, nfc and nfd don't affect the amount of > whitespace in the string and thus don't interact with whitespace > normalization. But I don't know -- are there transformations in > nfc or nfd which do affect whitespace? If so, (1) and (2) are > also false; if not, they are true. > > In a way, the question is academic: if the XML Schema WG decides > to add a Unicode normalization facet, we can (and presumably will) > specify that Unicode normalization and whitespace normalization occur > in a particular order; if an implementation believes it can achieve > some advantage by transposing the operations, it's going to be the > implementor's responsibility to figure out whether the transposition > is safe (guaranteed to produce the same results) or not. > > But, well, you know, it would be nice to know, even if the knowledge > doesn't turn up in the XSDL spec. (But, I guess, not SO nice to > know that I'm willing to work through Standard Annex #15 and the > relevant parts of Unicode myself, to figure out the answer. Hence > my desire to consult an oracle.) > > Thanks again. > > Michael Sperberg-McQueen > > > > > -- > Mark
Received on Monday, 29 October 2007 11:59:16 UTC