- From: Mark Davis <mark.davis@icu-project.org>
- Date: Sat, 27 Oct 2007 14:36:28 -0700
- To: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>
- Cc: public-i18n-core@w3.org
- Message-ID: <30b660a20710271436u3e319587l65acaf520ff644c5@mail.gmail.com>
Nice to hear from you. Yes, NFKC and NFKD don't commute with the XML ws function. I believe that NFC and NFD do, but will have to write a little program to verify that. If it is the case, that would only be guaranteed for the current version of Unicode. While I wouldn't forsee that changing in the future, if you really want a guarantee that it will continue to work that way in future versions of Unicode, we'd have to propose that that be one of the Unicode stability policies.... See http://www.unicode.org/standard/stability_policy.html for examples. NFKC and NFKD are mostly useful for loose matching, since they lose formatting information. Mark On 10/24/07, C. M. Sperberg-McQueen <cmsmcq@acm.org> wrote: > > On 16 October, Mark Davis wrote: > > > There is a factual problem in the example. > > > > The normalized form of <space, combining umlaut> is <space, combining > > umlaut> in all cases; it does not change under normalization. The > > normalized > > form of <00a8> remains the same (00a8) under NFC and NFD: it only > > changes in > > the compatibility forms to <space, combining umlaut>. So if > > normalization > > form C is being discussed, then the example needs to be changed. > > > > If you have any questions about particular normalizations, the icu > > browser > > is helpful. > > > > http://demo.icu-project.org/icu-bin/nbrowser > > Thanks for the correction. I've noted the error in the minutes of the > meeting where the example was concocted, and will try not to propagate > it further. > > The crucial question, which the ICU browser might help me answer if I > had more background knowledge (but not in my current state of > innocence), is: if the process of Unicode normalization is represented > by functions nfc, nfd, nfkc, and nfkd, and the process of whitespace > normalization as specified in XML is represented by function ws, then > which of the following are true and which are false? > > (1) for all Unicode strings s, nfc(ws(s)) = ws(nfc(s)) > (2) for all Unicode strings s, nfd(ws(s)) = ws(nfd(s)) > (3) for all Unicode strings s, nfkc(ws(s)) = ws(nfkc(s)) > (4) for all Unicode strings s, nfkd(ws(s)) = ws(nfkd(s)) > > I believe your comment allows me to infer that (3) and (4) are false, > because for s = x, blank, spacing umlaut, x (U+0078, U+0020, U+00A8, > U+0078), we have (and the ICU browser does help, thanks a million): > > ws(nfkc(s)) = ws(x, blank, blank, combining umlaut, x) > = ws(0078 0020 0020 0308 0078) > = 0078 0020 0308 0078 > = x, blank, combining umlaut, x > nfkc(ws(s)) = nfkc(s) > = 0078 0020 0020 0308 0078 > = x, blank, blank, combining umlaut, x > > So ws(nfkc(s)) != nfkc(ws(s)) > > and ditto for nfkd. > > For this trivial example, nfc and nfd don't affect the amount of > whitespace in the string and thus don't interact with whitespace > normalization. But I don't know -- are there transformations in > nfc or nfd which do affect whitespace? If so, (1) and (2) are > also false; if not, they are true. > > In a way, the question is academic: if the XML Schema WG decides > to add a Unicode normalization facet, we can (and presumably will) > specify that Unicode normalization and whitespace normalization occur > in a particular order; if an implementation believes it can achieve > some advantage by transposing the operations, it's going to be the > implementor's responsibility to figure out whether the transposition > is safe (guaranteed to produce the same results) or not. > > But, well, you know, it would be nice to know, even if the knowledge > doesn't turn up in the XSDL spec. (But, I guess, not SO nice to > know that I'm willing to work through Standard Annex #15 and the > relevant parts of Unicode myself, to figure out the answer. Hence > my desire to consult an oracle.) > > Thanks again. > > Michael Sperberg-McQueen > > -- Mark
Received on Saturday, 27 October 2007 21:36:38 UTC