- From: Mark Davis <mark.davis@icu-project.org>
- Date: Mon, 29 Oct 2007 08:30:33 -0700
- To: "Felix Sasaki" <fsasaki@w3.org>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, public-i18n-core@w3.org
- Message-ID: <30b660a20710290830m248af6abgb78412d850f7ae38@mail.gmail.com>
The algorithm would be simple: check each Unicode character C and get its normalization N (for whichever form you are testing). If C != N, see whether there are whitespace (that would be affected by XML) in C and not N, or N and not C. Mark On 10/29/07, Felix Sasaki <fsasaki@w3.org> wrote: > > Hello Mark, > > thank you very much for your explanation! > > Mark Davis wrote: > > Nice to hear from you. Yes, NFKC and NFKD don't commute with the XML > > ws function. I believe that NFC and NFD do, but will have to write a > > little program to verify that. > > would it be possible to see the script or (even better) the algorithm > which is necessary for verification? Or could you point me / others to > the relevant part in the normalization specification? > > > > > If it is the case, that would only be guaranteed for the current > > version of Unicode. While I wouldn't forsee that changing in the > > future, if you really want a guarantee that it will continue to work > > that way in future versions of Unicode, we'd have to propose that that > > be one of the Unicode stability policies.... See > > http://www.unicode.org/standard/stability_policy.html for examples. > > From my understanding, this might be helpful for the resolution Michael > proposed for the XML Schema 1.1. issue > http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245#c9 : > > having a factet for creating NFC and doing whitespace normalization > afterwards. Michael, am I right? > > Felix > > > > > NFKC and NFKD are mostly useful for loose matching, since they lose > > formatting information. > > > > Mark > > > > On 10/24/07, *C. M. Sperberg-McQueen* <cmsmcq@acm.org > > <mailto:cmsmcq@acm.org>> wrote: > > > > On 16 October, Mark Davis wrote: > > > > > There is a factual problem in the example. > > > > > > The normalized form of <space, combining umlaut> is <space, > > combining > > > umlaut> in all cases; it does not change under normalization. The > > > normalized > > > form of <00a8> remains the same (00a8) under NFC and NFD: it only > > > changes in > > > the compatibility forms to <space, combining umlaut>. So if > > > normalization > > > form C is being discussed, then the example needs to be changed. > > > > > > If you have any questions about particular normalizations, the icu > > > browser > > > is helpful. > > > > > > http://demo.icu-project.org/icu-bin/nbrowser > > <http://demo.icu-project.org/icu-bin/nbrowser> > > > > Thanks for the correction. I've noted the error in the minutes of > the > > meeting where the example was concocted, and will try not to > propagate > > it further. > > > > The crucial question, which the ICU browser might help me answer if > I > > had more background knowledge (but not in my current state of > > innocence), is: if the process of Unicode normalization is > represented > > by functions nfc, nfd, nfkc, and nfkd, and the process of whitespace > > normalization as specified in XML is represented by function ws, > then > > which of the following are true and which are false? > > > > (1) for all Unicode strings s, nfc(ws(s)) = ws(nfc(s)) > > (2) for all Unicode strings s, nfd(ws(s)) = ws(nfd(s)) > > (3) for all Unicode strings s, nfkc(ws(s)) = ws(nfkc(s)) > > (4) for all Unicode strings s, nfkd(ws(s)) = ws(nfkd(s)) > > > > I believe your comment allows me to infer that (3) and (4) are > false, > > because for s = x, blank, spacing umlaut, x (U+0078, U+0020, U+00A8, > > U+0078), we have (and the ICU browser does help, thanks a million): > > > > ws(nfkc(s)) = ws(x, blank, blank, combining umlaut, x) > > = ws(0078 0020 0020 0308 0078) > > = 0078 0020 0308 0078 > > = x, blank, combining umlaut, x > > nfkc(ws(s)) = nfkc(s) > > = 0078 0020 0020 0308 0078 > > = x, blank, blank, combining umlaut, x > > > > So ws(nfkc(s)) != nfkc(ws(s)) > > > > and ditto for nfkd. > > > > For this trivial example, nfc and nfd don't affect the amount of > > whitespace in the string and thus don't interact with whitespace > > normalization. But I don't know -- are there transformations in > > nfc or nfd which do affect whitespace? If so, (1) and (2) are > > also false; if not, they are true. > > > > In a way, the question is academic: if the XML Schema WG decides > > to add a Unicode normalization facet, we can (and presumably will) > > specify that Unicode normalization and whitespace normalization > occur > > in a particular order; if an implementation believes it can achieve > > some advantage by transposing the operations, it's going to be the > > implementor's responsibility to figure out whether the transposition > > is safe (guaranteed to produce the same results) or not. > > > > But, well, you know, it would be nice to know, even if the knowledge > > doesn't turn up in the XSDL spec. (But, I guess, not SO nice to > > know that I'm willing to work through Standard Annex #15 and the > > relevant parts of Unicode myself, to figure out the answer. Hence > > my desire to consult an oracle.) > > > > Thanks again. > > > > Michael Sperberg-McQueen > > > > > > > > > > -- > > Mark > > > -- Mark
Received on Monday, 29 October 2007 15:30:55 UTC