- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Wed, 24 Oct 2007 20:15:35 -0600
- To: Mark Davis <mark.davis@icu-project.org>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, public-i18n-core@w3.org
On 16 October, Mark Davis wrote: > There is a factual problem in the example. > > The normalized form of <space, combining umlaut> is <space, combining > umlaut> in all cases; it does not change under normalization. The > normalized > form of <00a8> remains the same (00a8) under NFC and NFD: it only > changes in > the compatibility forms to <space, combining umlaut>. So if > normalization > form C is being discussed, then the example needs to be changed. > > If you have any questions about particular normalizations, the icu > browser > is helpful. > > http://demo.icu-project.org/icu-bin/nbrowser Thanks for the correction. I've noted the error in the minutes of the meeting where the example was concocted, and will try not to propagate it further. The crucial question, which the ICU browser might help me answer if I had more background knowledge (but not in my current state of innocence), is: if the process of Unicode normalization is represented by functions nfc, nfd, nfkc, and nfkd, and the process of whitespace normalization as specified in XML is represented by function ws, then which of the following are true and which are false? (1) for all Unicode strings s, nfc(ws(s)) = ws(nfc(s)) (2) for all Unicode strings s, nfd(ws(s)) = ws(nfd(s)) (3) for all Unicode strings s, nfkc(ws(s)) = ws(nfkc(s)) (4) for all Unicode strings s, nfkd(ws(s)) = ws(nfkd(s)) I believe your comment allows me to infer that (3) and (4) are false, because for s = x, blank, spacing umlaut, x (U+0078, U+0020, U+00A8, U+0078), we have (and the ICU browser does help, thanks a million): ws(nfkc(s)) = ws(x, blank, blank, combining umlaut, x) = ws(0078 0020 0020 0308 0078) = 0078 0020 0308 0078 = x, blank, combining umlaut, x nfkc(ws(s)) = nfkc(s) = 0078 0020 0020 0308 0078 = x, blank, blank, combining umlaut, x So ws(nfkc(s)) != nfkc(ws(s)) and ditto for nfkd. For this trivial example, nfc and nfd don't affect the amount of whitespace in the string and thus don't interact with whitespace normalization. But I don't know -- are there transformations in nfc or nfd which do affect whitespace? If so, (1) and (2) are also false; if not, they are true. In a way, the question is academic: if the XML Schema WG decides to add a Unicode normalization facet, we can (and presumably will) specify that Unicode normalization and whitespace normalization occur in a particular order; if an implementation believes it can achieve some advantage by transposing the operations, it's going to be the implementor's responsibility to figure out whether the transposition is safe (guaranteed to produce the same results) or not. But, well, you know, it would be nice to know, even if the knowledge doesn't turn up in the XSDL spec. (But, I guess, not SO nice to know that I'm willing to work through Standard Annex #15 and the relevant parts of Unicode myself, to figure out the answer. Hence my desire to consult an oracle.) Thanks again. Michael Sperberg-McQueen
Received on Thursday, 25 October 2007 02:15:46 UTC