Re: Unicode normalization and whitespace normalization from Felix Sasaki on 2007-10-29 (public-i18n-core@w3.org from October to December 2007)

From: Felix Sasaki <fsasaki@w3.org>
Date: Mon, 29 Oct 2007 20:58:54 +0900
To: Mark Davis <mark.davis@icu-project.org>, "C. M. Sperberg-McQueen" <cmsmcq@acm.org>
CC: public-i18n-core@w3.org
Message-ID: <4725CAFE.805@w3.org>
Hello Mark,

thank you very much for your explanation!

Mark Davis wrote:
> Nice to hear from you. Yes, NFKC and NFKD don't commute with the XML 
> ws function. I believe that NFC and NFD do, but will have to write a 
> little program to verify that.

would it be possible to see the script or (even better) the algorithm 
which is necessary for verification? Or could you point me / others to 
the relevant part in the normalization specification?

>
> If it is the case, that would only be guaranteed for the current 
> version of Unicode. While I wouldn't forsee that changing in the 
> future, if you really want a guarantee that it will continue to work 
> that way in future versions of Unicode, we'd have to propose that that 
> be one of the Unicode stability policies.... See 
> http://www.unicode.org/standard/stability_policy.html for examples.

 From my understanding, this might be helpful for the resolution Michael 
proposed for the XML Schema 1.1. issue 
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245#c9 :

having a factet for creating NFC and doing whitespace normalization 
afterwards. Michael, am I right?

Felix

>
> NFKC and NFKD are mostly useful for loose matching, since they lose 
> formatting information.
>
> Mark
>
> On 10/24/07, *C. M. Sperberg-McQueen* <cmsmcq@acm.org 
> <mailto:cmsmcq@acm.org>> wrote:
>
>     On 16 October, Mark Davis wrote:
>
>     > There is a factual problem in the example.
>     >
>     > The normalized form of <space, combining umlaut> is <space,
>     combining
>     > umlaut> in all cases; it does not change under normalization. The
>     > normalized
>     > form of <00a8> remains the same (00a8) under NFC and NFD: it only
>     > changes in
>     > the compatibility forms to <space, combining umlaut>. So if
>     > normalization
>     > form C is being discussed, then the example needs to be changed.
>     >
>     > If you have any questions about particular normalizations, the icu
>     > browser
>     > is helpful.
>     >
>     > http://demo.icu-project.org/icu-bin/nbrowser
>     <http://demo.icu-project.org/icu-bin/nbrowser>
>
>     Thanks for the correction.  I've noted the error in the minutes of the
>     meeting where the example was concocted, and will try not to propagate
>     it further.
>
>     The crucial question, which the ICU browser might help me answer if I
>     had more background knowledge (but not in my current state of
>     innocence), is: if the process of Unicode normalization is represented
>     by functions nfc, nfd, nfkc, and nfkd, and the process of whitespace
>     normalization as specified in XML is represented by function ws, then
>     which of the following are true and which are false?
>
>        (1) for all Unicode strings s, nfc(ws(s)) = ws(nfc(s))
>        (2) for all Unicode strings s, nfd(ws(s)) = ws(nfd(s))
>        (3) for all Unicode strings s, nfkc(ws(s)) = ws(nfkc(s))
>        (4) for all Unicode strings s, nfkd(ws(s)) = ws(nfkd(s))
>
>     I believe your comment allows me to infer that (3) and (4) are false,
>     because for s = x, blank, spacing umlaut, x (U+0078, U+0020, U+00A8,
>     U+0078), we have (and the ICU browser does help, thanks a million):
>
>        ws(nfkc(s)) = ws(x, blank, blank, combining umlaut, x)
>                    = ws(0078 0020 0020 0308 0078)
>                    = 0078 0020 0308 0078
>                    = x, blank, combining umlaut, x
>        nfkc(ws(s)) = nfkc(s)
>                    = 0078 0020 0020 0308 0078
>                    = x, blank, blank, combining umlaut, x
>
>        So ws(nfkc(s)) != nfkc(ws(s))
>
>     and ditto for nfkd.
>
>     For this trivial example, nfc and nfd don't affect the amount of
>     whitespace in the string and thus don't interact with whitespace
>     normalization.  But I don't know -- are there transformations in
>     nfc or nfd which do affect whitespace?  If so, (1) and (2) are
>     also false; if not, they are true.
>
>     In a way, the question is academic:  if the XML Schema WG decides
>     to add a Unicode normalization facet, we can (and presumably will)
>     specify that Unicode normalization and whitespace normalization occur
>     in a particular order; if an implementation believes it can achieve
>     some advantage by transposing the operations, it's going to be the
>     implementor's responsibility to figure out whether the transposition
>     is safe (guaranteed to produce the same results) or not.
>
>     But, well, you know, it would be nice to know, even if the knowledge
>     doesn't turn up in the XSDL spec.  (But, I guess, not SO nice to
>     know that I'm willing to work through Standard Annex #15 and the
>     relevant parts of Unicode myself, to figure out the answer.  Hence
>     my desire to consult an oracle.)
>
>     Thanks again.
>
>     Michael Sperberg-McQueen
>
>
>
>
> -- 
> Mark
Received on Monday, 29 October 2007 11:59:16 UTC