Unicode normalization and whitespace normalization (was: Re: Update on Bug 3245) from C. M. Sperberg-McQueen on 2007-10-25 (public-i18n-core@w3.org from October to December 2007)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Wed, 24 Oct 2007 20:15:35 -0600
To: Mark Davis <mark.davis@icu-project.org>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, public-i18n-core@w3.org
Message-Id: <7BEB5C24-E917-4259-87A4-955AD01EF8FC@acm.org>

On 16 October, Mark Davis wrote:

> There is a factual problem in the example.
>
> The normalized form of <space, combining umlaut> is <space, combining
> umlaut> in all cases; it does not change under normalization. The  
> normalized
> form of <00a8> remains the same (00a8) under NFC and NFD: it only  
> changes in
> the compatibility forms to <space, combining umlaut>. So if  
> normalization
> form C is being discussed, then the example needs to be changed.
>
> If you have any questions about particular normalizations, the icu  
> browser
> is helpful.
>
> http://demo.icu-project.org/icu-bin/nbrowser

Thanks for the correction.  I've noted the error in the minutes of the
meeting where the example was concocted, and will try not to propagate
it further.

The crucial question, which the ICU browser might help me answer if I
had more background knowledge (but not in my current state of
innocence), is: if the process of Unicode normalization is represented
by functions nfc, nfd, nfkc, and nfkd, and the process of whitespace
normalization as specified in XML is represented by function ws, then
which of the following are true and which are false?

   (1) for all Unicode strings s, nfc(ws(s)) = ws(nfc(s))
   (2) for all Unicode strings s, nfd(ws(s)) = ws(nfd(s))
   (3) for all Unicode strings s, nfkc(ws(s)) = ws(nfkc(s))
   (4) for all Unicode strings s, nfkd(ws(s)) = ws(nfkd(s))

I believe your comment allows me to infer that (3) and (4) are false,
because for s = x, blank, spacing umlaut, x (U+0078, U+0020, U+00A8,
U+0078), we have (and the ICU browser does help, thanks a million):

   ws(nfkc(s)) = ws(x, blank, blank, combining umlaut, x)
               = ws(0078 0020 0020 0308 0078)
               = 0078 0020 0308 0078
               = x, blank, combining umlaut, x
   nfkc(ws(s)) = nfkc(s)
               = 0078 0020 0020 0308 0078
               = x, blank, blank, combining umlaut, x

   So ws(nfkc(s)) != nfkc(ws(s))

and ditto for nfkd.

For this trivial example, nfc and nfd don't affect the amount of
whitespace in the string and thus don't interact with whitespace
normalization.  But I don't know -- are there transformations in
nfc or nfd which do affect whitespace?  If so, (1) and (2) are
also false; if not, they are true.

In a way, the question is academic:  if the XML Schema WG decides
to add a Unicode normalization facet, we can (and presumably will)
specify that Unicode normalization and whitespace normalization occur
in a particular order; if an implementation believes it can achieve
some advantage by transposing the operations, it's going to be the
implementor's responsibility to figure out whether the transposition
is safe (guaranteed to produce the same results) or not.

But, well, you know, it would be nice to know, even if the knowledge
doesn't turn up in the XSDL spec.  (But, I guess, not SO nice to
know that I'm willing to work through Standard Annex #15 and the
relevant parts of Unicode myself, to figure out the answer.  Hence
my desire to consult an oracle.)

Thanks again.

Michael Sperberg-McQueen

Received on Thursday, 25 October 2007 02:15:46 UTC