Re: Unicode normalization and whitespace normalization (was: Re: Update on Bug 3245) from Mark Davis on 2007-10-27 (public-i18n-core@w3.org from October to December 2007)

From: Mark Davis <mark.davis@icu-project.org>
Date: Sat, 27 Oct 2007 14:36:28 -0700
To: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>
Cc: public-i18n-core@w3.org
Message-ID: <30b660a20710271436u3e319587l65acaf520ff644c5@mail.gmail.com>

Nice to hear from you. Yes, NFKC and NFKD don't commute with the XML ws
function. I believe that NFC and NFD do, but will have to write a little
program to verify that.

If it is the case, that would only be guaranteed for the current version of
Unicode. While I wouldn't forsee that changing in the future, if you really
want a guarantee that it will continue to work that way in future versions
of Unicode, we'd have to propose that that be one of the Unicode stability
policies.... See http://www.unicode.org/standard/stability_policy.html for
examples.

NFKC and NFKD are mostly useful for loose matching, since they lose
formatting information.

Mark

On 10/24/07, C. M. Sperberg-McQueen <cmsmcq@acm.org> wrote:
>
> On 16 October, Mark Davis wrote:
>
> > There is a factual problem in the example.
> >
> > The normalized form of <space, combining umlaut> is <space, combining
> > umlaut> in all cases; it does not change under normalization. The
> > normalized
> > form of <00a8> remains the same (00a8) under NFC and NFD: it only
> > changes in
> > the compatibility forms to <space, combining umlaut>. So if
> > normalization
> > form C is being discussed, then the example needs to be changed.
> >
> > If you have any questions about particular normalizations, the icu
> > browser
> > is helpful.
> >
> > http://demo.icu-project.org/icu-bin/nbrowser
>
> Thanks for the correction.  I've noted the error in the minutes of the
> meeting where the example was concocted, and will try not to propagate
> it further.
>
> The crucial question, which the ICU browser might help me answer if I
> had more background knowledge (but not in my current state of
> innocence), is: if the process of Unicode normalization is represented
> by functions nfc, nfd, nfkc, and nfkd, and the process of whitespace
> normalization as specified in XML is represented by function ws, then
> which of the following are true and which are false?
>
>    (1) for all Unicode strings s, nfc(ws(s)) = ws(nfc(s))
>    (2) for all Unicode strings s, nfd(ws(s)) = ws(nfd(s))
>    (3) for all Unicode strings s, nfkc(ws(s)) = ws(nfkc(s))
>    (4) for all Unicode strings s, nfkd(ws(s)) = ws(nfkd(s))
>
> I believe your comment allows me to infer that (3) and (4) are false,
> because for s = x, blank, spacing umlaut, x (U+0078, U+0020, U+00A8,
> U+0078), we have (and the ICU browser does help, thanks a million):
>
>    ws(nfkc(s)) = ws(x, blank, blank, combining umlaut, x)
>                = ws(0078 0020 0020 0308 0078)
>                = 0078 0020 0308 0078
>                = x, blank, combining umlaut, x
>    nfkc(ws(s)) = nfkc(s)
>                = 0078 0020 0020 0308 0078
>                = x, blank, blank, combining umlaut, x
>
>    So ws(nfkc(s)) != nfkc(ws(s))
>
> and ditto for nfkd.
>
> For this trivial example, nfc and nfd don't affect the amount of
> whitespace in the string and thus don't interact with whitespace
> normalization.  But I don't know -- are there transformations in
> nfc or nfd which do affect whitespace?  If so, (1) and (2) are
> also false; if not, they are true.
>
> In a way, the question is academic:  if the XML Schema WG decides
> to add a Unicode normalization facet, we can (and presumably will)
> specify that Unicode normalization and whitespace normalization occur
> in a particular order; if an implementation believes it can achieve
> some advantage by transposing the operations, it's going to be the
> implementor's responsibility to figure out whether the transposition
> is safe (guaranteed to produce the same results) or not.
>
> But, well, you know, it would be nice to know, even if the knowledge
> doesn't turn up in the XSDL spec.  (But, I guess, not SO nice to
> know that I'm willing to work through Standard Annex #15 and the
> relevant parts of Unicode myself, to figure out the answer.  Hence
> my desire to consult an oracle.)
>
> Thanks again.
>
> Michael Sperberg-McQueen
>
>


-- 
Mark

Received on Saturday, 27 October 2007 21:36:38 UTC