Re: Unicode normalization and whitespace normalization from Mark Davis on 2007-10-29 (public-i18n-core@w3.org from October to December 2007)

From: Mark Davis <mark.davis@icu-project.org>
Date: Mon, 29 Oct 2007 08:30:33 -0700
To: "Felix Sasaki" <fsasaki@w3.org>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, public-i18n-core@w3.org
Message-ID: <30b660a20710290830m248af6abgb78412d850f7ae38@mail.gmail.com>
The algorithm would be simple: check each Unicode character C and get its
normalization N (for whichever form you are testing). If C != N, see whether
there are whitespace (that would be affected by XML) in C and not N, or N
and not C.

Mark

On 10/29/07, Felix Sasaki <fsasaki@w3.org> wrote:
>
> Hello Mark,
>
> thank you very much for your explanation!
>
> Mark Davis wrote:
> > Nice to hear from you. Yes, NFKC and NFKD don't commute with the XML
> > ws function. I believe that NFC and NFD do, but will have to write a
> > little program to verify that.
>
> would it be possible to see the script or (even better) the algorithm
> which is necessary for verification? Or could you point me / others to
> the relevant part in the normalization specification?
>
> >
> > If it is the case, that would only be guaranteed for the current
> > version of Unicode. While I wouldn't forsee that changing in the
> > future, if you really want a guarantee that it will continue to work
> > that way in future versions of Unicode, we'd have to propose that that
> > be one of the Unicode stability policies.... See
> > http://www.unicode.org/standard/stability_policy.html for examples.
>
> From my understanding, this might be helpful for the resolution Michael
> proposed for the XML Schema 1.1. issue
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245#c9 :
>
> having a factet for creating NFC and doing whitespace normalization
> afterwards. Michael, am I right?
>
> Felix
>
> >
> > NFKC and NFKD are mostly useful for loose matching, since they lose
> > formatting information.
> >
> > Mark
> >
> > On 10/24/07, *C. M. Sperberg-McQueen* <cmsmcq@acm.org
> > <mailto:cmsmcq@acm.org>> wrote:
> >
> >     On 16 October, Mark Davis wrote:
> >
> >     > There is a factual problem in the example.
> >     >
> >     > The normalized form of <space, combining umlaut> is <space,
> >     combining
> >     > umlaut> in all cases; it does not change under normalization. The
> >     > normalized
> >     > form of <00a8> remains the same (00a8) under NFC and NFD: it only
> >     > changes in
> >     > the compatibility forms to <space, combining umlaut>. So if
> >     > normalization
> >     > form C is being discussed, then the example needs to be changed.
> >     >
> >     > If you have any questions about particular normalizations, the icu
> >     > browser
> >     > is helpful.
> >     >
> >     > http://demo.icu-project.org/icu-bin/nbrowser
> >     <http://demo.icu-project.org/icu-bin/nbrowser>
> >
> >     Thanks for the correction.  I've noted the error in the minutes of
> the
> >     meeting where the example was concocted, and will try not to
> propagate
> >     it further.
> >
> >     The crucial question, which the ICU browser might help me answer if
> I
> >     had more background knowledge (but not in my current state of
> >     innocence), is: if the process of Unicode normalization is
> represented
> >     by functions nfc, nfd, nfkc, and nfkd, and the process of whitespace
> >     normalization as specified in XML is represented by function ws,
> then
> >     which of the following are true and which are false?
> >
> >        (1) for all Unicode strings s, nfc(ws(s)) = ws(nfc(s))
> >        (2) for all Unicode strings s, nfd(ws(s)) = ws(nfd(s))
> >        (3) for all Unicode strings s, nfkc(ws(s)) = ws(nfkc(s))
> >        (4) for all Unicode strings s, nfkd(ws(s)) = ws(nfkd(s))
> >
> >     I believe your comment allows me to infer that (3) and (4) are
> false,
> >     because for s = x, blank, spacing umlaut, x (U+0078, U+0020, U+00A8,
> >     U+0078), we have (and the ICU browser does help, thanks a million):
> >
> >        ws(nfkc(s)) = ws(x, blank, blank, combining umlaut, x)
> >                    = ws(0078 0020 0020 0308 0078)
> >                    = 0078 0020 0308 0078
> >                    = x, blank, combining umlaut, x
> >        nfkc(ws(s)) = nfkc(s)
> >                    = 0078 0020 0020 0308 0078
> >                    = x, blank, blank, combining umlaut, x
> >
> >        So ws(nfkc(s)) != nfkc(ws(s))
> >
> >     and ditto for nfkd.
> >
> >     For this trivial example, nfc and nfd don't affect the amount of
> >     whitespace in the string and thus don't interact with whitespace
> >     normalization.  But I don't know -- are there transformations in
> >     nfc or nfd which do affect whitespace?  If so, (1) and (2) are
> >     also false; if not, they are true.
> >
> >     In a way, the question is academic:  if the XML Schema WG decides
> >     to add a Unicode normalization facet, we can (and presumably will)
> >     specify that Unicode normalization and whitespace normalization
> occur
> >     in a particular order; if an implementation believes it can achieve
> >     some advantage by transposing the operations, it's going to be the
> >     implementor's responsibility to figure out whether the transposition
> >     is safe (guaranteed to produce the same results) or not.
> >
> >     But, well, you know, it would be nice to know, even if the knowledge
> >     doesn't turn up in the XSDL spec.  (But, I guess, not SO nice to
> >     know that I'm willing to work through Standard Annex #15 and the
> >     relevant parts of Unicode myself, to figure out the answer.  Hence
> >     my desire to consult an oracle.)
> >
> >     Thanks again.
> >
> >     Michael Sperberg-McQueen
> >
> >
> >
> >
> > --
> > Mark
>
>
>


-- 
Mark
Received on Monday, 29 October 2007 15:30:55 UTC