- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Wed, 18 Feb 2009 22:05:34 +0100
- To: www-archive@w3.org
Hi, The following is a proof that the sets of strings in NFC, NFD, NFKC, and NFKD are regular languages, at least as of the current version of Unicode. This is not entirely obvious from the specification because it contains no statement to this effect and because the forms as well as the Hangul composition and decomposition are defined as algorithms. The main implications are that deciding whether a string is in one of these normalization forms takes linear time and constant memory, and it is possible to reason about the forms using formal language theory. First we'll see that a Unicode string is in Normalization Form C iff it: * contains no NFC_QuickCheck = No code point and * contains no exchangable pair and * contains no disappearing NFC_QuickCheck = Maybe code point The UAX #15 specification contains a quickCheck() function that tell us whether a string is, is not, or may be in Normalization Form C. It will return "NO" if and only if a string meets one of the first two criteria. Assuming the correctness of this function, the first two criteria are necessary for a string to be in NFC. For the third criterion we first observe that a code point with the NFC_ QuickCheck property set to Maybe doesn't have a canonical decomposition. That means it is not removed during canonical decomposition or canonical reordering. We say that such a code point disappears if it is removed in the process of canonical recomposition. It is then obvious that the third criterion is also necessary: if the code point disappears the resulting string cannot be identical to the input string. So we only have to show that, given a string that meets all three criteria, it must be in NFC. Suppose we have a sequence of code points a b c d e ... We write a(n) for the n-th code point in the canonical decomposition of the code point a. Then the decomposed string will look like this: a(1) a(2) a(3) ... b(1) b(2) ... c(1) ... d(1) ... e(1) ... Recall that for all code points c with NFC_QuickCheck = Yes or Maybe it holds that NFC(c) is identical to c. Therefore, if we immediately re- combine the code points without canonically reordering them, we would obtain a string that is identical to the original. So we only have to consider the effect of canonical reordering, in par- ticular, only the case where reordering results in a code point blocking two others from recombining that were combined in the original. Then we have two cases, one is that reordering results in a string containing p(i) q(j) p(k) where q(j) blocks p(k) from recombining with p(i). That would imply that either q(j) is a starter or that q(j) has the same canonical combining class as p(k). However, that would mean the pair <p(k), q(j)> is not an exchangable pair and thus we would never have put them into the order above. The other possibility is that the reordered string has r(l) s(m) r(n) where s(m) primary combines with r(l) but r(n) does not combine with the combined code point. Suppose the code point `s` had NFC_QC = Yes. Then it would be a starter and thus never move before the r(n). So it must be NFC_QC = Maybe. However, then `s` would have no canonical decomposition as pointed out above, so s(m) is identical to s, and as s combines with r(l) it would be a code point in the original string that disappears, which would contradict our third criterion. It follows that taken together the three criteria are both necessary and sufficient. Let us recall them: A Unicode string is in NFC iff it * contains no NFC_QuickCheck = No code point and * contains no exchangable pair and * contains no disappearing NFC_QuickCheck = Maybe code point That is the intersection of three languages. The first two are simple substring searches meaning they are regular. The third looks for strings of the form <p q* r> where - p is a starter with NFC_QuickCheck = Yes or Maybe - q are the code points that do not block r from combining with p - r has NFC_QuickCheck = Maybe such that with d = NFD(p) while the last code point in d has a combining class > ccc(r) remove that code point from d d = NFC(d) # after this d is a single code point as explained above d = NFC(concat(d, r)) ... the string `d` is a single code point, or in other words, where r is a code point that disappears. So we have a finite set of <p q* r> sub- strings, so the language for the third criterion is also a regular language. We can conclude that the language of NFC strings is a regular language. The NFD_QC and NFKD_QC properties are binary properties so NFD and NFKD are regular languages aswell. Strings in NFKC are NFC strings that do not contain particular characters, namely those with NFKC_QC = No. That implies NFKC is likewise a regular language. regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Wednesday, 18 February 2009 21:06:09 UTC