NFC, NFD, NFKC, NFKD are regular languages from Bjoern Hoehrmann on 2009-02-18 (www-archive@w3.org from February 2009)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Wed, 18 Feb 2009 22:05:34 +0100
To: www-archive@w3.org
Message-ID: <jmqop4h4c7ddpopo3oggfgn91mcl8jnh0h@hive.bjoern.hoehrmann.de>
Hi,

  The following is a proof that the sets of strings in NFC, NFD, NFKC,
and NFKD are regular languages, at least as of the current version of
Unicode. This is not entirely obvious from the specification because it
contains no statement to this effect and because the forms as well as
the Hangul composition and decomposition are defined as algorithms.

The main implications are that deciding whether a string is in one of
these normalization forms takes linear time and constant memory, and
it is possible to reason about the forms using formal language theory.
First we'll see that a Unicode string is in Normalization Form C iff it:

  * contains no NFC_QuickCheck = No code point and
  * contains no exchangable pair and
  * contains no disappearing NFC_QuickCheck = Maybe code point

The UAX #15 specification contains a quickCheck() function that tell us
whether a string is, is not, or may be in Normalization Form C. It will
return "NO" if and only if a string meets one of the first two criteria.
Assuming the correctness of this function, the first two criteria are
necessary for a string to be in NFC.

For the third criterion we first observe that a code point with the NFC_
QuickCheck property set to Maybe doesn't have a canonical decomposition.
That means it is not removed during canonical decomposition or canonical
reordering. We say that such a code point disappears if it is removed in
the process of canonical recomposition.

It is then obvious that the third criterion is also necessary: if the
code point disappears the resulting string cannot be identical to the
input string. So we only have to show that, given a string that meets
all three criteria, it must be in NFC. Suppose we have a sequence of
code points

  a b c d e ...

We write a(n) for the n-th code point in the canonical decomposition
of the code point a. Then the decomposed string will look like this:

  a(1) a(2) a(3) ... b(1) b(2) ... c(1) ... d(1) ... e(1) ...

Recall that for all code points c with NFC_QuickCheck = Yes or Maybe it
holds that NFC(c) is identical to c. Therefore, if we immediately re-
combine the code points without canonically reordering them, we would
obtain a string that is identical to the original.

So we only have to consider the effect of canonical reordering, in par-
ticular, only the case where reordering results in a code point blocking
two others from recombining that were combined in the original. Then we
have two cases, one is that reordering results in a string containing

  p(i) q(j) p(k)

where q(j) blocks p(k) from recombining with p(i). That would imply that
either q(j) is a starter or that q(j) has the same canonical combining
class as p(k). However, that would mean the pair

  <p(k), q(j)>

is not an exchangable pair and thus we would never have put them into
the order above. The other possibility is that the reordered string has

  r(l) s(m) r(n)

where s(m) primary combines with r(l) but r(n) does not combine with the
combined code point. Suppose the code point `s` had NFC_QC = Yes. Then
it would be a starter and thus never move before the r(n).

So it must be NFC_QC = Maybe. However, then `s` would have no canonical
decomposition as pointed out above, so s(m) is identical to s, and as s
combines with r(l) it would be a code point in the original string that
disappears, which would contradict our third criterion.

It follows that taken together the three criteria are both necessary and
sufficient. Let us recall them: A Unicode string is in NFC iff it

  * contains no NFC_QuickCheck = No code point and
  * contains no exchangable pair and
  * contains no disappearing NFC_QuickCheck = Maybe code point

That is the intersection of three languages. The first two are simple
substring searches meaning they are regular. The third looks for strings
of the form <p q* r> where

  - p is a starter with NFC_QuickCheck = Yes or Maybe
  - q are the code points that do not block r from combining with p
  - r has NFC_QuickCheck = Maybe

such that with

  d = NFD(p)

  while the last code point in d has a combining class > ccc(r)
    remove that code point from d

  d = NFC(d) # after this d is a single code point as explained above

  d = NFC(concat(d, r))

... the string `d` is a single code point, or in other words, where r is
a code point that disappears. So we have a finite set of <p q* r> sub-
strings, so the language for the third criterion is also a regular
language. We can conclude that the language of NFC strings is a regular
language.

The NFD_QC and NFKD_QC properties are binary properties so NFD and NFKD
are regular languages aswell. Strings in NFKC are NFC strings that do
not contain particular characters, namely those with NFKC_QC = No. That
implies NFKC is likewise a regular language.

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Wednesday, 18 February 2009 21:06:09 UTC