Details of Unicode normalization algorithm (non-normative) from John Cowan on 2000-01-18 (www-xml-canonicalization-comments@w3.org from January 2000)

From: John Cowan <jcowan@reutershealth.com>
Date: Tue, 18 Jan 2000 16:32:23 -0500
To: www-xml-canonicalization-comments@w3.org
Message-ID: <3884DBE7.84B653EF@reutershealth.com>
The process of Unicode Form C normalization (which is the kind that
the character model WD recommends) involves three steps.  For background,
Unicode has three kinds of characters (for this purpose): base characters,
combining marks, and precomposed characters (which are a base
character with one or more combining marks attached).  Each combining
mark has a non-zero number associated with it called its "class".

1)      Each precomposed character (for example, LATIN SMALL LETTER A
        WITH ACUTE) is replaced by its decomposed form (in this case,
        LATIN SMALL LETTER A followed by COMBINING ACUTE ACCENT). If the
        first character of the result is still precomposed, it
        needs to be decomposed recursively.  At this point, the
        "deprecated" characters (those which are listed in the Unicode
        database as canonically equivalent to a single character) are
        replaced by their equivalent counterparts (for example, ANGSTROM
        SIGN becomes LATIN CAPITAL LETTER A WITH RING).

2)      Sequences of combining marks are reordered so that marks of
        lower-numbered classes appear before marks of higher-numbered
        classes.  For example, COMBINING ACUTE ACCENT has class 230, whereas
        COMBINING GRAVE ACCENT BELOW has class 220.  The sequence of
        COMBINING ACUTE ACCENT followed by COMBINING GRAVE ACCENT BELOW
        will need to be reordered so that the BELOW accent appears first.
        Since sequences of combining characters longer than 2 are
        exceedingly rare, and since combining marks introduced by step 1
        are already in the correct order, bubble-sort is a good algorithm here.

3)      Combining marks are then rejoined to base characters wherever
        possible to re-create precomposed characters, both for
        compactness and because precomposed characters more often can be
        translated 1-to-1 into non-Unicode character sets.  The algorithm
        for doing this is a little trickier: for each combining mark,
        we examine whether it can be combined with the previous base
        or precomposed character to form a precomposed character.  If so,
        the previous base/precomposed character is replaced by the
        new precomposed character, and the combining mark is removed.
        However, this is *not* done for a combining mark which is preceded
        by another combining mark of the same class, and is not done
        for certain precomposed characters specified by Unicode.

This overall process is essentially O(N) unless there are an immense
and unlikely number of combining marks.

The table size required is approximately 8K bytes, as follows:

        276 combining characters at 3 bytes each (code and class);
        988 precomposed characters at 6 bytes each
		(original code, base code, mark code),
                plus a bit to say if this character should participate in
                step 3 or not;
        322 "deprecated" characters at 4 bytes each (bad code, good code).

This assumes linear or binary searches.  Faster searches can be provided by
different data structures, trading space for time as usual.

(Note: This explanation excludes the processing of Hangul jamo (Korean
characters), which are treated purely algorithmically and require no
tables.  Their processing is generally analogous to the above model.)

-- 

Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)
Received on Tuesday, 18 January 2000 16:23:26 UTC