- From: John Cowan <jcowan@reutershealth.com>
- Date: Tue, 18 Jan 2000 16:32:23 -0500
- To: www-xml-canonicalization-comments@w3.org
The process of Unicode Form C normalization (which is the kind that the character model WD recommends) involves three steps. For background, Unicode has three kinds of characters (for this purpose): base characters, combining marks, and precomposed characters (which are a base character with one or more combining marks attached). Each combining mark has a non-zero number associated with it called its "class". 1) Each precomposed character (for example, LATIN SMALL LETTER A WITH ACUTE) is replaced by its decomposed form (in this case, LATIN SMALL LETTER A followed by COMBINING ACUTE ACCENT). If the first character of the result is still precomposed, it needs to be decomposed recursively. At this point, the "deprecated" characters (those which are listed in the Unicode database as canonically equivalent to a single character) are replaced by their equivalent counterparts (for example, ANGSTROM SIGN becomes LATIN CAPITAL LETTER A WITH RING). 2) Sequences of combining marks are reordered so that marks of lower-numbered classes appear before marks of higher-numbered classes. For example, COMBINING ACUTE ACCENT has class 230, whereas COMBINING GRAVE ACCENT BELOW has class 220. The sequence of COMBINING ACUTE ACCENT followed by COMBINING GRAVE ACCENT BELOW will need to be reordered so that the BELOW accent appears first. Since sequences of combining characters longer than 2 are exceedingly rare, and since combining marks introduced by step 1 are already in the correct order, bubble-sort is a good algorithm here. 3) Combining marks are then rejoined to base characters wherever possible to re-create precomposed characters, both for compactness and because precomposed characters more often can be translated 1-to-1 into non-Unicode character sets. The algorithm for doing this is a little trickier: for each combining mark, we examine whether it can be combined with the previous base or precomposed character to form a precomposed character. If so, the previous base/precomposed character is replaced by the new precomposed character, and the combining mark is removed. However, this is *not* done for a combining mark which is preceded by another combining mark of the same class, and is not done for certain precomposed characters specified by Unicode. This overall process is essentially O(N) unless there are an immense and unlikely number of combining marks. The table size required is approximately 8K bytes, as follows: 276 combining characters at 3 bytes each (code and class); 988 precomposed characters at 6 bytes each (original code, base code, mark code), plus a bit to say if this character should participate in step 3 or not; 322 "deprecated" characters at 4 bytes each (bad code, good code). This assumes linear or binary searches. Faster searches can be provided by different data structures, trading space for time as usual. (Note: This explanation excludes the processing of Hangul jamo (Korean characters), which are treated purely algorithmically and require no tables. Their processing is generally analogous to the above model.) -- Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)
Received on Tuesday, 18 January 2000 16:23:26 UTC