- From: John Cowan <jcowan@reutershealth.com>
- Date: Tue, 18 Jan 2000 16:32:23 -0500
- To: www-xml-canonicalization-comments@w3.org
The process of Unicode Form C normalization (which is the kind that
the character model WD recommends) involves three steps. For background,
Unicode has three kinds of characters (for this purpose): base characters,
combining marks, and precomposed characters (which are a base
character with one or more combining marks attached). Each combining
mark has a non-zero number associated with it called its "class".
1) Each precomposed character (for example, LATIN SMALL LETTER A
WITH ACUTE) is replaced by its decomposed form (in this case,
LATIN SMALL LETTER A followed by COMBINING ACUTE ACCENT). If the
first character of the result is still precomposed, it
needs to be decomposed recursively. At this point, the
"deprecated" characters (those which are listed in the Unicode
database as canonically equivalent to a single character) are
replaced by their equivalent counterparts (for example, ANGSTROM
SIGN becomes LATIN CAPITAL LETTER A WITH RING).
2) Sequences of combining marks are reordered so that marks of
lower-numbered classes appear before marks of higher-numbered
classes. For example, COMBINING ACUTE ACCENT has class 230, whereas
COMBINING GRAVE ACCENT BELOW has class 220. The sequence of
COMBINING ACUTE ACCENT followed by COMBINING GRAVE ACCENT BELOW
will need to be reordered so that the BELOW accent appears first.
Since sequences of combining characters longer than 2 are
exceedingly rare, and since combining marks introduced by step 1
are already in the correct order, bubble-sort is a good algorithm here.
3) Combining marks are then rejoined to base characters wherever
possible to re-create precomposed characters, both for
compactness and because precomposed characters more often can be
translated 1-to-1 into non-Unicode character sets. The algorithm
for doing this is a little trickier: for each combining mark,
we examine whether it can be combined with the previous base
or precomposed character to form a precomposed character. If so,
the previous base/precomposed character is replaced by the
new precomposed character, and the combining mark is removed.
However, this is *not* done for a combining mark which is preceded
by another combining mark of the same class, and is not done
for certain precomposed characters specified by Unicode.
This overall process is essentially O(N) unless there are an immense
and unlikely number of combining marks.
The table size required is approximately 8K bytes, as follows:
276 combining characters at 3 bytes each (code and class);
988 precomposed characters at 6 bytes each
(original code, base code, mark code),
plus a bit to say if this character should participate in
step 3 or not;
322 "deprecated" characters at 4 bytes each (bad code, good code).
This assumes linear or binary searches. Faster searches can be provided by
different data structures, trading space for time as usual.
(Note: This explanation excludes the processing of Hangul jamo (Korean
characters), which are treated purely algorithmically and require no
tables. Their processing is generally analogous to the above model.)
--
Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com>
Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com
Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)
Received on Tuesday, 18 January 2000 16:23:26 UTC