Rationale for Unicode normalization as part of c14n

By vote, but not by consensus, the XML Core WG decided that the
product of XML canonicalization (c14n) should always be in
Unicode normalized form as defined in http://www.w3.org/TR/charmod .

Because of the principle of early normalization, text data in most
cases should already be normalized.  (Representing the original
document using almost any non-Unicode character set, including
all the commonly used ones, ensures normalization.)  But in case
the same document turns up in two versions, one with a character
in normalized form and the other with the same character
in unnormalized form, then only the first one should be
able to call itself 'canonicalized'.

The overhead of normalization is not large in code space or data space or
time.  I have provided a non-normative explanation of the algorithm at
http://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0002.html
In general, a table space of about 8K bytes is involved, and the process
is O(N) except on pathological data.

A few words about stability.  The tables will need to be extended after
Unicode 3.0 in order to accommodate new characters.  Old implementations
exposed to newly defined characters *might* fail to produce results
equivalent to new implementations.  About the only plausible case is text
which mixes new and old combining marks on a single base character.

However, *anything* that is normalized relative to Unicode 3.0 is
guaranteed to be normalized relative to *any* later version of Unicode
as well.  So if an XML document is canonicalized by a current
canonicalizer, then it will *still* be canonical according to
later canonicalizers with updated Unicode tables.  This is guaranteed.

-- 

Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Received on Tuesday, 18 January 2000 16:22:54 UTC