- From: John Cowan <jcowan@reutershealth.com>
- Date: Tue, 18 Jan 2000 16:31:50 -0500
- To: www-xml-canonicalization-comments@w3.org
By vote, but not by consensus, the XML Core WG decided that the product of XML canonicalization (c14n) should always be in Unicode normalized form as defined in http://www.w3.org/TR/charmod . Because of the principle of early normalization, text data in most cases should already be normalized. (Representing the original document using almost any non-Unicode character set, including all the commonly used ones, ensures normalization.) But in case the same document turns up in two versions, one with a character in normalized form and the other with the same character in unnormalized form, then only the first one should be able to call itself 'canonicalized'. The overhead of normalization is not large in code space or data space or time. I have provided a non-normative explanation of the algorithm at http://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0002.html In general, a table space of about 8K bytes is involved, and the process is O(N) except on pathological data. A few words about stability. The tables will need to be extended after Unicode 3.0 in order to accommodate new characters. Old implementations exposed to newly defined characters *might* fail to produce results equivalent to new implementations. About the only plausible case is text which mixes new and old combining marks on a single base character. However, *anything* that is normalized relative to Unicode 3.0 is guaranteed to be normalized relative to *any* later version of Unicode as well. So if an XML document is canonicalized by a current canonicalizer, then it will *still* be canonical according to later canonicalizers with updated Unicode tables. This is guaranteed. -- Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)
Received on Tuesday, 18 January 2000 16:22:54 UTC