XML Schema and the necessity for canonical representations

Having a canonical form of an entity is very important for comparison and
digital signature purposes.

XML is sufficiently rich that canonicalization needs to be considered at several
levels.  For example, the character set used in two XML documents needs to be
converted to a standard if they are to be usefully compared for many purposes.
There are also canonicalization considerations related to white space, namespace
prefixes, etc, which are being considered by the XML Syntax WG.  Similarly, I
believe that canonicalization of datatype representation must be considered and
the schema WG seems like the place to do it.

I think the need for datatype's to have a designated canonical lexical form
should be fairly clear for comparison purposes.  It relieves the comparitor from
the burden of having to be able to parse every form of every datatype and covert
it to a canonical form the comparitor has selected.

The need may not be as immediately obvious in the digital signature arena,
depending on your mental picture of the "typical" digital signature application.
If you picture is very document/object oriented, you might wonder what all the
fuss is about since any lump of bits can be signed and, if faithfully
transmitted, this signature can be verified later on the same lump of bits.  On
the other hand, if you have a transactional/protocol point of view, where pieces
of messages are being signed, data is processed and forwarded by intermediate
parties, and the signature verified by later recipients, etc., canonicalization
is essential.

I have been involved with too many systems where people thought that all they
were doing was verifying signatures on unchanged data being sent through
multi-party but faithful transmission channels only to find that there was some
circumstance where a signed object had to be partly or fully re-constituted or
some transmission channel was not as faithful as they thought.  As a result,
some incredibly stupid thing like capitalization, padding, line ending character
sequences, etc., etc., at least temporarily derailed their entire effort as, on
a crash basis, they designed and painfully retrofitted canonicalization into
their system.  Also witness the diddly little lack of canonicalization in the
original ASN.1 time and date format: As soon as there was substantial real world
use of this, a new, almost identical, fundamental data type, had to be added to
ASN.1, with significant disruption and confusion, just to squeeze out the last
case of alternative representations of the same date and time.

There is no problem with the Schema Datatypes document providing multiple
lexical representations as long as exactly one form is designated as the
canonical form.

I believe that the XML Schema Datatypes document should be changed to do this
and perhaps this should be added to the XML Schema requirements document.

Thanks,
Donald

Donald E. Eastlake, 3rd
17 Skyline Drive, Hawthorne, NY 10532 USA
dee3@us.ibm.com   tel: 1-914-784-7913, fax: 1-914-784-3833

home: 65 Shindegan Hill Road, RR#1, Carmel, NY 10512 USA
dee3@torque.pothole.com   tel: 1-914-276-2668

Received on Friday, 21 May 1999 17:10:18 UTC