Canonical MicroXML from James Clark on 2012-10-02 (public-microxml@w3.org from October 2012)

From: James Clark <jjc@jclark.com>
Date: Tue, 2 Oct 2012 19:05:48 +0700
To: James Fuller <jim@webcomposite.com>
Cc: "public-microxml@w3.org" <public-microxml@w3.org>
Message-ID: <CANz3_EZqxnvD1AGRzpeB3-_Jzx6fAqR9eoUz0G+7VS0dyO_g=g@mail.gmail.com>

On Tue, Oct 2, 2012 at 4:37 PM, James Fuller <jim@webcomposite.com> wrote:

I use xml canonisation all the time for precise diff calcs that have
> nothing to do with security (for example genetic algorithm fitness,
> which must characterise precisely differences between 2 files) …


I hear you. I believe the first version of XML Canonicalization was
actually defined by me for the purposes of parser testing:

http://www.jclark.com/xml/canonxml.html

The C14N specs make incredibly heavy weather of defining something that is
very simple.

We could add an Appendix that defines it very succinctly as follows.

The Canonical MicroXML for a document is the unique MicroXML document that

a) has the same data model as that document
b) matches the grammar below (productions not defined below are as defined
in the body of the spec)
c) has attributes in lexicographic (Unicode code point) order

document ::= element #xA
element ::= startTag content endTag
startTag ::= '<' name attributeList '>'
endTag ::= '</' name '>'
content ::= (element | dataChar | charRef)*
attributeList ::= (space attribute)*
attribute ::= attributeName  '='  attributeValue
attributeValue ::= '"' ((attributeValueChar - '"') |
attributeValueCharRef)* '"'
attributeValueCharRef ::= charRef | '&quot;'
charRef ::= '&lt;' | '&amp;' | '&gt;'
space ::= #x20

Is this worth including in the spec?

James

Received on Tuesday, 2 October 2012 12:06:37 UTC