Canonical MicroXML

The current MicroXML draft says that if you want a canonical form for a
MicroXML document, you can apply XML Canonicalization (RFC 3076), but the
result is not necessarily well-formed MicroXML.  So I thought I would write
down a reasonable definition of MicroXML Canonicalization.

To canonicalize a MicroXML document, take the following actions:

Normalize all line breaks to #xA.

Convert all attribute values wrapped in single quotes to be in double
quotes, converting any embedded quotation marks into ".

Convert all numeric character references in character content and attribute
values to single characters, except that & < > become &
< > respectively, and (in attribute values only) #&x27 becomes '.

Convert empty elements to start-end tag pairs.

Remove all whitespace outside the document element.

Remove all whitespace within start-tags except for a single space
separating the element name from the first attribute (if there is one) and
preceding each additional attribute (if any).

Remove all whitespace within end-tags.

Sort the attributes of each element in lexicographical order by Unicode
code points.

The result is not Canonical XML, because > has been escaped in attribute
values, which Canonical XML doesn't allow.  But it is functionally
equivalent.

Comments?

-- 
John Cowan          http://vrici.lojban.org/~cowan        cowan@ccil.org
If a soldier is asked why he kills people who have done him no harm, or a
terrorist why he kills innocent people with his bombs, they can always
reply that war has been declared, and there are no innocent people in an
enemy country in wartime.  The answer is psychotic, but it is the answer
that humanity has given to every act of aggression in history.  --Northrop
Frye

Received on Saturday, 22 July 2017 14:21:24 UTC