ACTION NW xmlChunk-44: Chunk of XML - Canonicalization and equality from Norman Walsh on 2004-06-28 (www-tag@w3.org from June 2004)

From: Norman Walsh <Norman.Walsh@Sun.COM>
Date: Mon, 28 Jun 2004 10:58:36 -0400
To: www-tag@w3.org
Message-id: <87smcflof7.fsf@nwalsh.com>
From http://www.w3.org/2001/tag/actions_owner.html#NW

  xmlChunk-44: Chunk of XML - Canonicalization and equality

  Write up a named equivalence function based on today's discussion
  (e.g., based on infoset, augmented with xml:lang/xml:base, not
  requiring prefixes, etc.).

    * accepted on 12 May 2004

I submit that the following proposal completes this action:

Open questions:

1. There is no normative description of how to build an infoset. It follows
   that infoset equality does not guarantee equality of the underlying
   serialized form in any absolute sense.
2. Should processing instructions be significant?
3. Should comments be significant?
4. Should the document type declaration be significant?

I've taken a conservative approach in this description, answering
"yes" to points 2-4.

General notes:

- Ordered lists (such as the [children] property) are compared
  pairwise and in order. In other words, two ordered lists "A" and "B"
  are the same if and only if the first item if "A" is the same as the
  first item of "B", the second item of "A" is the same as the second
  item of "B", etc. It follows that they can only be the same if they
  are the same length.

- Unordered lists (such as the [attributes] property) are compared
  pairwise and without respect to order. In other words, two unordered
  lists "A" and "B" are the same if and only if there exists a set of
  pairs of items, one from each list, such that the two items in each
  pair are equal and no item from "A" or "B" appears in more than one
  pair. It follows that they can only be the same if they are the same
  length.

- XML Base. If the infosets being compared were constructed by an
  application that claims conformance to the XML Base recommendation,
  then the xml:base attribute is excluded from attribute comparisons.

- Natural Language. The xml:lang attribute is not treated specially in
  the Infoset but is intended to have a scoped effect much like the
  base URI. This proposal finesses that point by requiring that
  elements and attributes must be in the same language

  If the infosets being compared were constructed by an application
  that provides application semantics for xml:lang, then the
  application must be able to determine whether or not two elements or
  attributes have the same language.

  If the infosets being compared were constructed by an application
  that does not provide special semantics for xml:lang, then two
  elements or attributes have the same language if they have the same
  inherited value for xml:lang.

  The inherited value for xml:lang is the value of xml:lang on the
  element in question or the value from the closest ancestor. In XPath
  terms: (ancestor-or-self::*/@xml:lang)[last()]

  Languages are compared case insensitively.

- When two information items are compared:

  - Properties with the value "no value" are equal.
  - Properties with the value "unknown" are not equal.

0. Infosets

Two infosets are equal if their Document Information Items are equal.

1. Document Information Items

Two document information items are equal if the following properties
are equal:

 - [children]
 - [document element]
 - [all declarations processed]
 - [base uri]

2. Element Information Items

Two element information items are equal if they have the same language
and the following properties are equal:

 - [namespace name]
 - [local name]
 - [children]
 - [attributes], exclusive of xml:lang
 - [base uri]

3. Attribute Information Items

Two attribute information items are equal if they have the same
language and the following properties are equal:

 - [namespace name]
 - [local name]
 - [normalized value]
 - [attribute type]

4. Processing Instruction Information Items

Two processing instruction information items are equal if the
following properties are equal:

 - [target]
 - [content]
 - [base uri]

5. Unexpanded Entity Reference Information Items

Two unexpanded entity reference information items are equal if the
following properties are equal:

  - [name]
  - [system identifier]
  - [public identifier]

6. Character Information Items

Two character information items are equal if the following properties
are equal:

 - [character code]
 - [element content whitespace]

7. Comment Information Items

Two comment information items are equal if the following properties
are equal:

 - [content]

8. The Document Type Declaration Information Item

Two documen type declaration information items are equal if the
following properties are equal:

 - [system identifer]
 - [public identifier]
 - [children]

9. Unparsed Entity Information Items

Two unparsed entity information items are equal if the following
properties are equal:

 - [name]
 - [system identifer]
 - [public identifier]
 - [notation name]

                                        Be seeing you,
                                          norm

-- 
Norman.Walsh@Sun.COM / XML Standards Architect / Sun Microsystems, Inc.
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
Received on Monday, 28 June 2004 10:59:26 UTC