Towards a repair for our override problems

Henry S. Thompson
11 Mar 2011

Table of Contents

1. Managing circularity: a graph-based sketch of an approach

Our goal is to construct a tree of include and overide connections with no duplicate leaves. However we will speak as if we're dealing with a graph, because the fact that we're building a tree is something that eventually will need to be proved, not just assumed/asserted. The nodes of the graph (which we'll call G) are schema documents, identified by their base URIs, which we take to be the URIs of the schema documents as retrieved after all redirection has occurred.

Notation:

1.1. Algorithm I: include only

This algorithm reconstructs XSV's handling of re-entrant and circular includes.

Without loss of generality, assume we start with a single 'root' schema document whose URI as retrieved is root
  1. Add r = Node(root) to G
  2. Initialise a set of nodes to be processed Q = {r} and a set of nodes already processed P = {}
  3. If Q is empty, we're done
  4. Otherwise let n be a node removed from Q
  5. If there exists m in P such that m.uri = n.uri and m.markers = n.markers, then
    1. Remove n from G
    2. For all edges e such that e.to = n, remove e from G
  6. Otherwise
    1. Add n to P
    2. For each include EII or import EII with a schemaLocation attribute (call that EII i) in n.sd.[children]
      1. Add m = Node(i/@schemaLocation) to G
      2. If i is an include, add all the members of n.markers to m.markers
      3. If i is an include and its inclusion of m.sd is a chameleon inclusion, then add tns(n.sd) to m.markers
      4. Add m to Q
      5. Add Edge(n,m,"include" or "import" as appropriate) to G
  7. Go to step 3

There is slightly more mechanism here than is needed (edge labels are not used, 6.b.ii and 6.b.iii are mutually exclusive and the marker set of a node is always either empty or consists of a single URI). This is to prepare the way for handling override as well.

On the other hand, the treatment of import is oversimplified, and redefine is not mentioned at all.

I think it's reasonably obvious that the result of the above is a tree rooted at r and that the leaves are unique. Representation checking and schema construction can then proceed bottom-up with respect to that tree. That doesn't mean that a given schema document will only be processed once: if is chameleon-included with multiple including namespaces, or is both chameleon and non-chameleon included, it will occur at multiple leaves (with different values for its markers) and be processed multiple times, but necessarily so, and with different results in each case.

(In XSV (vintage Schema 1.0), the actual algorithm dispenses with Q, and simply recurses at 6.b.iv. This means that once it gets to the end of the include [children] in step 6, it can check and process the rest of the [children] and add the corresponding components to the schema component for the appropriate namespace, and no spurious duplicates will ever by encountered.)

1.2. Algorithm O: override only

Ignoring include for the time being (and import and redefine).

  1. Add r = Node(root) to G
  2. Initialise a set of nodes to be processed Q = {r}, a set of nodes already processed P = {}
  3. If Q is empty, we're done
  4. Otherwise let n be a node removed from Q
  5. If there exists m in P such that m.uri = n.uri and there exist nn, tt, ii, jj, uu and vv such that <nn,tt,uu,ii> ε n.markers and <nn,tt,vv,jj> ε m.markers and it's not the case that both uu = vv and ii = jj, then there are conflicting overrides with respect to the component named nn of type tt, signal the relevant error and abandon override processing
    Recovery to enable further error detection undoubtedly possible
  6. If there exists m in P such that m.uri = n.uri and n.markers ⊆ m.markers, then
    1. Remove n from G
    2. For all edges e such that e.to = n, remove e from G
  7. Otherwise
    1. For each m in P such that m.uri = n.uri and m.markers ⊂ n.markers remove the subtree rooted at m from G, P and/or Q, that is
      1. Remove m from G and P or Q
      2. For all edges e ε m.in, remove e from G
      3. For all edges e ε m.out, remove e from G and remove e.to from G and P or Q as well, and all the edges in e.to.out, and so on, but not removing n, if encountered at any point in this recursion
    2. Add n to P
    3. For each override EII (call it o) in n.sd.[children]
      1. Add m = Node(o/@schemaLocation) to G
      2. Add all the members of n.markers to m.markers
      3. For each declaration or definition EII in o.[children] (call it d), if n.markers does not contain a tuple of the form <d/@name,local-name(d),...>, then add a tuple of the form <d/@name,local-name(d),n.uri,o/position()> to m.markers
      4. Add m to Q
      5. Add Edge(n,m,"override") to G
  8. Go to step 3

If the node removed from Q at step 4 is always among those with the largest marker set, I believe wasted traversals (those removed at 7.a) will be kept to a minimum.

If this algorithm completes without error, once again we have a tree with uniquely labelled leaves. To avoid duplication of effort when there are multiple compatible overrides of the same schema document (that is, leaf nodes with the same uri but markers neither of which is a subset of the other), I think the following suffices:

  1. For all leaf nodes for the same schema document (that is with equal uris), take the union of all their markers and transform the schema document once, using the definitions and declarations identified by markers in that union to override local ones, to produce a new schema document
    Note that since these are leaf nodes, even if they contain override elements, these are not processed, since their targets must have been pruned at some point in O
  2. include the result in each node in the tree which has an edge ending at one of the leaf nodes and erase that edge
  3. Repeat until only r is left, at which point schema(r) can be constructed.

As specified this algorithm doesn't handle include, but modifying it to treat <include schemaLoc="..."/> as if it were <override schemaLoc="..."/> should be straightforward.

2. Worked example

Consider the simple example offered by Mike Kay. The tree resulting from the application of Algorithm O is as follows:

["P.xsd",{}]
    |
["Q.xsd",{<'doc','element',"P.xsd",1>}]
    |
["P.xsd",{<'doc','element',"P.xsd",1>}]

And the associated documents will (schematically) look like this:

{doc as xs:date}
    |
{doc as xs:date}
    |
   {}

Which I take it is the result Mike Kay wanted.

What happens if we start at Q?

["Q.xsd",{}]
    |
["P.xsd",{}]
    |
["Q.xsd",{<'doc','element',"P.xsd",1>}]
    |
["P.xsd",{<'doc','element',"P.xsd",1>}]

This may look as if it's the same as the previous case, but the algorithm doesn't think so:

{doc as anon, doc as xs:date}
    |
{doc as xs:date}
    |
{doc as xs:date}
    |
   {}

Which is not a conformant schema!

Mike, is that what you expected? Is it what the current resolution calls for? Is it right?