How to scope the note about D and override(E,D) (was Re: [Bug 12184] Circularity in xs:override) from C. M. Sperberg-McQueen on 2011-03-11 (www-xml-schema-comments@w3.org from January to March 2011)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Fri, 11 Mar 2011 14:22:11 -0700
To: ht@inf.ed.ac.uk (Henry S. Thompson)
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, www-xml-schema-comments@w3.org
Message-Id: <E172B924-02D0-408E-AB97-4A7447C424CD@blackmesatech.com>
On Mar 11, 2011, at 10:47 AM, Henry S. Thompson wrote:
>> ...
> 
> In this connection, rereading the status quo section 4.2.5 [1], I am
> struck by a apparent (?) contradiction between the newly added text,
> beginning "If the above definition is naively translated. . ."  (now
> the first real Note after the override element tableau) and the _last_
> (third) Note after the definition of *target set*, which appears to
> _also_ be intended to address to the problem of circular chains of
> overrides.

This is the third in a series of three replies to Henry's message.  It
takes up a tangential issue, namely whether the note Henry mentions is
inappropriately narrow in scope.

On re-reading the note carefully today, I was struck by the narrow
formulation of the conditional in the second paragraph.  The note says
 
    If applying the override transformation ... to D and E results in
    a schema document equivalent to D ..., then the effect is the same
    as for a cyclic set of <include> references, or as for multiple
    inclusions of the same document (as described in ... §4.2.3).

But in the general case, if we have a document cycle D1, D2, ... Dn,
D1 where D1 contains an override element E1, and we are trying to
calculate schema(D1), it will often happen that the target set of E1
will include not override(E1, D1) but override(E', D1).  This will
happen if any of the override elements in D2 ... Dn have children
distinct from every child of E1.  E' will be the result of overlaying
E2 with E1, overlaying E3 with that result, overlaying E4 with that,
etc. until you have overlaid the override element of Dn with the
result of all the previous overlays.

Under some conditions, as illustrated in test 23 (starting at schema
document A), I think this situation is harmless, should be allowed by
our spec, and is in fact allowed by the status quo. In other cases
(e.g. test 23 starting at schema document B) it won't be harmless,
should not be allowed, and is not allowed.  It might be nice if the
note were a little clearer on the subject, but given the difficulty we
are having explaining things to each other in the WG, that may prove
hard to do.

On the other hand, there may be a good reason for the note to call out
the edge case where the target set of E includes override(D,E),
instead of discussing the general case where it includes
override(E',D): it is precisely the case which arises whenever an
override element E points into a set of schema documents with one or
more include cycles.  To take a concrete example, assume we have
schema documents D1, D2, D3, D4.

  D1 overrides D2 with override element E.
  D2 includes D3.
  D3 includes D4.
  D4 includes D3.

The result of applying the override transform here will be a cycle of
the form described in the note: 

  D1 overrides D2 with override element E.
  D2 overrides D3 with override element E.
  D3 overrides D4 with override element E.
  D4 overrides D3 with override element E.

So the target set of E in schema document D1 is:

  override(E, D2)
  override(E, D3)
  override(E, D4)

Each override element in the cycle (i.e. the versions of E in
override(E,D3) and override(E,D4) will have the same children (for
some definition of sameness), and once a naive queuing or stacking
implementation has been around the cycle once, there is no need to go
around more times.

At the risk of flogging a dead horse, here is a more procedural
account of how a simple document-queuing implementation might handle
this case.  (To skip the example, look for the end-tag.)

<flogging>

Our implementation will keep track of what schema documents it has to
handle, and what documents it has already handled, by means of sets
called QUEUE and FINISHED, and it will keep track of what it's reading
right now with a variable called CURRENT.  For concreteness, we'll
assume that these variables all use 'schema document designators',
defined as follows.

  - A URI is a schema document designator.
  - If D is a schema document designator and E is an override element,
    then "override(E,D)" is a schema document designator.
  - If D is a schema document designator and N is a namespace name,
    then "chameleon(N,D)" is a schema document designator.

1 We start by trying to calculate schema(D1).  So

  CURRENT = null
  QUEUE = { D1 }
  FINISHED = {}

2 We process D1 by taking the builtins, and the components defined by
D1 itself, and queuing any other schema documents D1 refers to.  D1
has just the override of D2 by element E, so as soon as we read that
override element, we add "override(E,D2)" to the queue, so we have

  CURRENT = D1
  QUEUE = { override(E,D2) }
  FINISHED = {}

Algebraically:

  schema(D1) = b + immed(D1) + schema(override(E,D2))

3 When we're done with D1, we can add D1 to the list of schema
documents we've already processed.

  CURRENT = null
  QUEUE = { override(E,D2) }
  FINISHED = { D1 }

4 Now we turn to the next item in the queue, namely override(E,D2).

Its corresponding schema includes the builtins, but we already have
those.  It includes the components described in the document, which
are unproblematic as long as they don't conflict with anything we've
already got.  And since D2 has an include of D3, override(E,D2) has an
override element with the same children as E, but pointing at D3.

So when we start processing the schema document, we have

  CURRENT = override(E,D2) 
  QUEUE = {}
  FINISHED = { D1 }

When we see the override element, we add it to QUEUE and when we're
done with the document, we have

  CURRENT = null
  QUEUE = { override(E,D3) }
  FINISHED = { D1, override(E,D2)  }

Algebraically:

  schema(D1) = b + immed(D1) + schema(override(E,D2))
             = b + immed(D1) + b + immed(override(E,D2)) + schema(override(E,D3))
             = b + immed(D1) + immed(override(E,D2)) + schema(override(E,D3))

5 The next item in the queue (the override of D3 by E) is similar.
When we have finished processing it, we have

  CURRENT = null
  QUEUE = { override(E,D4) }
  FINISHED = { D1, override(E,D2), override(E,D3)  }

Algebraically:

  schema(D1) = b + immed(D1) + schema(override(E,D2))
             = b + immed(D1) + b + immed(override(E,D2)) 
               + schema(override(E,D3))
             = b + immed(D1) + immed(override(E,D2)) 
               + b + immed(override(E,D3)) + schema(override(E,D4))

6 Now we process the next item in the queue, the override of D4 by E.
Recall that D4 includes D3, so override(E,D4) will include an override
element pointing to D3 with the same children as E,
i.e. override(E,D3).

Instead of adding override(E, D3) to the QUEUE, however, the processor
notices that we have already included it (it's in the FINISHED set).
So we don't need to add it to the queue.  When we finish processing
the override of D4, therefore, the variables have this state:

  CURRENT = null
  QUEUE = {}
  FINISHED = { D1, override(E,D2), override(E,D3), override(E,D4) }

The queue is empty, so we're done.

Algebraically:

  schema(D1) = b + immed(D1) + schema(override(E,D2))
             = b + immed(D1) + b + immed(override(E,D2)) 
               + schema(override(E,D3))
             = b + immed(D1) + immed(override(E,D2)) 
               + b + immed(override(E,D3)) 
               + schema(override(E,D4))
             = b + immed(D1) + immed(override(E,D2)) 
               + b + immed(override(E,D3)) 
               + b + immed(override(E,D4)) + schema(override(E,D3))
             = b + immed(D1) + immed(override(E,D2)) 
               + immed(override(E,D3)) 
               + immed(override(E,D4)) 

The last step (elimination of duplicate b and of the trailing item
schema(override(E,D3))) is justified in the usual way: we know from
step 2 in the development that schema(D1) includes
schema(override(E,D3)), and therefore it's already taken care of.
This is exactly the same logic required to eliminate the trailing term
in the algebraic development of a cyclic include.  See, for example,
section 3.8 in the posting

  http://lists.w3.org/Archives/Member/w3c-archive/2009Jan/0053.html

of 12 January 2009.  If it makes sense as a legal move for cyclic
includes, it must necessarily make sense as a legal move for cyclic
overrides. And the WG agreed some time ago that the spec should make
it explicit that it's a legal move for cyclic includes.  

The note Henry points us to simply observes that the move is also
legal for cyclic redefines.

</flogging>

In our discussions of cyclic overrides, we have several times observed
that it would simplify things (for us, if not for users) to forbid
cycles of overrides.  But we have shied away from that because most
members of the working group do not wish to forbid cycles of includes.
Override cycles and include cycles need to be treated in parallel,
though, for two reasons: first, they present essentially the same
logical problem (as illustrated by the algebraic treatments of
parallel examples) and treating them differently would be a horrible
design errors.  And second our handling of 'scoped transformation'
turns cycles of includes into cycles of overrides.

('Scoped transformation', it may be helpful to recall, is the term we
invented to describe our rule that the user of an override should not
have to point to the specific document containing the declaration to
be overridden, but should be able to point to a top-level driver that
includes it, possibly at a remove.)

In our recent teleconferences, I have sometimes thought it might be
helpful to clone the discussion about cyclic includes at the end of
4.2.3 and modify it slightly to cover the case of cyclic includes
which are turned into cyclic overrides in a scoped transformation.
What I discover, looking carefully at the note Henry points us to, is
that we already have just such a note: the narrow case covered by this
note is precisely the case which will be generated by an override
pointing into a cluster of schema documents with a cyclic include.
(As illustrated, the immediate target of the starting override [D2 in
the example] doesn't have to be part of the cycle.)

So perhaps the note's focus on the edge case is justified after all.

As for the more general case, I think it is covered by the general
rule specified in 4.2.3 concerning 'include' elements:

    If [two include elements] specify different schema locations, 
    then they refer to different schema documents, unless the 
    implementation is able to determine that the two URIs are 
    references to the same resource.

Here the spec explicitly gives permission to a processor to determine
that two schema-document references are references to the same schema
document. I think it follows automatically that a processor is allowed
to detect a situation where (a) a schema document D contains an
override element E, and the target set of E includes not override(E,D)
but override(E',D), and (b) the resource denoted by override(E',D) is
"the same resource" as D.  (At this point, the ineffable mysteries of
web architecture block us from further reasoning.)

And once I have detected that D and override(E', D) are "the same
resource", I am allowed to snip the cycle, by (a) the rules governing
include and (b) the statement in the spec that the meaning of an
override of a schema document SD is the same as that of an include for
a different schema document override(E,SD).

By the same token, a processor written to be less aggressive about
detecting resource identity is allowed to fail to detect the identity.
Unfortunately, there is as you know no agreement in the WG about what
they are allowed or required to do then: some read the XML mapping
rules and the spec's remarks about forestalling "the need to establish
identity component by component" as making clear that the processor
ought to detect, when processing override(E',D) that it is not seeing
new components but only redundant descriptions of components already
known.  Others tell me that they read the spec as allowing processors
to avoid "the need to establish identity component by component" just
by closing their eyes and clicking their heels three times.  I believe
that processors built along those lines will often object to
override(E',D) on the grounds that it provides duplicate declarations
for top-level items.

I conclude that the note Henry points to is not in fact unduly narrow
in scope: it deals with an important special case, and the general
case is already covered by the rules for include.

It may be possible to make the note clearer; having a better
understanding of where people are having trouble understanding it
might help.

I hope this helps.


-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************
Received on Friday, 11 March 2011 21:22:43 UTC