RE: XML Schema WG Comments on XInclude from Jonathan Marsh on 2002-01-09 (www-xml-xinclude-comments@w3.org from January 2002)

From: Jonathan Marsh <jmarsh@microsoft.com>
Date: Wed, 9 Jan 2002 10:27:28 -0800
To: <w3c-xml-schema-wg@w3.org>
Cc: <www-xml-xinclude-comments@w3.org>, "XML Core WG" <w3c-xml-core-wg@w3.org>
Message-ID: <330564469BFEC046B84E591EB3D4D59C049AC227@red-msg-08.redmond.corp.microsoft.com>
> -----Original Message-----
> From: Mary Holstege <holstege@mathling.com>
> Message-Id: <15151.59251.704000.128392@gargle.gargle.HOWL>
> Date: Tue, 19 Jun 2001 16:59:47 -0700
> To: www-xml-xinclude-comments@w3.org, w3c-xml-core-wg@w3.org
> Subject: XML Schema WG Comments on XInclude
> 
> 
> XML Schema Working Group comments on XML Inclusions Last Call Working
> Draft

Thank you for these comments.  We apologize for the long response time.  XInclude has at times taken a back seat to other work in the Core WG, in part because of dependencies on other specs like XPointer.  On top of that your comments caused long discussions in the group.

The resolutions are document at http://www.w3.org/XML/Group/2002/01/xinclude-comments.html and reflected in the latest internal draft at http://www.w3.org/XML/Group/2002/01/WD-xinclude-20020107.

If any of our resolutions below prove not to be acceptable to the Schema WG, please let us know as soon as possible, so we can resolve the issue prior to issuing a CR draft.

> We believe that the XInclude specification defines a foundation
> specification that has to be harmonized carefully with the other
> foundation specifications. The following points outline our concerns
> with the specification as it stands;
> 
> (1) While the Infoset specification countenances synthetic infosets
>     that do not maintain the normal consistency relations of infosets
>     created directly by parsing XML, we consider it a poor idea in
>     general to take advantage of that laxity, particularly in the case
>     of such a foundational specification.
> 
>     We believe the XInclude specification must be crystal clear which
>     Infoset properties are adjusted, and how, and further that it
>     should specify rules so that core invariants are maintained. Since
>     a downstream application has no markers in an Infoset that the
>     XInclude process has occurred, it is unacceptable to create an
>     Infoset that cannot be processed in the normal way. The XInclude
>     specification itself highlights several situtations that call out
>     for special processing. We call on the specification not to
>     satisfy itself  with highlighting the problems, but to solve
>     them. Among these are:
> 
>     namespace handling

We have agreed that [namespace declarations] and [in-scope namespaces] should always be synchronized - given one you can calculate the other.  In order to do this, we now require additional [namespace declarations] to be added when a fragment is included.  We also require that [in-scope namespaces] of a fragment be augmented to reflect the new scope of namespace declarations in the resulting document.  The result is that the result document can be serialized and reparsed with no affect on the values of either the [namespace declarations] or [in-scope namespaces].

>     base handling

We have agreed to add xml:base attributes to record changes in the [base URI] property.  We still encourage users to rely on the [base URI] property as definitive, because there is no standard serialization for processing instructions with a different base URI than their parent.

>     name collisions on notations and entities

The options available for normalizing the document are to rely on the [references] property to disambiguate clashing notations and entities, or to rename notations and entities - which may constitute "corruption" of the infoset itself.  In light of the lack of use cases for this feature, we opted to make it an error to include documents with clashing notation or entities.  Thus the result is guaranteed to be serializable as XML or you will get no result at all.

>     PSVI properties
> 
>         We find the statement that PSVI properties be carried across
>         untouched particularly troubling: this decision makes it
>         impossible to build reliable type-aware applications in an
>         environment where XInclude processing may occur.

The result infoset has a different structure than the input documents, and it is unreasonable to expect that in general the same schema can govern both the pre-inclusion and post-inclusion structure.  Thus we decided by default to remove PSVI and other unknown properties from the result and allow a clean reapplication of a schema to govern the typing of the result.  However, we also recognize the value of keeping properties intact when that is possible.  So we have provided a user option to preserve properties "correctly" although what that means is outside the scope of this specification.  This would, for instance, allow the Schema group or some other body to come up with a specification for PSVI property fix-up that could be used in conjunction with XInclude.

>    We respectfully dissentand request that:
>     (a) the specification enumerate precisely which infoset properties are
>         affected by the inclusion operation, and how they are affected;
>     (b) the specification require that infoset consistency be preserved;
>     (c) most particularly that PSVI properties be either carried across
>         so as to maintain consistency or not be carried across at all.

We believe we have fulfilled these requests.

> (2) We are deeply concerned that the XInclude specification interferes
>     with meaningful type-aware processing. Some of the arguments are
>     similar to those of
>     http://tigger.uic.edu/~cmsmcq/tech/xml/munging.html. By raising
>     an infrastructural process to the same architectural level as an
>     application process, an ambiguity arises. Since it is not possible
>     to know whether XInclude will be applied before or after validation
>     it becomes difficult to write schemas (and/or DTDs) that correctly
>     describe instances that use XInclude, requiring the schema to either
>     use an overabundance of disjunctions (xml:include | myElement)
>     throughout
>     the schema or "lie" at some point in the processing about the logical
>     structure of the instances. Ubiquitous disjunctions are non-trivial
>     to implement and may substantially harm the logical model of a schema.
> 
>     Some of our members have suggested that replacing the magic
>     element with a magic type or a magic wildcard (any) would smooth
>     the integration, but we have no consensus or concrete proposals at
>     this time.
> 
>     In general, there are architectural questions raised by the
>     ambiguities inherent in combining, for example, XInclude with
>     type-aware XPath. We believe these questions must be carefully
>     considered and resolved. We recognize that resolving these
>     questions should not fall solely on the XInclude specification
>     alone: they are larger questions.
> 
>     We hope to work with the Core WG to help resolve these important
>     architectural questions, which be believe must be resolved, and look
>     forward to the Processing Model Workshop as a forum for progress on
>     these
>     issues.

XInclude is inherently a transformational process.  A schema describing the pre-inclusion document will not in general describe a post-inclusion document.  It is outside the scope of XInclude to describe how to associate a schema with a document, whether this document is the source of an XInclude transformation, or the result.  This exactly parallels XSLT.  While we welcome efforts to make progress on the larger issues, we don't think XInclude itself should attempt a solution, nor be held up awaiting such a solution.

> (3) We consider it a mistake to erase all record that XInclude
>     processing has occurred. This damages the usability of this
>     specification for many applications, such as distributed editing,
>     document packaging, and so on. Leaving a trace may well be part of a
>     solution to (2) above. We do not find the fact that the current
>     Infoset specification does not mandate properties recording a trace of
>     external entities a reason for XInclude to do likewise for two
>     reasons:
>     (1) some feel that that decision for Infoset was not a wise one, and
>     (2) XInclude processing, unlike external entity resolution, is not
>     guaranteed to occur before parsing and validation (and indeed that is
>     the point of using an XML syntax for inclusion!). The preponderance of
>     the opinion in the Schemas WG was that this is a very important issue
>     than
>     must be addressed, although a minority felt it was less crucial.

We had such a record in earlier drafts, but removed it in the interests of simplicity.  Since the Infoset is not a definitive list of information contained in an XML document or a record of the processing history of the information, nothing we are doing precludes such information to be carried along as infoset extensions or other mechanisms.  Conversely, adding such a capability to the spec does not require XInclude processors to make such information available to downstream applications, or for those applications to interpret such information in any particular way.
 
> (4) We wonder why the decision was made to specifically violate the
>     RFCs for how fragment identifiers should be interpreted, in favour
>     of a mandated interpretation. We do not consider it wise, in
>     general, to run counter to the relevant IETF specifications. We do
>     not see the rationale of forbidding, say, a schema-specific
>     pointing syntax defined at the logical component model level being
>     used with XInclude to compose schema documents. We raise this as a
>     general architectural question and ask for clarification of the
>     rationale.

By violating RFCs for fragment identifiers, we assume you mean that we assume the fragment syntax is that of text/xml instead of whatever the particular media type says it might be.  This is based on practical interoperability.

First of all, XInclude operates on infosets.  The resource fetched from a URI must be converted into an infoset.  We provide two mechanisms to accomplish this.  The resource can be parsed as an XML document (parse="xml"), or converted to a list of character information items (parse="text").  These two mechanisms for generating an infoset can be specified precisely, reasonably implemented, and tested for compliance.  This does not hold if we allowed arbitrary fragment syntaxes to be converted to infosets in a manner not specified by this or other W3C specs.  Fragment syntax extensibility is in conflict with interoperability in this case, and we saw no value in providing such extensibility for version 1.0.

Another alternative is to fail any resource not returned as text/xml or application/xml when parse="xml".  This ensures that the fragment syntax and the media type are always in sync, but precludes many useful scenarios, such as image/svg, and application/*+xml.  XML well-formedness weeds out attempts to include non-XML resources like image/jpeg for instance.  Furthermore, if XPointer is adopted, we expect the XPointer extensibility mechanism to allow additional fragment syntax and semantics to be added to xml-based media types without hindering the interpretation of such fragments as pure XPointers by applications such as XInclude.

We also note that we specifically cleared our plan to "cast" resources into xml or text and apply the corresponding fragment syntax with TimBL, since he has often commented on such matters.  If you can more accurately describe the abuse you perceive and the consequences of it perhaps we can revisit this issue.


> (5) The included XML Schema fragment does not quite capture the
>     expressed constraints. We suggest that the attribute 'parse'
>     should be defined with use='default' and value='xml' and that the
>     anyAttribute be defined with namespace='##other'. Also the DTD
>     specifies that the include element must be empty while the schema
>     specifies that the include element can have character information
>     item children.
> 
>     We suggest the schema should be;
> 
> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
>            xmlns:xi="http://www.w3.org/2001/XInclude"
>            targetNamespace="http://www.w3.org/2001/XInclude">
> 
>   <xs:element name="include">
>     <xs:complexType>
>       <xs:attribute name="href" type="xs:anyURI" use="required" />
>       <xs:attribute name="parse" use="optional" default="xml" >
>         <xs:simpleType>
>           <xs:restriction base="xs:string">
>             <xs:enumeration value="xml"/>
>             <xs:enumeration value="text"/>
>           </xs:restriction>
>         </xs:simpleType>
>       </xs:attribute>
>       <xs:attribute name="encoding" use="optional" type="xs:string" />
>       <xs:anyAttribute namespace="##other" />
>     </xs:complexType>
>   </xs:element>
> 
> </xs:schema>

This is an improvement.  A further improvement might be to indicate in this non-normative sample that the element content is wide open.  I believe this requires us to retain the mixed="true" attribute, and to add <xs:any> as follows:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           targetNamespace="http://www.w3.org/2001/XInclude">

  <xs:element name="include">
    <xs:complexType mixed="true">
      <xs:any/>
      <xs:attribute name="href" type="xs:anyURI" use="required" />
      <xs:attribute name="parse" use="optional" default="xml" >
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="xml"/>
            <xs:enumeration value="text"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
      <xs:attribute name="encoding" use="optional" type="xs:string" />
      <xs:anyAttribute namespace="##other" />
    </xs:complexType>
  </xs:element>

</xs:schema>

In addition, we need to extend this schema definition in the draft to accommodate the fallback mechanism added at the request of the HTML WG.  Please review the latest working draft and see if you have additional suggestions.

> (6) We are doubtful whether it is appropriate to mandate normalized
>     characters in all circumstances. We reiterate our comments on the
>     Character Model for the Web:
> 
>     "Early uniform normalization appears to have a laudable goal, but
>     it is no clear that it is a reliable way, let alone the best way,
>     to achieve that goal. It places a heavy burden on
>     footprint-constrained software, and (as defined in this document)
>     leaves downstream users more or less at the mercy of upstream
>     software over which they have no control. We believe serious
>     attention should be given to other normalization forms for Unicode
>     (e.g. the decomposed normal form) and to other regimes for
>     deciding who should normalize when."
> 
>     We raise this as a general important architectural question, and
>     suggest
>     that if the Character Model specification backs off from requiring
>     early normalization, the XInclude specification do likewise.

If the Character Model backs off from early uniform normalization (their latest document does not do so), we will gladly remove this requirement.  Our motivation for providing it is simply to be good W3C citizens in this matter.  Although not reflected in the Jan 7th draft, we plan to weaken this language a bit and simply reference the Character Model.

Thank you again for your comments.

- Jonathan Marsh
Received on Wednesday, 9 January 2002 13:49:34 UTC