XML Literals, fragments, and cannonicalization

Author: Jeremy Carroll

This memo considers the problem of xml literals in RDF  (rdfms-literal-is-xml-structure). We consider the relationship with  XML Fragment Interchange  and Canonical XML.
The underlying problem is that an XML fragment need not be self contained, but may use name space prefices and/or references that refer to other parts of the XML document.

Contents

The Issue

M&S specifies the following:
P33 - Statements
The object of a statement (i.e., the property value) can be another resource or it can be a literal; i.e., a resource (specified by a URI) or a simple string or other primitive datatype defined by XML. In RDF terms, a literal may have content that is XML markup but is not further evaluated by the RDF processor.
parseLiteral
[6.32] parseLiteral   ::= ' parseType="Literal"'
literal
[6.34] literal        ::= (any well-formed XML)
P202
if parseType="Literal" is specified in the start tag of E then v is the content of E (a literal).
P203
The value 'Literal' specifies that the element content is to be treated as an RDF/XML literal; that is, the content must not be interpreted by an RDF processor.
P203 (subpara)
The RDF Model and Syntax Working Group acknowledges that the parseType='Literal' mechanism is a minimum-level solution to the requirement to express an RDF statement with a value that has XML markup. Additional complexities of XML such as canonicalization of whitespace are not yet well defined. Future work of the W3C is expected to resolve such issues in a uniform manner for all applications based on XML. Future versions of RDF will inherit this work and may extend it as we gain insight from further application experience.
P212
or if parseType="Literal" is specified in the start tag of E then v is the content of E (a literal).
P214
The attribute parseType="Literal" specifies that the element content is an RDF literal. Any markup that is part of this content is included as part of the literal and not interpreted by RDF
P220
This specification does not state a mechanism for determining equivalence between literals that contain markup, nor whether such a mechanism is guaranteed to exist.
Values Containing Markup
The precise representation of the resulting value is not specified here. and a MathML example in which the default namespace is significant, and not duplicated in the literal.
P282
The content of a literal is not interpreted by RDF itself and may contain additional XML markup. Literals are distinguished from Resources in that the RDF model does not permit literals to be the subject of a statement.
Implementations that treat an XML literal value simply as the string of characters actually present in the source document seem to me to follow the spirit of M&S.
However, it is also desirable to allow implementations that are truely on top of Infoset to be conformant, and these cannot recreate things not in infoset  , in particular the original string.
P220 is an escape clause for this problem.
A simple string, pulled out of an XML file is not necessarily well-formed, and may have a different meaning, when put back into another XML file. e.g. it can make use of name spaces that are not specified in the new context, or specified differently; it can make use of entity and character references, which perhaps should be expanded before extraction; its menaing may depend on xml attributes that are in scope but not included in the fragment, e.g. xml:space="preserve" or xml:lang="rom" or xml:future="not yet defined".

XML Fragments

The XML Fragments spec is trying to show how to take a fragment out of an XML document, and to take the context of that fragment, so that the pair (the fragment and the context) allow a full understanding of the fragment. Hence, references are not expanded in this processed, but are included unexpanded. They do not specify a mechanism for relating the pair but give some non-normative examples. In the following MIME example taken from XML Fragment Interchange section C.2 we see the fragment (part3) is understood in terms of the context (the fourth part), which referes to the first, and the second parts of the message. Thus the reference &author; in the fragment can be expanded to "me".
 
And here is an example of MIME packaging used to transmit the fragment context specification, the fragment body, the internal subset, and the external entity within a single stream such as a mail message:
Content-Type: multipart/related; boundary="/04w6evG8XlLl3ft";type="text/xml"

--/04w6evG8XlLl3ft
Content-Type: text/xml; charset=us-ascii
Content-ID: <part1>
Content-Disposition: attachment; filename="mybook.decls"

<!ENTITY title "My Book">
<!ENTITY author "me">
<!ENTITY try SYSTEM "cid:part2" NDATA CGM-BINARY>

--/04w6evG8XlLl3ft
Content-Type: image/cgm
Content-ID: <part2>
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="try.cgm"

ACEAABAiAAEQXwBEQyJTb3VyY2U6IEhTSSAvV01GLXRvLUNHTSBmaWx0ZXIg
LyBWZXJzaW9uIDEuMzUgIiAiRGF0ZTogMTk5OS0wMS0xNyIRZgAB//8AARBi
AAAQpgAAAAkAFxFGAAAA////EYQwIgAQEYogyAAAAAB//3//AAARvwC3C1RJ
TUVTX1JPTUFODFRJTUVTX0lUQUxJQwpUSU1FU19CT0xEEVRJTUVTX0JPTERf
SVRBTElDCUhFTFZFVElDQRFIRUxWRVRJQ0FfT0JMSVFVRQ5IRUxWRVRJQ0Ff
Qk9MRBZIRUxWRVRJQ0FfQk9MRF9PQkxJUVVFB0NPVVJJRVIOQ09VUklFUl9J
VEFMSUMMQ09VUklFUl9CT0xEE0NPVVJJRVJfQk9MRF9JVEFMSUMGU1lNQk9M
ABHOAAABQgABAUEABAMqLToR4gABAGEAACAmAAE9NJ9IIEIAASBiAAAgggAA
IKIAACDI95D0wAhqCzoAAACAQWj5cAa5/TEJikGGAogCUQGQUGIACEAo+dD/
+v7g+TpRYgACUkwAAQAEAAAAAAAAAABRgBxUggAAABkAGQAAFKCAAJAkAEg/
MoAAQlTb21lIFRleHQAoABA

--/04w6evG8XlLl3ft
Content-Type: text/xml; charset=us-ascii
Content-ID: <part3>
Content-Disposition: attachment; filename="chapter3.xml"

      <p>This is a paragraph within the third chapter within
the first part of a Docbook <quote>book</quote> document.</p>
      <p>And this is a succeeding paragraph.</p>
      <p>And an internal text entity reference &author;.</p>
      <p>And a reference to an unparsed entity (a CGM graphic):
         <graphic entityref="try"></graphic></p>

--/04w6evG8XlLl3ft
Content-Type: text/xml; charset=us-ascii

<?xml version="1.0"?>
<f:fcs xmlns:f="http://www.w3.org/2001/02/xml-fragment"
       xmlns="http://www.oasis-open.org/docbook/docbook/3.0/docbook.dtd"
       extref="http://www.oasis-open.org/docbook/docbook/3.0/docbook.dtd"
       intref="cid:part1">
    <book>
      <part>
        <chapter type="intro"/>
        <chapter/>
        <chapter>
            <f:fragbody fragbodyref="cid:part3"/>
        </chapter>
      </part>
    </book>
</f:fcs>

--/04w6evG8XlLl3ft--

A packaging of (a different) pair as a single XML document might be like: XML Fragment Interchange section C.1
 
<?xml version="1.0"?>
<p:package xmlns:p="http://www.w3.org/2001/02/xml-package">
  <p:fcs xmlns:f="http://www.w3.org/2001/02/xml-fragment"
         sourcelocn="http://acme.com/trans1234#root().child(1,purchase).child(2,book)">
    <transaction>
      <purchase>
        <book/>
    <p:fragbody/>
      </purchase>
    </transaction>
  </p:fcs>

  <p:body>
    <book>
      <Author>J. R. R. Tolkien</Author>
      <Title>The Book of Lost Tales (The History of Middle-Earth)</Title>
      <Edition>Mass Market Paperback Reprint edition (June 1992)</Edition>
      <ISBN>0345375211</ISBN>
      <Price currency="USD">4.79</Price>
      <Quantity>1</Quantity>
    </book>
  </p:body>
</p:package>

The principle conclusion is that XML Fragment Interchange suggests using something substantially more complicated than the string of the fragment.

XML Canonicalization

 Canonical XML specifies a mechanism for turning an XML document into a canonical form.
 The Document Subsets section explicitly discusses how to canonicalise a subset of the document.
This specifies one of the many different forms of an XML document to be the canonical one.
This canonical form is defined in terms of an XPath node-set rather than an Infoset. I haven't completely understood the difference. I think, a node-set is a subset of Infoset; and that references have been expanded.A key quote may be:
There are seven types of node:
The context, as far as entities etc. goes is hence expanded before canonicalization. The context as in name-spaces an xml:lang xml:base, xml:space attributes goes is added on to the top-level elements in a document subset as additional attributes (if needed). All aspects of the resulting string are determined, so that a string equality test will indicate whether the XPath node sets are identical.
The actual process of canonicalization is boring:

XML Canonicalization, first Last Call

I found it instructive to look at an earlier version of XML Canonicalization, which was very different (different editors). It was over Infoset not the XPath node set, and renamed the namespaces. This version was rejected, I think most importantly because of the namespace renaming that breaks schema amongst other things. From the  archive.
 c14n messes up qnames in attribute values
the canonical form of:
<aDoc xmlns:aPrefix="http://example.com/">
<anElt anAttr="aPrefix:anNCName">
</aDoc>
is:
<n1:aDoc xmlns:n1="http://example.com/">
<n1:anElt xmlns:n1="http://example.com/" anAttr="aPrefix:anNCName">
</n1:aDoc>
 XML Schema WG response to the C14N Last Call WD
The Schema WG has serious concerns with the provisions in Section 5.9 "Namespaces" that require namespace prefixes to be rewritten [...] the prefix rewriting problem is not just a schema concern but will have potential impact on many other namespace aware instances.
Relationship of Canonical XML to the InfoSet
A simplistic approach would say "if it's in the core Infoset, it's present in Canonical XML, if it isn't, it isn't".
Instead we seem to have a pick-and-choose approach.

Breakdown of Choices

My view is that answering some of the questions above will make it clearer whether XML Fragment Interchange or XML Canonicalization (or neither) is the better way forward for parseType="Literal".