XML Fragments and parseType="Literal"

XML Literals, fragments, and cannonicalization

This memo considers the problem of xml literals in RDF (rdfms-literal-is-xml-structure). We consider the relationship with XML Fragment Interchange and Canonical XML.
The underlying problem is that an XML fragment need not be self contained, but may use name space prefices and/or references that refer to other parts of the XML document.

The Issue

Quotes from model and syntax on XML literals.

XML Fragments

Example of how XML fragments work

XML Canonicalization

How canonicalization works for fragments.

XML Canonicalization, first last call

How not to do namespaces.

Breakdown of choices

The Issue

M&S specifies the following:

P33 - Statements: The object of a statement (i.e., the property value) can be another resource or it can be a literal; i.e., a resource (specified by a URI) or a simple string or other primitive datatype defined by XML. In RDF terms, a literal may have content that is XML markup but is not further evaluated by the RDF processor.
parseLiteral: [6.32] parseLiteral ::= ' parseType="Literal"'
literal: [6.34] literal ::= (any well-formed XML)
P202: if parseType="Literal" is specified in the start tag of E then v is the content of E (a literal).
P203: The value 'Literal' specifies that the element content is to be treated as an RDF/XML literal; that is, the content must not be interpreted by an RDF processor.
P203 (subpara): The RDF Model and Syntax Working Group acknowledges that the parseType='Literal' mechanism is a minimum-level solution to the requirement to express an RDF statement with a value that has XML markup. Additional complexities of XML such as canonicalization of whitespace are not yet well defined. Future work of the W3C is expected to resolve such issues in a uniform manner for all applications based on XML. Future versions of RDF will inherit this work and may extend it as we gain insight from further application experience.
P212: or if parseType="Literal" is specified in the start tag of E then v is the content of E (a literal).
P214: The attribute parseType="Literal" specifies that the element content is an RDF literal. Any markup that is part of this content is included as part of the literal and not interpreted by RDF
P220: This specification does not state a mechanism for determining equivalence between literals that contain markup, nor whether such a mechanism is guaranteed to exist.
Values Containing Markup: The precise representation of the resulting value is not specified here. and a MathML example in which the default namespace is significant, and not duplicated in the literal.
P282: The content of a literal is not interpreted by RDF itself and may contain additional XML markup. Literals are distinguished from Resources in that the RDF model does not permit literals to be the subject of a statement.

Implementations that treat an XML literal value simply as the string of characters actually present in the source document seem to me to follow the spirit of M&S.
However, it is also desirable to allow implementations that are truely on top of Infoset to be conformant, and these cannot recreate things not in infoset , in particular the original string.
P220 is an escape clause for this problem.
A simple string, pulled out of an XML file is not necessarily well-formed, and may have a different meaning, when put back into another XML file. e.g. it can make use of name spaces that are not specified in the new context, or specified differently; it can make use of entity and character references, which perhaps should be expanded before extraction; its menaing may depend on xml attributes that are in scope but not included in the fragment, e.g. xml:space="preserve" or xml:lang="rom" or xml:future="not yet defined".

XML Fragments

The XML Fragments spec is trying to show how to take a fragment out of an XML document, and to take the context of that fragment, so that the pair (the fragment and the context) allow a full understanding of the fragment. Hence, references are not expanded in this processed, but are included unexpanded. They do not specify a mechanism for relating the pair but give some non-normative examples. In the following MIME example taken from XML Fragment Interchange section C.2 we see the fragment (part3) is understood in terms of the context (the fourth part), which referes to the first, and the second parts of the message. Thus the reference &author; in the fragment can be expanded to "me".

And here is an example of MIME packaging used to transmit the fragment context specification, the fragment body, the internal subset, and the external entity within a single stream such as a mail message:

Content-Type: multipart/related; boundary="/04w6evG8XlLl3ft";type="text/xml"

--/04w6evG8XlLl3ft
Content-Type: text/xml; charset=us-ascii
Content-ID: <part1>
Content-Disposition: attachment; filename="mybook.decls"

<!ENTITY title "My Book">
<!ENTITY author "me">
<!ENTITY try SYSTEM "cid:part2" NDATA CGM-BINARY>

--/04w6evG8XlLl3ft
Content-Type: image/cgm
Content-ID: <part2>
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="try.cgm"

ACEAABAiAAEQXwBEQyJTb3VyY2U6IEhTSSAvV01GLXRvLUNHTSBmaWx0ZXIg
LyBWZXJzaW9uIDEuMzUgIiAiRGF0ZTogMTk5OS0wMS0xNyIRZgAB//8AARBi
AAAQpgAAAAkAFxFGAAAA////EYQwIgAQEYogyAAAAAB//3//AAARvwC3C1RJ
TUVTX1JPTUFODFRJTUVTX0lUQUxJQwpUSU1FU19CT0xEEVRJTUVTX0JPTERf
SVRBTElDCUhFTFZFVElDQRFIRUxWRVRJQ0FfT0JMSVFVRQ5IRUxWRVRJQ0Ff
Qk9MRBZIRUxWRVRJQ0FfQk9MRF9PQkxJUVVFB0NPVVJJRVIOQ09VUklFUl9J
VEFMSUMMQ09VUklFUl9CT0xEE0NPVVJJRVJfQk9MRF9JVEFMSUMGU1lNQk9M
ABHOAAABQgABAUEABAMqLToR4gABAGEAACAmAAE9NJ9IIEIAASBiAAAgggAA
IKIAACDI95D0wAhqCzoAAACAQWj5cAa5/TEJikGGAogCUQGQUGIACEAo+dD/
+v7g+TpRYgACUkwAAQAEAAAAAAAAAABRgBxUggAAABkAGQAAFKCAAJAkAEg/
MoAAQlTb21lIFRleHQAoABA

--/04w6evG8XlLl3ft
Content-Type: text/xml; charset=us-ascii
Content-ID: <part3>
Content-Disposition: attachment; filename="chapter3.xml"

      <p>This is a paragraph within the third chapter within
the first part of a Docbook <quote>book</quote> document.</p>
      <p>And this is a succeeding paragraph.</p>
      <p>And an internal text entity reference &author;.</p>
      <p>And a reference to an unparsed entity (a CGM graphic):
         <graphic entityref="try"></graphic></p>

--/04w6evG8XlLl3ft
Content-Type: text/xml; charset=us-ascii

<?xml version="1.0"?>
<f:fcs xmlns:f="http://www.w3.org/2001/02/xml-fragment"
       xmlns="http://www.oasis-open.org/docbook/docbook/3.0/docbook.dtd"
       extref="http://www.oasis-open.org/docbook/docbook/3.0/docbook.dtd"
       intref="cid:part1">
    <book>
      <part>
        <chapter type="intro"/>
        <chapter/>
        <chapter>
            <f:fragbody fragbodyref="cid:part3"/>
        </chapter>
      </part>
    </book>
</f:fcs>

--/04w6evG8XlLl3ft--

A packaging of (a different) pair as a single XML document might be like: XML Fragment Interchange section C.1

<?xml version="1.0"?> <p:package xmlns:p="http://www.w3.org/2001/02/xml-package"> <p:fcs xmlns:f="http://www.w3.org/2001/02/xml-fragment" sourcelocn="http://acme.com/trans1234#root().child(1,purchase).child(2,book)"> <transaction> <purchase> <book/> <p:fragbody/> </purchase> </transaction> </p:fcs> <p:body> <book> <Author>J. R. R. Tolkien</Author> <Title>The Book of Lost Tales (The History of Middle-Earth)</Title> <Edition>Mass Market Paperback Reprint edition (June 1992)</Edition> <ISBN>0345375211</ISBN> <Price currency="USD">4.79</Price> <Quantity>1</Quantity> </book> </p:body> </p:package>

The principle conclusion is that XML Fragment Interchange suggests using something substantially more complicated than the string of the fragment.

XML Canonicalization

Canonical XML specifies a mechanism for turning an XML document into a canonical form.
The Document Subsets section explicitly discusses how to canonicalise a subset of the document.
This specifies one of the many different forms of an XML document to be the canonical one.
This canonical form is defined in terms of an XPath node-set rather than an Infoset. I haven't completely understood the difference. I think, a node-set is a subset of Infoset; and that references have been expanded.A key quote may be:

There are seven types of node:

root nodes
element nodes
text nodes
attribute nodes
namespace nodes
processing instruction nodes
comment nodes

The context, as far as entities etc. goes is hence expanded before canonicalization. The context as in name-spaces an xml:lang xml:base, xml:space attributes goes is added on to the top-level elements in a document subset as additional attributes (if needed). All aspects of the resulting string are determined, so that a string equality test will indicate whether the XPath node sets are identical.
The actual process of canonicalization is boring:

The document is encoded in UTF-8
Line breaks normalized to #xA on input, before parsing
Attribute values are normalized, as if by a validating processor
Character and parsed entity references are replaced
CDATA sections are replaced with their character content
The XML declaration and document type declaration (DTD) are removed
Empty elements are converted to start-end tag pairs
Whitespace outside of the document element and within start and end tags is normalized
All whitespace in character content is retained (excluding characters removed during line feed normalization)
Attribute value delimiters are set to quotation marks (double quotes)
Special characters in attribute values and character content are replaced by character references
Superfluous namespace declarations are removed from each element
Default attributes are added to each element
Lexicographic order is imposed on the namespace declarations and attributes of each element

XML Canonicalization, first Last Call

I found it instructive to look at an earlier version of XML Canonicalization, which was very different (different editors). It was over Infoset not the XPath node set, and renamed the namespaces. This version was rejected, I think most importantly because of the namespace renaming that breaks schema amongst other things. From the archive.

c14n messes up qnames in attribute values: the canonical form of:; <aDoc xmlns:aPrefix="http://example.com/">; <anElt anAttr="aPrefix:anNCName">; </aDoc>; is:; <n1:aDoc xmlns:n1="http://example.com/">; <n1:anElt xmlns:n1="http://example.com/" anAttr="aPrefix:anNCName">; </n1:aDoc>
XML Schema WG response to the C14N Last Call WD: The Schema WG has serious concerns with the provisions in Section 5.9 "Namespaces" that require namespace prefixes to be rewritten [...] the prefix rewriting problem is not just a schema concern but will have potential impact on many other namespace aware instances.
Relationship of Canonical XML to the InfoSet: A simplistic approach would say "if it's in the core Infoset, it's present in Canonical XML, if it isn't, it isn't".; Instead we seem to have a pick-and-choose approach.

Breakdown of Choices

M&S P203 suggests future RDF working groups may try and improve the current spec. Has the future arrived?
Are naive RDF processors that simply take the string be conformant?
Are XML infoset RDF processors that cannot reproduce the string conformant?
If the answers to the previous two questions are both yes, what are the words that permit this?
Is the representation of an XML Literal in the model a string or something closer to Infoset?
Which of the following parts of infoset are processed before forming the literal or present in the literal: