XML Literals, fragments, and cannonicalization
Author: Jeremy Carroll
This memo considers the problem of xml literals in RDF (rdfms-literal-is-xml-structure).
We consider the relationship with XML
Fragment Interchange and Canonical
The underlying problem is that an XML fragment need not be self contained,
but may use name space prefices and/or references that refer to other parts
of the XML document.
M&S specifies the following:
Implementations that treat an XML literal value simply as the string of
characters actually present in the source document seem to me to follow
the spirit of M&S.
The object of a statement (i.e., the property value) can be another resource
or it can be a literal; i.e., a resource (specified by a URI) or a simple
string or other primitive datatype defined by XML. In RDF terms, a literal
may have content that is XML markup but is not further evaluated by the
[6.32] parseLiteral ::= ' parseType="Literal"'
[6.34] literal ::= (any well-formed
if parseType="Literal" is specified in the start tag of E then v is the
content of E (a literal).
The value 'Literal' specifies that the element content is to be treated
as an RDF/XML literal; that is, the content must not be interpreted by
an RDF processor.
The RDF Model and Syntax Working Group acknowledges that the parseType='Literal'
mechanism is a minimum-level solution to the requirement to express an
RDF statement with a value that has XML markup. Additional complexities
of XML such as canonicalization of whitespace are not yet well defined.
Future work of the W3C is expected to resolve such issues in a uniform
manner for all applications based on XML. Future versions of RDF will inherit
this work and may extend it as we gain insight from further application
or if parseType="Literal" is specified in the start tag of E then v is
the content of E (a literal).
The attribute parseType="Literal" specifies that the element content is
an RDF literal. Any markup that is part of this content is included as
part of the literal and not interpreted by RDF
This specification does not state a mechanism for determining equivalence
between literals that contain markup, nor whether such a mechanism is guaranteed
The precise representation of the resulting value is not specified here.
a MathML example in which the default namespace is significant, and not
duplicated in the literal.
The content of a literal is not interpreted by RDF itself and may contain
additional XML markup. Literals are distinguished from Resources in that
the RDF model does not permit literals to be the subject of a statement.
However, it is also desirable to allow implementations that are truely
on top of Infoset to be conformant, and these cannot recreate things
not in infoset , in particular the original string.
P220 is an escape clause for this problem.
A simple string, pulled out of an XML file is not necessarily well-formed,
and may have a different meaning, when put back into another XML file.
e.g. it can make use of name spaces that are not specified in the new context,
or specified differently; it can make use of entity and character references,
which perhaps should be expanded before extraction; its menaing may depend
on xml attributes that are in scope but not included in the fragment, e.g.
xml:space="preserve" or xml:lang="rom" or xml:future="not yet defined".
The XML Fragments spec is trying to show how to take a fragment out of
an XML document, and to take the context of that fragment, so that the
pair (the fragment and the context) allow a full understanding of the fragment.
Hence, references are not expanded in this processed, but are included
unexpanded. They do not specify a mechanism for relating the pair but give
some non-normative examples. In the following MIME example taken from XML
Fragment Interchange section C.2 we see the fragment (part3) is understood
in terms of the context (the fourth part), which referes to the first,
and the second parts of the message. Thus the reference &author; in
the fragment can be expanded to "me".
And here is an example of MIME packaging used to transmit the
fragment context specification, the fragment body, the internal subset,
and the external entity within a single stream such as a mail message:
Content-Type: multipart/related; boundary="/04w6evG8XlLl3ft";type="text/xml"
Content-Type: text/xml; charset=us-ascii
Content-Disposition: attachment; filename="mybook.decls"
<!ENTITY title "My Book">
<!ENTITY author "me">
<!ENTITY try SYSTEM "cid:part2" NDATA CGM-BINARY>
Content-Disposition: attachment; filename="try.cgm"
Content-Type: text/xml; charset=us-ascii
Content-Disposition: attachment; filename="chapter3.xml"
<p>This is a paragraph within the third chapter within
the first part of a Docbook <quote>book</quote> document.</p>
<p>And this is a succeeding paragraph.</p>
<p>And an internal text entity reference &author;.</p>
<p>And a reference to an unparsed entity (a CGM graphic):
Content-Type: text/xml; charset=us-ascii
A packaging of (a different) pair as a single XML document might be
like: XML Fragment
Interchange section C.1
<Author>J. R. R. Tolkien</Author>
<Title>The Book of Lost Tales (The History of Middle-Earth)</Title>
<Edition>Mass Market Paperback Reprint edition (June 1992)</Edition>
The principle conclusion is that XML Fragment Interchange suggests using
something substantially more complicated than the string of the fragment.
Canonical XML specifies
a mechanism for turning an XML document into a canonical form.
Subsets section explicitly discusses how to canonicalise a subset of
This specifies one of the many different forms of an XML document to
be the canonical one.
This canonical form is defined in terms of an XPath
node-set rather than an Infoset. I haven't completely understood the
difference. I think, a node-set is a subset of Infoset; and that references
have been expanded.A key quote may be:
There are seven types of node:
The context, as far as entities etc. goes is hence expanded before canonicalization.
The context as in name-spaces an xml:lang xml:base, xml:space attributes
goes is added on to the top-level elements in a document subset as additional
attributes (if needed). All aspects of the resulting string are determined,
so that a string equality test will indicate whether the XPath node sets
processing instruction nodes
The actual process
of canonicalization is boring:
The document is encoded in UTF-8
Line breaks normalized to #xA on input, before parsing
Attribute values are normalized, as if by a validating processor
Character and parsed entity references are replaced
CDATA sections are replaced with their character content
The XML declaration and document type declaration (DTD) are removed
Empty elements are converted to start-end tag pairs
Whitespace outside of the document element and within start and end tags
All whitespace in character content is retained (excluding characters removed
during line feed normalization)
Attribute value delimiters are set to quotation marks (double quotes)
Special characters in attribute values and character content are replaced
by character references
Superfluous namespace declarations are removed from each element
Default attributes are added to each element
Lexicographic order is imposed on the namespace declarations and attributes
of each element
XML Canonicalization, first Last Call
I found it instructive to look at an earlier
version of XML Canonicalization, which was very different (different
editors). It was over Infoset not the XPath node set, and renamed the namespaces.
This version was rejected, I think most importantly because of the namespace
renaming that breaks schema amongst other things. From the archive.
messes up qnames in attribute values
the canonical form of:
<n1:anElt xmlns:n1="http://example.com/" anAttr="aPrefix:anNCName">
Schema WG response to the C14N Last Call WD
The Schema WG has serious concerns with the provisions in Section 5.9 "Namespaces"
that require namespace prefixes to be rewritten [...] the prefix rewriting
problem is not just a schema concern but will have potential impact on
many other namespace aware instances.
of Canonical XML to the InfoSet
A simplistic approach would say "if it's in the core Infoset, it's present
in Canonical XML, if it isn't, it isn't".
Instead we seem to have a pick-and-choose approach.
Breakdown of Choices
My view is that answering some of the questions above will make it clearer
whether XML Fragment Interchange or XML Canonicalization (or neither) is
the better way forward for parseType="Literal".
M&S P203 suggests future RDF working groups may try and improve the
current spec. Has the future arrived?
Are naive RDF processors that simply take the string be conformant?
Are XML infoset RDF processors that cannot reproduce the string conformant?
If the answers to the previous two questions are both yes, what are the
words that permit this?
Is the representation of an XML Literal in the model a string or something
closer to Infoset?
Which of the following parts of infoset are processed before forming the
literal or present in the literal:
2.1 The Document Information Item
(I have marked the ones I think most pertinent, with each of these, marked
and unmarked, one can go to a deeper level in Infoset, to ask which attributes
of these infoset items are represented in the literal)
2.2 Element Information Items
2.3 Attribute Information Items
2.4 Processing Instruction Information Items
2.5 Unexpanded Entity Reference Information Items
2.6 Character Information Items
2.7 Comment Information Items
2.8 The Document Type Declaration Information Item
2.9 Unparsed Entity Information Items
**** from DTD
2.10 Notation Information Items
**** from DTD
2.11 Namespace Information Items