XML Literals, fragments, and cannonicalization
Author: Jeremy Carroll
This memo considers the problem of xml literals in RDF (rdfms-literal-is-xml-structure).
We consider the relationship with XML
Fragment Interchange and Canonical
XML.
The underlying problem is that an XML fragment need not be self contained,
but may use name space prefices and/or references that refer to other parts
of the XML document.
Contents
The Issue
M&S specifies the following:
-
P33
- Statements
-
The object of a statement (i.e., the property value) can be another resource
or it can be a literal; i.e., a resource (specified by a URI) or a simple
string or other primitive datatype defined by XML. In RDF terms, a literal
may have content that is XML markup but is not further evaluated by the
RDF processor.
-
parseLiteral
-
[6.32] parseLiteral ::= ' parseType="Literal"'
-
literal
-
[6.34] literal ::= (any well-formed
XML)
-
P202
-
if parseType="Literal" is specified in the start tag of E then v is the
content of E (a literal).
-
P203
-
The value 'Literal' specifies that the element content is to be treated
as an RDF/XML literal; that is, the content must not be interpreted by
an RDF processor.
-
P203
(subpara)
-
The RDF Model and Syntax Working Group acknowledges that the parseType='Literal'
mechanism is a minimum-level solution to the requirement to express an
RDF statement with a value that has XML markup. Additional complexities
of XML such as canonicalization of whitespace are not yet well defined.
Future work of the W3C is expected to resolve such issues in a uniform
manner for all applications based on XML. Future versions of RDF will inherit
this work and may extend it as we gain insight from further application
experience.
-
P212
-
or if parseType="Literal" is specified in the start tag of E then v is
the content of E (a literal).
-
P214
-
The attribute parseType="Literal" specifies that the element content is
an RDF literal. Any markup that is part of this content is included as
part of the literal and not interpreted by RDF
-
P220
-
This specification does not state a mechanism for determining equivalence
between literals that contain markup, nor whether such a mechanism is guaranteed
to exist.
-
Values
Containing Markup
-
The precise representation of the resulting value is not specified here.
and
a MathML example in which the default namespace is significant, and not
duplicated in the literal.
-
P282
-
The content of a literal is not interpreted by RDF itself and may contain
additional XML markup. Literals are distinguished from Resources in that
the RDF model does not permit literals to be the subject of a statement.
Implementations that treat an XML literal value simply as the string of
characters actually present in the source document seem to me to follow
the spirit of M&S.
However, it is also desirable to allow implementations that are truely
on top of Infoset to be conformant, and these cannot recreate things
not in infoset , in particular the original string.
P220 is an escape clause for this problem.
A simple string, pulled out of an XML file is not necessarily well-formed,
and may have a different meaning, when put back into another XML file.
e.g. it can make use of name spaces that are not specified in the new context,
or specified differently; it can make use of entity and character references,
which perhaps should be expanded before extraction; its menaing may depend
on xml attributes that are in scope but not included in the fragment, e.g.
xml:space="preserve" or xml:lang="rom" or xml:future="not yet defined".
XML Fragments
The XML Fragments spec is trying to show how to take a fragment out of
an XML document, and to take the context of that fragment, so that the
pair (the fragment and the context) allow a full understanding of the fragment.
Hence, references are not expanded in this processed, but are included
unexpanded. They do not specify a mechanism for relating the pair but give
some non-normative examples. In the following MIME example taken from XML
Fragment Interchange section C.2 we see the fragment (part3) is understood
in terms of the context (the fourth part), which referes to the first,
and the second parts of the message. Thus the reference &author; in
the fragment can be expanded to "me".
And here is an example of MIME packaging used to transmit the
fragment context specification, the fragment body, the internal subset,
and the external entity within a single stream such as a mail message:
Content-Type: multipart/related; boundary="/04w6evG8XlLl3ft";type="text/xml"
--/04w6evG8XlLl3ft
Content-Type: text/xml; charset=us-ascii
Content-ID: <part1>
Content-Disposition: attachment; filename="mybook.decls"
<!ENTITY title "My Book">
<!ENTITY author "me">
<!ENTITY try SYSTEM "cid:part2" NDATA CGM-BINARY>
--/04w6evG8XlLl3ft
Content-Type: image/cgm
Content-ID: <part2>
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="try.cgm"
ACEAABAiAAEQXwBEQyJTb3VyY2U6IEhTSSAvV01GLXRvLUNHTSBmaWx0ZXIg
LyBWZXJzaW9uIDEuMzUgIiAiRGF0ZTogMTk5OS0wMS0xNyIRZgAB//8AARBi
AAAQpgAAAAkAFxFGAAAA////EYQwIgAQEYogyAAAAAB//3//AAARvwC3C1RJ
TUVTX1JPTUFODFRJTUVTX0lUQUxJQwpUSU1FU19CT0xEEVRJTUVTX0JPTERf
SVRBTElDCUhFTFZFVElDQRFIRUxWRVRJQ0FfT0JMSVFVRQ5IRUxWRVRJQ0Ff
Qk9MRBZIRUxWRVRJQ0FfQk9MRF9PQkxJUVVFB0NPVVJJRVIOQ09VUklFUl9J
VEFMSUMMQ09VUklFUl9CT0xEE0NPVVJJRVJfQk9MRF9JVEFMSUMGU1lNQk9M
ABHOAAABQgABAUEABAMqLToR4gABAGEAACAmAAE9NJ9IIEIAASBiAAAgggAA
IKIAACDI95D0wAhqCzoAAACAQWj5cAa5/TEJikGGAogCUQGQUGIACEAo+dD/
+v7g+TpRYgACUkwAAQAEAAAAAAAAAABRgBxUggAAABkAGQAAFKCAAJAkAEg/
MoAAQlTb21lIFRleHQAoABA
--/04w6evG8XlLl3ft
Content-Type: text/xml; charset=us-ascii
Content-ID: <part3>
Content-Disposition: attachment; filename="chapter3.xml"
<p>This is a paragraph within the third chapter within
the first part of a Docbook <quote>book</quote> document.</p>
<p>And this is a succeeding paragraph.</p>
<p>And an internal text entity reference &author;.</p>
<p>And a reference to an unparsed entity (a CGM graphic):
<graphic entityref="try"></graphic></p>
--/04w6evG8XlLl3ft
Content-Type: text/xml; charset=us-ascii
<?xml version="1.0"?>
<f:fcs xmlns:f="http://www.w3.org/2001/02/xml-fragment"
xmlns="http://www.oasis-open.org/docbook/docbook/3.0/docbook.dtd"
extref="http://www.oasis-open.org/docbook/docbook/3.0/docbook.dtd"
intref="cid:part1">
<book>
<part>
<chapter type="intro"/>
<chapter/>
<chapter>
<f:fragbody fragbodyref="cid:part3"/>
</chapter>
</part>
</book>
</f:fcs>
--/04w6evG8XlLl3ft--
|
A packaging of (a different) pair as a single XML document might be
like: XML Fragment
Interchange section C.1
<?xml version="1.0"?>
<p:package xmlns:p="http://www.w3.org/2001/02/xml-package">
<p:fcs xmlns:f="http://www.w3.org/2001/02/xml-fragment"
sourcelocn="http://acme.com/trans1234#root().child(1,purchase).child(2,book)">
<transaction>
<purchase>
<book/>
<p:fragbody/>
</purchase>
</transaction>
</p:fcs>
<p:body>
<book>
<Author>J. R. R. Tolkien</Author>
<Title>The Book of Lost Tales (The History of Middle-Earth)</Title>
<Edition>Mass Market Paperback Reprint edition (June 1992)</Edition>
<ISBN>0345375211</ISBN>
<Price currency="USD">4.79</Price>
<Quantity>1</Quantity>
</book>
</p:body>
</p:package>
|
The principle conclusion is that XML Fragment Interchange suggests using
something substantially more complicated than the string of the fragment.
XML Canonicalization
Canonical XML specifies
a mechanism for turning an XML document into a canonical form.
The Document
Subsets section explicitly discusses how to canonicalise a subset of
the document.
This specifies one of the many different forms of an XML document to
be the canonical one.
This canonical form is defined in terms of an XPath
node-set rather than an Infoset. I haven't completely understood the
difference. I think, a node-set is a subset of Infoset; and that references
have been expanded.A key quote may be:
There are seven types of node:
-
root nodes
-
element nodes
-
text nodes
-
attribute nodes
-
namespace nodes
-
processing instruction nodes
-
comment nodes
The context, as far as entities etc. goes is hence expanded before canonicalization.
The context as in name-spaces an xml:lang xml:base, xml:space attributes
goes is added on to the top-level elements in a document subset as additional
attributes (if needed). All aspects of the resulting string are determined,
so that a string equality test will indicate whether the XPath node sets
are identical.
The actual process
of canonicalization is boring:
-
The document is encoded in UTF-8
-
Line breaks normalized to #xA on input, before parsing
-
Attribute values are normalized, as if by a validating processor
-
Character and parsed entity references are replaced
-
CDATA sections are replaced with their character content
-
The XML declaration and document type declaration (DTD) are removed
-
Empty elements are converted to start-end tag pairs
-
Whitespace outside of the document element and within start and end tags
is normalized
-
All whitespace in character content is retained (excluding characters removed
during line feed normalization)
-
Attribute value delimiters are set to quotation marks (double quotes)
-
Special characters in attribute values and character content are replaced
by character references
-
Superfluous namespace declarations are removed from each element
-
Default attributes are added to each element
-
Lexicographic order is imposed on the namespace declarations and attributes
of each element
XML Canonicalization, first Last Call
I found it instructive to look at an earlier
version of XML Canonicalization, which was very different (different
editors). It was over Infoset not the XPath node set, and renamed the namespaces.
This version was rejected, I think most importantly because of the namespace
renaming that breaks schema amongst other things. From the archive.
-
c14n
messes up qnames in attribute values
-
the canonical form of:
-
<aDoc xmlns:aPrefix="http://example.com/">
-
<anElt anAttr="aPrefix:anNCName">
-
</aDoc>
-
is:
-
<n1:aDoc xmlns:n1="http://example.com/">
-
<n1:anElt xmlns:n1="http://example.com/" anAttr="aPrefix:anNCName">
-
</n1:aDoc>
-
XML
Schema WG response to the C14N Last Call WD
-
The Schema WG has serious concerns with the provisions in Section 5.9 "Namespaces"
that require namespace prefixes to be rewritten [...] the prefix rewriting
problem is not just a schema concern but will have potential impact on
many other namespace aware instances.
-
Relationship
of Canonical XML to the InfoSet
-
A simplistic approach would say "if it's in the core Infoset, it's present
in Canonical XML, if it isn't, it isn't".
-
Instead we seem to have a pick-and-choose approach.
Breakdown of Choices
-
M&S P203 suggests future RDF working groups may try and improve the
current spec. Has the future arrived?
-
Are naive RDF processors that simply take the string be conformant?
-
Are XML infoset RDF processors that cannot reproduce the string conformant?
-
If the answers to the previous two questions are both yes, what are the
words that permit this?
-
Is the representation of an XML Literal in the model a string or something
closer to Infoset?
-
Which of the following parts of infoset are processed before forming the
literal or present in the literal:
2.1 The Document Information Item
2.2 Element Information Items
2.3 Attribute Information Items
2.4 Processing Instruction Information Items
2.5 Unexpanded Entity Reference Information Items
****
2.6 Character Information Items
2.7 Comment Information Items
****
2.8 The Document Type Declaration Information Item
****
2.9 Unparsed Entity Information Items
**** from DTD
2.10 Notation Information Items
**** from DTD
2.11 Namespace Information Items
****
(I have marked the ones I think most pertinent, with each of these, marked
and unmarked, one can go to a deeper level in Infoset, to ask which attributes
of these infoset items are represented in the literal)
My view is that answering some of the questions above will make it clearer
whether XML Fragment Interchange or XML Canonicalization (or neither) is
the better way forward for parseType="Literal".