Returning un-escaped XML literals in SPARQL XML Results

Howdy!

(I have tried sending this before and it didn't seem to go through.  Apologies if you get multiple copies).

In September 2006 I posted a question about literal XML inclusion in SPARQL results to the jena-dev list,
and Andy suggested that I post the issue here.  Sorry it's taken me so long to do that.  I don't see
anywhere on this list that the subject has come up in the meantime, so perhaps it's still germane.

When literal XML is stored inside an RDF model, it is in some cases desirable to fetch that content
as part of a SPARQL XML result stream *without escaping*.   For example, consider the storage of
XHTML content within RDF literals.  It seems reasonable (and works fine) to assert a triple like this:

s = m:someDocument
p = m:hasContent
o = <xh:p xh="http://www.w3.org/1999/xhtml">Contents of <xh:em>THE</xh:em> paragraph</xh:p>^^rdf:XMLLiteral

Note that the datatype of the object node is rdf:XMLLiteral.
Also note that I am only using XHTML as an example, and the literal block could be of any XML type.

What I would like is to query a model containing this triple, and receive the results as SPARQL-XML,
with the literal's contents simply included into the result stream as XML, so we would see output like this:

<binding name="o">
         <literal datatype="http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral">
                  <xh:p  xmlns:xh="http://www.w3 .org/1999/xhtml">Contents of <xh:em>THE</xh:em> paragraph</xh:p>
         </literal>
</binding>

(I have hacked up my copy of ARQ to do this, and it works great, for my purposes!).

This is not the normal behavior of ARQ, however.  ARQ will always
escape the XML tag characters, turning angle brackets into entity references, and so on.

I can see how some users might want that escaping, but it also seems reasonable to NOT want it.
Turning the escaped tags back into parsed XML requires a consumer (who shares my assumptions
and preferences) to serialize the result set document into a buffer and re-parse it from text, which
is not fun or fast.  From my use case (embedding RDF  technology within an established content
management application based on the Cocoon XML-pipeline framework), it is very nice to have the
result set available as a single unbroken XML tree, which is immediately ready for downstream
processing using XSLT.  With this feature available, the embedding of small fragments of XML
content within RDF models becomes quite attractive in some situations.

To me it would be reasonable to control this behaviour ("to escape or not to escape") at the SPARQL
query engine API level, probably by setting a flag on the ResultSet object.    I proposed this on the
jena-dev list (with the simple implementation that I had hacked up for my own use), and Andy gave a
very comprehensive and thoughtful response indicating how this serialization issue relates to
the design of the XML Schema for SPARQL results, the defined lexical form of the results
in the spec, and concerns about reparsing the literals in downstream processes:

http://tech.groups.yahoo.com/group/jena-dev/message/25395

I understand the desire to keep schemas tight and not have gratuitous XSD:ANY's flying around.
But, on the other hand, it seems to me that RDF+SPARQL users who choose to use the XMLLiteral
datatype are essentially choosing to store arbitrary XML within their RDF, and they are tagging
it as such.  So, if we want to support that use case, allowing the <literal> return block
to contain XML, and using the XSD:ANY schema type to implement the facility seems appropriate.
I do see that there are some choices which would need to be made in the face of the limited
expressiveness of the XML-schema standard, and I won't launch into a discussion of those
details unless/until others are interested.

I can also see that some apparent collision issues could arise if literal content uses the
default namespace, but these don't appear insurmountable.  Again, I won't
launch into examples until others have had a chance to respond in general terms.
(Perhaps this whole issue was already debated somewhere before).

In summary:  I think that if there was a way for SPARQL engines to (optionally) return
XMLLiterals without escaping the tags (and preferably, without violating applicable
standards), that would be peachy.

sincerely,

Stu Baurmann

Received on Thursday, 9 August 2007 11:50:45 UTC