Re: Unescaped XML in the SPARQL XML Result Format and Tuesday's agenda from Seaborne, Andy on 2007-09-22 (public-rdf-dawg@w3.org from July to September 2007)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Sat, 22 Sep 2007 19:00:58 +0100
To: Ivan Mikhailov <imikhailov@openlinksw.com>
Cc: Lee Feigenbaum <lee@thefigtrees.net>, 'RDF Data Access Working Group' <public-rdf-dawg@w3.org>
Message-ID: <46F5585A.5080409@hp.com>
Ivan,

I think it's more complicated that your message suggests.  Simple embedding of 
XML in SPARQL results does not work and a full solution is quite intricate and 
makes at least one significant trade-off (loss of schema validation).

The SPARQL Query Results XML Format might start like this example from 
Virtuoso which sends literals just as the lexical form:

<sparql
  xmlns="http://www.w3.org/2005/sparql-results#"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.w3.org/2001/sw/DataAccess/rf1/result2.xsd">

so the default namespace is http://www.w3.org/2005/sparql-results#

Later on in the results, I have found examples such as this:

     <binding name="z"><literal><a
href="http://woodcutters.meard.org/gallery/v/IMG_3999-woodcutter10.jpg.html"><img
border="0"
src="http://woodcutters.meard.org/gallery/d/41-2/IMG_3999-woodcutter10.jpg"
width="100" height="150" /></a></literal></binding>

which is supposed, I guess, to be embedded HTML in the results format


1: Namespaces

The <a> is in the default namespace of 
"http://www.w3.org/2005/sparql-results#" so it's
   {http://www.w3.org/2005/sparql-results#}a
which is not HTML nor XHTML.

Similarly, xml:lang would be passed into the subtree.

This affects namespace-aware XSLT processing.

2: HTML need not be well-formed XML

One motivation is the use with HTML but HTML can't be simple placed textually 
because it isn't necessary well-formed XML.

That "<a>...</a>" was a plain string, no datatype in the <literal> so it has 
to be escaped anyway.  Even if we allowed nested XML, it would still would not 
be valid unless we go beyond the XML literal style and allowed mixed text and 
content.

It might have been "Here is a <a href="">link</a>."

3: It can't be validated any more.

This was one of the original goals of the design - to have valid XML documents 
with respect to a fixed schema.  Then, receiving systems could validate the 
XML in the usual way for an XML processing pipeline.

This would be a significant loss as evidenced by the XML syntax for RDF.

Going for a modular XML design would take a long time to finish and I don't 
see how it would not break deployed systems.  It would impose significant 
complexity to the parsing process.

4: Escaping in lexical forms is still needed

Suppose we have a literal which is "x<y"   The < needs encoding as &lt; 
regardless of allowing unescaped XML.  A writer already needs to check for 
escapes.


I do agree about readability though. :-)

I do not see any new information and do not support reopening the design of 
the SPARQL Query Results XML Format.

	Andy


Ivan Mikhailov wrote:
> Hi everyone,
> 
> I vote for support of unescaped XML texts in the ..Results XML...
> because it could be convenient for XSLT and similar tools that may be
> used to transform result sets of different formats into each other. It
> is also definitely more readable. It also resembles RDF/XML decision.
> 
> There should be an attribute that will indicate the difference between a
> string and an XML entity that consists of a string. I'm in doubt whether
> we should support generic entities there or just XML trees, so  probably
> we should repeat RDF/XML decision.
> 
> I understand that unescaped XML texts may add problems for some
> lightweight parsers of the format but these problems are minor and not
> common for all implementations whereas convenient report format is a
> worth thing for everybody.
> 
> I also understand that this will 'relax' XML Schema of the document but
> I don't care :)
> 
> Best Regards,
> Ivan Mikhailov.
> 
> On Sat, 2007-09-22 at 00:58 -0400, Lee Feigenbaum wrote:
>> Hi everyone,
>>
>> Eric is at risk for Tuesday; Orri and Ivan M can't make it, and I have 
>> schedule-crunch. We're still doing well towards a decision to move to 
>> PR, but I think we might shorten up this Tuesday's teleconf and push the 
>> meat of our work to a week from Tuesday.
>>
>> We do have one issue that we need to tackle ASAP:
>>
>> On August 2, we received a comment from Stu Baurmann:
>>
>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2007Aug/0005.html
>>
>> The message brings up the possibility of including unescaped XML literal 
>> values in the SPARQL Query Results XML Format. Although Richard Newman 
>> responded with some technical concerns about the suggestion, the Working 
>> Group never responded. We owe Stu a response before publishing a CR 
>> version of the XML results format.
>>
>> I'd like to know if there is anyone on the working group who would like 
>> to consider this suggestion and propose a design for it. I know that 
>> Andy had some technical concerns about it and there are also, of course, 
>> schedule concerns, but in the interest of due diligence I wanted to give 
>> working group members who might support this comment a chance to speak up.
>>
>> So please register your support or active lack of support on the mailing 
>> list if you can, and we'll attempt to dispatch of the comment on 
>> Tuesday's teleconference.
>>
>> For Tuesday, I'm picturing taking up this issue and then going over 
>> where we stand in terms of advancing all three of our specifications to 
>> PR, and seeing who has what actions on the critical path between here 
>> and there. I'm hoping to keep the call to 30 minutes.
>>
>> The flip side is that I'm expecting a somewhat lengthy call the week 
>> after -- probably on the order of 90 minutes. Please let us know as soon 
>> sa you can if you cannot make our call on Oct 2.
>>
>> Lee
>>
> 
>
Received on Saturday, 22 September 2007 18:01:53 UTC