Re: allow implicitly unbound variables in SPARQL results?

Jeen, thanks for the summary of the situation. 

I'd like to add a bit on Elias and my objection to choice 'c'. Basically, 
the results format
as it currently stands explicitly makes possible multiple ways of 
processing the results
format -- either by variable name or by column position, given that the 
results format
mandates that the order of the variable names must in each row in the 
resultset
must match the order in the header of the result document. Option 'c' 
eliminates
the indexed-based approach to parsing the results. At the very least, I 
feel that if
the Working Group decides on option 'c' than the requirement that variable 
name
order be preserved within the resultset should be removed. 

I must say that I, at least, am not particularly swayed by our requirement 
for a 
bandwidth-efficient protocol. Yes, the collapsed version is more 
efficient, but only
for a particular class of queries (those featuring unbound variables). I 
would much
rather publish a less-efficient yet fully functional XML design along with 
an informative 
pointer to a much-more efficient binary design then attempt what I see as 
a strange 
compromise. (The flippant side of me wants to point out that the results 
format would 
be more efficient if it only used one-letter element names, as well.)

Regarding the complexity of the XSLT for the collapsed version. While I 
appreciate
Jeen's attempts to demonstrate that it is not particularly difficult to 
write XSLT for
the collapsed result format:

1) The technique that Jeen uses of remember positions and then comparing 
current positions with expected names, is not particularly elegant. 
Obviously, it
is not incumbent upon us to produce a design which allows for elegant 
processing,
but this does bother me.

2) More importantly, the XSLT as currently in CVS does not work correctly.
I am attaching a sample input result format that produces an incorrect and 

malformed table when used with this XSLT. I would argue that this is due 
to
the added complexity of the collapsed result format.

Finally, from a procedural point of view, I'd like to point out that 4.7 
is a design 
objective and NOT a requirement.


thanks,
Lee







Jeen Broekstra <jeen.broekstra@aduna.biz> 
Sent by: public-rdf-dawg-request@w3.org
12/14/2005 04:55 AM

To
RDF Data Access Working Group <public-rdf-dawg@w3.org>
cc
Dan Connolly <connolly@w3.org>
Subject
Re: allow implicitly unbound variables in SPARQL results?








After yesterday's telcon it seems that we have for now eliminated going
for the 'stripped' variation, but unfortunately we got more or less
deadlocked on deciding between sticking with the LC design (option a)
and going for the 'collapsed' variation (option c). Most people seemed
slightly in favor of the LC design, but there are some vehement protests
against that option as well. Likewise, most people are ok with the
collapsed version as well, but there some people have serious objections
as well.

After giving the whole thing some additional thought I think I am
starting to come down in favor of going for option c, and I'll lay out
the main reasons for it here. I invite anyone to disagree with me and to
convince me to stick with option a instead :)

A lot of this is simply a repeat of what has been said earlier, but I'm
hoping to use this as a springboard for reaching a WG decision as soon
as possible.

The major argument against choosing option a seems to be that in our use
cases and requirements[1] there is a requirement for bandwidth efficiency:

  [[4.7 Bandwidth-efficient Protocol

    The access protocol design shall address bandwidth utilization
    issues; that is, it shall allow for at least one result format that
    does not make excessive use of network bandwidth for a given
    collection of results.

    Status: Accepted.]]

Whether or not the LC design meets this requirement is subjective I
guess (what is "excessive", exactly?), however it has been shown that
more bandwidth-efficient variations are not only possible, but workable.

Option c has been shown to be definitely more compact, and
bandwidth-efficient. It has also been shown that processing it through
simple SAX parsing in a language like Java is no more complex than the
LC design. Moreover, IMHO it feels conceptually more correct: the whole
point of XML is that it caters for semi-structuredness by not enforcing
rigid schema adherance. When certain information is not available, (for
example, the binding for a particulary variable), you simply leave it out.

The main problem people seem to have with the format is the perceived
complexity of processing it with XSLT. This arises from the fact that
you have to align values to the correct column when doing table
rendering (in other words, you have to detect the 'missing' information
by matching the name of a binding to the column header, and inserting an
empty table cell if it does not match). I have shown that it is
possible[2] to do this by producing an XSLT sheet for rendering the
collapsed format as an HTML table[3]. Moreover, it was pointed out
during the telcon that for many XSLT tasks, this column-matching is not
even necessary: many types of processing may simply involve picking the
right variable with a particular mentioned value and are not concerned
with rendering a table column at all.

For these reasons, I feel that currently we can not really claim that
XSLT processing is made so much harder by going for this design, and
therefore I think option c is the right way to go.

If others disagree with this, I'd like them to outline an example
processing task for which writing the XSLT is so much more complex that
it warrants sticking with a less efficient design.

Binary vs. XML
==============

Quite seperately from this, there is the issue of having an *XML*-based
result format in the first place. It has been shown that for purposes of
bandwidth efficiency, the choice of XML in the is a limiting factor, and
a dedicated (binary) format is much better.

To illustrate this, here are some figures to illustrate the relative
differences between the options we have (obtained with Sesame running on
desktop hardware on the julie-dump data set):

Parse results from file on local disk:
--------------------------------------
variation    time(s)  indexed time
last call      9.3         100
stripped       6.8          73
collapsed      4.0          43
binary         0.6           6

File sizes & (de)compression times:
-----------------------------------
variation    uncompressed(kB)   gzipped(kB)   gzip(s)   gunzip(s)
last call       123,709           3,570         79         7
stripped         83,607           3,128         55         4
collapsed        55,338           2,985         37         2
binary           11,280           2,530          9         1

As one can see, the performance gain on practically all fronts by using
a binary format completely dwarfs any performance gain by optimizing the
XML format. So a separate question is whether or not the WG wants to
sanction (informally?) a specification of such a binary format (I know
that Andy and I are at least interested in submitting such a format to
W3C).

If we do decide to do this, at takes away part of the reason for
changing the XML format, but of course still both options will be open.
I do, however, believe that if we decide to stick with the LC design, we
will _have_ to sanction this additional binary format, because otherwise
we have not sufficiently provided for requirement 4.7.

Jeen

[1] http://www.w3.org/TR/rdf-dawg-uc/
[2] Well, duh, of course it is *possible*: XSLT is Turing-complete. The
     real question is whether it is easy enough. I think, in the end,
     that it is, given that I do not have very much practical XSLT
     experience and still managed to do this (more-or-less correctly) in
     an afternoon.
[3] http://www.w3.org/2001/sw/DataAccess/rf1/sparql-collapsed-to-html.xsl
-- 
Jeen Broekstra          Aduna BV
Knowledge Engineer      Julianaplein 14b, 3817 CS Amersfoort
http://aduna.biz        The Netherlands
tel. +31 33 46599877

Received on Wednesday, 14 December 2005 16:13:54 UTC