Re: allow implicitly unbound variables in SPARQL results? from Jeen Broekstra on 2005-12-14 (public-rdf-dawg@w3.org from October to December 2005)

From: Jeen Broekstra <jeen.broekstra@aduna.biz>
Date: Wed, 14 Dec 2005 10:55:30 +0100
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
CC: Dan Connolly <connolly@w3.org>
Message-ID: <439FEC12.8030906@aduna.biz>
After yesterday's telcon it seems that we have for now eliminated going
for the 'stripped' variation, but unfortunately we got more or less
deadlocked on deciding between sticking with the LC design (option a)
and going for the 'collapsed' variation (option c). Most people seemed
slightly in favor of the LC design, but there are some vehement protests
against that option as well. Likewise, most people are ok with the
collapsed version as well, but there some people have serious objections
as well.

After giving the whole thing some additional thought I think I am
starting to come down in favor of going for option c, and I'll lay out
the main reasons for it here. I invite anyone to disagree with me and to
convince me to stick with option a instead :)

A lot of this is simply a repeat of what has been said earlier, but I'm
hoping to use this as a springboard for reaching a WG decision as soon
as possible.

The major argument against choosing option a seems to be that in our use
cases and requirements[1] there is a requirement for bandwidth efficiency:

  [[4.7 Bandwidth-efficient Protocol

    The access protocol design shall address bandwidth utilization
    issues; that is, it shall allow for at least one result format that
    does not make excessive use of network bandwidth for a given
    collection of results.

    Status: Accepted.]]

Whether or not the LC design meets this requirement is subjective I
guess (what is "excessive", exactly?), however it has been shown that
more bandwidth-efficient variations are not only possible, but workable.

Option c has been shown to be definitely more compact, and
bandwidth-efficient. It has also been shown that processing it through
simple SAX parsing in a language like Java is no more complex than the
LC design. Moreover, IMHO it feels conceptually more correct: the whole
point of XML is that it caters for semi-structuredness by not enforcing
rigid schema adherance. When certain information is not available, (for
example, the binding for a particulary variable), you simply leave it out.

The main problem people seem to have with the format is the perceived
complexity of processing it with XSLT. This arises from the fact that
you have to align values to the correct column when doing table
rendering (in other words, you have to detect the 'missing' information
by matching the name of a binding to the column header, and inserting an
empty table cell if it does not match). I have shown that it is
possible[2] to do this by producing an XSLT sheet for rendering the
collapsed format as an HTML table[3]. Moreover, it was pointed out
during the telcon that for many XSLT tasks, this column-matching is not
even necessary: many types of processing may simply involve picking the
right variable with a particular mentioned value and are not concerned
with rendering a table column at all.

For these reasons, I feel that currently we can not really claim that
XSLT processing is made so much harder by going for this design, and
therefore I think option c is the right way to go.

If others disagree with this, I'd like them to outline an example
processing task for which writing the XSLT is so much more complex that
it warrants sticking with a less efficient design.

Binary vs. XML
==============

Quite seperately from this, there is the issue of having an *XML*-based
result format in the first place. It has been shown that for purposes of
bandwidth efficiency, the choice of XML in the is a limiting factor, and
a dedicated (binary) format is much better.

To illustrate this, here are some figures to illustrate the relative
differences between the options we have (obtained with Sesame running on
desktop hardware on the julie-dump data set):

Parse results from file on local disk:
--------------------------------------
variation    time(s)  indexed time
last call      9.3         100
stripped       6.8          73
collapsed      4.0          43
binary         0.6           6

File sizes & (de)compression times:
-----------------------------------
variation    uncompressed(kB)   gzipped(kB)   gzip(s)   gunzip(s)
last call       123,709           3,570         79         7
stripped         83,607           3,128         55         4
collapsed        55,338           2,985         37         2
binary           11,280           2,530          9         1

As one can see, the performance gain on practically all fronts by using
a binary format completely dwarfs any performance gain by optimizing the
XML format. So a separate question is whether or not the WG wants to
sanction (informally?) a specification of such a binary format (I know
that Andy and I are at least interested in submitting such a format to
W3C).

If we do decide to do this, at takes away part of the reason for
changing the XML format, but of course still both options will be open.
I do, however, believe that if we decide to stick with the LC design, we
will _have_ to sanction this additional binary format, because otherwise
we have not sufficiently provided for requirement 4.7.

Jeen

[1] http://www.w3.org/TR/rdf-dawg-uc/
[2] Well, duh, of course it is *possible*: XSLT is Turing-complete. The
     real question is whether it is easy enough. I think, in the end,
     that it is, given that I do not have very much practical XSLT
     experience and still managed to do this (more-or-less correctly) in
     an afternoon.
[3] http://www.w3.org/2001/sw/DataAccess/rf1/sparql-collapsed-to-html.xsl
-- 
Jeen Broekstra          Aduna BV
Knowledge Engineer      Julianaplein 14b, 3817 CS Amersfoort
http://aduna.biz        The Netherlands
tel. +31 33 46599877
Received on Wednesday, 14 December 2005 09:56:12 UTC