Re: allow implicitly unbound variables in SPARQL results? from Jeen Broekstra on 2005-12-09 (public-rdf-dawg@w3.org from October to December 2005)

From: Jeen Broekstra <jeen.broekstra@aduna.biz>
Date: Fri, 09 Dec 2005 16:29:55 +0100
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
CC: kendall@monkeyfist.com, Dan Connolly <connolly@w3.org>
Message-ID: <4399A2F3.2020805@aduna.biz>
Following up on myself here. This discussion is of relevance to Ron
Alford's comments to the WG regarding unbound variables in the SPARQL
query results XML format (see
http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Aug/0043 ).

Jeen Broekstra wrote:

> Kendall Clark wrote:

[snip]

>> I don't believe the change to <binding/> makes processing that much
>>  more difficult.
> 
> 
> To clarify: I think my major concern (with both stripping and 
> collapsing, though mostly with collapsing) is not so much that it is 
> more difficult but that it is more costly in terms of processor 
> performance. Note that Ron's tests only give performance figures for
> the case where you actually _have_ a significant size reduction
> (because it contains a lot of unbounds).
> 
> I'd like to see some figures on how comparative processor performance
> is for result sets that contain no unbounds (I'll see if I can come
> up with some figures on this myself tomorrow, or perhaps Ron will do
> this, you/Bijan mentioned he'd do some extra tests). That would give
> us a more complete picture of the consequences of either design.

Well, it's a bit later than tomorrow, but I've finally found the time to
do some tests with the different proposed formats and see how they
compare, performance-wise.

I've taken the julie-dump.xml files from Ron's experiment (see [1]), the
last call version, the stripped version, and the collapsed version as
input data for the experiments. I've also produced a variant of the
dataset, called julie-dump-nonulls.xml, in which I've replaced all
'unbound' entries with a dummy literal (file available on request).

The first test I did was basically a repeat of Ron's experiment, but
with different tools. Where Ron used XSLT processing, I've used Sesame's
SPARQLResultsReader, which is basically a Java SAX handler (using
Xerces-J's SAX driver), and created two variations of the original, one
for stripped, and one for collapsed.

The first surprise (to me personally) was that it took me virtually no
time to create these variations, so my fear that this is significantly
harder to implement was unfounded. All in all, for the collapsed
version, I had to add about 20 lines of Java code, compared to the
version based on the last call design.

In any case, I ran 6 parsing runs for each variation, and averaged the
results (with lowest and highest removed):

  julie-dump parsing

  variation	time(s)	perf. gain (%)
  last call	12.4	-
  stripped	 9.3	25
  collapsed       5.1	58

So this corroborates Ron's findings of the performance gain (and of
course was to be expected). (Testing platform: Java 5, run from Eclipse
3.1 on a Dell Latitude D600 laptop with 512MB)

I also did a test with each variation of the parser on a large result
that contains no unbounds at all, and therefore is of equal size for
each parser variation. The file is an adaptation of the julie-dump in
which each unbound is replaced with a literal value.

julie-dump-nonull parsing

variation	time(s)
last call	13.7
stripped	13.5
collapsed	13.9

As you can see, the difference is negligable for all three cases,
showing that even on large result sets (the test result set contains
234,661 results), performance of the 'more complicated' parser is not
significantly worse.

Given these results it seems that my original concerns about the
complexity of the collapsed variation is unfounded.

There are still a number of possible reasons I can think of to stick
with the last call design however:

  1. It has been implemented already and changing it will break
     existing tools (but hey, we're a working draft, what do you
     expect?).
  2. The collapsed form esp. will make the data structure more
     irregular and therefore possibly harder to understand, esp. for
     people with table/SQL background.

I personally find neither reason particularly compelling, but I'd like
to hear the WG's opinion on this, if any.

Do we have enough data here to perhaps put it on next telcon's agenda
and make a decision one way or the other?

Cheers,

Jeen

[1]
http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Aug/0043
-- 
Jeen Broekstra          Aduna BV
Knowledge Engineer      Julianaplein 14b, 3817 CS Amersfoort
http://aduna.biz        The Netherlands
tel. +31 33 46599877
Received on Friday, 9 December 2005 15:33:44 UTC