- From: Jeen Broekstra <jeen.broekstra@aduna.biz>
- Date: Fri, 09 Dec 2005 16:29:55 +0100
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
- CC: kendall@monkeyfist.com, Dan Connolly <connolly@w3.org>
Following up on myself here. This discussion is of relevance to Ron Alford's comments to the WG regarding unbound variables in the SPARQL query results XML format (see http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Aug/0043 ). Jeen Broekstra wrote: > Kendall Clark wrote: [snip] >> I don't believe the change to <binding/> makes processing that much >> more difficult. > > > To clarify: I think my major concern (with both stripping and > collapsing, though mostly with collapsing) is not so much that it is > more difficult but that it is more costly in terms of processor > performance. Note that Ron's tests only give performance figures for > the case where you actually _have_ a significant size reduction > (because it contains a lot of unbounds). > > I'd like to see some figures on how comparative processor performance > is for result sets that contain no unbounds (I'll see if I can come > up with some figures on this myself tomorrow, or perhaps Ron will do > this, you/Bijan mentioned he'd do some extra tests). That would give > us a more complete picture of the consequences of either design. Well, it's a bit later than tomorrow, but I've finally found the time to do some tests with the different proposed formats and see how they compare, performance-wise. I've taken the julie-dump.xml files from Ron's experiment (see [1]), the last call version, the stripped version, and the collapsed version as input data for the experiments. I've also produced a variant of the dataset, called julie-dump-nonulls.xml, in which I've replaced all 'unbound' entries with a dummy literal (file available on request). The first test I did was basically a repeat of Ron's experiment, but with different tools. Where Ron used XSLT processing, I've used Sesame's SPARQLResultsReader, which is basically a Java SAX handler (using Xerces-J's SAX driver), and created two variations of the original, one for stripped, and one for collapsed. The first surprise (to me personally) was that it took me virtually no time to create these variations, so my fear that this is significantly harder to implement was unfounded. All in all, for the collapsed version, I had to add about 20 lines of Java code, compared to the version based on the last call design. In any case, I ran 6 parsing runs for each variation, and averaged the results (with lowest and highest removed): julie-dump parsing variation time(s) perf. gain (%) last call 12.4 - stripped 9.3 25 collapsed 5.1 58 So this corroborates Ron's findings of the performance gain (and of course was to be expected). (Testing platform: Java 5, run from Eclipse 3.1 on a Dell Latitude D600 laptop with 512MB) I also did a test with each variation of the parser on a large result that contains no unbounds at all, and therefore is of equal size for each parser variation. The file is an adaptation of the julie-dump in which each unbound is replaced with a literal value. julie-dump-nonull parsing variation time(s) last call 13.7 stripped 13.5 collapsed 13.9 As you can see, the difference is negligable for all three cases, showing that even on large result sets (the test result set contains 234,661 results), performance of the 'more complicated' parser is not significantly worse. Given these results it seems that my original concerns about the complexity of the collapsed variation is unfounded. There are still a number of possible reasons I can think of to stick with the last call design however: 1. It has been implemented already and changing it will break existing tools (but hey, we're a working draft, what do you expect?). 2. The collapsed form esp. will make the data structure more irregular and therefore possibly harder to understand, esp. for people with table/SQL background. I personally find neither reason particularly compelling, but I'd like to hear the WG's opinion on this, if any. Do we have enough data here to perhaps put it on next telcon's agenda and make a decision one way or the other? Cheers, Jeen [1] http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Aug/0043 -- Jeen Broekstra Aduna BV Knowledge Engineer Julianaplein 14b, 3817 CS Amersfoort http://aduna.biz The Netherlands tel. +31 33 46599877
Received on Friday, 9 December 2005 15:33:44 UTC