- From: Jeen Broekstra <jeen.broekstra@aduna.biz>
- Date: Wed, 14 Dec 2005 10:55:30 +0100
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
- CC: Dan Connolly <connolly@w3.org>
After yesterday's telcon it seems that we have for now eliminated going for the 'stripped' variation, but unfortunately we got more or less deadlocked on deciding between sticking with the LC design (option a) and going for the 'collapsed' variation (option c). Most people seemed slightly in favor of the LC design, but there are some vehement protests against that option as well. Likewise, most people are ok with the collapsed version as well, but there some people have serious objections as well. After giving the whole thing some additional thought I think I am starting to come down in favor of going for option c, and I'll lay out the main reasons for it here. I invite anyone to disagree with me and to convince me to stick with option a instead :) A lot of this is simply a repeat of what has been said earlier, but I'm hoping to use this as a springboard for reaching a WG decision as soon as possible. The major argument against choosing option a seems to be that in our use cases and requirements[1] there is a requirement for bandwidth efficiency: [[4.7 Bandwidth-efficient Protocol The access protocol design shall address bandwidth utilization issues; that is, it shall allow for at least one result format that does not make excessive use of network bandwidth for a given collection of results. Status: Accepted.]] Whether or not the LC design meets this requirement is subjective I guess (what is "excessive", exactly?), however it has been shown that more bandwidth-efficient variations are not only possible, but workable. Option c has been shown to be definitely more compact, and bandwidth-efficient. It has also been shown that processing it through simple SAX parsing in a language like Java is no more complex than the LC design. Moreover, IMHO it feels conceptually more correct: the whole point of XML is that it caters for semi-structuredness by not enforcing rigid schema adherance. When certain information is not available, (for example, the binding for a particulary variable), you simply leave it out. The main problem people seem to have with the format is the perceived complexity of processing it with XSLT. This arises from the fact that you have to align values to the correct column when doing table rendering (in other words, you have to detect the 'missing' information by matching the name of a binding to the column header, and inserting an empty table cell if it does not match). I have shown that it is possible[2] to do this by producing an XSLT sheet for rendering the collapsed format as an HTML table[3]. Moreover, it was pointed out during the telcon that for many XSLT tasks, this column-matching is not even necessary: many types of processing may simply involve picking the right variable with a particular mentioned value and are not concerned with rendering a table column at all. For these reasons, I feel that currently we can not really claim that XSLT processing is made so much harder by going for this design, and therefore I think option c is the right way to go. If others disagree with this, I'd like them to outline an example processing task for which writing the XSLT is so much more complex that it warrants sticking with a less efficient design. Binary vs. XML ============== Quite seperately from this, there is the issue of having an *XML*-based result format in the first place. It has been shown that for purposes of bandwidth efficiency, the choice of XML in the is a limiting factor, and a dedicated (binary) format is much better. To illustrate this, here are some figures to illustrate the relative differences between the options we have (obtained with Sesame running on desktop hardware on the julie-dump data set): Parse results from file on local disk: -------------------------------------- variation time(s) indexed time last call 9.3 100 stripped 6.8 73 collapsed 4.0 43 binary 0.6 6 File sizes & (de)compression times: ----------------------------------- variation uncompressed(kB) gzipped(kB) gzip(s) gunzip(s) last call 123,709 3,570 79 7 stripped 83,607 3,128 55 4 collapsed 55,338 2,985 37 2 binary 11,280 2,530 9 1 As one can see, the performance gain on practically all fronts by using a binary format completely dwarfs any performance gain by optimizing the XML format. So a separate question is whether or not the WG wants to sanction (informally?) a specification of such a binary format (I know that Andy and I are at least interested in submitting such a format to W3C). If we do decide to do this, at takes away part of the reason for changing the XML format, but of course still both options will be open. I do, however, believe that if we decide to stick with the LC design, we will _have_ to sanction this additional binary format, because otherwise we have not sufficiently provided for requirement 4.7. Jeen [1] http://www.w3.org/TR/rdf-dawg-uc/ [2] Well, duh, of course it is *possible*: XSLT is Turing-complete. The real question is whether it is easy enough. I think, in the end, that it is, given that I do not have very much practical XSLT experience and still managed to do this (more-or-less correctly) in an afternoon. [3] http://www.w3.org/2001/sw/DataAccess/rf1/sparql-collapsed-to-html.xsl -- Jeen Broekstra Aduna BV Knowledge Engineer Julianaplein 14b, 3817 CS Amersfoort http://aduna.biz The Netherlands tel. +31 33 46599877
Received on Wednesday, 14 December 2005 09:56:12 UTC