- From: Lee Feigenbaum <feigenbl@us.ibm.com>
- Date: Wed, 14 Dec 2005 11:27:45 -0500
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
- Message-ID: <OF76720136.A1D4F1E9-ON852570D7.005A2DE9-852570D7.005A6C7E@us.ibm.com>
Sorry for the poor formatting. You'd think I'd eventually get the hang of this whole "email" thing. A fixed copy is here: Jeen, thanks for the summary of the situation. I'd like to add a bit on Elias and my objection to choice 'c'. Basically, the results format as it currently stands explicitly makes possible multiple ways of processing the results format -- either by variable name or by column position, given that the results format mandates that the order of the variable names must in each row in the resultset must match the order in the header of the result document. Option 'c' eliminates the indexed-based approach to parsing the results. At the very least, I feel that if the Working Group decides on option 'c' than the requirement that variable name order be preserved within the resultset should be removed. I must say that I, at least, am not particularly swayed by our requirement for a bandwidth-efficient protocol. Yes, the collapsed version is more efficient, but only for a particular class of queries (those featuring unbound variables). I would much rather publish a less-efficient yet fully functional XML design along with an informative pointer to a much-more efficient binary design then attempt what I see as a strange compromise. (The flippant side of me wants to point out that the results format would be more efficient if it only used one-letter element names, as well.) Regarding the complexity of the XSLT for the collapsed version. While I appreciate Jeen's attempts to demonstrate that it is not particularly difficult to write XSLT for the collapsed result format: 1) The technique that Jeen uses of remember positions and then comparing current positions with expected names, is not particularly elegant. Obviously, it is not incumbent upon us to produce a design which allows for elegant processing, but this does bother me. 2) More importantly, the XSLT as currently in CVS does not work correctly. I am attaching a sample input result format that produces an incorrect and malformed table when used with this XSLT. I would argue that this is due to the added complexity of the collapsed result format. Finally, from a procedural point of view, I'd like to point out that 4.7 is a design objective and NOT a requirement. thanks, Lee Jeen Broekstra <jeen.broekstra@aduna.biz> Sent by: public-rdf-dawg-request@w3.org 12/14/2005 04:55 AM To RDF Data Access Working Group <public-rdf-dawg@w3.org> cc Dan Connolly <connolly@w3.org> Subject Re: allow implicitly unbound variables in SPARQL results? After yesterday's telcon it seems that we have for now eliminated going for the 'stripped' variation, but unfortunately we got more or less deadlocked on deciding between sticking with the LC design (option a) and going for the 'collapsed' variation (option c). Most people seemed slightly in favor of the LC design, but there are some vehement protests against that option as well. Likewise, most people are ok with the collapsed version as well, but there some people have serious objections as well. After giving the whole thing some additional thought I think I am starting to come down in favor of going for option c, and I'll lay out the main reasons for it here. I invite anyone to disagree with me and to convince me to stick with option a instead :) A lot of this is simply a repeat of what has been said earlier, but I'm hoping to use this as a springboard for reaching a WG decision as soon as possible. The major argument against choosing option a seems to be that in our use cases and requirements[1] there is a requirement for bandwidth efficiency: [[4.7 Bandwidth-efficient Protocol The access protocol design shall address bandwidth utilization issues; that is, it shall allow for at least one result format that does not make excessive use of network bandwidth for a given collection of results. Status: Accepted.]] Whether or not the LC design meets this requirement is subjective I guess (what is "excessive", exactly?), however it has been shown that more bandwidth-efficient variations are not only possible, but workable. Option c has been shown to be definitely more compact, and bandwidth-efficient. It has also been shown that processing it through simple SAX parsing in a language like Java is no more complex than the LC design. Moreover, IMHO it feels conceptually more correct: the whole point of XML is that it caters for semi-structuredness by not enforcing rigid schema adherance. When certain information is not available, (for example, the binding for a particulary variable), you simply leave it out. The main problem people seem to have with the format is the perceived complexity of processing it with XSLT. This arises from the fact that you have to align values to the correct column when doing table rendering (in other words, you have to detect the 'missing' information by matching the name of a binding to the column header, and inserting an empty table cell if it does not match). I have shown that it is possible[2] to do this by producing an XSLT sheet for rendering the collapsed format as an HTML table[3]. Moreover, it was pointed out during the telcon that for many XSLT tasks, this column-matching is not even necessary: many types of processing may simply involve picking the right variable with a particular mentioned value and are not concerned with rendering a table column at all. For these reasons, I feel that currently we can not really claim that XSLT processing is made so much harder by going for this design, and therefore I think option c is the right way to go. If others disagree with this, I'd like them to outline an example processing task for which writing the XSLT is so much more complex that it warrants sticking with a less efficient design. Binary vs. XML ============== Quite seperately from this, there is the issue of having an *XML*-based result format in the first place. It has been shown that for purposes of bandwidth efficiency, the choice of XML in the is a limiting factor, and a dedicated (binary) format is much better. To illustrate this, here are some figures to illustrate the relative differences between the options we have (obtained with Sesame running on desktop hardware on the julie-dump data set): Parse results from file on local disk: -------------------------------------- variation time(s) indexed time last call 9.3 100 stripped 6.8 73 collapsed 4.0 43 binary 0.6 6 File sizes & (de)compression times: ----------------------------------- variation uncompressed(kB) gzipped(kB) gzip(s) gunzip(s) last call 123,709 3,570 79 7 stripped 83,607 3,128 55 4 collapsed 55,338 2,985 37 2 binary 11,280 2,530 9 1 As one can see, the performance gain on practically all fronts by using a binary format completely dwarfs any performance gain by optimizing the XML format. So a separate question is whether or not the WG wants to sanction (informally?) a specification of such a binary format (I know that Andy and I are at least interested in submitting such a format to W3C). If we do decide to do this, at takes away part of the reason for changing the XML format, but of course still both options will be open. I do, however, believe that if we decide to stick with the LC design, we will _have_ to sanction this additional binary format, because otherwise we have not sufficiently provided for requirement 4.7. Jeen [1] http://www.w3.org/TR/rdf-dawg-uc/ [2] Well, duh, of course it is *possible*: XSLT is Turing-complete. The real question is whether it is easy enough. I think, in the end, that it is, given that I do not have very much practical XSLT experience and still managed to do this (more-or-less correctly) in an afternoon. [3] http://www.w3.org/2001/sw/DataAccess/rf1/sparql-collapsed-to-html.xsl -- Jeen Broekstra Aduna BV Knowledge Engineer Julianaplein 14b, 3817 CS Amersfoort http://aduna.biz The Netherlands tel. +31 33 46599877
Attachments
- application/octet-stream attachment: output-collapsed.srx
Received on Wednesday, 14 December 2005 16:27:59 UTC