- From: Richard Newman <r.newman@reading.ac.uk>
- Date: Wed, 10 Aug 2005 09:34:33 -0700
- To: public-rdf-dawg-comments@w3.org
Hello all, As SPARQL is in last call, I've been asked to provide some feedback based on my experiences of implementing twinql[1]. (I'm re-sending this, because it didn't seem to arrive.) I'm not sure what structure this should take, so I'll just ramble. Bear with me. twinql is a SPARQL query implementation for the Wilbur RDF toolkit [2], written in ANSI Common Lisp. As documented on my blog[3], I took a typically Lispy two-stage DSL approach to implementing SPARQL, writing an executor for an s-expression syntax, then a parser to convert text into s-expressions. This was a very useful technique, allowing me to implement pre-compilation and direct execution of s- expressions independently of the parser, and leaving open the possibility of later integrating a better parser. The s-expression syntax is deliberately roughly isomorphic to the query language: (sparql :select :distinct t :vars '(name mbox) :from "http://www.holygoat.co.uk/foaf.rdf" :where '(graph-pattern (triple x !foaf:knows !rich:RichardNewman) (triple x !foaf:name name) (triple x !foaf:mbox_sha1sum mbox))) (The parser does the job of producing Wilbur nodes (indicated by ! prefix:suffix) from the PREFIX and BASE declarations, so they do not appear.) This s-expression can be directly executed in the Lisp toplevel -- if you have a Lisp system to hand, I invite you to ASDF-install twinql, switch to the :twinql package, and paste it in. The executor uses hash tables to represent bindings and touched lists in order to do the resolution of graph patterns. Calculation of nested graph patterns, unions, and optionals were almost zero effort, thanks to the design of the executor -- I just provide different bindings or touched lists to the execution function. Thus, on the topic of unions, I would disagree that they are frustrating for implementers. It took me literally minutes to get them working. (Whether the implementation is correct is another matter!) The parser was a little less exciting to develop. I used CL-YACC, a yacc-like parser generator for Common Lisp. This required: • Translating the EBNF grammar into s-expression BNF productions • Writing a lexer • Wiring actions into the productions to output the s-expressions. Translating the grammar was no fun at all. Firstly, the EBNF had to be manually translated into BNF, requiring the creation of a number of intermediate productions. Secondly, the grammar is beyond the reach of a simple parser generator, requiring lookahead -- not a difficulty for the Java compiler tool being used, but too much for CL- YACC. Thus, implementing the parser needed continual switching back to the lexer, shifting the complicated "smarts" into the generation of tokens. The lexer is therefore quite complex, applying various regular expressions, producing symbols, resolving URIs, and building literals. One unavoidable issue was the optional dot after the last query element in each graph pattern. This made the grammar ambiguous, so twinql currently makes them compulsory. I'm looking for a better solution. I'm also unhappy about my implementation of comment removal. URIs and literals can contain hashes, so it's not enough to strip from any hash to the end of the line. I first extract URIs and literals, to avoid this, but this means that comments containing URI or literal patterns could mess up your query. I'll have to fix this some time; the lexer mostly does the job, but I could do better. Thanks to my existing implementation of CBDs and serialisation, DESCRIBE and CONSTRUCT were straightforward. Similarly, serialising binding hashes to the XML results format was easy. The features of SPARQL that I have not yet fully implemented are (off the top of my head): • Filter functions • Order, limits, etc. Regexes and other filter function calls have had effort expended on them, but are not yet usable. These are obviously vital, and I can't think of a better way to integrate them into the syntax. Implementation-wise, I obviously need to map between function URIs and functions, which should be straightforward. Sorting (especially with datatypes) is less easy, but is mostly drudge work. Limit would require result caching -- computing the whole result set, but serialising in parts -- and has not yet been attempted. Named graphs were actually trivially easy to implement on top of Wilbur's sourced quads. If any FROM NAMED terms exist in the query, we retrieve every triple with the specified source from the FROM source, and add them into a new store. (Now, this may not be in line with the actual meaning of the source field -- which might not be implementing named graphs -- but I wouldn't do it any other way.) I don't think I've implemented graph name variables, because I couldn't see an elegant way to do it. bnodes in the query have been implemented, but it's a while since I made sure they worked correctly. Certainly they complicate matters, but not unduly. Referencing bnodes in the results has already been the topic of discussion on various lists, and is still a sticking point. My particular opinion is that either a URL-based scheme (reference binding labels for bnodes in the URL), or a filter function, would be the best choice. The definitions of string literals I found to be too complex, so I simplified a little. This is non-conformant, but I didn't want to get bogged down with fiddling with regular expressions. So, what are my last call recommendations? I'm not sure. Certainly parts of the grammar weren't designed to be easily parsed; it has a number of features that are clearly user-focused, such as optional dots. (If it were designed to be easily parsed, it would be s- expressions from the start -- i.e., already a parse tree -- or be stricter.) If I were evolving a query language, I expect I would have done some things a little differently. I have seen objections made to UNION, on the grounds that implementation is difficult. I'd like to nip that one in the bud; I wrote twinql in a few weeks, not full-time, inventing algorithms as I went, and UNION wasn't difficult. Filter functions are complex, but necessarily so. I can't produce a recommendation to simplify them. As I write, it occurs to me that some way to say which method is being (server -> client), or should be (client -> server), used for DESCRIBE would be desirable -- I'd like my clients to know that they're getting CBDs, or the clients might wish to ask for a certain kind of description. This being last call, it is not the time to be discussing core decisions, such as features, support for inference, etc., so I'll leave those, and finish here. Any interrogation from WG members or other interested parties is welcome. Regards, -Richard Newman [1] <http://www.holygoat.co.uk/projects/twinql/> [2] <http://wilbur-rdf.sourceforge.net/> [3] <http://www.holygoat.co.uk/blog/entry/2005-07-12-3>
Received on Wednesday, 10 August 2005 16:34:38 UTC