Re: twinql Retrospective from Dan Connolly on 2005-11-22 (public-rdf-dawg-comments@w3.org from November 2005)

From: Dan Connolly <connolly@w3.org>
Date: Mon, 21 Nov 2005 20:35:21 -0600
To: Richard Newman <r.newman@reading.ac.uk>
Cc: public-rdf-dawg-comments@w3.org
Message-Id: <1132626921.26034.136.camel@dirk>
On Wed, 2005-08-10 at 09:34 -0700, Richard Newman wrote:
> Hello all,
>    As SPARQL is in last call, I've been asked to provide some  
> feedback based on my experiences of implementing twinql[1].

Hi. I think we have now addressed all your comments.
Details below.

Please let us know whether this reply addresses your comments
to your satisfaction.

If you're in a particularly helpful mood, you can put [closed]
in the subject of your reply to indicate to some automted
tools that yes, you are satisfied.


> 
>    (I'm re-sending this, because it didn't seem to arrive.)
> 
>    I'm not sure what structure this should take, so I'll just ramble.  
> Bear with me.
> 
>    twinql is a SPARQL query implementation for the Wilbur RDF toolkit 
> [2], written in ANSI Common Lisp. As documented on my blog[3], I took  
> a typically Lispy two-stage DSL approach to implementing SPARQL,  
> writing an executor for an s-expression syntax, then a parser to  
> convert text into s-expressions. This was a very useful technique,  
> allowing me to implement pre-compilation and direct execution of s- 
> expressions independently of the parser, and leaving open the  
> possibility of later integrating a better parser.
> 
>    The s-expression syntax is deliberately roughly isomorphic to the  
> query language:
> 
> (sparql :select :distinct t :vars '(name mbox)
>                  :from "http://www.holygoat.co.uk/foaf.rdf"
>                  :where
>                  '(graph-pattern
>                     (triple x !foaf:knows !rich:RichardNewman)
>                     (triple x !foaf:name name)
>                     (triple x !foaf:mbox_sha1sum mbox)))
> 
> (The parser does the job of producing Wilbur nodes (indicated by ! 
> prefix:suffix) from the PREFIX and BASE declarations, so they do not  
> appear.)
> 
>    This s-expression can be directly executed in the Lisp toplevel --  
> if you have a Lisp system to hand, I invite you to ASDF-install  
> twinql, switch to the :twinql package, and paste it in. The executor  
> uses hash tables to represent bindings and touched lists in order to  
> do the resolution of graph patterns. Calculation of nested graph  
> patterns, unions, and optionals were almost zero effort, thanks to  
> the design of the executor -- I just provide different bindings or  
> touched lists to the execution function. Thus, on the topic of  
> unions, I would disagree that they are frustrating for implementers.  
> It took me literally minutes to get them working. (Whether the  
> implementation is correct is another matter!)
> 
>    The parser was a little less exciting to develop. I used CL-YACC,  
> a yacc-like parser generator for Common Lisp. This required:
> 
> • Translating the EBNF grammar into s-expression BNF productions
> • Writing a lexer
> • Wiring actions into the productions to output the s-expressions.
> 
>    Translating the grammar was no fun at all. Firstly, the EBNF had  
> to be manually translated into BNF, requiring the creation of a  
> number of intermediate productions. Secondly, the grammar is beyond  
> the reach of a simple parser generator, requiring lookahead -- not a  
> difficulty for the Java compiler tool being used, but too much for CL- 
> YACC. Thus, implementing the parser needed continual switching back  
> to the lexer, shifting the complicated "smarts" into the generation  
> of tokens. The lexer is therefore quite complex, applying various  
> regular expressions, producing symbols, resolving URIs, and building  
> literals. One unavoidable issue was the optional dot after the last  
> query element in each graph pattern. This made the grammar ambiguous,  
> so twinql currently makes them compulsory. I'm looking for a better  
> solution.

We have re-worked the grammar. It's now LL(1).

http://www.w3.org/2001/sw/DataAccess/rq23/#grammar

Regarding manually translating to BNF, we have a number of
machine-generated parsers built from our grammars. There's
a "@@Links to generated parsers" TODO item in the editor's
draft now, but some of those generated forms should be linked
for your convenience in an upcoming draft.


>    I'm also unhappy about my implementation of comment removal. URIs  
> and literals can contain hashes, so it's not enough to strip from any  
> hash to the end of the line. I first extract URIs and literals, to  
> avoid this, but this means that comments containing URI or literal  
> patterns could mess up your query. I'll have to fix this some time;  
> the lexer mostly does the job, but I could do better.
> 
>    Thanks to my existing implementation of CBDs and serialisation,  
> DESCRIBE and CONSTRUCT were straightforward. Similarly, serialising  
> binding hashes to the XML results format was easy.
> 
>    The features of SPARQL that I have not yet fully implemented are  
> (off the top of my head):
> 
> • Filter functions
> • Order, limits, etc.
> 
>    Regexes and other filter function calls have had effort expended  
> on them, but are not yet usable. These are obviously vital, and I  
> can't think of a better way to integrate them into the syntax.  
> Implementation-wise, I obviously need to map between function URIs  
> and functions, which should be straightforward.
> 
>    Sorting (especially with datatypes) is less easy, but is mostly  
> drudge work. Limit would require result caching -- computing the  
> whole result set, but serialising in parts -- and has not yet been  
> attempted.
> 
>    Named graphs were actually trivially easy to implement on top of  
> Wilbur's sourced quads. If any FROM NAMED terms exist in the query,  
> we retrieve every triple with the specified source from the FROM  
> source, and add them into a new store. (Now, this may not be in line  
> with the actual meaning of the source field -- which might not be  
> implementing named graphs -- but I wouldn't do it any other way.)  I  
> don't think I've implemented graph name variables, because I couldn't  
> see an elegant way to do it.
> 
>    bnodes in the query have been implemented, but it's a while since  
> I made sure they worked correctly. Certainly they complicate matters,  
> but not unduly. Referencing bnodes in the results has already been  
> the topic of discussion on various lists, and is still a sticking  
> point. My particular opinion is that either a URL-based scheme  
> (reference binding labels for bnodes in the URL), or a filter  
> function, would be the best choice.
> 
>    The definitions of string literals I found to be too complex, so I  
> simplified a little. This is non-conformant, but I didn't want to get  
> bogged down with fiddling with regular expressions.
> 
>    So, what are my last call recommendations? I'm not sure. Certainly  
> parts of the grammar weren't designed to be easily parsed; it has a  
> number of features that are clearly user-focused, such as optional  
> dots. (If it were designed to be easily parsed, it would be s- 
> expressions from the start -- i.e., already a parse tree -- or be  
> stricter.)  If I were evolving a query language, I expect I would  
> have done some things a little differently.
> 
>    I have seen objections made to UNION, on the grounds that  
> implementation is difficult. I'd like to nip that one in the bud; I  
> wrote twinql in a few weeks, not full-time, inventing algorithms as I  
> went, and UNION wasn't difficult.

Noted.


>    Filter functions are complex, but necessarily so. I can't produce  
> a recommendation to simplify them.
> 
>    As I write, it occurs to me that some way to say which method is  
> being (server -> client), or should be (client -> server), used for  
> DESCRIBE would be desirable -- I'd like my clients to know that  
> they're getting CBDs, or the clients might wish to ask for a certain  
> kind of description.

We considered that a bit; see
http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0517.html
and following...

This didn't lead to a critical mass of support for anything
other than confirming our decision to leave this
service description area in the realm of experiementation, perhaps
for later standardization.
  http://www.w3.org/2001/sw/DataAccess/issues#serviceDescription



>    This being last call, it is not the time to be discussing core  
> decisions, such as features, support for inference, etc., so I'll  
> leave those, and finish here.

Your message of 17 Aug
http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Aug/0065
seems to say that you're aware that it's not completely too late
to suggest changes to our requirements, and you're not requesting
any such changes just now.

>    Any interrogation from WG members or other interested parties is  
> welcome.
> 
>    Regards,
> -Richard Newman
> 
> [1] <http://www.holygoat.co.uk/projects/twinql/>
> [2] <http://wilbur-rdf.sourceforge.net/>
> [3] <http://www.holygoat.co.uk/blog/entry/2005-07-12-3>
-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E
Received on Tuesday, 22 November 2005 02:35:28 UTC