- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Fri, 8 Sep 2006 16:31:58 +0200
- To: Jiri Dokulil <dokulil@gmail.com>
- Cc: "Seaborne, Andy" <andy.seaborne@hp.com>, public-rdf-dawg-comments@w3.org
On Fri, Aug 18, 2006 at 09:15:04PM +0200, Jiri Dokulil wrote: > > On 8/18/06, Seaborne, Andy <andy.seaborne@hp.com> wrote: > > > > > >Jiri Dokulil wrote: > >> > >> I am not sure how should scanner for SPARQL determine whether '<' > >> character it encountered is beginning of an IRI or a comparison > >> operator. > >> > >> Consider these queries: > >> > >> SELECT * WHERE { ?a ?b ?c, ?d . FILTER(?a<?b && ?c>?d) } > >> SELECT * WHERE { ?a ?b ?c, ?d . FILTER(?a<?b&&?c>?d) } > >> > >> Yacker validator results look troubling to me: > >> > >http://www.w3.org/2005/01/yacker/uploads/SPARQL?markup=html&lang=perl&text=SELECT+*+WHERE+%7B+%3Fa+%3Fb+%3Fc%2C+%3Fd+.+FILTER%28%3Fa%3C%3Fb+%26%26+%3Fc%3E%3Fd%29+%7D&action=validate+text > >> > >> > >http://www.w3.org/2005/01/yacker/uploads/SPARQL?markup=html&lang=perl&text=SELECT+*+WHERE+%7B+%3Fa+%3Fb+%3Fc%2C+%3Fd+.+FILTER%28%3Fa%3C%3Fb%26%26%3Fc%3E%3Fd%29+%7D%0D%0A&action=validate+text Yeah, that "shift (Q_IRI_REF, <?b&&?c>)" at the bottom of the trace is perfectly legal (by the grammar), but unfortunate. I argue below that, given historical delimiters for URIs, this is still the best of all possible worlds. > >> The first query validates, the other does not. > >> My guess is that the validator uses some flex-like scanner, that > >> prefers the longest tokens. In the first case "<?b && ?c>" can't be > >> parsed as IRI because of the spaces, so the scanner falls back and > >> 'less than' rule is picked. > >> On the other hand, "<?b&&?c>" is a valid (according to the grammar) > >> IRI. But 'variable iri variable' is not a valid FILTER condition and > >> the parser rejects the query. > >> > >> The problem is more obvious for scanners with one character > >> look-ahead, because they are completely unable to distinguish these > >> two cases. > >> They also have the same problem with () and [] tokens (NIL and ANON > >> terminals) but that can easily be solved by going from LL(1) to LL(2). > >> > >> Jiri Dokulil > > > >Because the characters < and > are overloaded for IRIs and for comparison > >operators there is a potential ambiguity. The SPARQL grammar handles IRI > >in > >two ways - the general grammar rule that is simple and covers any IRI > >scheme, > >but then replies on further validating by an IRI parser. > > No objection about the simple rule. I don't expect the grammar to > provide advanced checks. > > > > >For the http: scheme, <?b> is a valid IRI, as is <?b&&1>. ? and & are > >legal in > >an HTTP URL. > > > >For example: > > > > > >BASE <http://example/page> > >PREFIX : <http://example/ns#> > > > >ASK { <?b> :p <?b&&1> } > > > > > > > > 1 BASE <http://example/page> > > 2 PREFIX : <http://example/ns#> > > 3 > > 4 ASK > > 5 WHERE > > 6 { <http://example/page?b> > > 7 :p <http://example/page?b&&1> . > > 8 } > > > ><?b> is a relative URL relative to base <http://example/page> > >That is <http://example/page?b> > > > >The rule "longest token wins" resolves the tokenizing problem (and is > >common > >practice in lexers because it also means 123 is a single number, not 3 > >individual one digit numbers) although it moves the problem to the grammar. > > > >It could be disambiguated but it needs more than changes to the lexer. It > >needs a context sensitive lexer (< and an IRI can't occur in the same > >place in > >a valid expression, after ?a seeing < must be a comparison in a legal > >expression). The WG has chosen to cover the wider range of parser > >toolkits, > >rather than chose the more complicated context sensitive approach. > > Again, no objection here. In fact, the '<' is an issue for me because > the lexer I used is too weak to handle even this. What lexer are you using? Can you use it twice, once to look for URIs? > > > >I'll look at adding an editorial note that highlights this better. It does > >already say: > > > >http://www.w3.org/TR/rdf-sparql-query/#whitespace > >""" > >White space (production WS) is used to separate two terminals which would > >otherwise be (mis-)recognized as one terminal. > >""" > >which already covers this case. > > > >I hope that this message addresses you comment. If it does, please let us > >know > >- if you put [CLOSED] in the subject line, it will help scripts that help > >manage this list. > > Thanks for the explanation. It certainly clarified the way SPARQL > queries should be parsed. > Still, I'm not happy with this solution because it makes the > -otherwise simple- language complicated and somewhat tricky. Using an > operator as a string delimiter seems highly unusual to me. RFC3986 Appendix C. Delimiting a URI in Context [3986C] recommends <> as delimiters. (The text comes from RFC2396 and dates back before August 1998.) Other RDF languages (ntriples, turtle, n3) use <>s in this way. It is very familiar to the RDF community. There are also one or two languages that use an infix '<' operator to indicate less-than. I think this is as good as we can do. [3986C] http://ietfreport.isoc.org/idref/rfc3986/#page-51 > Unfortunately it is obviously way too late to do anything about this, > so I'll have to cope with the problem (I'm creating a SPARQL > implementation to experiment with). > Thanks again for the explanation. > > Jiri Dokulil -- -eric home-office: +1.617.395.1213 (usually 900-2300 CET) +33.1.45.35.62.14 cell: +33.6.73.84.87.26 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.
Received on Friday, 8 September 2006 14:31:00 UTC