- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Fri, 8 Sep 2006 16:31:58 +0200
- To: Jiri Dokulil <dokulil@gmail.com>
- Cc: "Seaborne, Andy" <andy.seaborne@hp.com>, public-rdf-dawg-comments@w3.org
On Fri, Aug 18, 2006 at 09:15:04PM +0200, Jiri Dokulil wrote:
>
> On 8/18/06, Seaborne, Andy <andy.seaborne@hp.com> wrote:
> >
> >
> >Jiri Dokulil wrote:
> >>
> >> I am not sure how should scanner for SPARQL determine whether '<'
> >> character it encountered is beginning of an IRI or a comparison
> >> operator.
> >>
> >> Consider these queries:
> >>
> >> SELECT * WHERE { ?a ?b ?c, ?d . FILTER(?a<?b && ?c>?d) }
> >> SELECT * WHERE { ?a ?b ?c, ?d . FILTER(?a<?b&&?c>?d) }
> >>
> >> Yacker validator results look troubling to me:
> >>
> >http://www.w3.org/2005/01/yacker/uploads/SPARQL?markup=html&lang=perl&text=SELECT+*+WHERE+%7B+%3Fa+%3Fb+%3Fc%2C+%3Fd+.+FILTER%28%3Fa%3C%3Fb+%26%26+%3Fc%3E%3Fd%29+%7D&action=validate+text
> >>
> >>
> >http://www.w3.org/2005/01/yacker/uploads/SPARQL?markup=html&lang=perl&text=SELECT+*+WHERE+%7B+%3Fa+%3Fb+%3Fc%2C+%3Fd+.+FILTER%28%3Fa%3C%3Fb%26%26%3Fc%3E%3Fd%29+%7D%0D%0A&action=validate+text
Yeah, that "shift (Q_IRI_REF, <?b&&?c>)" at the bottom of the trace is
perfectly legal (by the grammar), but unfortunate. I argue below that,
given historical delimiters for URIs, this is still the best of all
possible worlds.
> >> The first query validates, the other does not.
> >> My guess is that the validator uses some flex-like scanner, that
> >> prefers the longest tokens. In the first case "<?b && ?c>" can't be
> >> parsed as IRI because of the spaces, so the scanner falls back and
> >> 'less than' rule is picked.
> >> On the other hand, "<?b&&?c>" is a valid (according to the grammar)
> >> IRI. But 'variable iri variable' is not a valid FILTER condition and
> >> the parser rejects the query.
> >>
> >> The problem is more obvious for scanners with one character
> >> look-ahead, because they are completely unable to distinguish these
> >> two cases.
> >> They also have the same problem with () and [] tokens (NIL and ANON
> >> terminals) but that can easily be solved by going from LL(1) to LL(2).
> >>
> >> Jiri Dokulil
> >
> >Because the characters < and > are overloaded for IRIs and for comparison
> >operators there is a potential ambiguity. The SPARQL grammar handles IRI
> >in
> >two ways - the general grammar rule that is simple and covers any IRI
> >scheme,
> >but then replies on further validating by an IRI parser.
>
> No objection about the simple rule. I don't expect the grammar to
> provide advanced checks.
>
> >
> >For the http: scheme, <?b> is a valid IRI, as is <?b&&1>. ? and & are
> >legal in
> >an HTTP URL.
> >
> >For example:
> >
> >
> >BASE <http://example/page>
> >PREFIX : <http://example/ns#>
> >
> >ASK { <?b> :p <?b&&1> }
> >
> >
> >
> > 1 BASE <http://example/page>
> > 2 PREFIX : <http://example/ns#>
> > 3
> > 4 ASK
> > 5 WHERE
> > 6 { <http://example/page?b>
> > 7 :p <http://example/page?b&&1> .
> > 8 }
> >
> ><?b> is a relative URL relative to base <http://example/page>
> >That is <http://example/page?b>
> >
> >The rule "longest token wins" resolves the tokenizing problem (and is
> >common
> >practice in lexers because it also means 123 is a single number, not 3
> >individual one digit numbers) although it moves the problem to the grammar.
> >
> >It could be disambiguated but it needs more than changes to the lexer. It
> >needs a context sensitive lexer (< and an IRI can't occur in the same
> >place in
> >a valid expression, after ?a seeing < must be a comparison in a legal
> >expression). The WG has chosen to cover the wider range of parser
> >toolkits,
> >rather than chose the more complicated context sensitive approach.
>
> Again, no objection here. In fact, the '<' is an issue for me because
> the lexer I used is too weak to handle even this.
What lexer are you using? Can you use it twice, once to look for URIs?
> >
> >I'll look at adding an editorial note that highlights this better. It does
> >already say:
> >
> >http://www.w3.org/TR/rdf-sparql-query/#whitespace
> >"""
> >White space (production WS) is used to separate two terminals which would
> >otherwise be (mis-)recognized as one terminal.
> >"""
> >which already covers this case.
> >
> >I hope that this message addresses you comment. If it does, please let us
> >know
> >- if you put [CLOSED] in the subject line, it will help scripts that help
> >manage this list.
>
> Thanks for the explanation. It certainly clarified the way SPARQL
> queries should be parsed.
> Still, I'm not happy with this solution because it makes the
> -otherwise simple- language complicated and somewhat tricky. Using an
> operator as a string delimiter seems highly unusual to me.
RFC3986 Appendix C. Delimiting a URI in Context [3986C] recommends <>
as delimiters. (The text comes from RFC2396 and dates back before
August 1998.) Other RDF languages (ntriples, turtle, n3) use <>s in
this way. It is very familiar to the RDF community.
There are also one or two languages that use an infix '<' operator to
indicate less-than. I think this is as good as we can do.
[3986C] http://ietfreport.isoc.org/idref/rfc3986/#page-51
> Unfortunately it is obviously way too late to do anything about this,
> so I'll have to cope with the problem (I'm creating a SPARQL
> implementation to experiment with).
> Thanks again for the explanation.
>
> Jiri Dokulil
--
-eric
home-office: +1.617.395.1213 (usually 900-2300 CET)
+33.1.45.35.62.14
cell: +33.6.73.84.87.26
(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Friday, 8 September 2006 14:31:00 UTC