Re: Determining whether '<' is a beginning of IRI or 'less than' operator [CLOSED]

On 8/18/06, Seaborne, Andy <andy.seaborne@hp.com> wrote:
>
>
> Jiri Dokulil wrote:
> >
> > I am not sure how should scanner for SPARQL determine whether '<'
> > character it encountered is beginning of an IRI or a comparison
> > operator.
> >
> > Consider these queries:
> >
> > SELECT * WHERE { ?a ?b ?c, ?d . FILTER(?a<?b && ?c>?d) }
> > SELECT * WHERE { ?a ?b ?c, ?d . FILTER(?a<?b&&?c>?d) }
> >
> > Yacker validator results look troubling to me:
> > http://www.w3.org/2005/01/yacker/uploads/SPARQL?markup=html&lang=perl&text=SELECT+*+WHERE+%7B+%3Fa+%3Fb+%3Fc%2C+%3Fd+.+FILTER%28%3Fa%3C%3Fb+%26%26+%3Fc%3E%3Fd%29+%7D&action=validate+text
> >
> > http://www.w3.org/2005/01/yacker/uploads/SPARQL?markup=html&lang=perl&text=SELECT+*+WHERE+%7B+%3Fa+%3Fb+%3Fc%2C+%3Fd+.+FILTER%28%3Fa%3C%3Fb%26%26%3Fc%3E%3Fd%29+%7D%0D%0A&action=validate+text
> >
> >
> > The first query validates, the other does not.
> > My guess is that the validator uses some flex-like scanner, that
> > prefers the longest tokens. In the first case "<?b && ?c>" can't be
> > parsed as IRI because of the spaces, so the scanner falls back and
> > 'less than' rule is picked.
> > On the other hand, "<?b&&?c>" is a valid (according to the grammar)
> > IRI. But 'variable iri variable' is not a valid FILTER condition and
> > the parser rejects the query.
> >
> > The problem is more obvious for scanners with one character
> > look-ahead, because they are completely unable to distinguish these
> > two cases.
> > They also have the same problem with () and [] tokens (NIL and ANON
> > terminals) but that can easily be solved by going from LL(1) to LL(2).
> >
> > Jiri Dokulil
>
> Because the characters < and > are overloaded for IRIs and for comparison
> operators there is a potential ambiguity.  The SPARQL grammar handles IRI in
> two ways - the general grammar rule that is simple and covers any IRI scheme,
> but then replies on further validating by an IRI parser.

No objection about the simple rule. I don't expect the grammar to
provide advanced checks.

>
> For the http: scheme, <?b> is a valid IRI, as is <?b&&1>. ? and & are legal in
> an HTTP URL.
>
> For example:
>
>
> BASE     <http://example/page>
> PREFIX : <http://example/ns#>
>
> ASK { <?b> :p <?b&&1> }
>
>
>
>   1 BASE    <http://example/page>
>   2 PREFIX  : <http://example/ns#>
>   3
>   4 ASK
>   5 WHERE
>   6   { <http://example/page?b>
>   7               :p  <http://example/page?b&&1> .
>   8   }
>
> <?b> is a relative URL relative to base <http://example/page>
> That is <http://example/page?b>
>
> The rule "longest token wins" resolves the tokenizing problem (and is common
> practice in lexers because it also means 123 is a single number, not 3
> individual one digit numbers) although it moves the problem to the grammar.
>
> It could be disambiguated but it needs more than changes to the lexer.  It
> needs a context sensitive lexer (< and an IRI can't occur in the same place in
> a valid expression, after ?a seeing < must be a comparison in a legal
> expression).  The WG has chosen to cover the wider range of parser toolkits,
> rather than chose the more complicated context sensitive approach.

Again, no objection here. In fact, the '<' is an issue for me because
the lexer I used is too weak to handle even this.

>
> I'll look at adding an editorial note that highlights this better. It does
> already say:
>
> http://www.w3.org/TR/rdf-sparql-query/#whitespace
> """
> White space (production WS) is used to separate two terminals which would
> otherwise be (mis-)recognized as one terminal.
> """
> which already covers this case.
>
> I hope that this message addresses you comment. If it does, please let us know
> - if you put [CLOSED] in the subject line, it will help scripts that help
> manage this list.

Thanks for the explanation. It certainly clarified the way SPARQL
queries should be parsed.
Still, I'm not happy with this solution because it makes the
-otherwise simple- language complicated and somewhat tricky. Using an
operator as a string delimiter seems highly unusual to me.
Unfortunately it is obviously way too late to do anything about this,
so I'll have to cope with the problem (I'm creating a SPARQL
implementation to experiment with).
Thanks again for the explanation.

Jiri Dokulil

Received on Friday, 18 August 2006 19:15:21 UTC