Re: Determining whether '<' is a beginning of IRI or 'less than' operator [CLOSED] from Eric Prud'hommeaux on 2006-09-08 (public-rdf-dawg-comments@w3.org from September 2006)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 8 Sep 2006 16:31:58 +0200
To: Jiri Dokulil <dokulil@gmail.com>
Cc: "Seaborne, Andy" <andy.seaborne@hp.com>, public-rdf-dawg-comments@w3.org
Message-ID: <20060908143158.GB6888@w3.org>
On Fri, Aug 18, 2006 at 09:15:04PM +0200, Jiri Dokulil wrote:
> 
> On 8/18/06, Seaborne, Andy <andy.seaborne@hp.com> wrote:
> >
> >
> >Jiri Dokulil wrote:
> >>
> >> I am not sure how should scanner for SPARQL determine whether '<'
> >> character it encountered is beginning of an IRI or a comparison
> >> operator.
> >>
> >> Consider these queries:
> >>
> >> SELECT * WHERE { ?a ?b ?c, ?d . FILTER(?a<?b && ?c>?d) }
> >> SELECT * WHERE { ?a ?b ?c, ?d . FILTER(?a<?b&&?c>?d) }
> >>
> >> Yacker validator results look troubling to me:
> >> 
> >http://www.w3.org/2005/01/yacker/uploads/SPARQL?markup=html&lang=perl&text=SELECT+*+WHERE+%7B+%3Fa+%3Fb+%3Fc%2C+%3Fd+.+FILTER%28%3Fa%3C%3Fb+%26%26+%3Fc%3E%3Fd%29+%7D&action=validate+text
> >>
> >> 
> >http://www.w3.org/2005/01/yacker/uploads/SPARQL?markup=html&lang=perl&text=SELECT+*+WHERE+%7B+%3Fa+%3Fb+%3Fc%2C+%3Fd+.+FILTER%28%3Fa%3C%3Fb%26%26%3Fc%3E%3Fd%29+%7D%0D%0A&action=validate+text

Yeah, that "shift (Q_IRI_REF, <?b&&?c>)" at the bottom of the trace is
perfectly legal (by the grammar), but unfortunate. I argue below that,
given historical delimiters for URIs, this is still the best of all
possible worlds.

> >> The first query validates, the other does not.
> >> My guess is that the validator uses some flex-like scanner, that
> >> prefers the longest tokens. In the first case "<?b && ?c>" can't be
> >> parsed as IRI because of the spaces, so the scanner falls back and
> >> 'less than' rule is picked.
> >> On the other hand, "<?b&&?c>" is a valid (according to the grammar)
> >> IRI. But 'variable iri variable' is not a valid FILTER condition and
> >> the parser rejects the query.
> >>
> >> The problem is more obvious for scanners with one character
> >> look-ahead, because they are completely unable to distinguish these
> >> two cases.
> >> They also have the same problem with () and [] tokens (NIL and ANON
> >> terminals) but that can easily be solved by going from LL(1) to LL(2).
> >>
> >> Jiri Dokulil
> >
> >Because the characters < and > are overloaded for IRIs and for comparison
> >operators there is a potential ambiguity.  The SPARQL grammar handles IRI 
> >in
> >two ways - the general grammar rule that is simple and covers any IRI 
> >scheme,
> >but then replies on further validating by an IRI parser.
> 
> No objection about the simple rule. I don't expect the grammar to
> provide advanced checks.
> 
> >
> >For the http: scheme, <?b> is a valid IRI, as is <?b&&1>. ? and & are 
> >legal in
> >an HTTP URL.
> >
> >For example:
> >
> >
> >BASE     <http://example/page>
> >PREFIX : <http://example/ns#>
> >
> >ASK { <?b> :p <?b&&1> }
> >
> >
> >
> >  1 BASE    <http://example/page>
> >  2 PREFIX  : <http://example/ns#>
> >  3
> >  4 ASK
> >  5 WHERE
> >  6   { <http://example/page?b>
> >  7               :p  <http://example/page?b&&1> .
> >  8   }
> >
> ><?b> is a relative URL relative to base <http://example/page>
> >That is <http://example/page?b>
> >
> >The rule "longest token wins" resolves the tokenizing problem (and is 
> >common
> >practice in lexers because it also means 123 is a single number, not 3
> >individual one digit numbers) although it moves the problem to the grammar.
> >
> >It could be disambiguated but it needs more than changes to the lexer.  It
> >needs a context sensitive lexer (< and an IRI can't occur in the same 
> >place in
> >a valid expression, after ?a seeing < must be a comparison in a legal
> >expression).  The WG has chosen to cover the wider range of parser 
> >toolkits,
> >rather than chose the more complicated context sensitive approach.
> 
> Again, no objection here. In fact, the '<' is an issue for me because
> the lexer I used is too weak to handle even this.

What lexer are you using? Can you use it twice, once to look for URIs?

> >
> >I'll look at adding an editorial note that highlights this better. It does
> >already say:
> >
> >http://www.w3.org/TR/rdf-sparql-query/#whitespace
> >"""
> >White space (production WS) is used to separate two terminals which would
> >otherwise be (mis-)recognized as one terminal.
> >"""
> >which already covers this case.
> >
> >I hope that this message addresses you comment. If it does, please let us 
> >know
> >- if you put [CLOSED] in the subject line, it will help scripts that help
> >manage this list.
> 
> Thanks for the explanation. It certainly clarified the way SPARQL
> queries should be parsed.
> Still, I'm not happy with this solution because it makes the
> -otherwise simple- language complicated and somewhat tricky. Using an
> operator as a string delimiter seems highly unusual to me.

RFC3986 Appendix C.  Delimiting a URI in Context [3986C] recommends <>
as delimiters. (The text comes from RFC2396 and dates back before
August 1998.) Other RDF languages (ntriples, turtle, n3) use <>s in
this way. It is very familiar to the RDF community.

There are also one or two languages that use an infix '<' operator to
indicate less-than. I think this is as good as we can do.

[3986C] http://ietfreport.isoc.org/idref/rfc3986/#page-51

> Unfortunately it is obviously way too late to do anything about this,
> so I'll have to cope with the problem (I'm creating a SPARQL
> implementation to experiment with).
> Thanks again for the explanation.
> 
> Jiri Dokulil

-- 
-eric

home-office: +1.617.395.1213 (usually 900-2300 CET)
	    +33.1.45.35.62.14
cell:       +33.6.73.84.87.26

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Friday, 8 September 2006 14:31:00 UTC