Re: Restpark - Minimal RESTful API for querying RDF triples from Aidan Hogan on 2013-04-17 (public-lod@w3.org from April 2013)

From: Aidan Hogan <aidan.hogan@deri.org>
Date: Wed, 17 Apr 2013 18:25:56 +0100
To: public-lod@w3.org
Message-ID: <516EDB24.4040900@deri.org>
Hi Kingsley,

I fear we're not getting anywhere but I want to try wrap up my side of 
the argument with this mail.

On 16/04/2013 22:38, Kingsley Idehen wrote:
> On 4/16/13 5:22 PM, Aidan Hogan wrote:
<snip>
>> Anyways, as per my previous reply ...
>>
>> With respect to this SPARQL query service:
>>
>>     http://lod.openlinksw.com/sparql
>>
>> I would like a response complaint with the SPARQL standard for either
>> of the following two SPARQL queries:
>>
>>     SELECT * WHERE
>>     {?s foaf:knows ?o}
>>
>> or
>>
>>     SELECT * WHERE
>>     {?s foaf:knows ?o . ?o foaf:knows ?o2 .}
>>
>>
>>
> Did you perform a count on either? If so, why no LIMIT in the query ? If
> you want no LIMIT into what bucket are you placing the result? Would you
> dare send the following to a decently sized RDBMS and use it as the
> basis for assessing scale:

I am very much impressed that Virtuoso can compute the COUNT on the 
second query, but that was not the query I asked for.

As for not putting a LIMIT on the query, I don't see where in the SPARQL 
standard it says that LIMIT is mandatory? Nor do I have to put them in a 
"bucket". Maybe I want to download the results to a CSV file and scan 
through them in the scripting language of my choice, maybe building a 
map of foaf:knows relations across different sites?

My query is a valid SPARQL query but I'm not getting a valid SPARQL 
response.

But okay, for arguments sake, let's say I want to load them into an 
Excel spreadsheet (a bucket), which has a max of 1048576 rows.

SELECT * WHERE
{?s foaf:knows ?o}
LIMIT 1048576

Still won't work. I'll only get 100,000 results.

And I completely understand why it doesn't work, and I completely 
understand why Virtuoso and other SPARQL admins commonly enforce such 
limits: because we can both agree that it is not practical to answer 
even such simple SPARQL queries at large scale for many users without 
cut-offs.

But even aside from cut-offs, we can increase the hops in the foaf:knows 
network arbitrarily for a SPARQL query and see where that leads, or we 
start counting n-cliques of arbitrary size in the social network or 
maybe do a six-degrees-of-separation query across the data. These types 
of interesting SPARQL queries are obviously not scalable to evaluate.

<snip>
>>> In which context? I don't know if it answers your question, but XPath
>>> 1.0 is PTime and is parallelisable.
>>
>> So is SPARQL. That and when I mention vectorized execution I am
>> referring to doing this in a very fine-grained manner such that threads
>> (which execute queries in parallel) are scoped to CPU cores

SPARQL is not in PTime. Only parts of evaluating a SPARQL query can be 
effectively parallelised. (Answering multiple SPARQL queries is 
obviously parallelisable.)

XPath deals with trees. SPARQL deals with graphs. The former is much 
"easier", hence more scalable. (This by no means makes XPath scalable in 
itself or somehow better than SPARQL or anything like that.)

And I am delighted that Virtuoso is a SPARQL engine, not an XPath 
engine. And I'm not sure of the relevance of this line of discussion. :)

>> No, not if your scalability yardstick boils down to cursor-less patterns
>> against massive datasets of the form:

My yardstick simply involves reliably servicing valid SPARQL queries 
with a valid response over valid large inputs. Can you do useful things 
with SPARQL at scale? Yes. Can you support a fully compliant SPARQL 
engine at scale? No. (Again, "I give up" or "here's a partial response" 
are not valid SPARQL responses.)

>>> My core point is that *one cannot make blanket guarantees for
>>> scalability with respect to something like SPARQL*. I hope we could
>>> agree on that point.
>>
>> We can, and that's my point. We can make that claim [1] and defend it.

"150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster" 
sounds like a great engineering achievement (congrats!), and I'm sure 
lots of useful queries can be answered very quickly, but if you want to 
defend the blanket guarantee, can we start by shifting down five orders 
of magnitude and looking to get those million foaf:knows relations out 
of the public endpoint and into CSV/Excel to start?

Service:

	http://lod.openlinksw.com/sparql

SPARQL Query:

	SELECT * WHERE
	{?s foaf:knows ?o}
	LIMIT 1048576

This should, by definition, be covered under the "blanket guarantee". If 
we get that working (probably a param in the admin interface to turn off 
the artificial 100,000 limit), we can then explore the wide variety of 
other things that cannot be covered by this blanket guarantee.

Anyways, I'm sure you have a lot of disagreeing to do ;), but I'll exit 
the discussion on my side by suggesting a possible compromise:

"SPARQL scales (except for those parts that don't)".

The part in parentheses is mandatory. :)


Cheers,
Aidan
Received on Wednesday, 17 April 2013 17:26:27 UTC