Re: Restpark - Minimal RESTful API for querying RDF triples from Kingsley Idehen on 2013-04-17 (public-lod@w3.org from April 2013)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 17 Apr 2013 14:36:43 -0400
To: public-lod@w3.org
Message-ID: <516EEBBB.2080200@openlinksw.com>
On 4/17/13 1:25 PM, Aidan Hogan wrote:
> Hi Kingsley,
>
> I fear we're not getting anywhere but I want to try wrap up my side of 
> the argument with this mail.
>
> On 16/04/2013 22:38, Kingsley Idehen wrote:
>> On 4/16/13 5:22 PM, Aidan Hogan wrote:
> <snip>
>>> Anyways, as per my previous reply ...
>>>
>>> With respect to this SPARQL query service:
>>>
>>>     http://lod.openlinksw.com/sparql
>>>
>>> I would like a response complaint with the SPARQL standard for either
>>> of the following two SPARQL queries:
>>>
>>>     SELECT * WHERE
>>>     {?s foaf:knows ?o}
>>>
>>> or
>>>
>>>     SELECT * WHERE
>>>     {?s foaf:knows ?o . ?o foaf:knows ?o2 .}
>>>
>>>
>>>
>> Did you perform a count on either? If so, why no LIMIT in the query ? If
>> you want no LIMIT into what bucket are you placing the result? Would you
>> dare send the following to a decently sized RDBMS and use it as the
>> basis for assessing scale:
>
> I am very much impressed that Virtuoso can compute the COUNT on the 
> second query, but that was not the query I asked for.

And the query you asked for was what?

If you mean these queries:

    SELECT * WHERE
     {?s foaf:knows ?o}

or

     SELECT * WHERE
     {?s foaf:knows ?o . ?o foaf:knows ?o2 .}

And against 50+ Billion triples, and somehow that's a yardstick or 
meaningful basis for scale. You have utterly lost me.


>
> As for not putting a LIMIT on the query, I don't see where in the 
> SPARQL standard it says that LIMIT is mandatory?

LIMIT and OFFSET are part of the SPARQL query language. They exist so 
that you can use SPARQL to page through data using a cursor-like 
mechanism [1][2].

> Nor do I have to put them in a "bucket". 

Bucket means: you are going to pass a query to a server requesting the 
solution to a query. The server will produce a solution and return it to 
you. You are the driver of some client app e.g., a Web browser or some 
other user agent. What ever it is that you are driving, there will be 
storage for the results that comprise the solution.

> Maybe I want to download the results to a CSV file and scan through 
> them in the scripting language of my choice, maybe building a map of 
> foaf:knows relations across different sites?

Yes, you want to use SPARQL to download 2 billion+ records. We'll if you 
want to do that, then use OFFSET and LIMIT and you will eventually get 
all the data into your local CSV document.

>
> My query is a valid SPARQL query but I'm not getting a valid SPARQL 
> response.

See my comment above.

>
> But okay, for arguments sake, let's say I want to load them into an 
> Excel spreadsheet (a bucket), which has a max of 1048576 rows.
>
> SELECT * WHERE
> {?s foaf:knows ?o}
> LIMIT 1048576
>
> Still won't work. I'll only get 100,000 results.

Again, because that isn't what you do, and that has nothing to do with 
scale. You fetch your data in chunks. Also note, our public instances 
are configured to force user agents to work this way since the instances 
are for public use i.e., not for anyone to come along and hog the system.

If you want to play around with hypothetical queries like that, you have 
a simple solution. Install your own Virtuoso instance, in your own 
setup, and play to your hearts content.

A server on the Web has to be configurable for its desired usage 
pattern. Our instances (be it DBpedia, LOD Cloud cache etc..) are 
configured to allow the public perform queries in a manner that ensures 
everyone has some degree of fair use.

>
> And I completely understand why it doesn't work, and I completely 
> understand why Virtuoso and other SPARQL admins commonly enforce such 
> limits: because we can both agree that it is not practical to answer 
> even such simple SPARQL queries at large scale for many users without 
> cut-offs.

But that doesn't mean that SPARQL doesn't scale. How does XQuery/XPath 
solve that issue? How does any query language solve that issue?

>
> But even aside from cut-offs, we can increase the hops in the 
> foaf:knows network arbitrarily for a SPARQL query and see where that 
> leads, or we start counting n-cliques of arbitrary size in the social 
> network or maybe do a six-degrees-of-separation query across the data. 

You can do that as long as you are also ready to factor in OFFSET and 
LIMIT as the basis for working with chunks of data. Likewise, if you are 
attempting a transitive closure, ther
> These types of interesting SPARQL queries are obviously not scalable 
> to evaluate.
>
> <snip>
>>>> In which context? I don't know if it answers your question, but XPath
>>>> 1.0 is PTime and is parallelisable.
>>>
>>> So is SPARQL. That and when I mention vectorized execution I am
>>> referring to doing this in a very fine-grained manner such that threads
>>> (which execute queries in parallel) are scoped to CPU cores
>
> SPARQL is not in PTime. Only parts of evaluating a SPARQL query can be 
> effectively parallelised. (Answering multiple SPARQL queries is 
> obviously parallelisable.)
>
> XPath deals with trees. SPARQL deals with graphs. The former is much 
> "easier", hence more scalable. (This by no means makes XPath scalable 
> in itself or somehow better than SPARQL or anything like that.)

You are speculating. Can you please provide an XPath query that's 
comparable to SPARQL that demonstrates your point? This is the Web a URL 
is all you need to prove your point.

>
> And I am delighted that Virtuoso is a SPARQL engine, not an XPath 
> engine. And I'm not sure of the relevance of this line of discussion. :)

You made the SPARQL and XPath engine juxtaposition above in the context 
of scale. Your claim is speculative until you provide proof.

>
>>> No, not if your scalability yardstick boils down to cursor-less 
>>> patterns
>>> against massive datasets of the form:
>
> My yardstick simply involves reliably servicing valid SPARQL queries 
> with a valid response over valid large inputs. 

Proof via a URL. Otherwise you are just speculating, as I've stated already.

> Can you do useful things with SPARQL at scale? Yes. Can you support a 
> fully compliant SPARQL engine at scale? No. (Again, "I give up" or 
> "here's a partial response" are not valid SPARQL responses.)
>
>>>> My core point is that *one cannot make blanket guarantees for
>>>> scalability with respect to something like SPARQL*. I hope we could
>>>> agree on that point.
>>>
>>> We can, and that's my point. We can make that claim [1] and defend it.
>
> "150 Billion Triple dataset hosted on the LOD2 Knowledge Store 
> Cluster" sounds like a great engineering achievement (congrats!), and 
> I'm sure lots of useful queries can be answered very quickly, but if 
> you want to defend the blanket guarantee, can we start by shifting 
> down five orders of magnitude and looking to get those million 
> foaf:knows relations out of the public endpoint and into CSV/Excel to 
> start?

If you can provide an endpoint that will do the same with XPath or be 
ready to commission some servers in your data center, it's game on, with 
immense pleasure :-)

>
> Service:
>
>     http://lod.openlinksw.com/sparql
>
> SPARQL Query:
>
>     SELECT * WHERE
>     {?s foaf:knows ?o}
>     LIMIT 1048576
>
> This should, by definition, be covered under the "blanket guarantee".

Of course not.

I assume you understand what scrollable cursors are and why they exist 
re., DBMS technology?

> If we get that working (probably a param in the admin interface to 
> turn off the artificial 100,000 limit), we can then explore the wide 
> variety of other things that cannot be covered by this blanket guarantee.

I don't know where "blanket guarantee" came from.

BTW -- are you implying your "blanket guarantee" is something XPath can 
deliver over this volume of data i.e., from a public endpoint to a local 
CSV? Note, said endpoint will be  accessible to the public etc. If 
that's possible I wouldn't expect the endpoint in question to be a secret.

>
> Anyways, I'm sure you have a lot of disagreeing to do ;), but I'll 
> exit the discussion on my side by suggesting a possible compromise:
>
> "SPARQL scales (except for those parts that don't)".

That's a strawman i.e., you are now responding to a totally different 
question. Remember, we started off with SPARQL being more scalable than 
an ever evolving collection of divergent RESTful patterns for accessing 
data. My claim was that REST patterns (for this sort of job) will die 
out overtime because there will be a zillion of them as opposed to a 
single declarative query language like SPARQL which includes a RESTful 
data interactions via the SPARQL-Protocol.

Web 2.0 APIs (SOAP or those that offer RESTful interactions) demonstrate 
the claim I make above, every other day of the week.

>
> The part in parentheses is mandatory. :)

Strawman :-)

Links:

1. 
http://msdn.microsoft.com/en-us/library/windows/desktop/ms710292(v=vs.85).aspx 
-- Scrollable Cursors
2. http://en.wikipedia.org/wiki/Cursor_(databases)#Scrollable_cursors -- 
ditto
3. http://bit.ly/Yv8nAK -- DESCRIBE based paging via OFFSET and LIMIT 
(query definition)
4. http://bit.ly/XGqqXo -- ditto (keep refreshing the URL and notice the 
data change per refresh)
5. http://bit.ly/13iP198 -- same thing (results URL) based on your query .

>
>
> Cheers,
> Aidan
>
>
>


-- 

Regards,

Kingsley Idehen	
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Attachments

application/pkcs7-signature attachment: S/MIME Cryptographic Signature
Received on Wednesday, 17 April 2013 18:37:07 UTC