- From: Kingsley Idehen <kidehen@openlinksw.com>
- Date: Wed, 17 Apr 2013 14:36:43 -0400
- To: public-lod@w3.org
- Message-ID: <516EEBBB.2080200@openlinksw.com>
On 4/17/13 1:25 PM, Aidan Hogan wrote: > Hi Kingsley, > > I fear we're not getting anywhere but I want to try wrap up my side of > the argument with this mail. > > On 16/04/2013 22:38, Kingsley Idehen wrote: >> On 4/16/13 5:22 PM, Aidan Hogan wrote: > <snip> >>> Anyways, as per my previous reply ... >>> >>> With respect to this SPARQL query service: >>> >>> http://lod.openlinksw.com/sparql >>> >>> I would like a response complaint with the SPARQL standard for either >>> of the following two SPARQL queries: >>> >>> SELECT * WHERE >>> {?s foaf:knows ?o} >>> >>> or >>> >>> SELECT * WHERE >>> {?s foaf:knows ?o . ?o foaf:knows ?o2 .} >>> >>> >>> >> Did you perform a count on either? If so, why no LIMIT in the query ? If >> you want no LIMIT into what bucket are you placing the result? Would you >> dare send the following to a decently sized RDBMS and use it as the >> basis for assessing scale: > > I am very much impressed that Virtuoso can compute the COUNT on the > second query, but that was not the query I asked for. And the query you asked for was what? If you mean these queries: SELECT * WHERE {?s foaf:knows ?o} or SELECT * WHERE {?s foaf:knows ?o . ?o foaf:knows ?o2 .} And against 50+ Billion triples, and somehow that's a yardstick or meaningful basis for scale. You have utterly lost me. > > As for not putting a LIMIT on the query, I don't see where in the > SPARQL standard it says that LIMIT is mandatory? LIMIT and OFFSET are part of the SPARQL query language. They exist so that you can use SPARQL to page through data using a cursor-like mechanism [1][2]. > Nor do I have to put them in a "bucket". Bucket means: you are going to pass a query to a server requesting the solution to a query. The server will produce a solution and return it to you. You are the driver of some client app e.g., a Web browser or some other user agent. What ever it is that you are driving, there will be storage for the results that comprise the solution. > Maybe I want to download the results to a CSV file and scan through > them in the scripting language of my choice, maybe building a map of > foaf:knows relations across different sites? Yes, you want to use SPARQL to download 2 billion+ records. We'll if you want to do that, then use OFFSET and LIMIT and you will eventually get all the data into your local CSV document. > > My query is a valid SPARQL query but I'm not getting a valid SPARQL > response. See my comment above. > > But okay, for arguments sake, let's say I want to load them into an > Excel spreadsheet (a bucket), which has a max of 1048576 rows. > > SELECT * WHERE > {?s foaf:knows ?o} > LIMIT 1048576 > > Still won't work. I'll only get 100,000 results. Again, because that isn't what you do, and that has nothing to do with scale. You fetch your data in chunks. Also note, our public instances are configured to force user agents to work this way since the instances are for public use i.e., not for anyone to come along and hog the system. If you want to play around with hypothetical queries like that, you have a simple solution. Install your own Virtuoso instance, in your own setup, and play to your hearts content. A server on the Web has to be configurable for its desired usage pattern. Our instances (be it DBpedia, LOD Cloud cache etc..) are configured to allow the public perform queries in a manner that ensures everyone has some degree of fair use. > > And I completely understand why it doesn't work, and I completely > understand why Virtuoso and other SPARQL admins commonly enforce such > limits: because we can both agree that it is not practical to answer > even such simple SPARQL queries at large scale for many users without > cut-offs. But that doesn't mean that SPARQL doesn't scale. How does XQuery/XPath solve that issue? How does any query language solve that issue? > > But even aside from cut-offs, we can increase the hops in the > foaf:knows network arbitrarily for a SPARQL query and see where that > leads, or we start counting n-cliques of arbitrary size in the social > network or maybe do a six-degrees-of-separation query across the data. You can do that as long as you are also ready to factor in OFFSET and LIMIT as the basis for working with chunks of data. Likewise, if you are attempting a transitive closure, ther > These types of interesting SPARQL queries are obviously not scalable > to evaluate. > > <snip> >>>> In which context? I don't know if it answers your question, but XPath >>>> 1.0 is PTime and is parallelisable. >>> >>> So is SPARQL. That and when I mention vectorized execution I am >>> referring to doing this in a very fine-grained manner such that threads >>> (which execute queries in parallel) are scoped to CPU cores > > SPARQL is not in PTime. Only parts of evaluating a SPARQL query can be > effectively parallelised. (Answering multiple SPARQL queries is > obviously parallelisable.) > > XPath deals with trees. SPARQL deals with graphs. The former is much > "easier", hence more scalable. (This by no means makes XPath scalable > in itself or somehow better than SPARQL or anything like that.) You are speculating. Can you please provide an XPath query that's comparable to SPARQL that demonstrates your point? This is the Web a URL is all you need to prove your point. > > And I am delighted that Virtuoso is a SPARQL engine, not an XPath > engine. And I'm not sure of the relevance of this line of discussion. :) You made the SPARQL and XPath engine juxtaposition above in the context of scale. Your claim is speculative until you provide proof. > >>> No, not if your scalability yardstick boils down to cursor-less >>> patterns >>> against massive datasets of the form: > > My yardstick simply involves reliably servicing valid SPARQL queries > with a valid response over valid large inputs. Proof via a URL. Otherwise you are just speculating, as I've stated already. > Can you do useful things with SPARQL at scale? Yes. Can you support a > fully compliant SPARQL engine at scale? No. (Again, "I give up" or > "here's a partial response" are not valid SPARQL responses.) > >>>> My core point is that *one cannot make blanket guarantees for >>>> scalability with respect to something like SPARQL*. I hope we could >>>> agree on that point. >>> >>> We can, and that's my point. We can make that claim [1] and defend it. > > "150 Billion Triple dataset hosted on the LOD2 Knowledge Store > Cluster" sounds like a great engineering achievement (congrats!), and > I'm sure lots of useful queries can be answered very quickly, but if > you want to defend the blanket guarantee, can we start by shifting > down five orders of magnitude and looking to get those million > foaf:knows relations out of the public endpoint and into CSV/Excel to > start? If you can provide an endpoint that will do the same with XPath or be ready to commission some servers in your data center, it's game on, with immense pleasure :-) > > Service: > > http://lod.openlinksw.com/sparql > > SPARQL Query: > > SELECT * WHERE > {?s foaf:knows ?o} > LIMIT 1048576 > > This should, by definition, be covered under the "blanket guarantee". Of course not. I assume you understand what scrollable cursors are and why they exist re., DBMS technology? > If we get that working (probably a param in the admin interface to > turn off the artificial 100,000 limit), we can then explore the wide > variety of other things that cannot be covered by this blanket guarantee. I don't know where "blanket guarantee" came from. BTW -- are you implying your "blanket guarantee" is something XPath can deliver over this volume of data i.e., from a public endpoint to a local CSV? Note, said endpoint will be accessible to the public etc. If that's possible I wouldn't expect the endpoint in question to be a secret. > > Anyways, I'm sure you have a lot of disagreeing to do ;), but I'll > exit the discussion on my side by suggesting a possible compromise: > > "SPARQL scales (except for those parts that don't)". That's a strawman i.e., you are now responding to a totally different question. Remember, we started off with SPARQL being more scalable than an ever evolving collection of divergent RESTful patterns for accessing data. My claim was that REST patterns (for this sort of job) will die out overtime because there will be a zillion of them as opposed to a single declarative query language like SPARQL which includes a RESTful data interactions via the SPARQL-Protocol. Web 2.0 APIs (SOAP or those that offer RESTful interactions) demonstrate the claim I make above, every other day of the week. > > The part in parentheses is mandatory. :) Strawman :-) Links: 1. http://msdn.microsoft.com/en-us/library/windows/desktop/ms710292(v=vs.85).aspx -- Scrollable Cursors 2. http://en.wikipedia.org/wiki/Cursor_(databases)#Scrollable_cursors -- ditto 3. http://bit.ly/Yv8nAK -- DESCRIBE based paging via OFFSET and LIMIT (query definition) 4. http://bit.ly/XGqqXo -- ditto (keep refreshing the URL and notice the data change per refresh) 5. http://bit.ly/13iP198 -- same thing (results URL) based on your query . > > > Cheers, > Aidan > > > -- Regards, Kingsley Idehen Founder & CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca handle: @kidehen Google+ Profile: https://plus.google.com/112399767740508618350/about LinkedIn Profile: http://www.linkedin.com/in/kidehen
Attachments
- application/pkcs7-signature attachment: S/MIME Cryptographic Signature
Received on Wednesday, 17 April 2013 18:37:07 UTC