Re: Restpark - Minimal RESTful API for querying RDF triples from Jerven Bolleman on 2013-04-17 (public-lod@w3.org from April 2013)

From: Jerven Bolleman <me@jerven.eu>
Date: Wed, 17 Apr 2013 20:00:45 +0200
To: Aidan Hogan <aidan.hogan@deri.org>
Cc: public-lod@w3.org
Message-ID: <CAHM_hUMERNAS_e-mFgN2oGYHmrW7EMf3t=huB=m5rK6vqn42pg@mail.gmail.com>
Hi Aidan,

If the existence of admin resource limits means something does not scale
then google or facebook do not scale.
Which to my common sense definition of scale seems bizarre.

Give enough request to facebook or google and they will stop responding in
http spec compliant ways. Does http not scale?

I think an argument can be made that current user management tools in
standard sparql leaves something to be desired. But that is a different
discussion altogether.

regards,
Jerven

On Apr 17, 2013 7:30 PM, "Aidan Hogan" <aidan.hogan@deri.org> wrote:
>
> Hi Kingsley,
>
> I fear we're not getting anywhere but I want to try wrap up my side of
the argument with this mail.
>
>
> On 16/04/2013 22:38, Kingsley Idehen wrote:
>>
>> On 4/16/13 5:22 PM, Aidan Hogan wrote:
>
> <snip>
>>>
>>> Anyways, as per my previous reply ...
>>>
>>> With respect to this SPARQL query service:
>>>
>>>     http://lod.openlinksw.com/sparql
>>>
>>> I would like a response complaint with the SPARQL standard for either
>>> of the following two SPARQL queries:
>>>
>>>     SELECT * WHERE
>>>     {?s foaf:knows ?o}
>>>
>>> or
>>>
>>>     SELECT * WHERE
>>>     {?s foaf:knows ?o . ?o foaf:knows ?o2 .}
>>>
>>>
>>>
>> Did you perform a count on either? If so, why no LIMIT in the query ? If
>> you want no LIMIT into what bucket are you placing the result? Would you
>> dare send the following to a decently sized RDBMS and use it as the
>> basis for assessing scale:
>
>
> I am very much impressed that Virtuoso can compute the COUNT on the
second query, but that was not the query I asked for.
>
> As for not putting a LIMIT on the query, I don't see where in the SPARQL
standard it says that LIMIT is mandatory? Nor do I have to put them in a
"bucket". Maybe I want to download the results to a CSV file and scan
through them in the scripting language of my choice, maybe building a map
of foaf:knows relations across different sites?
>
> My query is a valid SPARQL query but I'm not getting a valid SPARQL
response.
>
> But okay, for arguments sake, let's say I want to load them into an Excel
spreadsheet (a bucket), which has a max of 1048576 rows.
>
>
> SELECT * WHERE
> {?s foaf:knows ?o}
> LIMIT 1048576
>
> Still won't work. I'll only get 100,000 results.
>
> And I completely understand why it doesn't work, and I completely
understand why Virtuoso and other SPARQL admins commonly enforce such
limits: because we can both agree that it is not practical to answer even
such simple SPARQL queries at large scale for many users without cut-offs.
>
> But even aside from cut-offs, we can increase the hops in the foaf:knows
network arbitrarily for a SPARQL query and see where that leads, or we
start counting n-cliques of arbitrary size in the social network or maybe
do a six-degrees-of-separation query across the data. These types of
interesting SPARQL queries are obviously not scalable to evaluate.
>
> <snip>
>
>>>> In which context? I don't know if it answers your question, but XPath
>>>> 1.0 is PTime and is parallelisable.
>>>
>>>
>>> So is SPARQL. That and when I mention vectorized execution I am
>>> referring to doing this in a very fine-grained manner such that threads
>>> (which execute queries in parallel) are scoped to CPU cores
>
>
> SPARQL is not in PTime. Only parts of evaluating a SPARQL query can be
effectively parallelised. (Answering multiple SPARQL queries is obviously
parallelisable.)
>
> XPath deals with trees. SPARQL deals with graphs. The former is much
"easier", hence more scalable. (This by no means makes XPath scalable in
itself or somehow better than SPARQL or anything like that.)
>
> And I am delighted that Virtuoso is a SPARQL engine, not an XPath engine.
And I'm not sure of the relevance of this line of discussion. :)
>
>
>>> No, not if your scalability yardstick boils down to cursor-less patterns
>>> against massive datasets of the form:
>
>
> My yardstick simply involves reliably servicing valid SPARQL queries with
a valid response over valid large inputs. Can you do useful things with
SPARQL at scale? Yes. Can you support a fully compliant SPARQL engine at
scale? No. (Again, "I give up" or "here's a partial response" are not valid
SPARQL responses.)
>
>
>>>> My core point is that *one cannot make blanket guarantees for
>>>> scalability with respect to something like SPARQL*. I hope we could
>>>> agree on that point.
>>>
>>>
>>> We can, and that's my point. We can make that claim [1] and defend it.
>
>
> "150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster"
sounds like a great engineering achievement (congrats!), and I'm sure lots
of useful queries can be answered very quickly, but if you want to defend
the blanket guarantee, can we start by shifting down five orders of
magnitude and looking to get those million foaf:knows relations out of the
public endpoint and into CSV/Excel to start?
>
> Service:
>
>         http://lod.openlinksw.com/sparql
>
> SPARQL Query:
>
>
>         SELECT * WHERE
>         {?s foaf:knows ?o}
>         LIMIT 1048576
>
> This should, by definition, be covered under the "blanket guarantee". If
we get that working (probably a param in the admin interface to turn off
the artificial 100,000 limit), we can then explore the wide variety of
other things that cannot be covered by this blanket guarantee.
>
> Anyways, I'm sure you have a lot of disagreeing to do ;), but I'll exit
the discussion on my side by suggesting a possible compromise:
>
> "SPARQL scales (except for those parts that don't)".
>
> The part in parentheses is mandatory. :)
>
>
> Cheers,
> Aidan
>
Received on Wednesday, 17 April 2013 18:01:13 UTC