Re: comparing to OWL and SPIN from Sandro Hawke on 2014-07-21 (public-rdf-shapes@w3.org from July 2014)

From: Sandro Hawke <sandro@w3.org>
Date: Mon, 21 Jul 2014 17:14:20 -0400
To: Jerven Bolleman <jerven.bolleman@isb-sib.ch>
CC: Kendall Clark <kendall@clarkparsia.com>, "Peter F. Patel-Schneider" <pfpschneider@gmail.com>, "Dam, Jesse van" <jesse.vandam@wur.nl>, "public-rdf-shapes@w3.org" <public-rdf-shapes@w3.org>
Message-ID: <53CD82AC.3000105@w3.org>
On 07/21/2014 04:34 PM, Jerven Bolleman wrote:
> On 21 Jul 2014, at 21:07, Sandro Hawke <sandro@w3.org> wrote:
>
>> On 07/21/2014 02:50 PM, Jerven Bolleman wrote:
>>> On 21 Jul 2014, at 20:16, Sandro Hawke <sandro@w3.org> wrote:
>>>
>>>> On 07/21/2014 01:54 PM, Kendall Clark wrote:
>>>>> n Mon, Jul 21, 2014 at 1:49 PM, Sandro Hawke <sandro@w3.org> wrote:
>>>>> On 07/21/2014 08:09 AM, Peter F. Patel-Schneider wrote:
>>>>> I could be that the Regular Expression derivatives algorithm, although much less expressive then OWL, is outperforming the OWL reasoners.  Only some research and testing will give an useful answer, but certainly something nice to consider and test.
>>>>>
>>>>> Yes, this could be tested.  I expect that StarDog ICV will perform very well, as it works by translation into SPARQL queries.
>>>>>
>>>>> It looks to me like ShEx could validate a graph serialization in linear time (with the size of the serialization), with no need for storing the graph.  That's appealing to me when we're talking
>>>>> about validating messages that are being sent between systems.
>>>>>   No need to store the graph unless its size exceed available memory, right? That does happen from time to time.
>>>>>
>>>> When I said "store", I meant in RAM.  :)   I was thinking it would be nice to have validation as part of a streaming serializer and streaming parser.  It's nice to have those things not buffer the whole input/output before moving it on.
>>> You can only do that if you know the order of triples you are going to get i.e. bounded messages. And in any case you will have to validate on a sliding window of a number of triples, this is no different between ShEx or SPARQL. So you need an in memory buffer, on which you can execute SPARQL. At this size you most likely don’t need indexes because you can build your binding sets on the fly.
>>>>> SPARQL based solutions require storing and searching the graph, which is exponential (and likely slow unless properly indexed), but that's probably fine if you're just validating data that you need to keep in a SPARQL system anyway.
>>>>>
>>>>> Actually Stardog ICV does both; either transactionally for data under storage or in-memory for message passing and middleware contexts.
>>>>>
>>>>> Also, the complexity of SPARQL query answering is well understood and it's not EXP.
>>>>>
>>>> Interesting, this is what I get for stretching myself too thin across too many technologies.   I would have thought executing a query with a graph pattern like { <s> <p1> ?v1.  ?v1 <p2> ?v2. ...    ?v(n-1) <pn> ?vn } would take time proportional to k^n.   With sufficient indexing, k might be very close to 1, but without indexing, I'd think k would be the mean cardinality of p1...pn.   And of course indexing takes time.
>>> In a sliding window you can build any triple pattern as it comes in. e.g. you only need to materialise the joins. Lets work that out with an example.
>>> Assume messages of the type
>>>
>>> example:person1 a :Intelligence ;
>>>                  a :Human .
>>>
>>> ASK
>>> {
>>> 	?wrong a :Intelligence .
>>>          MINUS {?wrong a :Human } # Assume a world where AI does not exist and is not allowed.
>>> }
>>>
>>> If the query returns true the validation failed.
>>>
>>> As the data comes in you can direct the BGP patterns into different FIFO queues (e.g. what is normally filled from disk in a method like getStatements in sesame [1])
>>> and the execution is merely a straight filter between two FIFO queues.
>>>
>>> This of course means you need to know the queries in advance, but that is the same for ShEx.
>> Thanks for the explanation, which I realize is somewhat off-topic.    I can see how for ShEx patterns, SPARQL would be as fast.   Do you know offhand how this kind of streaming technique behaves for patterns that are outside of what ShEx can do?
> Well first of all I don’t think that is an important consideration at this time anyway. But assume you do something like this.
> ASK
> {
> 	?wrong a :Intelligence .
>          SERVICE <http://realyfar.away/and/slow/sparql/system>{
>          	MINUS {?wrong a :Human } # Assume a world where AI does not exist and is not allowed.
>          }
> }
> Could easily give extremely bad performance. But that does not matter so much, because most real world programmers will
> just try coding something and see if its fast enough for their business needs. I.e. most programmers pay attention to
> economic speed, not so much to absolute speed. What drives programmers up the wall is when their tool can’t do trivial things
> that it should. e.g. ShEx can validate that a letter is addressed correctly (has postcode, country, state etc…) SPIN can also
> check that the country, postcode and state exists and that they all match (possibly asking a remote endpoint for that information).
> I think ICV can do that too in theory but have not investigated it (assuming Stardog implements SERVICE which currently I believe it does not.)

You make a strong case.

I think the main counter argument I've heard is that it's easy to write 
SPARQL queries which constrain a graph in ways which are much harder to 
understand than ShExC.

Thoughts on that?

Maybe we could evolve a style/formatting for SPARQL that made easy 
patterns easy to see?

> SPIN has named graph support build in, I don’t know if ShEx can do that at this time.
>
> Regards,
> Jerven
>
> PS. If you don’t believe economic speed vs absolute speed, explain Ruby, Perl and Java to me

Not everyone has the same priorities?   I've no doubt that some people 
would prefer each of these options, doing their own cost/benefit 
analysis, with their own weights on the factors.

Wtandards tend to be shaped largely by who actually shows up week after 
week, month after month, to do the work.     It's not a perfect 
solution.    Hopefully the people doing the work take into account 
everyone else's needs to a reasonable extent.   (One challenge in 
standards is that there's still work to do after the decisions are made, 
and there's less incentive to stick around for that part.)

So, I'm thinking all this debate should be taken the WG, and the people 
who show up (especially at the face to face meetings) can sort this out 
among themselves.

Back to the point that started this particular subthread, I'm remember 
of a claim I heard once (in an LDP WG meeting) that Google, as a matter 
of policy, doesn't build on NP systems.  Reportedly this rules out SQL 
and SPARQL.   Alas, I haven't been able to substantiate this allegation 
(although I haven't actually tried contacts at Google - it didn't seem 
worth bothering them about).   I expect some of the motivation for the  
NoSQL movement is this kind of thinking.

      -- Sandro

>>      - Sandro
>>
>>
>>
>>> Regards,
>>> Jerven
>>>
>>>
>>> [1]
>>> http://openrdf.callimachus.net/sesame/2.7/apidocs/org/openrdf/query/algebra/evaluation/TripleSource.html#getStatements(org.openrdf.model.Resource, org.openrdf.model.URI, org.openrdf.model.Value, org.openrdf.model.Resource...)
>>>
>>>
>>>
>>>
>>>
>>>>        -- Sandro
>>>>
>>>>> Cheers,
>>>>> Kendall
>>> -------------------------------------------------------------------
>>> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
>>> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
>>> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
>>> 1211 Geneve 4,
>>> Switzerland     www.isb-sib.ch - www.uniprot.org
>>> Follow us at https://twitter.com/#!/uniprot
>>> -------------------------------------------------------------------
>>>
>>>
> -------------------------------------------------------------------
> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
> 1211 Geneve 4,
> Switzerland     www.isb-sib.ch - www.uniprot.org
> Follow us at https://twitter.com/#!/uniprot
> -------------------------------------------------------------------
>
>
Received on Monday, 21 July 2014 21:14:31 UTC