Re: comparing to OWL and SPIN from Jerven Bolleman on 2014-07-21 (public-rdf-shapes@w3.org from July 2014)

From: Jerven Bolleman <jerven.bolleman@isb-sib.ch>
Date: Mon, 21 Jul 2014 22:34:07 +0200
To: Sandro Hawke <sandro@w3.org>
Cc: Kendall Clark <kendall@clarkparsia.com>, "Peter F. Patel-Schneider" <pfpschneider@gmail.com>, "Dam, Jesse van" <jesse.vandam@wur.nl>, "public-rdf-shapes@w3.org" <public-rdf-shapes@w3.org>
Message-Id: <4AA529E5-C9D5-490C-A805-B3B1FDF0FC08@isb-sib.ch>
On 21 Jul 2014, at 21:07, Sandro Hawke <sandro@w3.org> wrote:

> On 07/21/2014 02:50 PM, Jerven Bolleman wrote:
>> On 21 Jul 2014, at 20:16, Sandro Hawke <sandro@w3.org> wrote:
>> 
>>> On 07/21/2014 01:54 PM, Kendall Clark wrote:
>>>> n Mon, Jul 21, 2014 at 1:49 PM, Sandro Hawke <sandro@w3.org> wrote:
>>>> On 07/21/2014 08:09 AM, Peter F. Patel-Schneider wrote:
>>>> I could be that the Regular Expression derivatives algorithm, although much less expressive then OWL, is outperforming the OWL reasoners.  Only some research and testing will give an useful answer, but certainly something nice to consider and test.
>>>> 
>>>> Yes, this could be tested.  I expect that StarDog ICV will perform very well, as it works by translation into SPARQL queries.
>>>> 
>>>> It looks to me like ShEx could validate a graph serialization in linear time (with the size of the serialization), with no need for storing the graph.  That's appealing to me when we're talking
>>>> about validating messages that are being sent between systems.
>>>>  No need to store the graph unless its size exceed available memory, right? That does happen from time to time.
>>>> 
>>> When I said "store", I meant in RAM.  :)   I was thinking it would be nice to have validation as part of a streaming serializer and streaming parser.  It's nice to have those things not buffer the whole input/output before moving it on.
>> You can only do that if you know the order of triples you are going to get i.e. bounded messages. And in any case you will have to validate on a sliding window of a number of triples, this is no different between ShEx or SPARQL. So you need an in memory buffer, on which you can execute SPARQL. At this size you most likely don’t need indexes because you can build your binding sets on the fly.
>>>> SPARQL based solutions require storing and searching the graph, which is exponential (and likely slow unless properly indexed), but that's probably fine if you're just validating data that you need to keep in a SPARQL system anyway.
>>>> 
>>>> Actually Stardog ICV does both; either transactionally for data under storage or in-memory for message passing and middleware contexts.
>>>> 
>>>> Also, the complexity of SPARQL query answering is well understood and it's not EXP.
>>>> 
>>> Interesting, this is what I get for stretching myself too thin across too many technologies.   I would have thought executing a query with a graph pattern like { <s> <p1> ?v1.  ?v1 <p2> ?v2. ...    ?v(n-1) <pn> ?vn } would take time proportional to k^n.   With sufficient indexing, k might be very close to 1, but without indexing, I'd think k would be the mean cardinality of p1...pn.   And of course indexing takes time.
>> In a sliding window you can build any triple pattern as it comes in. e.g. you only need to materialise the joins. Lets work that out with an example.
>> Assume messages of the type
>> 
>> example:person1 a :Intelligence ;
>>                 a :Human .
>> 
>> ASK
>> {
>> 	?wrong a :Intelligence .
>>         MINUS {?wrong a :Human } # Assume a world where AI does not exist and is not allowed.
>> }
>> 
>> If the query returns true the validation failed.
>> 
>> As the data comes in you can direct the BGP patterns into different FIFO queues (e.g. what is normally filled from disk in a method like getStatements in sesame [1])
>> and the execution is merely a straight filter between two FIFO queues.
>> 
>> This of course means you need to know the queries in advance, but that is the same for ShEx.
> 
> Thanks for the explanation, which I realize is somewhat off-topic.    I can see how for ShEx patterns, SPARQL would be as fast.   Do you know offhand how this kind of streaming technique behaves for patterns that are outside of what ShEx can do?
Well first of all I don’t think that is an important consideration at this time anyway. But assume you do something like this.
ASK
{
	?wrong a :Intelligence .
        SERVICE <http://realyfar.away/and/slow/sparql/system>{
        	MINUS {?wrong a :Human } # Assume a world where AI does not exist and is not allowed.
        }
}
Could easily give extremely bad performance. But that does not matter so much, because most real world programmers will
just try coding something and see if its fast enough for their business needs. I.e. most programmers pay attention to
economic speed, not so much to absolute speed. What drives programmers up the wall is when their tool can’t do trivial things
that it should. e.g. ShEx can validate that a letter is addressed correctly (has postcode, country, state etc…) SPIN can also
check that the country, postcode and state exists and that they all match (possibly asking a remote endpoint for that information). 
I think ICV can do that too in theory but have not investigated it (assuming Stardog implements SERVICE which currently I believe it does not.)

SPIN has named graph support build in, I don’t know if ShEx can do that at this time.

Regards,
Jerven

PS. If you don’t believe economic speed vs absolute speed, explain Ruby, Perl and Java to me
> 
>     - Sandro
> 
> 
> 
>> Regards,
>> Jerven
>> 
>> 
>> [1]
>> http://openrdf.callimachus.net/sesame/2.7/apidocs/org/openrdf/query/algebra/evaluation/TripleSource.html#getStatements(org.openrdf.model.Resource, org.openrdf.model.URI, org.openrdf.model.Value, org.openrdf.model.Resource...)
>> 
>> 
>> 
>> 
>> 
>>>       -- Sandro
>>> 
>>>> Cheers,
>>>> Kendall
>> -------------------------------------------------------------------
>> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
>> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
>> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
>> 1211 Geneve 4,
>> Switzerland     www.isb-sib.ch - www.uniprot.org
>> Follow us at https://twitter.com/#!/uniprot
>> -------------------------------------------------------------------
>> 
>> 
> 

-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.isb-sib.ch - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------
Received on Monday, 21 July 2014 20:34:56 UTC