RE: Prototype SPARQL engine based on Datalog and relation of SPARQL to the Rules Layer from Bob MacGregor on 2006-12-12 (public-rdf-dawg-comments@w3.org from December 2006)

From: Bob MacGregor <bmacgregor@siderean.com>
Date: Tue, 12 Dec 2006 11:06:36 -0800
To: "Bijan Parsia" <bparsia@cs.man.ac.uk>
CC: "Eric Prud'hommeaux" <eric@w3.org> , "axel@polleres.net" <axel@polleres.net> , "public-rdf-dawg-comments@w3.org" <public-rdf-dawg-comments@w3.org> , "public-rdf-dawg@w3.org" <public-rdf-dawg@w3.org>
Message-ID: <20061212110636347.00000005004@bmacgregor1>

Dear Bijan,

Thanks for responding.  Since I only *outlined* my arguments against SPARQL, it is easy to misinterpret them.  I will try to expand on them a bit.

A good query language should has the following attributes (among others):

(1) It should be expressive
(2) Its operators should be composable, and minimal (except for sugar-coating)
(3) Its should be pleasantly readable

For the uses we employ in a logic language, we add a couple more:
(4) It should support aggregation
(5) It should scale to very large datasets

SQL is an example of a language that meets all of these criteria.
Datalog satisfies all except that my impression is that it doesn't handle
aggregation.  So we have proof of concept.

I would be happy to use either of these with RDF.  Instead, we have SPARQL.
SPARQL violates all but the last.  Violating any of 1, 2, or 3 is grounds for divorce.

Expressivity:  

In our RDF applications, we routinely use AND, OR and NOT(UNSAID) operators, with
arbitrary nesting.  Even excluding UNSAID, we can't express our queries in SPARQL, due to the artificial limitations it places on the OR (UNION) operator.  We frequently nest AND
inside of UNSAID, and less often use OR and UNSAID.  So, SPARQL doesn't even come close to satifying our expressivity requirements.

Composable Operators:

"Composable" means that operators can be combined to produce useful expressions.
Only the AND is composable in SPARQL. The UNION operator has artificial syntactic
restrictions that don't belong.  It doesn't have NOT/UNSAID, so composability can't be 
discussed there.  Instead of a few composable operators, SPARQL has a kitchen sink
of non-composable ones.

Pleasantly Readable

Turtle syntax.  'nuff said.

Now I will address a few specific comments:

>> I totally fail to see why having *convenience syntax* for expressivity ...

I assume you are considering the UNSAID operator to be "convenience syntax".
If we ignore nested AND/OR/UNSAID within UNSAID, then it is indeed a convenience syntax.  That assumes a SPARQL mind-set, which I am not assuming.  I believe its not the case that
OPTIONAL/UNBOUND can simulate UNSAID expressions with compound arguments.

>> Even with the qualifier "Almost" I don't believe this is true, and if it happened to be 
>> true, it's not the case that UNA and CWA necessarily help the scale of reasoning.

They help if the alternative is using millions of  owl:minCardinality  and owl:disjointWith statements to accomplish the same thing.  No one I know does that, instead they either assume UNA/CWA or they don't use aggregation.  Also, they either have millions of  owl:maxCardinality  statements or they aren't using  owl:allValuesFrom.

>>RDF and RDFS has neither and Steve Harris has reported a store holding over 1.5 billion 
>> triples and responding in user sufficient real time to various classes of SPARQL queries.

My guess is that they aren't computing any aggregate expressions.

>>This is a different issue. If you want UNA and CWA it is certainly pragmatically a non-answer 
>> to say,"Hey, you can add it yourself!", at least, IMHO.

I'm not sure if you are saying "its OK to add UNA and CWA yourself".  If you are, then you are 
abandoning basic precepts of SPARQL and the "semantic layer cake".  Plus, you are inventing your
own semantics.  However, WE are adding UNA and CWA ourselves, because we have no choice.  I observed Mike Dean using UNA in a large-scale application that we worked on together, and he was not apologetic, but he did acknowledge that doing so inherently contradicts OWL semantics (since it is non-monotonic).  It would be much preferable if we could use UNA and CWA in combination with RDF and have Pat's blessing.

From the IBM paper you reference:
>>We apply the filtering criteria described in Section 3.2 to a canonical summary
>>Abox AŒ. For correctness with respect to cardinality restrictions, we need to
>>augment the canonical summary Abox with role assertion statistics, since role
>>assertions are merged by the summary Abox transformation. For each role R
>>that is part of a cardinality restriction, we associated with R the maximum
>>number of R-neighbors that any individual a has in A.

I can't be sure, but it sounds like they are asserting  maxCardinality restrictions
(a few hundred thousand?) into the ABox.  If so, they have frozen the ABox.  If their
application has no updates, then that's valid.  However, every application we work with
allows for introduction of new data (updates).  In that case, the operation of adding  maxCardinality assertions would have to be done and undone over and over.  We find that assertion/deletion is an expensive operation.  So, I would claim that IBM's results are valid for non-static applications only if the time to make and unmake the maxCardinality assertions is factored into their timings.  They of course didn't do that, because they were looking for low numbers, rather than realistic ones.

>>Finally, current SPARQL has Logspace datacomplexity

The only complexity that matters is average case complexity.  Anything with average-case complexity of N-squared or greater is non-scalable.  Logspace complexity doesn't meet that criteria, so is irrelevant.

Later on you quote some empirical results, which reflect average case complexity, so those are
indeed relevant.  However, I acknowledge that SPARQL can scale; its OWL that can't scale, unless
you ignore its core operators, or freeze your Abox (a bit of a contractiction with the notion
of "large scale").

Why is OWL relevant to SPARQL?  Well, Pat was using the "semantic layer cake" as an excuse for requiring open world semantics.  Given that the upper, open-world layers do not scale, we would prefer to see adoption of upper closed-world layers that do scale. 

Cheers, Bob

-----Original Message-----
From: Bijan Parsia [mailto:bparsia@cs.man.ac.uk] 
Sent: Tuesday, December 12, 2006 08:30
To: Bob MacGregor
Cc: Eric Prud'hommeaux; axel@polleres.net; public-rdf-dawg-comments@w3.org; public-rdf-dawg@w3.org
Subject: Re: Prototype SPARQL engine based on Datalog and relation of SPARQL to the Rules Layer

On Dec 7, 2006, at 5:07 PM, Bob MacGregor wrote:

> I realize that the UNSAID issue is closed.  However, it shouldn't have 
> been.

I agree with this, but not for the reasons given below. As I pointed out on behalf of Axel,

> Here's what Pat Hayes said about it:
>
>
>   If SPARQL contains UNSAID then it will be inconsistent with any
>   account of meaning which is based on the RDF/RDFS/OWL normative
>   semantics. This will not render SPARQL unusable, but it will place 
> it
>   outside the 'semantic web layer cake' and probably lead to the
>   eventual construction of a different, and rival, query language for
>   use by Web reasoners.

Let me just interject that this argument does nothing for me. I totally fail to see why having *convenience syntax* for expressivity
*already* in the language (via unbound) magically breaks the architecture. (Oh, and I'm unconvincable on this point, Pat, so please don't bother :)) Generally, it's much easier and safer to shove funky expressivity into query languages than it is to do so in the representation formalism itself.

And frankly, if I were going to bolt and try to set up a SPARQL rival, "UNSAID" wouldn't weigh it *at all*. Turtle syntax is way more likely to send *me* running :)

> Unfortunately, Pat has things exactly backwards.  The omission of 
> UNSAID INCREASES  the odds that a rival query language will be 
> constructed for use by Web reasoners, for the simple reason that 
> UNSAID is useful for a great many things.

Expressively useful, yes.

>   Almost all LARGE SCALE reasoning assumes both unique name assumption 
> and closed world assumption.

Even with the qualifier "Almost" I don't believe this is true, and if it happened to be true, it's not the case that UNA and CWA necessarily help the scale of reasoning.

RDF and RDFS has neither and Steve Harris has reported a store holding over 1.5 billion triples and responding in user sufficient real time to various classes of SPARQL queries.

Pellet has a UNA mode which can actually slow down reasoning :) (It's funny, sometimes adding negation, e.g., disjoints, can radically speed up reasoning (hey! the clash is obvious!) or it can slow things down (damn, I have to try a bunch of node merges that fail))

> The numbers simply
> make it impractical to assert  owl:differentFrom   or
> owl:maxCardinality
> over and over.

This is a different issue. If you want UNA and CWA it is certainly pragmatically a non-answer to say, "Hey, you can add it yourself!", at least, IMHO.

> The "semantic web layer cake" that Pat refers to is currently designed 
> so that it only applies to relatively small knowledge sets (say, below 
> 1 million triples).

I have no idea what the grounds for this are, but it certainly seems to me to be false in many points.

I believe Pellet, Racer, FaCT++ and KAON2 can scale, given reasonable server hardware, to more than a "million triples" (but this is meaningless anyway; much depends on the *nature* of the axioms encoded by those triples). If they cannot, it's *really* hard to see how adding UNA or some sort of *non-mon* (non-mon makes things
harder!!!) would magically help. Even restricting to a CWA, it's hard to see how it would help. (And I'd need to hear in considerable more detail exactly what sort of CWA you wish to add to SHOIN to make it "scale better".)

Finally, current SPARQL has Logspace datacomplexity, which indicates that SPARQL query answering is in relationalland. There are a number of fragments of OWL that either have logspace datacomplexity (e.g., DL Lite, RDFS(DL)) or PTime data complexity (e.g., EL++, DLP, and
hornSHIQ) at least for various key inference services. (E.g., EL++ is PTIme complete in the size of data for consistency, subsumption, concept satisfiability, and instance checking (not surprisingly, given their interreducibility) but for conjunctive query, it's only known that it's PTime-hard.

None of these impose UNA or CWA. Of course, some of them are rather weak on equality, which helps.

This is, of course, ignoring new research. To pick just one newer one, we see from IBM:
 <http://iswc2006.semanticweb.org/items/Kershenbaum2006qo.pdf>

I direct your attention to table 2. A 6 million role assertion ontology (with many type assertions as well) can be checked for consistency in 485 seconds. While they list the expressivity as SHIN, I think it's problably hornSHIQ, so if you were tuned for that, you might do much better. I'll also note that the times as you scale up the LUBM in that chart is roughly linear.

This doesn't mean that you can do useful arbitrary conjunctive query on SHIN or hornSHIQ kbs, but it's highly suggestive. No UNA or CWA.
I'll also point out that this is preliminary work, and using a version of Pellet that I *know* is not nearly pushing the boundaries of want moderately good engineering, much less novel techniques, can do.

This is a complex topic with a lot of parameters left underspecified at the moment. But I believe the considerations I've raised as sufficient to block your reasoning. I'd be very interested in a survey of large scale, in your sense, reasoning that goes beyond database expressivity (and beyond datalog expressivity) and was shown to be hopeless without CWA and UNA and when these were added in, they became all peanuts and good cheer. I would be interested in just one example, actually.

> That means that rival standards that
> CAN handle larger datasets are certain to emerge.

I fail to see how lack of an explicit construct for negation as failure will change the scalability of SPARQL one whit. I do think that the lack is unfortunate for users who could put it to good use, as I've argued before. I dislike having such a feature accessible in a contorted way. But eh. Let the vendors support an alternative.
There's not so many we couldn't get agreement, especially with a simple rewriter adding compatibility. If we had a nice XML syntax, this would be an easy XSLT.

My recommendation to the DAWG is that if you are not moved by the user considerations to reopen the issue (which is reasonable), then there is no "scalability" argument of any sort that counts as new information. And frankly, if I were you all, I'd dismiss scare tactic arguments *either way* out of hand. Though, aherm, perhaps with a bit of that thing they call "tact". :)

Cheers,
Bijan.
Received on Tuesday, 12 December 2006 19:09:48 UTC