Re: Prototype SPARQL engine based on Datalog and relation of SPARQL to the Rules Layer from Bijan Parsia on 2006-12-12 (public-rdf-dawg@w3.org from October to December 2006)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Tue, 12 Dec 2006 20:33:59 +0000
To: "Bob MacGregor" <bmacgregor@siderean.com>
Cc: "Eric Prud'hommeaux" <eric@w3.org>, "axel@polleres.net" <axel@polleres.net>, "public-rdf-dawg-comments@w3.org" <public-rdf-dawg-comments@w3.org>, "public-rdf-dawg@w3.org" <public-rdf-dawg@w3.org>
Message-Id: <CD68EF1C-943E-4E80-AD09-751ACD84469F@cs.man.ac.uk>
On 12 Dec 2006, at 19:06, Bob MacGregor wrote:

>
> Dear Bijan,
>
> Thanks for responding.  Since I only *outlined*

Well, not by any understanding of "outlined" that I have, but I  
accept your point that your email contained an enthymeme.

> my arguments against SPARQL, it is easy to misinterpret them.

I interpreted your email as written. Better than that I cannot do.

> I will try to expand on them a bit.
>
> A good query language should has the following attributes (among  
> others):
>
> (1) It should be expressive
> (2) Its operators should be composable, and minimal (except for  
> sugar-coating)
> (3) Its should be pleasantly readable
>
> For the uses we employ in a logic language, we add a couple more:
> (4) It should support aggregation
> (5) It should scale to very large datasets
>
> SQL is an example of a language that meets all of these criteria.

I guess 3 is in the eye of the beholder.

> Datalog satisfies all except that my impression is that it doesn't  
> handle
> aggregation.

Depends on the query language. You could use SQL like queries against  
a Datalog KB (depending on the exact flavor).

>   So we have proof of concept.
>
> I would be happy to use either of these with RDF.  Instead, we have  
> SPARQL.
> SPARQL violates all but the last.

Unless you mean something very specific, then it's hard to argue that  
SPARQL is not expressive. It is, of course, easy to argue that it  
lacks expressivity that is vital to you.

> Violating any of 1, 2, or 3 is grounds for divorce.
>
> Expressivity:
>
> In our RDF applications, we routinely use AND, OR and NOT(UNSAID)  
> operators, with
> arbitrary nesting.  Even excluding UNSAID, we can't express our  
> queries in SPARQL, due to the artificial limitations it places on  
> the OR (UNION) operator.  We frequently nest AND
> inside of UNSAID, and less often use OR and UNSAID.  So, SPARQL  
> doesn't even come close to satifying our expressivity requirements.

Don't see what this has to do with UNA and CWA and scaling. But I  
accept that SPARQL does not have the expressivity you need. Sorry  
about that.

> Composable Operators:
>
> "Composable" means that operators can be combined to produce useful  
> expressions.
> Only the AND is composable in SPARQL. The UNION operator has  
> artificial syntactic
> restrictions that don't belong.  It doesn't have NOT/UNSAID, so  
> composability can't be
> discussed there.  Instead of a few composable operators, SPARQL has  
> a kitchen sink
> of non-composable ones.

Please see the new algebra based on the excellent work in:
 http://iswc2006.semanticweb.org/items/Arenas2006bv.pdf

> Pleasantly Readable
>
> Turtle syntax.  'nuff said.

Not my favorite either. But it's workable enough for simple things.

Ok, that's one hell of an enthymeme.

> Now I will address a few specific comments:
>
>>> I totally fail to see why having *convenience syntax* for  
>>> expressivity ...
>
> I assume you are considering the UNSAID operator to be "convenience  
> syntax".

Yes.

> If we ignore nested AND/OR/UNSAID within UNSAID, then it is indeed  
> a convenience syntax.  That assumes a SPARQL mind-set, which I am  
> not assuming.

I don't even understand that. Hmm. Ok, I see. In this context, I  
believe we were discussing *even* UNSAID that was just an  
abbreviation for UNBOUND (plus a some stuff). My point against Pat is  
that once UNBOUND is in there, it hardly seems worse to give it  
convenience syntax. Additional expressivity has to be evaluated on  
its own terms, e.g., complexity, implementation impact, use cases, etc.

Since we are on a list discussing SPARQL, if you have other  
assumptions I think you need to make them clear.

>   I believe its not the case that
> OPTIONAL/UNBOUND can simulate UNSAID expressions with compound  
> arguments.

That's possible. I'd need to know more about the particular  
expressions you have in mind.

>>> Even with the qualifier "Almost" I don't believe this is true,  
>>> and if it happened to be
>>> true, it's not the case that UNA and CWA necessarily help the  
>>> scale of reasoning.
>
> They help if the alternative is using millions of   
> owl:minCardinality  and owl:disjointWith statements to accomplish  
> the same thing.

Not clear. Actually, you seem to be rather randomly spraying  
constructs around. minCardinaility? disjointWith? Don't you mean  
differentFrom?

What I didn't get was that you separated reasoning in general (and  
how it scaled) with reasoning where the intent is to have UNA and  
CWA. These are different things.

> No one I know does that, instead they either assume UNA/CWA or they  
> don't use aggregation.

Aggregation is a separate issue, though I see now that it was central  
to your thinking when you wrote your prior email.

Aggregation is, of course semantically tricky for the languages we  
might query with a SPARQL variant. For example, many people think  
that BNodes in results are countable (and we do have a way of  
counting them...but they cannot merely denote distinct objects,  
obviously, so they have to be counting something else, e.g.,  
"answers"; of course, how users will interpret that is a different  
issue; see the archives for discussion of DISTINCT and  
REALLYREALLYDISTINCT and friends).

>   Also, they either have millions of  owl:maxCardinality   
> statements or they aren't using  owl:allValuesFrom.

Well, our experiences obviously differ. I cannot say how valid yours  
are, or how general, since there isn't any data for me to look at.

>>> RDF and RDFS has neither and Steve Harris has reported a store  
>>> holding over 1.5 billion
>>> triples and responding in user sufficient real time to various  
>>> classes of SPARQL queries.
>
> My guess is that they aren't computing any aggregate expressions.

Certainly true, but that's not because of the difficulty per se. He  
*is* treating the data more or less in a database fashion, I believe.

>>> This is a different issue. If you want UNA and CWA it is  
>>> certainly pragmatically a non-answer
>>> to say,"Hey, you can add it yourself!", at least, IMHO.
>
> I'm not sure if you are saying "its OK to add UNA and CWA yourself".

What? No. I'm saying that if you want your KB to be governed by the  
UNA and CWA, i.e., it's part of your modeling, it's a total nonanswer  
to say "Oh, just add a bunch of inequalities and use nominals to  
close your domain, and the roles, and the concepts". Just as I think  
it's  rather a non-answer to say, "Oh, no UNSAID of any sort for you!  
just simulate it with UNBOUND" even for the cases where it's possible.

However, I also want the group to finish :) And they do too. UNSAID  
can be added as an extension and standardized later.

In any case this (that it's bloody insane to have to manually impose  
the UNA when you intend the UNA) is a vastly different claim than  
"reasoning won't scale without UNA and CWA" which is, in fact, what  
you claimed.

> If you are, then you are
> abandoning basic precepts of SPARQL and the "semantic layer cake".

? Well, frankly, I don't care about the semantic layer cake a whit. I  
don't know what the "basic precepts" of SPARQL is. There is a problem  
if you want these things in RDF or OWL as the specs don't support  
them. However, as I said, Pellet has a mode that allows the UNA and  
we've extended it in a number of ways with nonmon constructs. I tried  
to force some of the issues to a head in this group, since I think we  
do have an implicit UNA that covers BNodes as well. (Note: I didn't  
plan to go down that route...I was working it out as we went along as  
is too often the case.) And I think that's the right choice, given  
what implementations and users actually do. But I wish the W3C would  
face that fact and try to come up with a plan for dealing with it.

> Plus, you are inventing your
> own semantics.  However, WE are adding UNA and CWA ourselves,  
> because we have no choice.  I observed Mike Dean using UNA in a  
> large-scale application that we worked on together, and he was not  
> apologetic, but he did acknowledge that doing so inherently  
> contradicts OWL semantics (since it is non-monotonic).  It would be  
> much preferable if we could use UNA and CWA in combination with RDF  
> and have Pat's blessing.

I fail to see what Pat's blessing has to do with it. However, if you  
want to interoperate, it would be a good idea to write down in a  
somewhat specy form what you are doing. Make it a W3C submission if  
you'd like, which would give it visibility.

> From the IBM paper you reference:
>>> We apply the filtering criteria described in Section 3.2 to a  
>>> canonical summary
>>> Abox AŒ. For correctness with respect to cardinality  
>>> restrictions, we need to
>>> augment the canonical summary Abox with role assertion  
>>> statistics, since role
>>> assertions are merged by the summary Abox transformation. For  
>>> each role R
>>> that is part of a cardinality restriction, we associated with R  
>>> the maximum
>>> number of R-neighbors that any individual a has in A.
>
> I can't be sure,

Why not?

> but it sounds like they are asserting  maxCardinality restrictions

No.

> (a few hundred thousand?) into the ABox.

No.

>   If so, they have frozen the ABox.

Not really. (And this is the *summary* not the actual ABox.)

>   If their
> application has no updates, then that's valid.

This is "inside" a reasoner, roughly speaking. It's not done at the  
user level. The process would update if there were a change to the  
ontology, just the way a tableau reasoner would have to update its  
completion graph.

>   However, every application we work with
> allows for introduction of new data (updates).  In that case, the  
> operation of adding  maxCardinality assertions would have to be  
> done and undone over and over.  We find that assertion/deletion is  
> an expensive operation.

Well, it depends. They are trying to be very careful about making  
these internal operations relatively cheap and a lot could be done  
with a maintenance approach (as we've applied to completion graphs).

>   So, I would claim that IBM's results are valid for non-static  
> applications only if the time to make and unmake the maxCardinality  
> assertions is factored into their timings.  They of course didn't  
> do that, because they were looking for low numbers, rather than  
> realistic ones.

Before you libel people, why not try to understand the work?

At this point, I am continuing in the expectation that you will  
recognize that this was offensive and apologize.

>>> Finally, current SPARQL has Logspace datacomplexity
>
> The only complexity that matters is average case complexity.

I suspect you don't actually mean "average case complexity" in the  
technical sense, e.g.,:
 http://www.uncg.edu/mat/avg.html

Since it's perfectly possible to have a good average case complexity  
with respect to some distribution and have *all the real problems  
people want to solve* to be in the worst case (or in a worse case).

>   Anything with average-case complexity of N-squared or greater is  
> non-scalable.

An interesting opinion. For what distribution?

> Logspace complexity doesn't meet that criteria, so is irrelevant.

So relational databases aren't scalable? Interesting. Datalog isn't  
scalable? Interesting.

> Later on you quote some empirical results, which reflect average  
> case complexity,

I see, you do mean "common user case" complexity, not average case  
complexity. (After all, the samples weren't randomly generated, so  
how can they reflect average case complexity!) I think you didn't  
really understand the data complexity argument. My point was that if  
you have logspace *data* (worst case) complexity, then you are  
roughly in the same boat as relational databases. (Of course, you may  
have a different average case complexity, or "real case" complexity,  
etc. etc. etc.). My point was that there are clearly KR languages  
without CWA and UNA which plausibly scale. Now, they may give up  
other things that you want, like aggregation, but since your email  
didn't *mention* aggregation, I don't think I'm at fault for not  
getting this.

And there's an issue about what goes into the query language vs. what  
goes into the representation language.

> so those are
> indeed relevant.  However, I acknowledge that SPARQL can scale; its  
> OWL that can't scale,

That's what I've shown *not* to be the case, at least for OWL DL  
(well, minus nominals in the IBM case; but in any case, reasoners  
like Pellet, KAON2, FaCT++, and racer do better than you seem to  
expect, and could do better still).

> unless
> you ignore its core operators,

I suggest you examine EL++, which is indeed a fragment of OWL DL, but  
a very expressive one (it does give up allValuesFrom, but includes  
nominals and GCIs of various sorts). I don't know if allValuesFrom is  
a "core" operator, since I don't know there's a technical meaning for  
"core" (EL++ is generally accepted as a very expressive, if  
subboolean, description logic). If you fit into EL++ great. If you  
don't, then a lot depends on your particular problem.

> or freeze your Abox (a bit of a contractiction with the notion
> of "large scale").

Well, since this was based on your misunderstanding, I don't see it  
bolsters your case at all.

> Why is OWL relevant to SPARQL?  Well, Pat was using the "semantic  
> layer cake" as an excuse for requiring open world semantics.  Given  
> that the upper, open-world layers do not scale, we would prefer to  
> see adoption of upper closed-world layers that do scale.

Well, again, your assertion is unsupported to say the least, at least  
for the general case. Again, I totally concede that *if you want CWA  
and UNA in your KB* for modeling purposes, then it's really mean to  
say, "Oh, encode it yourself". But if you say, "The reason I want CWA  
and UNA is so that it'll SCALE" then you have not made the case in  
the least. Even for aggregation (though there is surely issues with  
what aggregation *means* in the case of OWL, it's hardly sensible to  
claim that a little CWA and UNA sprinkled haphazardly will clarify  
everything. Afterall, nonmon generally makes reasoning *harder*,  
although <http://logcom.oxfordjournals.org/cgi/content/abstract/ 
6/2/295> is interesting!) Before replying I would suggest that you  
read my initial email a little more carefully, and check out some of  
the pointers too with a bit more care. (Note, of course I don't mean  
to suggest that existing reasoners definitely scale to your problem!  
I don't know what your problem is! It would require some examination  
to determine what's going on.) And I would appreciate more precision  
in the problem statements. (I.e., if this is about aggregation, let's  
stick to aggregation.)

But, this is not the appropriate list. public-owl-dev might be more  
appropriate as we are discussing a possible next OWL working group. A  
UNA flag, at least, would be quite an easy and reasonable addition.  
At OWLED 2006 we discussed some variants of constraints (a la the K  
and A operators) which might help you. I doubt that these would get  
into the very next version of OWL as things are still unclear at this  
stage, but they are certainly something we might be able to get de  
facto support across the board in tools.

But coming back to *this* list, I don't see that the group has to do  
much in reply to your general problems with SPARQL since it's  
unlikely that such changes will be made in this iteration. UNSAID is  
only reasonably considered since there *is* a version which is just  
syntactic sugar, thus, in a sense, doesn't require a large revision  
to the language. But even that probably won't get in now, so the  
sensible move is a W3C note and lobbying the implementors.

Aggregation is a bit better shape for "no reasoning" SPARQL as Jorge  
Perez et al have done a very nice job of specifying  how many answers  
to expect (for graph matching SPARQL). Obviously, there are a host of  
issues when you pop this up to OWL, but for that, I suggest OWLED as  
the appropriate venue. I'll be trying to get together a "SPARQL/DL"  
document for it, so there's interest and attention.

Cheers,
Bijan.
Received on Tuesday, 12 December 2006 20:35:02 UTC