Re: [TF-ENT] RDFS entailment regime proposal from Birte Glimm on 2009-09-28 (public-rdf-dawg@w3.org from July to September 2009)

From: Birte Glimm <birte.glimm@comlab.ox.ac.uk>
Date: Mon, 28 Sep 2009 22:08:32 +0100
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <492f2b0b0909281408p3ff8a407s1eff31e74cf36d83@mail.gmail.com>
Andy,
scalability is important, but it is not the only driving factor for
me. I am still hesitant to have MAY instead of MUST because we then
specify a system behavior that tolerates the violation of the RDFS
entailment lemma from the RDF spec for the RDFS entailment regime. It
can give better performance under an RDFS entailment regime, but
interpreting blank nodes as normal names would also give you much
better performance in many cases and nevertheless that is not what is
and should be done.
I want to understand the consequences that such a change has and since
it can violate the very basic underlying principles, such as the RDFS
entailment lemma, I think one should be very careful with such a
change.

Apart from scalability, a consistent behavior of SPARQL engines under
an RDFS entailment regime is also important to me. What is not good
from an interoperability point of view is that one system gives you
answers A and another gives you answers B or in this case, one system
answers the query and another says the data is inconsistent. Which
system is correct? Both because the one that gave an answer just
didn't see the inconsistency? If you query the same data twice with
the same query, can it happen that for the first query you get an
answer, then the system answers some other query maybe from another
user, which makes it recognize the inconsistency, and then I ask my
same query again and then I get an inconsistency message? I would find
that not a nice behavior.

It is definitey something we should discuss in the telcon if we have
the time and if not, I would like to have some more opinions on that
and some more explanations of the effects that such a change would
have.

Birte

2009/9/28 Seaborne, Andy <andy.seaborne@hp.com>:
>
>
>> -----Original Message-----
>> From: b.glimm@googlemail.com [mailto:b.glimm@googlemail.com] On Behalf Of
>> Birte Glimm
>> Sent: 28 September 2009 16:55
>> To: Seaborne, Andy
>> Cc: SPARQL Working Group
>> Subject: Re: [TF-ENT] RDFS entailment regime proposal
>>
>> [snip]
>>
>> >> Well, but under RDFS semantics you have to check consistency first
>> >> anyway since an inconsistent graph entails all tuples. Bad lexical
>> >> forms are not causing an inconsistency, only when combined with an
>> >> assertion that the range of the used property/predicate is
>> >> rdfs:Literal or rdf:XMLLiteral. Thus, if you parse a data set and find
>> >> a literal that has a bad lexical form, you better check consistency
>> >> anyway and after that you know whether your data is legal or not.
>> >> Also, if a user asks
>> >> SLEECT ?x WHERE { ?x <ex:b> <ex:c> . }
>> >> I would expect an error because I wrote SEELCT instead of SELECT and I
>> >> should be told that the query is not a legal query. Similarly
>> >> SELECT ?x WHERE { ?x <ex:b> <ex:c> <ex:forthInATriple> . }
>> >> should give me an error, right?
>> >
>> > Yes it's a syntax error but I don't see how it connected.  It can be
>> determined by a static determination from the query string.
>>
>> Well, it is only connected in that I wanted to establish whether you
>> think that an illegal, mal-formed query should result in an error or
>> not. That is clear now, so we disagree about illegal data.
>
> And this is a general issue, not just RDFS: D-entailment, rules.
>
> The concern is scalability but I see no mention of this below.
>
>> > Strictly, it's not a SPARQL query string and what a service does with
>> that is outside the spec because the spec only defines what happens with
>> query strings that match the grammar and says nothing about non-matching
>> strings.  The SPARQL protocol error exists because the restriction is that
>> it a SPARQL query string.
>> >
>> > But in the RDFS entailment case it's the data at issue. For scalability,
>> I like to see a processor that can process the query and get the answers
>> be able to return them.  As proposed it's an error - it's not now outside
>> the spec; it's covered by the spec and explicitly wrong.  But if a
>> processor can perform a BGP matching without needing to touch the whole
>> graph, then I think that should be allowed.  Similarly if it can start
>> generating answers, then finds a problem, then a required error (and no
>> results) means the processor can't stream and has to buffer all results
>> before it sends any which is a potentially huge cost.
>> >
>>
>> Again, there can be illegal graphs due to inconsistencies or due to
>> just mal-formed RDF. I think you do want a different behavior for
>> inconsistent graphs. If I have mal-formed RDF, I don't see why any
>> system should just silently swallow that, see again data such as
>> <ex:a> <ex:b> <ex:c> <ex:d> .
>> That just is no RDF graph and I would want my system to tell me that I
>> wrote mal-formed RDF and I think you do as well. Thus, we can discuss
>> whether inconsistent graphs should be illegal or not I assume. You
>> propose, you read/load the data
>
> No.  Maybe the data is loaded, maybe it's partly loaded and an on-demand scheme is used.
>
>> and then, when you get a query, you
>> start finding answers, apply some (entailment rules) while you do that
>> (because after all we do want some entailments under an RDFS
>> entailment regime) and happily keep finding answers and return them
>> until you come to a point where you apply a rule and detect an
>> inconsistency. At that point you want to stop or you would simply
>> continue? What would you tell the user? Would you say anything?
>
> My point is why should the spec tell me that I have to do things one particular way.  For small systems, the provider might want to provide an exception but a system for large scale data may be unwilling to generate an error unless it is encountered.
>
> e.g.
>
> ASK { ?x :p :z }
>
> Or even
>
> ASK { :x :p :z }
>
>> Give a
>> warning that actually what you said before is still valid, but the
>> user should please be aware of the inconsistency?
>> What could also happen is that you know from some analysis that you
>> only need to look at a certain part of the graph and that part is fine
>> and you answer a query by only touching that part. But now another
>> query that touches another part and that part actually contains an
>> inconsistency that you could discover while you try to find the
>> answers to the query, right?
>> In that case, the answers to your first
>> query are wrong because an inconsistent graph entails everything and
>> not just the answers that you returned.
>
> May be wrong, it may not.  See above.
>
> A query that just requires only part of an entailment regime to be answered completely should be in scope for optimization.  The requirement to make a global determination has a scalability implication.
>
> Do you recognize that scalability is a concern some systems might have?  Or are you saying that scalability is not a primary issue and should not be considered a requirement for entailment regime designs?
>
> (noting the data may also be offered up under different entailment regimes on different endpoints) (/me avoids mentioning mixed entailment on different BGPs in the same query)
>
>> I am against this. Under RDFS, inconsistencies arise only due to
>> illegal XMLLiterals, so, yes, when you load your data,
>
> IF you load the data.
>
> The processor may not touch the literals.  Maybe it does the entailment by simple rules during query execution.
>
>> you have to
>> parse the xml and not just take it for a string. Usually that XML
>> should parse fine (after all users usually do not intend to produce
>> inconsistencies) and you can do what you suggest to do. You are
>> guaranteed not to have any inconsistencies. In case you find
>> mal-formed XML, you should better do a consistency check first and
>> only then answer queries. You might want to give a warning anyway. I
>> prefer this to having a kind of undefined behavior where you might
>> later change your mind about answers that you gave to previous
>> queries. You can do that, but I personally would not call it RDFS
>> entailment.
>>
>>
>> > The entailment doc does not specify what an error is - what had you in
>> mind?  If it's going to relatively undefined, then we can just say that if
>> the data is illegal, then all bets are off i.e. it's not matching for RDFS
>> entailment if you get any answers.
>> >
>> Well, but the point still is: Do we tell the user and at which point,
>> that all bets are off? Or can it happen that we answer some queries
>> and then suddenly say "Actually, dear user, all bets are off. I just
>> found an inconsistency. " I had in mind an error (with or without
>> error numer) that tells the user that the queried graph is
>> inconsistent, that we do not return any answers, but that an
>> inconsistent graph would entail all statements. If you are nice, you
>> even tell the user what caused the inconsistency.
>>
>> Birte
>
> There is no recognition here that scalability, and the related issue of streaming results are significant.
>
> Do you accept these are concerns?
>
> Infinite numbers of statements don't preclude useful answers.  I am proposing that instead of a design where an error MUST be signalled, which has scaling issues (streaming, global check of the data), the design is that it is outside the spec and an error MAY be signalled and MUST be if it affects the answers.
>
> This really is a small change and might even be argued to be there because if it's not a legal graph than it is outside the entailment regime anyway isn't?
>
> However, the wording is too categorical to me and it expresses an intent of a particular outcome.  I can see cases where the answers are what is required but the graph is illegal, where the inconsistency is somewhere that the engine need not touch.
>
>        Andy
>
>>
>> > I'm assuming "error" means like the errors we have in FILTER evaluation
>> i.e. no answers at best or the notion of "error" in other systems where it
>> means return an error code but no answers.  A situation where an error
>> code and answers are returned is harder to design over HTTP and may have
>> problems with streaming (the return code is sent before the body).
>> >
>> >        Andy
>> >
>> >>
>> >> I can see your point for simple entailment, but for RDFS entailment I
>> >> would think that illegal data or query are best treated by an error.
>> >>
>> >> Birte
>> >>
>> >>
>> >> --
>> >> Dr. Birte Glimm, Room 306
>> >> Computing Laboratory
>> >> Parks Road
>> >> Oxford
>> >> OX1 3QD
>> >> United Kingdom
>> >> +44 (0)1865 283529
>> >
>>
>>
>>
>> --
>> Dr. Birte Glimm, Room 306
>> Computing Laboratory
>> Parks Road
>> Oxford
>> OX1 3QD
>> United Kingdom
>> +44 (0)1865 283529
>



-- 
Dr. Birte Glimm, Room 306
Computing Laboratory
Parks Road
Oxford
OX1 3QD
United Kingdom
+44 (0)1865 283529
Received on Monday, 28 September 2009 21:09:07 UTC