Re: fed review from Gregory Williams on 2011-07-18 (public-rdf-dawg@w3.org from July to September 2011)

From: Gregory Williams <greg@evilfunhouse.com>
Date: Mon, 18 Jul 2011 16:35:48 -0400
To: Carlos Buil Aranda <cbuil@fi.upm.es>
Cc: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <44942D3D-0CDC-41B7-9F70-98E183B059DC@evilfunhouse.com>
I'm very concerned by the issues I've highlighted regarding the evaluation semantics and the conditions surrounding the variable-endpoint form of federation. Details below, along with some followup to the other issues.


On Jul 18, 2011, at 11:12 AM, Carlos Buil Aranda wrote:

> Thaks Greg for your comments and sorry for  the delay in answering them.
> 
> On 05/07/2011 10:35, Gregory Williams wrote:
>> Section 2.2
>> 
>> I think it would be a much simpler example if the SERVICE blocks weren't nested.
> I agree, but the original idea was to allow them, and since we can have subqueries in which SERVICE is allowed I do not know why nested SERVICE couldn't be there. Besides, this is an optional extension of SPARQL 1.1 and implementors should warn of what they implement in their systems. So, I prefer to keep it.

You already note that this example "requires the first SERVICE to be a federated query processor in order to be executed". Would you consider adding a reference to the service description document as a way to determine if this condition is in fact true? The SD document defines a feature for this exact situation: http://www.w3.org/TR/sparql11-service-description/#sd-basicfederatedquery

>> Section 3.1

I just noticed a typo in 3.1: s/addtion/addition/

>> 
>> "Let G := Join(G, Service(VAR, G, Transform(P), SilentOp))"
>>  I don't think this works, as the evaluation semantics should try to evaluate the Service() pattern without access to the results of evaluating the G pattern (which are needed to bind VAR)
> I do not completely understand what you mean.

What I mean here is that to execute the Service() part of this expression in a bottom-up fashion, it must be able to evaluate without data from G. The bottom-up semantics are defined by Query 1.1 section 18.5 ("Evaluation of Join(P1, P2)") and by Federation 1.1 section 3.1 ("Definition: Evaluation of a Service Pattern"). In this case, the left-hand-side of the join (P1) is the pattern before the SERVICE block (P2). I don't think you can evaluate Service(VAR, G, Transform(P), SilentOp), because you can't invoke a service operation on a variable. You need to substitute VAR for an actual URL, but the URLs that need to be substituted are produced in an entirely separate evaluation (eval(D(G), P1)).

I think this is a big problem, and it relates to another of my comments:

>> "foreach i in Ω(?var->i)"
>>  Where does Ω come from in this definition? I think it's meant to refer to results from a join that is outside the scope of this operation.
> yes, I will make that explicit.

I don't think making it explicit will help, because as currently defined it simply can't work with the existing Join() evaluation semantics. Have I misunderstood something?

>> "if IRI is a service URL"
>> "if IRI is a SPARQL service"
>>  How do I know if it's a SPARQL service URL or just some other URL?
> You can't, users are reponsible of knowing what they query in the same way users should know what data they want to query in a SELECT

Users being responsible for knowing that is irrelevant if the spec is using language like "if IRI is a service URL" and "if IRI is a SPARQL service". Where is the spec language defining what happens if the IRI *isn't* a SPARQL service?


>> "eval(D(G), Service(IRI,G,P,SilentOp)) = Invocation( IRI, vars, P, Bindings(G, vars), SilentOp )"
>>  Where does 'vars' come from here?
> it comes from the definition header:
> Definition: Evaluation of a Service Pattern
> 
>    if IRI is a service URL and vars is the set of variables in-scope in pattern P, Ω0 a solution set with one empty solution.

OK. I missed that.

>> "with no default-graph-uri or named-graph-uri"
>>  Why aren't these allowable in the service IRI?
> Because the idea of SERVICE is to query remote SPARQL endpoints, not named graphs.

Yes, I understand that. What I'm asking is why we shouldn't allow users to specify named and default graphs in the service URL for use during the remote service invocation. Something like:

SERVICE <http://example.org/sparql?default-graph-uri=http%3A%2F%2Fwww.other.example%2Fbooks> {
 ?s ?p ?o
}


>> "Definition: Strongly bound variable"
>>  I think there's missing clauses for BIND and property paths (e.g. ?s :p{0} ?o should result in both ?s and ?o being strongly bound).
>> 
>> "P = SELECT E1 ... En WHERE { P1 } and ?X is strongly bound in P1 and ?X = Ei"
>>  Should include the required 'AS ?var' syntax for expressions that aren't variables
>>  Should include the option for the select expression to strongly bind the variable: (if "(Ej AS ?X)" is one of the select expressions)
>>  This ignores the possibility of ?X being strongly bound in GROUP BY or HAVING clauses
>> 
>> "P = P1 GROUP BY E1 ... En such that either there is an Ei of the form ?X or ?X is strongly bound in P1"
>>  Needs to also consider grouping expressions that are aliased ("GROUP BY (Ej AS ?X)")
>> 
>> "P = P1 HAVING ( E1 ) and ?X is strongly bound within P1"
>>  This ignores the possibility of ?X being strongly bound in a GROUP BY clause.
>> 
> yes, you are right, I will add all the suggestions to the boundedness definition.

This seemed like it was missing a lot of conditions. Are we sure we've got them all now? Can somebody with fresh eyes look over this, please?

>> "UNBOUND is not a possible value for ?Xi in BindingValues"
>>  I don't know what "not a possible value" means. "?Xi is not unbound in BindingValues"?
> it is related to the issue you noticed in the service04.arq test. I will fix that.

I'm not sure this is connected to the service04 issue. My concern was with the use of "possible" in the description. I would think UNBOUND is always a "possible" value, it just might not actually be present in "BindingValues". This might just be me being pedantic, but I'd prefer a different working that made more explicit that the condition here is that UNBOUND can't appear in the BindingValues clause for the ?Xi variable.

>> Section 4.1
>> 
>> "It is considered a syntax error to use a variable as the first argument of a ServiceGraphPattern if that variable is not bound (at least optionally) before the execution of the SERVICE pattern"
>>  How is a query writer supposed to know in what order evaluation takes place? Asserting a syntax error based on evaluation order seems overly confusing.
> the boundedness condition allows to check when a variable is going to be bounded or not, it implicitly determines the execution order, so it would be possible to throw a syntax error. Maybe a syntax error is not the best error it could be there, any idea?

My main concern here is the reference to the actual "execution". The wording here would seem to imply that implementations cannot re-order joins, for example.

Also, I think there are a lot more cases than described where it's simply not possible to tell if the variable is bound at the (syntactic) point in the query where the SERVICE is used. The combination of BIND, RAND, IF, EXISTS, select expressions, extension functions, etc. make it impossible to know if a variable is going to be bound ahead of time, and these cases aren't mentioned. The definition of "strongly bound" seems intentionally conservative, so maybe these are all cases meant to be an error. If that's the case, I think this needs to be pointed out explicitly.

The discussion of "service-safeness" and "boundedness" (which elsewhere is actually 'strong boundedness') in section 2.4 seems rather disconnected from the rest of the text. These two things are defined at the end of section 3.1, but there isn't any text in 3.1 that refers to them. After these definitions are included, only in section 4.1 is "service safeness" mentioned, and then only weakly ("The Service Safeness definition ***suggests the use*** of a specific order in the execution", emphasis mine). MUST a conforming implementation execute patterns in an order suggested by the "service safeness" definition? I think this either needs much stronger definitions and normative text, or we should consider dropping the variable-endpoint form of federation entirely (punting it until next time, I suppose).



Re-reading the text about service-safeness, I notice a few more issues:

"A variable ?X is strongly bound within a graph pattern P if ... P = SERVICE t { P1 } then ?X is not strongly bound in P1 (It is not possible to guarantee that a variable will be bounded after a SERVICE execution)."

This isn't worded correctly. The whole list is introduced as a set of conditions under which ?x is strongly bound, but this list item turns right around and says ?x is not strongly bound. Moreover, I'm not sure why this wouldn't guarantee that ?X is strongly bound if it is strongly bound in P1. Without the use of SILENT, either 1) there are going to be no results, 2) there are results where ?X is bound, or 3) the entire query evaluation fails. Is that correct?

"* P is a list group graph patterns P1 ... Pn and ?X is strongly bound within some Pi."

I'm not sure where the phrase "list group graph pattern" comes from (are P1 ... Pn graph patterns contained by a single group graph pattern?). Intuitively I understand what's going on here, but I don't think it's well-defined. If you're talking about a list of graph patterns, that seems like a syntax-level thing, but you also talk about graph patterns like "P1 FILTER ( E1 )" which I don't think should be understood at the syntax-level, because something like "P1 FILTER(E1) P2" (where P1 and P2 are triple patterns) is really like "P1 . P2 . FILTER(E1)". I don't think the actual evaluation will be hurt by this (since the "list group graph pattern" condition will end up getting the right answer), but the intermediate steps end up being confusing and/or wrong with respect to what variables are actually strongly bound.



Section 3.1 "Definition: Evaluation of a Service Pattern" says "Execution failures cause the query to fail." Section 4.1 says "If a solution does not bind the variable, or binds it to something which cannot resolve to a SPARQL service, that solution is eliminated." How does "execution failure" differ from not being able to "resolve [a URL] to a SPARQL service"? If you get back a HTTP 400 or 500 (or, I guess, any other response code without a valid protocol response body), how is an implementation supposed to determine if this is an "execution failure" or a situation where the endpoint URL being used failed to "resolve to a SPARQL service"?





thanks,
.greg
Received on Monday, 18 July 2011 20:36:43 UTC