Re: Formal defintion of nested optionals

From: Jorge Pérez <jperez@utalca.cl>
Date: Thu, 13 Apr 2006 13:20:36 -0400
Message-ID: <1144948836.443e88643a36c@webmail.utalca.cl>

```
Hi!

> > But when there is a nested optional pattern it appears to be a confusion,
> > for:
> >
> > {
> >   A .
> >   OPTIONAL {
> >     B .
> >     OPTIONAL { C }
> >   }
> > }
> >
> > What is the right answer? I think that there are at least two
> > posibilities:
> >
> > 1) one is to think in the composition of two binary operations, a
> > binary operation between B and C, and a binary operation between
> > A and the result of the previous operation, something like
> > OPT(A, OPT(B, C)). The formal defintion in this case is clear
> > followind the doc: S is a solution of OPT(A, OPT(B,C)) if S is a
> > solution of A and of OPT(B, C) otherwise if S is a solution to A
> > but not to A and OPT(B, C).
>
> Yes - it's opt(A, opt(B,C)).
>
> and   A OPTIONAL B OPTIONAL C is opt(opt(A,B), C)
>
> >
> > 2) the other posibility is to think of the application of one single
> > ternary operation, i.e. somethign like OPT(A, B, C). The formal
> > definition here would be: S is a solution of OPT(A, B, C) if S is a
> > solution of A and B and C, otherwise if S is a solution of A and B but
> > not of C, otherwise if S is a solution to A but not to B.
>
> Surely OPT(A, B, C) == OPT(A, OPT(B, C)) by that definition anyway?

No, it is not. It is exactly the problematic case.
I include a counter example bellow

>
> >
> > In the last draft it seems that when defining formally the optional
> > matching the editors agree with 1) for nesting when thay say
> > "a combination of two patterns...". But when they are informally
> > reading the examples of nested optional patterns I think they agree
> > with 2). When looking in some implementations (SPARQLer for example)
> > they seems to use 2. for evaluation...
>
> SPARQLer (or rather ARQ, for that is the query engine being used) is doing
> (1)
> almost exactly: ARQ builds soltuion up, rather than consider all
> possibilities
> and narrow them down.
>
> The in-memory algorithm is roughly:
>
> 1/ Calculate A to get an iterator of solutions to A
>
> 2/ For each solution to A,
>     attempt to do OPTIONAL(Z)
>     where Z is B OPTIONAL C
>       either output the results of that or the original input solution from
> A.
>
> 3/ In doing Z, it's the same, take the input solution, solve B,
>     attempt to solve C given the solution of A and B
>     Output either the solution from A fed into B fed into C
>     or just A fed into B
>

This algorithm gives a different result to doing
somethig like OPT(A, OPT(B, C)) as a composition of binary operations.
The above algorithm is doing exactly what I've named a ternary
operation OPT(A, B, C): first get a solution for A, then try
to solve B (for each solutions of A) and then try to solve C
(for each solution of B). Note that the last solutions of C
are heavily involved with the solutions of A, and then the
operation OPT is not binary between B and C.

Doing something like OPT(A, OPT(B, C)) would result in the
following algorithm: first solve B and then try to solve C
(for each solution of B), now solve A and find the one that
are "compatibles" with the previous solutions (the solutions
for A and for OPT(B,C) or for A and not for OPT(B,C)).

A simple example shows that there are cases in which the
two process leads to different results:

Consider the data

++++++++++++++++++++++++++++++++++++++++++++++++++++++
@prefix : <http://ing.utalca.cl/~jperez/research/rdf/> .

_:b1 :name    "paul" .
_:b1 :phone   "777-3426" .

_:b2 :name    "john" .
_:b2 :email   "john@acd.edu" .

_:b3 :name    "george" .
_:b3 :webPage "www.george.edu" .

_:b4 :name    "ringo" .
_:b4 :email   "ringo@acd.edu" .
_:b4 :webPage "www.starr.edu" .
_:b4 :phone   "888-4537" .
+++++++++++++++++++++++++++++++++++++++++++++++++++++++

and the query

PREFIX : <http://ing.utalca.cl/~jperez/research/rdf/>

SELECT *
FROM :sample1.rdf
WHERE {
?A :email ?E
OPTIONAL {
?A :name ?N
OPTIONAL { ?A :webPage ?E }
}
}

Here A is ?A :email ?E, B is ?A :name ?N, and
C is ?A :webPage ?E. Note the using of the same
variable in A and C this is intentionally to cause
a conflict. The following result is obtained in SPARQLer
(and other implementations)

+-------+-----------------+---------+
+  A    |       E         |    N    |
+-------+-----------------+---------+
+ _:b4  | "ringo@acd.edu" | "ringo" |
+-------+-----------------+---------+
+ _:b2  | "john@acd.edu"  | "john"  |
+-----------------------------------+

It is a natural answer (at least te one that
I would spect), and correspond to the algorithm that you
mention (in my words a ternary operation).
But if the process is someting like OPT(A, OPT(B, C)),
i.e. using OPT as a binary operation then the result is

+-------+-----------------+---------+
+  A    |       E         |    N    |
+-------+-----------------+---------+
+ _:b4  | "ringo@acd.edu" |         |
+-------+-----------------+---------+
+ _:b2  | "john@acd.edu"  | "john"  |
+-----------------------------------+

because when evaluating the most internal optional OPT(B,C)

?A :name ?N
OPTIONAL { ?A :webPage ?E }

in the case of the first mapping ?E is bounded to
"www.starr.edu", ?A bounded to "_:b4", and ?N bounded
to "ringo". Then when evaluating the external optional
for A (in the case of the first mapping), ?A is bounded
to "_:b4" and ?E to "ringo@acd.edu". Now we have to make
OPT(A, OPT(B, C)) resulting in the mapping that bound
?A to _:b4 and ?E to "ringo@acd.edu" and letting ?N unbounded.

I think that the ambiguity problem of nested optionals remains
and is an important issue.
For one side, the evaluation process that you mention
is incompatible with binary operations, and for the other side
the last release states clearly that the OPT operator is binary....
What is the right formal definition then?

In other context, I've found other problematic queries that
gives strange results in SPARQLer, like changing the order
of subqueries in group graph patterns leading to different
results (!!). Is SPARQLer a confiable implementation?

Thaks for the time.

--------
Jorge Pérez R.
http://ing.utalca.cl/~jperez

-------------------------------------------------
Este mensaje fue enviado por: http://webmail.utalca.cl
```
Received on Thursday, 13 April 2006 17:27:02 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:52:07 UTC