Re: complete vs partial graph semantics

On Wed, 2012-04-11 at 18:40 +0100, William Waites wrote:
> On Wed, 11 Apr 2012 10:37:22 -0400, Sandro Hawke <> said:
>     sandro> Put differently, as a test case:
>     sandro>
>     sandro> Trig Document 1 (D1): <u> { <a> <b> 1 }
>     sandro>
>     sandro> Trig Document 2 (D2): <u> { <a> <b> 2 }
>     sandro>
>     sandro> What is the merge/union of D1 and D2?
>     sandro>
>     sandro> It's not defined, when asked like this.  We use
>     sandro> something Trig-Like but different:
>     sandro>
>     sandro>     D1A <u> {+ <a> <b> 1 } D2A <u> {+ <a> <b> 2 }
>     sandro>
>     sandro> in which case the merge is:
>     sandro>
>     sandro>     D3A <u> {+ <a> <b> 1,2 }
>     sandro>
>     sandro>         ==or==
>     sandro>
>     sandro>     D1B <u> {= <a> <b> 1 } D2B <u> {= <a> <b> 2 } in
>     sandro>
>     sandro> which case there is no merge; they are inconsistent.
> Reading some of the background discussion, talking about crawler dumps
> and such, it seems to me there is quite a bit more information we
> might want to carry around in the "header" of a trig document.

In the 6.1 proposal, you can say whatever you want in the default graph.
It can be used like a "header" that way.   The key point here is that
the default graph is asserted.

> For example, if D1 was downloaded at time t1 and D2 at t2, one could
> reasonably conclude that even with the + notation it is inappropriate
> to merge them, D2 having superceded D1.

It seems to me this kind of logic requires named DATASETS.   You're
reasoning about D1 and D2.

I suggest we try to design things so we only need to worry about named

All the designs on
do okay on this front.  In every case, if you can conjoin the datasets, 
you'll be able to do whatever reasoning you want about cache
expirations, etc, just by looking at data in the dataset that came out
of the conjunction.

I say "if" because only designs 1 and 3 are guaranteed to allow
conjoining.  With design 2, an attempt to conjoin will fail if one of
the data sources returns different contents during the different crawls.
But you still wont get incorrect results.

> Or perhaps D1 comes from a reliable source and D2 comes from someone
> whose data I'll use if I don't have anything better but otherwise I
> wouldn't trust. So when combining the information I'll throw out the
> second version. But perhaps I would nevertheless keep it around and do
> a straight additive merge if I know the cardinality of <b> to be
> greater than 1.
> My point is that combining data from different sources, or the same
> source at different times, is likely to need to take into account more
> than just the +/= hints. Some of this information can be in-band
> (e.g. time, source) and some must necessarily be out of band (e.g. how
> much I trust that source).

I'd like that kind of reasoning to happen within datasets rather than
across datasets.   I think that's much of why we want datasets, so we
can reason about trust and change in graphs, in a distributed way.  If
we just push the problem up to reasoning about datasets, we probably
haven't gained anything.

     -- Sandro

Received on Thursday, 12 April 2012 15:34:59 UTC