Re: PROV-ISSUE-474 (instances-and-bundles): Bundles and valid instances [prov-dm-constraints] from Ivan Herman on 2012-08-10 (public-prov-wg@w3.org from August 2012)

From: Ivan Herman <ivan@w3.org>
Date: Fri, 10 Aug 2012 13:44:47 +0200
To: James Cheney <jcheney@inf.ed.ac.uk>
Cc: Simon Miles <simon.miles@kcl.ac.uk>, Provenance Working Group <public-prov-wg@w3.org>
Message-Id: <530753BF-22AE-4681-885A-2B2879639EB9@w3.org>
On Aug 10, 2012, at 13:39 , James Cheney wrote:

> 
> On Aug 10, 2012, at 12:20 PM, Ivan Herman wrote:
> 
>> Just a side-issue:
>> 
>> On Aug 10, 2012, at 09:40 , James Cheney wrote:
>>> [snip]
>> 
>> 
>>> 
>>> Part of Simon's point was that what I was calling a toplevel-bundle is not really a bundle, just a set of statements.  What prov-n is calling a toplevel-bundle is also not really a bundle: it might have multiple bundles or none, along with an unnamed set of statements.  So the terminology is confusing (I agree).
>>> 
>>> So I expect Simon might suggest that we avoid the use of toplevel-bundle in prov-n too; if he doesn't, I will.  Calling it a dataset would be fine.
>>> 
>> 
>> That would lead to a possible confusion. The term 'dataset' is used in the SW world, namely in SPARQL. It *may* be the term adopted by RDF 1.1 for a collection of named graphs and, actually, it *may* be the right abstraction for Prov, too, but... we are not yet sure. And if we end up using the same term but with a different meaning then, well, hell is loose:-)
>> 
> 
> OK. I had the impression that in RDF terms, a PROV instance would literally correspond to a graph, a bundle would roughly correspond to a named graph,

That is what I would expect, too.

> and so saying "toplevel bundle" seemed odd since it doesn't have a name.  Then a "PROV dataset" would, if represented in RDF, literally be (an example of) a RDF dataset.
> 

Again, that is what I would expect, too. 


> But that is a fair point.  Can we make this conditional on staying aligned with RDF terminology?  

I am afraid that would be a dangerous path to follow:-( It is really unfortunate, but I simply cannot assure you that the RDF WG will be final on these terms before the final recommendation.

Of course, we could refer to the SPARQL usage of the term, but that is a bit more convoluted.

Note that Pat Hayes, one of the RDF WG members, is discussing a general comment on Prov as part of the RDF WG comments, and I know that his issue is really around the bundles. So we can expect some comments from that corner; it might be a good idea not to make a final decision on this before...


> I'd rather not invent three new terms for things that directly align with existing terminology, but I appreciate the concern that we not use the same terminology for slightly different things.  So any suggestions for a better term than "dataset" would be welcome.
> 
> 
>> B.t.w., if I use the RDF datasets as an analogy: that consists of (G, (n1,G1),....,(ni,Gi)), where (ni,Gi) is, to use the current terminology, a named graph (that is the term used in SPARQL) and G is the 'default graph'. As an analogy, what about 'default bundle' ?
>> 
> 
> Default bundle might be better than toplevel bundle, but for us, "bundle" has generally meant "named set of statements".
> 

I am not sure I understand the issue. We can call a bundle a 'set of statements' (yes, it is pretty much like an RDF graph being a set of triples...), and we have then named bundles and one default bundle.

Ivan


> So another terminology could be:
> 
> - "instance" - the whole thing (toplevel bundle + named bundles)
> - "bundle" - any set of statements (named or unnamed)
> - "named bundle" - 
> - "default bundle" - the unnamed set of statements at the toplevel (what we were calling "toplevel bundle" or "toplevel instance" in my recent revision).
> 
> Simon, would that be acceptable instead of "dataset", "instance", "bundle", "toplevel instance"?  Are there lots of places where we say that bundles are named, that would have to change to draw this distinction?
> 
> One advantage of this would be that prov-n doesn't need to change (except maybe renaming "toplevel" to "default").  
> 
> I think we can probably keep this change independent of technical content, so that we can align (or not) with RDF 1.1 later, in any case.
> 
> --James
> 
> 
> 
>> Ivan
>> 
>> 
>>>> 
>>>> 
>>>> 
>>>> Isn't it the case that an instance (which is a prov-constraint concept and not a prov-n concept)
>>>> a set of statement or a bundle or a toplevel-bundle/dataset?
>>> 
>>> I am now proposing that we use "instance" solely for "set of statements".  If this term is only used in this sense in prov-constraints, then it seems that we are free to redefine it, within reason.  Most of the document concerns instances, so the number of changes was small. For cohesion, if we talk about sets of statements elsewhere it might be sensible to call them "instances", but I don't insist on it, nor do I insist on the use of "dataset" elsewhere.
>>> 
>>> --James
>>> 
>>> 
>>>> 
>>>> Luc
>>>> 
>>>> 
>>>> On 09/08/12 18:03, James Cheney wrote:
>>>>> OK.  I have done a quick pass to use the term "PROV dataset" and changed all occurrences of "toplevel bundle" to "toplevel instance".  I think it's a lot better this way!
>>>>> 
>>>>> instance = named set of statements.  (Excluding "bundle" constructs, which are not statements.)
>>>>> bundle = named set of statements ~= named graph of PROV-O (hopefully!)
>>>>> dataset = an instance and zero or more bundles (with distinct names).
>>>>> toplevel instance = the set of statements at the toplevel of a dataset
>>>>> 
>>>>> Module typos/snags, does this look OK?  If so I will close.
>>>>> 
>>>>> Perhaps this terminology would be useful in other documents (Luc pointed out PROV-N uses "toplevel bundle" too...).
>>>>> 
>>>>> --James
>>>>> 
>>>>> On Aug 9, 2012, at 5:41 PM, Miles, Simon wrote:
>>>>> 
>>>>>> Hello James,
>>>>>> 
>>>>>> I strongly agree with the suggested general solution. I have no objection to "dataset" as a term. If you do still need to talk about bundles at all in PROV-Constraints, I think it should be made clear that the "toplevel" does not need to be named (does not need to be a bundle) to avoid confusion of concepts for different purposes.
>>>>>> 
>>>>>> As said on the IRC, I don't think this is a blocking issue, just a matter of text clarification.
>>>>>> 
>>>>>> thanks,
>>>>>> Simon
>>>>>> 
>>>>>> Dr Simon Miles
>>>>>> Senior Lecturer, Department of Informatics
>>>>>> Kings College London, WC2R 2LS, UK
>>>>>> +44 (0)20 7848 1166
>>>>>> 
>>>>>> Evolutionary Testing of Autonomous Software Agents:
>>>>>> http://eprints.dcs.kcl.ac.uk/1370/
>>>>>> ________________________________________
>>>>>> From: James Cheney [jcheney@inf.ed.ac.uk]
>>>>>> Sent: 09 August 2012 17:21
>>>>>> To: Provenance Working Group
>>>>>> Subject: Re: PROV-ISSUE-474 (instances-and-bundles): Bundles and valid instances [prov-dm-constraints]
>>>>>> 
>>>>>> We discussed this in the teleconference and it sounded like it would be appropriate to find better terminology for the following three things, which are currently not clearly distinguished:
>>>>>> 
>>>>>> - "the whole PROV instance, including set of toplevel statements and bundles"
>>>>>> - "a particular set of statements, either the toplevel one or one within a bundle"
>>>>>> - bundle = "a named set of provenance statements"
>>>>>> 
>>>>>> My initial proposal is "PROV dataset", "PROV instance", and "bundle".  I believe "PROV dataset" is roughly analogous to what people call "dataset" in the context of SPARQL; if anyone knows different (or has objections or better suggestions), let me know.
>>>>>> 
>>>>>> I'll send another message on this when this is ready for review.
>>>>>> 
>>>>>> --James
>>>>>> 
>>>>>> On Aug 9, 2012, at 3:45 PM, Provenance Working Group Issue Tracker wrote:
>>>>>> 
>>>>>>> PROV-ISSUE-474 (instances-and-bundles): Bundles and valid instances [prov-dm-constraints]
>>>>>>> 
>>>>>>> http://www.w3.org/2011/prov/track/issues/474
>>>>>>> 
>>>>>>> Raised by: Simon Miles
>>>>>>> On product: prov-dm-constraints
>>>>>>> 
>>>>>>> As requested, I'm submitting an issue where I feel a PROV-Constraints review comment of mine is not completely answered.
>>>>>>> 
>>>>>>> My original comment:
>>>>>>>> Bundles
>>>>>>>> -------
>>>>>>>> F. Section 6.1 seems a bit out of the blue. "The definitions
>>>>>>>> [etc.]... assume a PROV instance with exactly one bundle", and then
>>>>>>>> multiple bundles are handled as exactly the same number of
>>>>>>>> instances. Why? Why is there a connection between number of instances
>>>>>>>> and number of bundles? Why would a bundle be considered to be only one
>>>>>>>> instance? I thought a bundle was an identified set of statements,
>>>>>>>> allowing for provenance of provenance, which seems a distinct matter
>>>>>>>> from whether a set of statements are valid. It seems fine for a user
>>>>>>>> to treat one bundle as one instance if they want to, but there's no
>>>>>>>> reason given why this is the general case.
>>>>>>> Response from editors:
>>>>>>>> I am not sure I understand this comment.  However, I have rewritten
>>>>>>>> slightly the intro of section 6.1.
>>>>>>>> 
>>>>>>>> "The definitions, inferences, and constraints, and the resulting notions of normalization, validity and equivalence, assume a PROV instance that consists of exactly one bundle, the toplevel bundle, containing all PROV statements in the top level of the bundle (that is, not enclosed in a named bundle). In this section, we describe how to deal with PROV instances consisting of multiple named bundles. Briefly, each bundle is handled independently; there is no interaction between bundles from the perspective of applying definitions, inferences, or constraints, computing normal forms, or checking validity or equivalence."
>>>>>>> I agree this is clearer, but I don't feel it answers the key questions in my comment. To put my comment another way: you have explained checking validity where an instance consists of one bundle and of multiple bundles. The two other possibilities I see are:
>>>>>>> (a) A bundle containing multiple instances;
>>>>>>> (b) An instance that is a collection of PROV descriptions with no identifier and so is not a bundle, e.g. a provenance service query result.
>>>>>>> 
>>>>>>> How do we deal with each of these cases? Or, if they cannot occur, why not?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Simon
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>> 
>>>>> 
>>>> 
>>>> -- 
>>>> Professor Luc Moreau
>>>> Electronics and Computer Science   tel:   +44 23 8059 4487
>>>> University of Southampton          fax:   +44 23 8059 2865
>>>> Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
>>>> United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> -- 
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> 
>>> 
>> 
>> 
>> ----
>> Ivan Herman, W3C Semantic Web Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> FOAF: http://www.ivan-herman.net/foaf.rdf
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Friday, 10 August 2012 11:45:13 UTC