Re: PROV-ISSUE-311 (clarify-optionals): Clarify optional arguments in DM [prov-dm] from James Cheney on 2012-04-19 (public-prov-wg@w3.org from April 2012)

From: James Cheney <jcheney@inf.ed.ac.uk>
Date: Thu, 19 Apr 2012 12:40:49 +0100
To: Luc Moreau <L.Moreau@ecs.soton.ac.uk>
Cc: public-prov-wg@w3.org
Message-Id: <02CFA5B0-53DC-4CAD-8F89-3981E6F25CD4@inf.ed.ac.uk>
On Apr 19, 2012, at 10:30 AM, Luc Moreau wrote:

> Hi James,
> 
> For some reason, your cut and paste does not seem to include some '?' symbols,
> 
> Here is the rule in my antlr parser, where each identifier was given a name.
> 
>    :    'wasDerivedFrom' '(' ((id0=identifier | '-') ',')? id2=identifier ',' id1=identifier (',' (a=identifier | '-') ',' (g2=identifier  | '-') ',' (u1=identifier | '-') )?    optionalAttributeValuePairs ')'
> 

Yes, you're right.  For some reason copying and pasting from the HTML in PROV-N omitted some symbols and I was basing my comment on the pasted version.  Some of what I wrote was nonsense as a result.

Sorry about that!  Negligent of me to respond before coffee.

> Notice that either:
> - you have *three* arguments + optional attributes following id1
> - or you don't have these arguments but simply optional attributes.
> 
> So, the following are examples of valid expressions:
> wasDerivedFrom(id2,id1,a,g2,u1)
> wasDerivedFrom(id2,id1,a,-,u1)
> wasDerivedFrom(id2,id1,-,-,-)
> wasDerivedFrom(id2,id1)
> 
> If id0 is present, then, likewise:
> wasDerivedFrom(id0, id2,id1,a,g2,u1)
> wasDerivedFrom(id0, id2,id1,a,-,u1)
> wasDerivedFrom(id0, id2,id1,-,-,-)
> wasDerivedFrom(id0, id2,id1)
> 
> Note that in the above, '-' appears explicitly in the textual representation.
> 
> So, to me,
> 
> wasDerivedFrom(e2, e1,x)
> 
> can only parsed in a single way:
> wasDerivedFrom(id0, id2,id1)
> 

OK.  That's fine, and it's actually what I was suggesting we do by the "all or nothing" behavior.  In the current grammar, wasDerivedFrom, activity, wasQuotedFrom already have the "all or nothing" behavior, which is great. 

I do think this way of doing things (where the id is just another optional argument) puts an unnecessary burden on readers to understand the grammar in advance.  Since we claim human readability as a goal, saying "you should be able to figure out which types each identifier has by parsing in your head and counting the arguments" seems suboptimal.  Especially since we claim in various places that the position matters.

I think there are still ambiguity problems, though:

wasGeneratedBy, wasStartedBy and wasEndedBy have independent optional id,  activity, time  and attribute arguments, along with a constraint that one of activity, time and attrs must be present.  So how do I parse:

wasGeneratedBy(x,y,attrs)

where both x and y are identifiers?  It could mean 

x is a generation id and y is an entity (the generated entity)
or
x is an entity and y is an activity (that generated x)

Still seems ambiguous.  LL parsing will take the first parse, which means that if the id is omitted we always have to say so explicitly:

wasGeneratedBy(-,e,a,attrs)

which seems suboptimal to me, and there are many examples where we don't do this for the short form of generation.  I suppose we could patch this by saying that if attrs is present then the id also has to be.

Association has this problem too, but worse:

wasAssociatedWith(x,y)

could be interpreted as
x is an id and y is an activity
x is an activity and y is an agent
x is an activity and y is an entity


So I still advocate having simple, orthogonal rules for optional arguments:

- id is at the beginning and followed by a different symbol (say, semicolon) if present, to make it trivial to see whether there's an id present;
- attrs are in brackets if present (which is already fine);
- other optional attributes are either all omitted (short form) or all given, with missing ones as '-'




> 
> As far as the unknown/absent discussion is concerned, I am not trying to argue for the cases
> I enumerated. I am just saying that 'unknown' as you suggested is not clear. Unknown by whom?
> what is unknown?

I don't want to get into philosophy.  I meant unknown purely in the sense that when translating to RDF, we generate a fresh edge and id; whereas absent means we don't.  It would be good to explain this purely in PROV-DM terms.

The original issue raised by Stian here was how the different meanings of "missing" affect the translation to RDF.  So that is the distinctions that matters to me.  I think it would be beneficial if there's a clear alignment between the way you write optional things and the way they're treated (absence vs. there-but-not-specified, whatever you want to call it).  

--James

> 
> Luc
> 
> 
> On 04/19/2012 10:00 AM, James Cheney wrote:
>> On Apr 19, 2012, at 5:35 AM, Luc Moreau wrote:
>> 
>>   
>>> Hi James,
>>> 
>>> I don't think your description of the problem is accurate.
>>> The production [1] is not ambiguous (LL grammar), it definitely does not
>>> require multiple pass over the document to recognise types.
>>> 
>>>     
>> Sorry, I don't understand how a grammar containing the rule [1] can *possibly* be unambiguous.
>> 
>> derivationExpression ::= wasDerivedFrom ( ( identifier | - ) , eIdentifier , eIdentifier , ( aIdentifier | - ) , ( gIdentifier | - ) , ( uIdentifier | - ) optional-attribute-values )
>> 
>> By "unambiguous", I mean what people normally mean: for each string there is at most one parse tree (not that one can find *some* parse without backtracking.)  There are three parse trees for:
>> 
>> wasDerivedFrom(e2, e1,x)
>> - one where x is parsed as an aIdentifier,
>> - one where x is parsed as a gIdentifier,
>> - one where x is parsed as a uIdentifier.
>> 
>> The grammar may be LL, but an LL parser will always pick the leftmost derivation, i.e. the aIdentifier one.  This is *not* the same as unambiguity.
>> 
>> If this is the *required* way to disambiguate then the grammar spec should say so, and the rule "you have to use - for the first few omitted arguments" should be made explicit.  This seems at least as complicated as my alternative suggestion.
>> 
>> I haven't seen the current version of PROV-N so maybe this is explained better there, but it should also be explained in PROV-DM(-CONSTRAINTS).
>> 
>> 
>> 
>>   
>>> I think the confusion may have come from the description of the grammar but Paolo has reworked it.
>>> 
>>> As far as the reading of - is concerned, I would even say that we have the following cases:
>>> - value exists and is known but not expressed (say, because not deemed important)
>>> - value existence is known but actual value is unknown
>>> - value does not exist
>>> - value existence is not known
>>> So, your suggested split absent/unknown may not be the clearest.
>>> 
>>> I believe your Proposal 0 is implemented in the grammar.
>>> 
>>> I considered variants of Proposal 1 but ruled them out because the grammar was not ambiguous.
>>> 
>>>     
>> I would argue that the proliferation of different cases above is a strong motivation for cutting down on the number of cases.  Even if the grammar happens to be unambiguous (though I can't see how it can be), we are currently asking a lot of readers especially since the grammar is the last of the three documents they'll see.
>> 
>> In an open world setting (I think!) we shouldn't distinguish between "value does not exist" and "value existence is not known".  Combining provenance records could fill in unknown vlaues.  In any case, we currently have no way to express this distinction - and we don't say anywhere what should happen if we somehow learn the value of a "value that does not exist".
>> 
>> I also see no reason to distinguish between "value exists and is known but not expressed" and "value existence is known but actual value is unknown" - from the point of view of a consumer of provenance, what would I do differently?  In any case there is no way for the producer to express this difference.
>> 
>> At the end of the day, what matters is what people will implement, and it's unclear to me what someone should actually implement when doing inference/validation/equivalence checking on provenance descriptions.
>> 
>> If the consensus is that the existing way is fine, at least it should be explained clearly; especially we should explain how the "short" forms of expressions expand into the long forms.  Right now, this is not explained clearly anywhere.   I plan, at least, to expand all of the expressions used in constraints so that there is no ambiguity.
>> 
>> --James
>> 
>>   
>>> [1] http://dvcs.w3.org/hg/prov/raw-file/default/model/prov-n.html#Derivation-Relation
>>> 
>>> Professor Luc Moreau
>>> Electronics and Computer Science
>>> University of Southampton
>>> Southampton SO17 1BJ
>>> United Kingdom
>>> 
>>> On 19 Apr 2012, at 00:33, "James Cheney"<jcheney@inf.ed.ac.uk>  wrote:
>>> 
>>>     
>>>> OK, I've posted my thoughts on this, and a proposal, at:
>>>> 
>>>> http://www.w3.org/2011/prov/wiki/Optional_arguments
>>>> 
>>>> (Sorry this is a bit long, but I think it is worth being a little pedantic here).
>>>> 
>>>> I'd like to keep this open for discussion, but don't think it's a blocking issue.
>>>> 
>>>> --James
>>>> 
>>>> On Apr 18, 2012, at 10:43 AM, James Cheney wrote:
>>>> 
>>>>       
>>>>> Hi,
>>>>> 
>>>>> I have been working on the optional arguments in part 2, and I am still not sure what to write baed on what is in part 1 now.  I am trying to formulate a proposal to see if I am on the right track.  So I think this should be kept open for now (maybe it should be reassigned to prov-dm-constraints).
>>>>> 
>>>>> --James
>>>>> 
>>>>> 
>>>>> On Apr 18, 2012, at 7:51 AM, Luc Moreau wrote:
>>>>> 
>>>>>         
>>>>>> Hi Stian,
>>>>>> Can we close this issue now?
>>>>>> Regards,
>>>>>> Luc
>>>>>> 
>>>>>> On 04/02/2012 03:58 PM, Luc Moreau wrote:
>>>>>>           
>>>>>>> Hi Stian,
>>>>>>> 
>>>>>>> If you follow [1] below, you will now find our proposed answer to optional arguments.
>>>>>>> It contains explicit links to prov-dm part 2.
>>>>>>> 
>>>>>>> I propose to close this issue pending your review.
>>>>>>> Regards,
>>>>>>> Luc
>>>>>>> 
>>>>>>> 
>>>>>>> On 03/30/2012 04:12 PM, Luc Moreau wrote:
>>>>>>>             
>>>>>>>> Hi Stian,
>>>>>>>> 
>>>>>>>> I have been thinking about your suggestion on optional arguments.
>>>>>>>> I looked at all the optional arguments [1] in prov-dm.
>>>>>>>> 
>>>>>>>> Most of them, I believe, imply  existential quantification.
>>>>>>>> 
>>>>>>>> It would be nice to have this confirmed, and then we can write it up in part 2.
>>>>>>>> 
>>>>>>>> Luc
>>>>>>>> 
>>>>>>>> [1] http://dvcs.w3.org/hg/prov/raw-file/default/model/optional.html
>>>>>>>> 
>>>>>>>> On 13/03/2012 11:05, Provenance Working Group Issue Tracker wrote:
>>>>>>>>               
>>>>>>>>> PROV-ISSUE-311 (clarify-optionals): Clarify optional arguments in DM [prov-dm]
>>>>>>>>> 
>>>>>>>>> http://www.w3.org/2011/prov/track/issues/311
>>>>>>>>> 
>>>>>>>>> Raised by: Stian Soiland-Reyes
>>>>>>>>> On product: prov-dm
>>>>>>>>> 
>>>>>>>>> There seems to be some confusion over any of the 'optional' arguments in
>>>>>>>>> PROV-DM/PROV-N.
>>>>>>>>> 
>>>>>>>>> It is unclear if this means that the argument is *implied* (ie.
>>>>>>>>> existential quantification/bnodes in OWL/RDF) or not applicable/not present (NIL).
>>>>>>>>> 
>>>>>>>>> It might be good to go through all of the optionals in PROV-DM and make sure they make that clear.
>>>>>>>>> 
>>>>>>>>> For instance:
>>>>>>>>>                 
>>>>>>>>>> Generation, written wasGeneratedBy(id,e,a,t,attrs) in PROV-N, has the following components:
>>>>>>>>>> id: an optional identifier for a generation;
>>>>>>>>>> entity: an identifier for a created entity;
>>>>>>>>>> activity: an optional identifier for the activity that creates the entity;
>>>>>>>>>> time: an optional "generation time", the time at which the entity was completely created;
>>>>>>>>>> attributes: an optional set of attribute-value pairs that describes the modalities of generation of this entity by this activity.
>>>>>>>>>>                   
>>>>>>>>> Change to:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>                 
>>>>>>>>>> Generation, written wasGeneratedBy(id,e,a,t,attrs) in PROV-N, has the following components:
>>>>>>>>>> id: an optional identifier for a generation, if unspecified the identifier is not known;
>>>>>>>>>> entity: an identifier for a created entity;
>>>>>>>>>> activity: an optional identifier for the activity that creates the entity, if unspecified activity is still implied, but unknown;
>>>>>>>>>> time: an optional "generation time", the time at which the entity was completely created, if unspecified the time is unknown or not applicable;
>>>>>>>>>> attributes: an optional set of attribute-value pairs that describes the modalities of generation of this entity by this activity, if unspecified an empty set is implied.
>>>>>>>>>>                   
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>                 
>>>>>>>>               
>>>>>>>             
>>>>>> -- 
>>>>>> Professor Luc Moreau
>>>>>> Electronics and Computer Science   tel:   +44 23 8059 4487
>>>>>> University of Southampton          fax:   +44 23 8059 2865
>>>>>> Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
>>>>>> United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>           
>>>>> 
>>>>> -- 
>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland, with registration number SC005336.
>>>>> 
>>>>> 
>>>>> 
>>>>>         
>>>> 
>>>> -- 
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>> 
>>>>       
>>>     
>> 
>>   
> 
> -- 
> Professor Luc Moreau
> Electronics and Computer Science   tel:   +44 23 8059 4487
> University of Southampton          fax:   +44 23 8059 2865
> Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
> United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
> 
> 


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
Received on Thursday, 19 April 2012 11:41:24 UTC