Re: Prov-DM, entity records and ASN

Hi Graham,

I just raised ISSUE-125 related to your email.

Quick answer, interleaved, but fuller exposition of issues in my 
previous email.

On 01/12/2012 11:36 PM, Graham Klyne wrote:
> Hi,
> Short version:
> * URIs should not be used to identify entity records in Prov ASN.
> * Prefereably (in my view) entity records should be self-identifying, 
> and don't need separate identifiers.  If separate identifiers are used 
> for these, don't use URIs.
We should not just look at entity records, but also at usage records 
(mentioned in derivation records) and account records.
We should consider their specific requirements regarding identifiers.
Hopefully, we can come up with a solution that applies to all forms of 
records (entity records and other kinds).

Identifiers are defined in the DM as qualified names (which can be 
mapped to URIs).
Assuming we want to have identifiers for records, qualified names are a 
good approach I believe,
since they allow us to name records, in the context of namespace, which 
would use to identify an application.

> * We should distinguish between Provenance ASN as an *abstract* 
> syntax, and any concrete (hence parseable) realization of this 
> syntax.  I think we need to take care to avoid confusing the 
> provenance language(s) with the provenance information that is being 
> represented.
We never said the ASN is parsable or not. But this is beside the point, 
the issue is about identifiers in the
prov-dm model.

> * Maybe relax the constraints that require scoping of names in the 
> ASN, or don't use URIs for such names.

Can you discuss this suggestion in the light of the example?


> ...
> I came in rather late today on the telecon discussion, and picked up 
> on the end of a discussion about entity names and entity record 
> names.  From what I heard, I think there are a couple of areas of 
> confusion:
> (a) abstract syntax vs concrete syntax.  My original understanding was 
> that the Provenance ASN was intended to be an abstract syntax and 
> associated notation for talking about provenance assertions, not a 
> concrete syntax for machine processing of provenance.  Recent PROV-DM 
> specs say something similar [1].   As such, I felt the discussion of 
> concrete representations of provenance records (I heard mention of 
> relational representation) was adding to rather than clearing away the 
> complexities that users of provenance information are being asked to 
> address.
> [1] "This specification also introduces PROV-ASN, an abstract syntax 
> that is primarily aimed at human consumption. PROV-ASN allows 
> serializations of PROV-DM instances to be written in a technology 
> independent manner, it facilitates its mapping to concrete syntax, and 
> it is used as the basis for a formal semantics. This specification 
> uses instances of provenance written in PROV-ASN to illustrate the 
> data model." -- 
> My expectation is that the abstract syntax captures a structure, 
> abstractly, that can be mapped isomorphically (more or less) to some 
> concrete syntax.  This allows us to talk about the structures 
> involved, and their essential properties, separately from the 
> artifacts of dealing with concrete representations.  John McCarthy's 
> original (I think) introduction of abstract syntax notions [2] used 
> logical expressions over (abstract) classes of entities to describe 
> structural relationships independently of any particular 
> serialization.  This provides just enough structure to attach formal 
> semantic information without getting ensnared by considerations of 
> representation.  I thought this was the intention of provenance ASN, 
> but we still seem to be discussing it in terms of concrete 
> representations, which I don't think is helping us to be clear about 
> the issues.
> [2] 
> (b) naming things in the domain of discourse vs things in the 
> representation language.  This conflation seems to be at the heart of 
> much of the discussion about identifiers for entities vs identifiers 
> for entity records.  To my mind, entity records DO NOT exist in the 
> domain of discourse, hence any form of identification used for these 
> must be distinct from names of things in  the domain of discourse 
> (entities, etc.).  We use URIs to name things in the domain of 
> discourse.  I think it is a mistake to use URIs to identify things in 
> the language (i.e. entity records), if any separate identification is 
> needed. Personally, I think that (entire) entity records can serve as 
> their own identifiers, and that no further identification is needed.  
> But I recognize concise identifiers are sometimes useful as a 
> convenience to stand in for more complex structures, so if entity 
> record identifiers are to be introduced I think they should NOT be 
> URIs, and should not carry through to other serializations.
> There is a parallel here with the RDF abstract syntax.  In RDF 
> abstract syntax syntax there are graph nodes that denote things in the 
> domain of discourse. Some of these nodes *are* URIs (NOTE: not 
> *labelled* with URIs).  Other nodes are literals - i.e. arbitrary 
> stings of characters with optional language and datatyping.  And other 
> nodes are blank nodes.  URI nodes are names, which denote things in 
> the domain of discourse according to some interpretation - a mapping 
> from URIs to things.  Literals denotes things in the domain of 
> discourse according to datatype-defined mappings.  Plain literals are 
> self-denoting; i.e. the datatype mapping is the identity mapping.  
> Blank nodes denote things as existential variables: there is no fixed 
> mapping, but whatever they denote is constrained by the semantics of 
> the expressions in which they appear.  This is all the abstract syntax 
> has to say about nodes; no more is needed.
> But, when RDF graphs are serialized, some device is needed to encode 
> when different blank nodes in a serialization are actually the same 
> node.  RDF/XML and other serializations of RDF use the notion of a 
> blank node identifier.  The node identifiers do not themselves denote 
> anything in the domain of discourse, it is simply a device to 
> represent the structure of an RDF graph, and the relationships between 
> nodes that do denote things.  They are simply an artifact of a 
> serialization.  (In a purely graphical representation of RDF, it is 
> the position of a node on a page that distinguishes it from other 
> nodes - no label is needed.)
> In the case of provenance ASN, I think entity records are roughly like 
> RDF nodes, and entity record identifiers (if they are used) are merely 
> devices for talking about the structure of provenance ASN expressions 
> and do not of themselves represent anything in the domain of 
> discourse.  Crucially, I think that entity record identifiers should 
> not appear in different serializations of the provenance model, such 
> as RDF.  Thus, we shouldn't be using URIs for entity records, if we 
> use any form of identifier for them.
> ...
> There's a particular issue I haven't addressed so far: naming of 
> entities that appear in different accounts - i.e. scoping of names to 
> provenance accounts.
> As far as I'm aware, a main concern is that if the same entity appears 
> in different accounts, then one cannot apply constraints like saying 
> that an entity is generated by at most one process execution.  
> Personally, in the face of multiple accounts, I think this is a 
> pointless and unhelpful constraint - i.e. by relaxing the constraint 
> then one reason for scoping names to accounts is put aside.  Are there 
> others?
> A possible approach would be to not use URIs to name things in the 
> ASN, and use other mechanisms to assign appropriate URI names when 
> mapping to RDF.  Lacking a concrete use case, I can't expand further 
> on this notion.
> #g
> -- 

Professor Luc Moreau
Electronics and Computer Science   tel:   +44 23 8059 4487
University of Southampton          fax:   +44 23 8059 2865
Southampton SO17 1BJ               email:
United Kingdom           

Received on Friday, 13 January 2012 10:50:41 UTC