Prov-DM, entity records and ASN

Hi,

Short version:
* URIs should not be used to identify entity records in Prov ASN.
* Prefereably (in my view) entity records should be self-identifying, and don't 
need separate identifiers.  If separate identifiers are used for these, don't 
use URIs.
* We should distinguish between Provenance ASN as an *abstract* syntax, and any 
concrete (hence parseable) realization of this syntax.  I think we need to take 
care to avoid confusing the provenance language(s) with the provenance 
information that is being represented.
* Maybe relax the constraints that require scoping of names in the ASN, or don't 
use URIs for such names.

...

I came in rather late today on the telecon discussion, and picked up on the end 
of a discussion about entity names and entity record names.  From what I heard, 
I think there are a couple of areas of confusion:

(a) abstract syntax vs concrete syntax.  My original understanding was that the 
Provenance ASN was intended to be an abstract syntax and associated notation for 
talking about provenance assertions, not a concrete syntax for machine 
processing of provenance.  Recent PROV-DM specs say something similar [1].   As 
such, I felt the discussion of concrete representations of provenance records (I 
heard mention of relational representation) was adding to rather than clearing 
away the complexities that users of provenance information are being asked to 
address.

[1] "This specification also introduces PROV-ASN, an abstract syntax that is 
primarily aimed at human consumption. PROV-ASN allows serializations of PROV-DM 
instances to be written in a technology independent manner, it facilitates its 
mapping to concrete syntax, and it is used as the basis for a formal semantics. 
This specification uses instances of provenance written in PROV-ASN to 
illustrate the data model." -- 
http://www.w3.org/TR/2011/WD-prov-dm-20111215/#introduction

My expectation is that the abstract syntax captures a structure, abstractly, 
that can be mapped isomorphically (more or less) to some concrete syntax.  This 
allows us to talk about the structures involved, and their essential properties, 
separately from the artifacts of dealing with concrete representations.  John 
McCarthy's original (I think) introduction of abstract syntax notions [2] used 
logical expressions over (abstract) classes of entities to describe structural 
relationships independently of any particular serialization.  This provides just 
enough structure to attach formal semantic information without getting ensnared 
by considerations of representation.  I thought this was the intention of 
provenance ASN, but we still seem to be discussing it in terms of concrete 
representations, which I don't think is helping us to be clear about the issues.

[2] 
http://www-formal.stanford.edu/jmc/towards/node12.html#SECTION000120000000000000000


(b) naming things in the domain of discourse vs things in the representation 
language.  This conflation seems to be at the heart of much of the discussion 
about identifiers for entities vs identifiers for entity records.  To my mind, 
entity records DO NOT exist in the domain of discourse, hence any form of 
identification used for these must be distinct from names of things in  the 
domain of discourse (entities, etc.).  We use URIs to name things in the domain 
of discourse.  I think it is a mistake to use URIs to identify things in the 
language (i.e. entity records), if any separate identification is needed. 
Personally, I think that (entire) entity records can serve as their own 
identifiers, and that no further identification is needed.  But I recognize 
concise identifiers are sometimes useful as a convenience to stand in for more 
complex structures, so if entity record identifiers are to be introduced I think 
they should NOT be URIs, and should not carry through to other serializations.

There is a parallel here with the RDF abstract syntax.  In RDF abstract syntax 
syntax there are graph nodes that denote things in the domain of discourse. 
Some of these nodes *are* URIs (NOTE: not *labelled* with URIs).  Other nodes 
are literals - i.e. arbitrary stings of characters with optional language and 
datatyping.  And other nodes are blank nodes.  URI nodes are names, which denote 
things in the domain of discourse according to some interpretation - a mapping 
from URIs to things.  Literals denotes things in the domain of discourse 
according to datatype-defined mappings.  Plain literals are self-denoting; i.e. 
the datatype mapping is the identity mapping.  Blank nodes denote things as 
existential variables: there is no fixed mapping, but whatever they denote is 
constrained by the semantics of the expressions in which they appear.  This is 
all the abstract syntax has to say about nodes; no more is needed.

But, when RDF graphs are serialized, some device is needed to encode when 
different blank nodes in a serialization are actually the same node.  RDF/XML 
and other serializations of RDF use the notion of a blank node identifier.  The 
node identifiers do not themselves denote anything in the domain of discourse, 
it is simply a device to represent the structure of an RDF graph, and the 
relationships between nodes that do denote things.  They are simply an artifact 
of a serialization.  (In a purely graphical representation of RDF, it is the 
position of a node on a page that distinguishes it from other nodes - no label 
is needed.)

In the case of provenance ASN, I think entity records are roughly like RDF 
nodes, and entity record identifiers (if they are used) are merely devices for 
talking about the structure of provenance ASN expressions and do not of 
themselves represent anything in the domain of discourse.  Crucially, I think 
that entity record identifiers should not appear in different serializations of 
the provenance model, such as RDF.  Thus, we shouldn't be using URIs for entity 
records, if we use any form of identifier for them.

...

There's a particular issue I haven't addressed so far: naming of entities that 
appear in different accounts - i.e. scoping of names to provenance accounts.

As far as I'm aware, a main concern is that if the same entity appears in 
different accounts, then one cannot apply constraints like saying that an entity 
is generated by at most one process execution.  Personally, in the face of 
multiple accounts, I think this is a pointless and unhelpful constraint - i.e. 
by relaxing the constraint then one reason for scoping names to accounts is put 
aside.  Are there others?

A possible approach would be to not use URIs to name things in the ASN, and use 
other mechanisms to assign appropriate URI names when mapping to RDF.  Lacking a 
concrete use case, I can't expand further on this notion.

#g
--

Received on Thursday, 12 January 2012 23:39:56 UTC