- From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
- Date: Thu, 23 Feb 2012 13:16:20 +0000
- To: W3C provenance WG <public-prov-wg@w3.org>
Reviewing: http://dvcs.w3.org/hg/prov/raw-file/7aadc6332722/model/ProvenanceModel.html Summary: I'm sorry to say that I don't think the document even starts to bring in the kind of simplification discussed at the F2F meeting, which is required if this spec is to gain traction with web developers. I find the document is still difficult to read, and in a full morning of reviewing it I've only got as far as section 5. I think further *radical* simplification is required for the data model description, and I think it's possible without losing any essential information about the model. ... (Nit: when I load this document from a local copy of the repository, I get an error reported indicating a problem with fetching the CSS. It loads OK from the above URI. Is there a problematic relative URI reference in the source document?) ... I thought we'd agreed at F2F to provide a simple "scruffy" introduction to the DM (part 1), then introduce the requirement and refinements for more formally tractable provenance expressions that can be used to build accurate historical records over multiple related artifacts (part 2). The document I'm reading does very little that I can see to make the prov-dm more approachable, as was indicated that we need to do at the F2F. As far as I can tell, the only thing that has been in this direction is to *add* a new section on interpretation. This, of itself, does nothing to simplify the DM description. I think we should be placing far more emphasis on making it a simple as we possibly can for information providers to publish provenance. Consider that the primary beneficiaries of provenance information are the *consumers* of published information, not the *publishers*, so if we make life unnecessarily hard for publishers we're shooting ourselves in the collective foot. From this, I think the initial introduction to the DM needs to be radically simplified to the extent that a developer can spend 10-15 minutes glancing at it and think "oh yes, I can easily add this to my output data". If necessary, we push some of the work of understanding what needs to be done to harmonize the data to make it more suitable for building a historical record towards the consumer. ... With this in mind: Section 2: The introductory material in section 2.1 is unhelpful, and I propose it be removed from the introduction. Most of this material is not important until we come to consider the more formal aspects of the DM. With the exception of 2.1.2.1 about events, which I think should be introduced in the PROV-DM core model section. Similarly sections 2.2 and 2.3 (maybe moving the two introductory sentences of 2.2 into section 2.4). Thus section 2 would become just a very brief intro to the notation used for describing ASN, and maybe this could be moved into the PROV-DM core section (sect 5). Section 3 looks generally useful. But it still mentions an "account record", which I understood was being dropped. It also mentions "alternateOf" and "specializationOf" which are not necessary for a "scruffy" introduction to provenance, so I suggest mention of these is dropped from here. I suggest dropping the sentence about core and common relations - it's just noise. With the removal of accounts, I think the whole purpose of notes/annotation records *as part of the provenance model* has become moot, and suggest that these be dropped from the spec. There's nothing to prevent annotations being added to the provenance data as rdfs:comment or rdfs:label values. I suggest dropping the mention of extensibility points: again, it's just noise at this point. Section 4: to my mind, this example section adds no useful information and doesn't help understanding of the (on account of being harder to follow than the ASN model description), and suggest that it be dropped. Alternatively, I suggest moving it to an appendix. Section 5: this is the vital core of this document. Section 3 provides a very useful high-level overview, so this section can just get down to describing the constructs. I note that ASN is mis-named: it's not really an *abstract* syntax notation; it's quite concrete, so it's more like a (technology-neutral) functional syntax notion. @@raise separate issue for this? Section 5.1: prov-dm is a data model, not an implementation, right? So why do we need to introduce "housekeeping constructs ... to facilitate their interchange"? Suggest dropping most of the discussion of "record container", and simply introduce the "recordContainer" and "namespaceDeclaration" productions along with production for "record". Section 5.2.1: Entity record Suggest drop "In PROV-DM, " - it's redundant. Suggest the examples focus more on web documents, with "car" as more of an afterthought. Primary use will probably be to describe web documents, sop lets keep this at front-of-mind? Suggest dropping all mentions of "asserters viewpoint" and "situation in the world" - these don't matter for the "scruffy" view of provenance. Suggest dropping the idea that the attributes somehow define the entity ("whose situation in the world is represented by the attribute-value pairs"). They're just there to provide information about the entity, and as hooks for interoperability. (I argued previously for dropping attributes completely, but was persuaded otherwise by the interoperability argument from the provenance challenges - don't try to make more of them.) Suggest drop issue mentioning "characterization interval" - I think it's now a non-issue. I think the issue of uniqueness of identifiers should be dealt with in the introduction to ASN, not under the individual elements. Under "further considerations", suggest dropping all but 3rd and 6th bullets. In the 6th bullet, I don't understand the stuff about "a namespace also declares the number of occurrences...". I have deep concern about what this might be trying to say. In any case, shouldn't this be covered under a description of the namespace, if needed? I think the material about "activities" and "plans" really doesn't belong in this section. Section 5.2.2 Activity record Suggest drop "In PROV-DM, " - it's redundant. Didn't we discuss replacing the start, end times by events? I don't recall the outcome - I'm just mentioning this in case it's been missed. For the example, I suggest leading on something to do with information on the web. It was a surprise to me to learn that PROV-DM has reserved attributes. If attributes are in the model to support interoperability with other provenance frameworks (which is my understanding from previous discussions), this feels like a poor design choice. Maybe it should be a separate parameter? In any case, I think the intent of this "subtyping" needs to be explained. If this is to be a "scruffy" introduction, I think the reference to start-view-end is not needed here. In any case, the cross-reference is almost impossible to locate in a printed copy of the spec. I think the issue of uniqueness of identifiers should be dealt with in the introduction to ASN, not under the individual elements. Suggest dropping the "further considerations bullets." Did we not agree that activities *would* be allowable as entities (especially if entities are just stuff that can identified).? Section 5.2.3, Agent record Having introduced a framework for subtyping for activities, why not use the same approach for different types of agents ... especially considering that two major agent types are defined by reference to existing foaf definitions? I suggest not asserting the claim that the agent types are mutually exclusive. Suggest drop reference to "situation in the world". Suggest drop discussion of inferences of agent records - if needed, they should come later along with a more formal ("non-scruffy") treatment of the data model. Section 5.2.4, Note record I think this should be dropped from the data model. I don't see that it serves any needed *provenance* function. "extra information" can be added by format-specific extensions. As such, this record type only adds noise to the specification. Section 5.3.1.1 generation record I believe the ASN syntax given verges on being ambiguous, and is unnecessarily tricky to parse by a human or machine consumer; e.g. consider: wasGeneratedBy(a,b) wasGeneratedBy(a,b,) The presence of the trailing comma in the second example completely changes the parse tree productions associated with a and b. I think it would be much easier if ASN simply required a dummy activity identifier to be provided; i.e. don't make aidentifier optional. Indeed, rather than allowing optional identifiers anywhere in the ASN, one might use a placeholder (e.g. '_') for any unspecified identifier, which would make the overall syntax much more regular. Since the id is used only for annotations, I suggest dropping it (see section 5.2.4 comment above). If this is to be a "scruffy" introduction, I think the reference to generation-within-activity is not needed here. In any case, the cross-reference is almost impossible to locate in a printed copy of the spec. Suggest drop this. Similarly, suggest dropping the structural constraint here. Section 5.3.1.2 Usage record Suggest drop "In PROV-DM, " - it's redundant. Why is there an identifier for a usage record? Suggest lead with example of consuming a web resource. Suggest drop reference to annotation record (see above note about 5.2.4) Suggest drop reference to interpretation here Section 5.3.2.1 Association record Para 3: Suggest drop first sentence, and simplify; i.e. just say; "Activities may reflect the execution of a plan..." Para 4, there quite a bit of redundancy redundancy here. Suggest: [[ A plan is the description of a set of actions or steps intended by one or more agents to achieve some goal. PROV-DM is not prescriptive about the nature of plans, their representation, the actions and steps they consist of, and their intended goals. A plan can be a workflow for a scientific experiment, a recipe for a cooking activity, or a list of instructions for a micro-processor execution. Plans are entities, which may have associated provenance. An activity may be associated with multiple plans, allowing for descriptions of activities initially associated with a plan, which was changed, on the fly, as the activity progresses. Plans can be successfully executed or they can fail. We expect applications to exploit PROV-DM extensibility mechanisms to capture the rich nature of plans and associations between activities and plans. ]] Para 5: I see no value in cross-referencing the responsibility record here. Suggest dropping this paragraph. Why is there an identifier for an association record? Section 5.3.2.2 Start and End records This seems to overlap with start, end parameters on an activity. It's not immediately clear how they play together. Should this record not describe an "event"? Then the id should identify the start/end event, not the record. cf. Issue 207. Identifiers should denote activities and agents, *not records*. Section 5.3.3.1 Responsibility record Suggest drop "To promote take-up... " and instead lead with a simple introduction of what the record describes. Para 3: It seems to me that the responsibility record should stand independently of any association record. Suggest drop "Given an activity association record... (...)" Why is there an identifier for an responsibility record? Section 5.3.3.2 Derivation record Suggest drop "In PROV-DM, " This whole section seems way to complicated. My understanding is that the "Common relations" section is intended to cover those useful short-cut expressions that can be expressed with less convenience in the core model. As such, I think the derivation record should be a "common" rather than a "core" relation. Aside from that, I really don't see the utility of all this stuff about precise and imprecise derivations. I think there is just one useful relation to define, roughly corresponding to "imprecise n-derivation record" here: - I note that the "imprecise 1-derivation record" and "imprecise n-derivation record" are not syntactically distingushable, so there's no point in discussing the difference. - the "precise 1-derivation record" can be expressed using an activity, usage and generation record: I'm not convinced this alternative syntax is really buying anything worthwhile. Suggest radical simplification along these lines, and move to section 6. Don't introduce all the formal stuff until a later section handling more formal treatments. Section 5.3.3.3 Alternate and Specialization records In considering a "scruffy" view of provenance, these relations aren't really needed. However, they do underpin a more formal treatment in the face of dynamic resources. I would give serious consideration to introducing these later, when the more formal treatment of dynamic resources is considered. Section 5.3.4. Annotation record I think this serves no needed purpose, and should be dropped. (See earlier comments for section 5.2.4.) Section 5.4.1 Account record I understood we'd agreed to drop this. Section 5.4.2 Record container I think this is mainly an artifact of the ASN syntax, and should be introduced more briefly in the introductory section 5.1 (see previous comments) Section 5.5.1 Attribute I think the "optional-attribute-value" productions covered in section 5.2.1 (Entity) should be covered here since they apply to multiple record types. I would prefer to see attribute names presented as being IRIs in the data model, with the namespace-qualified CURIE syntax available as a convenience in the ASN presentation. I think the predefined attribute names should be dealt with in a separate section. I'm actually not convinced this is the best design choice for properties with DM-defined meaning, as opposed to (say) using separate record parameters, but that's more of a style issue than a fundamental objection. As indicated earlier, I think the whole discussion of derivation steps is too much detail, and I don't see the value, and would suggest dropping the prov:steps attribute. For attribute prov:label: why not just use rdfs:label? Section 5.5.2 Identifiers The text says they are *qualified* names, but in most of the example they are not. Also, some identifiers are described as having local scope: this is not compatible with using *qualified* names which are essentially IRIs. The text describes identifiers as denoting *records* (e.g. entity record) - I think this is wrong, and in any case is inconsistent with text elsewhere in the document. They should demote "entity", "activity", "agent", etc. Section 5.5.3 Literal "A PROV-DM Literal represents a value whose interpretation is outside the scope of PROV-DM." What a Terrible Failure... the whole point of languages introducing literals is precvisely that their interpretation *is* defined by the language. If not, they might as well be names. I think the intent is that their interpretation is defined by reference to the corresponding xsd datatype definition, or some other datatype definition, that is effectively incorporated by reference. I'd suggest that an interpretation of literals is provided by: - http://www.w3.org/TR/rdf-mt/#gddenot - http://www.w3.org/TR/rdf-mt/#DTYPEINTERP Section 5.5.4 Time No syntax production provided or indicated. I think it's unnecessary and inappropriate to indicate where time is used. It's just something to go wrong as the document evolves. Section 5.5.5 Asserter Do we really still need this (now accounts are gone). Suggest dropping. Section 5.5.6 Namespace I'd suggest covering this with the introduction of the record container syntax production Section 5.5.7 Location Do we have any explicit use of this? if not, I'd suggest dropping it. ... I'm out of time and stopping my review here. There's a general pattern here that I'd also apply to section 6. I'd then take section 7 and (probably) exp[and it into several sections ("Part 2") introducing and describing a more formal treatment of provenance that can be used to bridge from and refine the "scruffy" view to something that can be assembled and processed according to inferences that flow from the formal semantics. A key point to introduce here would be that it is possible to create provenance statements that cannot possibly satisfy the formal semantics, and to indicate what additional constraints and disciplines should be applied to ensure that they can (and hence to make the inferences that flow from those semantics valid). #g --
Received on Thursday, 23 February 2012 13:18:01 UTC