Re: PROV-ISSUE-613 (prov-aq-draft-review): Review paq for release as last call working draft [Accessing and Querying Provenance] from Stian Soiland-Reyes on 2013-01-28 (public-prov-wg@w3.org from January 2013)

From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Date: Mon, 28 Jan 2013 00:36:01 +0000
To: Provenance Working Group <public-prov-wg@w3.org>
Message-ID: <CAPRnXtkm2w2i93aK+gchvCGt4JhiMVB9+uaVeahGeKsty+C72A@mail.gmail.com>
On Thu, Jan 17, 2013 at 11:35 AM, Stian Soiland-Reyes
<soiland-reyes@cs.manchester.ac.uk> wrote:

> Here is my partial review of the above document PROV-AQ.
> Due to travelling and sick days I have not been able to review section
> 4, 5, 6, nor appendices.


I am aware this is a bit late - so the below should only be considered
advisory as I know next draft of AQ is still being edited. I therefore
have not put any blockers.


Please below kindly find my remaining review of section 4, 5, 6 and
appendixes of PROV-AQ editors draft
https://dvcs.w3.org/hg/prov/raw-file/b3f397c7b15c/paq/prov-aq.html



Summary:
========

PROV-AQ is a very interesting document, because it describes how to
connect provenance to the world, or more specifically to resources on
the Internet. For my own domain of scientific workflow preservation,
there is a particular need for this kind of standardization as
currently there is no recognized mechanism for a service to provide
provenance data in any form.

The core concepts of PROV-AQ are very easy to understand, simple to
use and clearly scoped. The document is however at times heavy to
read, as edge cases are often explored in detail before introducing
the main concepts and how a functionality is to be used.

The terminology is a bit odd compared to the rest of the PROV
documents, I particularly wonder why the authors are using the term
target-URI rather than entity-URI; however I understand this is
careful threading as in this particular document there is necessarily
a lot of talk about *resources*.

It is unclear as to whether PROV-AQ can and should be used for finding
non-PROV provenance descriptions, such as alternative models (OPM,
DCTerms), application-specific resources (logfiles, commit logs), and
human-readable documents (HTML, Word). My view: "PROV-AQ MAY be used
for such purposes, but that PROV-AQ provenance descriptions SHOULD be
available as PROV. PROV SHOULD be represented as PROV-O RDF, and MAY
be represented in other W3C specified PROV serializations.".

I find that the section about pingback service is out of scope for a
PROV-AQ service, and therefore below (point 56) suggest an alternative
approach where the pingback service simply receives link that a
provenance service may later return or include in its store. I don't
distinguish between 'forward' and 'backward' provenance, so for me
"has provenance" means I will find some provenance data where this
entity ("target-URI") is present - but the WG might have a different
view and could want to distinguish between the two directions, as
popular resources could accumulate a lot of forward traces.




Detailed review - numbering continues from previous email:
=======


4. Provenance query service

35) "the naming authority associated with the target-URI is not the
same as the service offering provenance descriptions" - why is this a
problem?
"multiple services have provenance descriptions about the same
resource" - why is this a problem?
Neither of these seem like a problem from the previous bits of this
specification. Section 3 specifically allows multiple provenance-uris
and don't require these to be hosted at the sane "naming authority".

I think what you are trying to say in these two is something like:

* "third-party providers of provenance descriptions who can't use the
mechanisms of Section 3 because the target-URI is outside their
control"


36) "the service associated with the target-URI is not accessible for
adding additional information when handling retrieval requests"
I don't know what this means.  Which service? Adding on retrieval? Not
accessible?


37) "query services may provide additional control over what
provenance is returned"
perhaps change "control" to "filters" - make it sound like a good
thing when there is too much provenance!


38) I suggest to add consideration:
"query services may support more complex queries such as "which
entities were derived from entities attributed to agent X""


39) "such usage is not described here" -> ".. not described here"


40) "use the information obtained to query for required provenance."
...  add "according to the specified query mechanism"

41) "Dereferencing a provenance query service URI" --> "... service-URI"


42) "this specification does not preclude the use of non-RDF formats"
JSON-LD <http://json-ld.org/spec/latest/json-ld-syntax/> is growing in
popularity, should we perhaps propose a JSON-LD context? I think it
would be quite straight forward, and actually managed to do it in
about 15 minutes (including learning the syntax).

If you try the JSON from https://gist.github.com/4565822 on
http://json-ld.org/playground/

( Obviously the "@context" here should be extracted and provided by us. )

You get the example from 4.1.3:

<http://example.com/service#direct>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/ns/prov#DirectQueryService> .
<http://example.com/service#direct>
<http://www.w3.org/ns/prov#provenanceUriTemplate> "?target={+uri}" .
<http://example.com/service#sparql>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/ns/sparql-service-description#Service> .
<http://example.com/service#sparql>
<http://www.w3.org/ns/sparql-service-description#endpoint>
<http://example.com/service/sparql/> .
<http://example.com/service#sparql>
<http://www.w3.org/ns/sparql-service-description#resultFormat>
<http://www.w3.org/ns/formats/RDF_XML> .
<http://example.com/service#sparql>
<http://www.w3.org/ns/sparql-service-description#resultFormat>
<http://www.w3.org/ns/formats/SPARQL_Results_CSV> .
<http://example.com/service#sparql>
<http://www.w3.org/ns/sparql-service-description#resultFormat>
<http://www.w3.org/ns/formats/SPARQL_Results_JSON> .
<http://example.com/service#sparql>
<http://www.w3.org/ns/sparql-service-description#resultFormat>
<http://www.w3.org/ns/formats/SPARQL_Results_TSV> .
<http://example.com/service#sparql>
<http://www.w3.org/ns/sparql-service-description#resultFormat>
<http://www.w3.org/ns/formats/SPARQL_Results_XML> .
<http://example.com/service#sparql>
<http://www.w3.org/ns/sparql-service-description#resultFormat>
<http://www.w3.org/ns/formats/Turtle> .
<http://example.com/service#sparql>
<http://www.w3.org/ns/sparql-service-description#supportedLanguage>
<http://www.w3.org/ns/sparql-service-description#SPARQL11Query> .
<http://example.com/service>
<http://www.w3.org/ns/prov#describesService>
<http://example.com/service#direct> .
<http://example.com/service>
<http://www.w3.org/ns/prov#describesService>
<http://example.com/service#sparql> .


Without the embedded @context the actual description can become very
tiny, and perhaps even skip some "@id"s to say:

{
  "@context": "http://www.w3.org/ns/prov-aq.jsonld",
  "service": [
     { "@type": "direct",
       "uritemplate": "?target={+uri}"
     } ,
     { "@type": "sparql",
       "endpoint": "http://example.com/service/sparql/"
     }
  ]
}

Note that above I've added a prov:describesService relation to relate
the services with the description, and so that I may make nested JSON.
(see #43 below)



43) As shown in the complete example in 4.1.3, the
ProvenanceQueryService is not connected to the DirectQueryService or
sd:Service. Given that services don't have a general name, it would be
difficult for implementers to know if a node in the graph is a service
or just happens to be further/additional data (for instance details
about the publisher of the service). It also means I can't mention at
all a service, without implying that I am somehow providing it as part
of my service description.

I therefore suggest that the ProvenanceQueryService should link to the
services using a term like prov:describesService - see modified
example:


@prefix prov: <http://www.w3c.org/ns/prov#>
@prefix sd: <http://www.w3.org/ns/sparql-service-description#>

<> a prov:ProvenanceQueryService ;
    prov:describesService <#direct>, <#sparql> ;
    dcterms:publisher <#us> .

<#us> a foaf:Organization ;
   foaf:name "and not a service!" .

<#direct> a prov:DirectQueryService ;
  prov:provenanceUriTemplate "?target={+uri}"
  .
<#sparql> a sd:Service ;
    sd:endpoint </sparql/> ;
    sd:supportedLanguage sd:SPARQL11Query .


The added advantage of this is that you can do the bnode shorthand
when you don't know quite know or care what to call your service
entries:

<> a prov:ProvenanceQueryService ;
    prov:describesService [
      [ a prov:DirectQueryService ;
        prov:provenanceUriTemplate "?target={+uri}" ],
      [ a sd:Service ;
        sd:endpoint "?target={+uri}",
        sd:supportedLanguage sd:SPARQL11Query
      ] .


44) I suggest renaming the verbose prov:ProvenanceQueryService to
prov:ServiceDescription. We don't need to say Provenance because of
the namespace. It's also not a service itself, just descriptions. This
avoids confusion whether the DirectQueryService is a
ProvenanceQueryService. Combined with the prov:describesService from
above, the distinction should be clear.


45) This protocol typically combines the target-URI with the
service-URI to formulate an HTTP GET request, according to the
following convention:

Typically..? Is this not meant to *define* the protocol? Remove "typically".


46) "provenance description for the resource-URI"
 - while I like "resource-URI" over "target-URI" (and perhaps
entity-URI even more) - I think this is a typo.  --> target-URI


47) "Any server that implements this protocol and receives a request
URI in this form SHOULD return a provenance description for the
resource-URI embedded in the query component, where that URI is the
result of percent-decoding the value associated with the
provenance-resource key" - a bit heavy and cryptic sentence. What is
"the value associated with the the provenance-resource key"?


48) "If the supplied resource-URI includes a fragment identifier, the
'#' MUST be %-encoded as %23 when constructing the provenance-URI
value; similarly, any '&' character in the resource-URI must be
%-encoded as %26 [[RFC3986]]."  - I am a bit uncertain about this -
are you implying that only those characters need to be escaped? What
about "%"? It should be clearly specified if a URL like
http://example.com/with%20spaces should be sent along as-is with %20,
or double-encoded as %2520.  I agree that it's very important to
highlight that # and & must be %-encoded as they would otherwise fall
out - but it should also here clearly indicate the regular encoding.
As this is getting a bit long - perhaps split into a second paragraph
which is only about encoding. (Ie. first paragraph says what is to be
returned, etc, second paragraph just details about the URI encoding)



49) "If the provenance described by the request does not exist in the
server, a 404 Not Found response code SHOULD be returned."

This section does not define other error conditions, like what the
server should do if access is restricted. Obviously the regular HTTP
status codes apply, but it might be worth pointing out that the server
is not required to make such responses public - so it might for
instance require authentication with 401, or 'hide' the existence of a
response with 404. " This status code is commonly used when the server
does not wish to reveal exactly why the request has been refused, or
when no other response is applicable.".

Probably this is out of scope - but I was thinking that it could be
useful if the server could return 403 Forbidden, for instance because
it refuses to give provenance details for resources that are not 'his'
(not under example.com for instance). It could return a text/uri-list
of base URIs of which the server will support.
(this is slight abuse of text/uri-list because there might be no
resource with that particular URI - more appropriate would be a list
of URI templates, but there are no media type for that).


50) "does not exist in the server"  --> change to "is unknown to the
server" - as there is no requirement that the provenance resource is
on the same server. (and neither should there be!)


51) "should be capable of returning RDF using the vocabulary defined
by [PROV-O], in any standard RDF serialization (e.g. RDF/XML), or any
other standard serialization of the Provenance Model specification
[PROV-DM]."  - both "any" change to "a" - only one of them is needed,
not all - which 'any' might imply!


52) "other standard serialization (..) PROV-DM"  - Is this something
we've defined somewhere? How would you know if say PROV JSON is a
standard serialization?


53) "A provenance query service SHOULD  be capable of returning RDF
... , or any other standard serialization of the Provenance Model
specification"
- it is unclear if second part is covered by the SHOULD or not.   I
can see 4 interpretations:


a) Service SHOULD return PROV-O RDF, and MAY return other PROV serializations

b) Service SHOULD return ( either PROV-O RDF or other PROV serialization )

c) Service SHOULD return at least one of ( PROV-O RDF, other PROV
serialization)  (ie.  simply "one of the PROV serializations")

d) Service SHOULD return PROV-O RDF.   Other PROV serializations could
be used. (no MAY/SHOULD).


I would recommend a) above - as then the clients would have some
reasonable expectation about what is generally supported, rather than
having to build in support for PROVXML, PROV-N, etc. just because they
are all covered by the same SHOULD of b).



54) "Previously, section 3. Locating provenance descriptions has
described use of HTTP Link: header fields and HTML <link> elements to
indicate provenance query services. Beyond that, this specification
does not define any specific mechanism for discovering query services.
"  - this forgot about section 3.3 Resource represented as RDF.


> 5. Forward provenance

>   S: Link: <http://acme.example.org/pingback/super-widget>;
>           rel=http://www.w3.org/ns/prov#provPingback

55) I would rename this to just "pingback" why double "prov"?

>           rel=http://www.w3.org/ns/prov#pingback



>  A consumer of the resource, or some other system, may perform an HTTP POST operation to the pingback URI where the POST request body contains provenance in one of the recognized provenance description formats. For interoperability, a ping-back receiving service should be able to accept at least PROV-O provenance presented as RDF/XML or Turtle.


56) I think this kind of "provenance posting" (and hence intended
provenance-URI creation) sounds out of scope for a pingback service
and probably also for this whole document. There are many existing
protocols on how to manage and create resources, such as AtomPub,
WebDav (uggh..), SFTP, etc. I don't think we need to go into that area
to define yet another way on how to create HTTP resources.


I would not expected to have to post my actual provenance to the
service, which implies that the service then should keep this and
present it willy-nilly to others as its own.  This document also does
not say much about what the server is expected or not to do with this,
or how it can refuse provenance which it does not like or permit.


I would rather think that a pingback service should work like
pingbacks in blogs, where the pingback simply gives the blog anURI of
a third-party site which talks about a given blog post at the pingback
host.


So I should just be posting a URI of a provenance-resource that has a
resource as an target-URI.  The service would then be able respond to
queries about that target-URI, giving links to the posted
provenance-URI. Then at my provenance-URI I can have as big, as wrong
provenance as I like, without affecting the provenance service.

Here is my alternative proposal for this section:


(Same blurb about finding prov:pingback)

  C: HEAD http://acme.example.org/super-widget HTTP/1.1

  S: 200 OK
  S: Link: <http://acme.example.org/pingback/super-widget>;
           rel=http://www.w3.org/ns/prov#pingback
   :
(as before - each resource MAY also have a prov:pingback relation)


A client MAY post a pingback request to any of the returned prov:pingback


  C: POST http://acme.example.org/pingback/super-widget HTTP/1.1
(I prefer this style as we're doing HTTP/1.1, btw)
  C: Content-Type: text/uri-list
  C:
  C: http://wile-e.example.org/contraption/provenance
  C: http://wile-e.example.org/another/provenance

  S: 204 No Content
  S: Link: <http://wile-e.example.org/contraption/provenance>;
           rel=http://www.w3.org/ns/prov#hasProvenance;
           anchor=http://acme.example.org/super-widget
  S: Link: <http://wile-e.example.org/another/provenance>;
           rel=http://www.w3.org/ns/prov#hasProvenance;
           anchor=http://acme.example.org/super-widget
  S: Link: <http://acme.example.org/pingback/super-widget>;
           rel=http://www.w3.org/ns/prov#pingback;
           anchor=http://acme.example.org/super-widget


This client request above indicates that the two provenance-URIs at
wile-e.example.org contains provenance mentioning
http://acme.example.org/super-widget or its target-URI (-- either
forward or backwards, we don't know).

The client MAY include provenance query services which can describe
the target-URI by including the corresponding {prov:hasQueryService}
Link headers. The anchor MUST be included, and SHOULD be the
target-URI of the resource which this pingback service belongs to,
unless the submitted query service would expect a different target-URI
to describe the given resource.

  C: POST http://acme.example.org/pingback/super-widget HTTP/1.1
  C: Link: <http://wile-e.example.org/sparql>;
rel="http://www.w3.org/ns/prov#hasQueryService";
anchor="http://acme.example.org/pingback/super-widget"
  C: Content-Type: text/uri-list
  C: Content-Length: 0
  C:

In the above example, the client did not submit any provenance-URIs
and the URI list is therefore empty.

The client MAY similarly include {prov:hasProvenance} Link headers to
specify a different anchor. The provenance-URIs of those headers MUST
also be included in the content if the POSTed Content-Type is
{text/uri-list}.



The pingback service MAY resolve the submitted URIs to validate and
check the provenance data, however reasonable care should be taken to
prevent malicious use of the pingback service for attacks such as
distributed denial of service (DDoS) and cross-site request forgery
(CSRF).

The server MAY, immediately or at a later time, include the submitted
*provenance-URI*s in responses to subsequent request to the provenance
service for the target-URI. (insert usual blurb about not trust on
such provenance)

The server SHOULD include a self-referential prov:pingback Link
header, which MUST include the anchor for the target-URI this pingback
service corresponds to. This serves the purpose for the client to
verify it has submitted a pingback to the correct service, in case it
has followed an untrusted prov:pingback Link header. The client MAY
for this purpose POST an empty text/uri-list to avoid side effects.


The server SHOULD indicate immediate acceptance by including the
corresponding {prov:hasProvenance} {Link} headers for the accepted
*provenance-URI*s. If all submitted provenance-URIs have been
immediately accepted, the server SHOULD respond with HTTP status {200
OK} or {204 No Content}.



If server acceptance is pending for any of the submitted URIs, for
instance because the provenance-URIs are being validated or due to be
approved by a moderator, the server SHOULD respond with HTTP status
{202 Accepted}, and only include corresponding {prov:hasProvenance}
{Link} headers for those provenance-URIs that have been immediately
accepted.

The server MAY respond with {401 Unauthorized} and standard
{{WWW-Authenticate}} headers if authentication is needed. The server
SHOULD respond with {403 Forbidden} if for any reason it refuses to
accept one or more of the submitted provenance-URIs or
provenance-service-URIs. If some URIs were accepted, but others were
refused, the server SHOULD respond with {403 Forbidden} and include
generated prov:hasProvenance and prov:hasQueryService Link headers for
the immediately accepted URIs.


(The above needs to be cleaned up so that it talks equally about
prov:hasProvenance and prov:hasQueryService in the error handling -
and also to separate better the protocol and the example).


> 6. Security considerations

>  When retrieving a provenance URI from a document, steps should be taken to ensure the document itself is an accurate copy of the original whose author is being trusted (e.g. signature checking, or use of a trusted secure web service).

57) What is "document" above? Should this refer to section 3.2?


58) A paragraph should be added about cross-site request forgery and
distributed denial attacks, similar to my blurb above:

When clients and servers are retrieving submitted URIs such as
provenance descriptions and following or registering links; reasonable
care should be taken to prevent malicious use such as distributed
denial of service attacks (DDoS), cross-site request forgery (CSRF),
spamming and hosting of inappropriate materials. Reasonable
preventions might include same-origin policy, HTTP authorization, SSL,
rate-limiting, spam filters, moderation queues, user acknowledgements
and validation. It is out of scope for this document to specify how
such mechanisms work and should be applied.


> Provenance descriptions may provide a route for leakage of privacy-related information

59) We should also add something obvious like:

Accessing provenance services might reveal to the service and
third-parties information which is considered private, including which
resources a client has taken interest in. For instance, a browser
extension which collects all provenance data for a resource which is
being saved to the local disk, could be revealing user interest in a
sensitive resource to a third-party site listed by prov:hasProvenance
or prov:hasQueryService relation. A detailed query submitted to a
third-party provenance query service might be revealing personal
information such as social security numbers.


> B. Names added to prov: namespace

60) Broken definition links: DirectQueryService, provenanceURITemplate

61) Where can I download the OWL for the additional relations?

62) After table, add a note like "In addition, PROV-AQ reuses these
terms from the SPARQL service description vocabulary: sd:AA sd: BB"


> It is is tempting to think of prov:DirectQueryService as a particular kind of prov:ProvenanceQueryService (..)

63) This section can be deleted if you follow my previous suggestion
to rename the latter to prov:ServiceDescription and add
prov:describesService relation. (See 43/44 above)



> C. References

I have NOT checked the validity or correctness of most of these links.

Should not SPARQL-SD and URI-template be given as normative
references, as this specification depends on them?



-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
Received on Monday, 28 January 2013 00:36:50 UTC