- From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
- Date: Mon, 28 Jan 2013 00:36:01 +0000
- To: Provenance Working Group <public-prov-wg@w3.org>
On Thu, Jan 17, 2013 at 11:35 AM, Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk> wrote: > Here is my partial review of the above document PROV-AQ. > Due to travelling and sick days I have not been able to review section > 4, 5, 6, nor appendices. I am aware this is a bit late - so the below should only be considered advisory as I know next draft of AQ is still being edited. I therefore have not put any blockers. Please below kindly find my remaining review of section 4, 5, 6 and appendixes of PROV-AQ editors draft https://dvcs.w3.org/hg/prov/raw-file/b3f397c7b15c/paq/prov-aq.html Summary: ======== PROV-AQ is a very interesting document, because it describes how to connect provenance to the world, or more specifically to resources on the Internet. For my own domain of scientific workflow preservation, there is a particular need for this kind of standardization as currently there is no recognized mechanism for a service to provide provenance data in any form. The core concepts of PROV-AQ are very easy to understand, simple to use and clearly scoped. The document is however at times heavy to read, as edge cases are often explored in detail before introducing the main concepts and how a functionality is to be used. The terminology is a bit odd compared to the rest of the PROV documents, I particularly wonder why the authors are using the term target-URI rather than entity-URI; however I understand this is careful threading as in this particular document there is necessarily a lot of talk about *resources*. It is unclear as to whether PROV-AQ can and should be used for finding non-PROV provenance descriptions, such as alternative models (OPM, DCTerms), application-specific resources (logfiles, commit logs), and human-readable documents (HTML, Word). My view: "PROV-AQ MAY be used for such purposes, but that PROV-AQ provenance descriptions SHOULD be available as PROV. PROV SHOULD be represented as PROV-O RDF, and MAY be represented in other W3C specified PROV serializations.". I find that the section about pingback service is out of scope for a PROV-AQ service, and therefore below (point 56) suggest an alternative approach where the pingback service simply receives link that a provenance service may later return or include in its store. I don't distinguish between 'forward' and 'backward' provenance, so for me "has provenance" means I will find some provenance data where this entity ("target-URI") is present - but the WG might have a different view and could want to distinguish between the two directions, as popular resources could accumulate a lot of forward traces. Detailed review - numbering continues from previous email: ======= 4. Provenance query service 35) "the naming authority associated with the target-URI is not the same as the service offering provenance descriptions" - why is this a problem? "multiple services have provenance descriptions about the same resource" - why is this a problem? Neither of these seem like a problem from the previous bits of this specification. Section 3 specifically allows multiple provenance-uris and don't require these to be hosted at the sane "naming authority". I think what you are trying to say in these two is something like: * "third-party providers of provenance descriptions who can't use the mechanisms of Section 3 because the target-URI is outside their control" 36) "the service associated with the target-URI is not accessible for adding additional information when handling retrieval requests" I don't know what this means. Which service? Adding on retrieval? Not accessible? 37) "query services may provide additional control over what provenance is returned" perhaps change "control" to "filters" - make it sound like a good thing when there is too much provenance! 38) I suggest to add consideration: "query services may support more complex queries such as "which entities were derived from entities attributed to agent X"" 39) "such usage is not described here" -> ".. not described here" 40) "use the information obtained to query for required provenance." ... add "according to the specified query mechanism" 41) "Dereferencing a provenance query service URI" --> "... service-URI" 42) "this specification does not preclude the use of non-RDF formats" JSON-LD <http://json-ld.org/spec/latest/json-ld-syntax/> is growing in popularity, should we perhaps propose a JSON-LD context? I think it would be quite straight forward, and actually managed to do it in about 15 minutes (including learning the syntax). If you try the JSON from https://gist.github.com/4565822 on http://json-ld.org/playground/ ( Obviously the "@context" here should be extracted and provided by us. ) You get the example from 4.1.3: <http://example.com/service#direct> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#DirectQueryService> . <http://example.com/service#direct> <http://www.w3.org/ns/prov#provenanceUriTemplate> "?target={+uri}" . <http://example.com/service#sparql> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/sparql-service-description#Service> . <http://example.com/service#sparql> <http://www.w3.org/ns/sparql-service-description#endpoint> <http://example.com/service/sparql/> . <http://example.com/service#sparql> <http://www.w3.org/ns/sparql-service-description#resultFormat> <http://www.w3.org/ns/formats/RDF_XML> . <http://example.com/service#sparql> <http://www.w3.org/ns/sparql-service-description#resultFormat> <http://www.w3.org/ns/formats/SPARQL_Results_CSV> . <http://example.com/service#sparql> <http://www.w3.org/ns/sparql-service-description#resultFormat> <http://www.w3.org/ns/formats/SPARQL_Results_JSON> . <http://example.com/service#sparql> <http://www.w3.org/ns/sparql-service-description#resultFormat> <http://www.w3.org/ns/formats/SPARQL_Results_TSV> . <http://example.com/service#sparql> <http://www.w3.org/ns/sparql-service-description#resultFormat> <http://www.w3.org/ns/formats/SPARQL_Results_XML> . <http://example.com/service#sparql> <http://www.w3.org/ns/sparql-service-description#resultFormat> <http://www.w3.org/ns/formats/Turtle> . <http://example.com/service#sparql> <http://www.w3.org/ns/sparql-service-description#supportedLanguage> <http://www.w3.org/ns/sparql-service-description#SPARQL11Query> . <http://example.com/service> <http://www.w3.org/ns/prov#describesService> <http://example.com/service#direct> . <http://example.com/service> <http://www.w3.org/ns/prov#describesService> <http://example.com/service#sparql> . Without the embedded @context the actual description can become very tiny, and perhaps even skip some "@id"s to say: { "@context": "http://www.w3.org/ns/prov-aq.jsonld", "service": [ { "@type": "direct", "uritemplate": "?target={+uri}" } , { "@type": "sparql", "endpoint": "http://example.com/service/sparql/" } ] } Note that above I've added a prov:describesService relation to relate the services with the description, and so that I may make nested JSON. (see #43 below) 43) As shown in the complete example in 4.1.3, the ProvenanceQueryService is not connected to the DirectQueryService or sd:Service. Given that services don't have a general name, it would be difficult for implementers to know if a node in the graph is a service or just happens to be further/additional data (for instance details about the publisher of the service). It also means I can't mention at all a service, without implying that I am somehow providing it as part of my service description. I therefore suggest that the ProvenanceQueryService should link to the services using a term like prov:describesService - see modified example: @prefix prov: <http://www.w3c.org/ns/prov#> @prefix sd: <http://www.w3.org/ns/sparql-service-description#> <> a prov:ProvenanceQueryService ; prov:describesService <#direct>, <#sparql> ; dcterms:publisher <#us> . <#us> a foaf:Organization ; foaf:name "and not a service!" . <#direct> a prov:DirectQueryService ; prov:provenanceUriTemplate "?target={+uri}" . <#sparql> a sd:Service ; sd:endpoint </sparql/> ; sd:supportedLanguage sd:SPARQL11Query . The added advantage of this is that you can do the bnode shorthand when you don't know quite know or care what to call your service entries: <> a prov:ProvenanceQueryService ; prov:describesService [ [ a prov:DirectQueryService ; prov:provenanceUriTemplate "?target={+uri}" ], [ a sd:Service ; sd:endpoint "?target={+uri}", sd:supportedLanguage sd:SPARQL11Query ] . 44) I suggest renaming the verbose prov:ProvenanceQueryService to prov:ServiceDescription. We don't need to say Provenance because of the namespace. It's also not a service itself, just descriptions. This avoids confusion whether the DirectQueryService is a ProvenanceQueryService. Combined with the prov:describesService from above, the distinction should be clear. 45) This protocol typically combines the target-URI with the service-URI to formulate an HTTP GET request, according to the following convention: Typically..? Is this not meant to *define* the protocol? Remove "typically". 46) "provenance description for the resource-URI" - while I like "resource-URI" over "target-URI" (and perhaps entity-URI even more) - I think this is a typo. --> target-URI 47) "Any server that implements this protocol and receives a request URI in this form SHOULD return a provenance description for the resource-URI embedded in the query component, where that URI is the result of percent-decoding the value associated with the provenance-resource key" - a bit heavy and cryptic sentence. What is "the value associated with the the provenance-resource key"? 48) "If the supplied resource-URI includes a fragment identifier, the '#' MUST be %-encoded as %23 when constructing the provenance-URI value; similarly, any '&' character in the resource-URI must be %-encoded as %26 [[RFC3986]]." - I am a bit uncertain about this - are you implying that only those characters need to be escaped? What about "%"? It should be clearly specified if a URL like http://example.com/with%20spaces should be sent along as-is with %20, or double-encoded as %2520. I agree that it's very important to highlight that # and & must be %-encoded as they would otherwise fall out - but it should also here clearly indicate the regular encoding. As this is getting a bit long - perhaps split into a second paragraph which is only about encoding. (Ie. first paragraph says what is to be returned, etc, second paragraph just details about the URI encoding) 49) "If the provenance described by the request does not exist in the server, a 404 Not Found response code SHOULD be returned." This section does not define other error conditions, like what the server should do if access is restricted. Obviously the regular HTTP status codes apply, but it might be worth pointing out that the server is not required to make such responses public - so it might for instance require authentication with 401, or 'hide' the existence of a response with 404. " This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.". Probably this is out of scope - but I was thinking that it could be useful if the server could return 403 Forbidden, for instance because it refuses to give provenance details for resources that are not 'his' (not under example.com for instance). It could return a text/uri-list of base URIs of which the server will support. (this is slight abuse of text/uri-list because there might be no resource with that particular URI - more appropriate would be a list of URI templates, but there are no media type for that). 50) "does not exist in the server" --> change to "is unknown to the server" - as there is no requirement that the provenance resource is on the same server. (and neither should there be!) 51) "should be capable of returning RDF using the vocabulary defined by [PROV-O], in any standard RDF serialization (e.g. RDF/XML), or any other standard serialization of the Provenance Model specification [PROV-DM]." - both "any" change to "a" - only one of them is needed, not all - which 'any' might imply! 52) "other standard serialization (..) PROV-DM" - Is this something we've defined somewhere? How would you know if say PROV JSON is a standard serialization? 53) "A provenance query service SHOULD be capable of returning RDF ... , or any other standard serialization of the Provenance Model specification" - it is unclear if second part is covered by the SHOULD or not. I can see 4 interpretations: a) Service SHOULD return PROV-O RDF, and MAY return other PROV serializations b) Service SHOULD return ( either PROV-O RDF or other PROV serialization ) c) Service SHOULD return at least one of ( PROV-O RDF, other PROV serialization) (ie. simply "one of the PROV serializations") d) Service SHOULD return PROV-O RDF. Other PROV serializations could be used. (no MAY/SHOULD). I would recommend a) above - as then the clients would have some reasonable expectation about what is generally supported, rather than having to build in support for PROVXML, PROV-N, etc. just because they are all covered by the same SHOULD of b). 54) "Previously, section 3. Locating provenance descriptions has described use of HTTP Link: header fields and HTML <link> elements to indicate provenance query services. Beyond that, this specification does not define any specific mechanism for discovering query services. " - this forgot about section 3.3 Resource represented as RDF. > 5. Forward provenance > S: Link: <http://acme.example.org/pingback/super-widget>; > rel=http://www.w3.org/ns/prov#provPingback 55) I would rename this to just "pingback" why double "prov"? > rel=http://www.w3.org/ns/prov#pingback > A consumer of the resource, or some other system, may perform an HTTP POST operation to the pingback URI where the POST request body contains provenance in one of the recognized provenance description formats. For interoperability, a ping-back receiving service should be able to accept at least PROV-O provenance presented as RDF/XML or Turtle. 56) I think this kind of "provenance posting" (and hence intended provenance-URI creation) sounds out of scope for a pingback service and probably also for this whole document. There are many existing protocols on how to manage and create resources, such as AtomPub, WebDav (uggh..), SFTP, etc. I don't think we need to go into that area to define yet another way on how to create HTTP resources. I would not expected to have to post my actual provenance to the service, which implies that the service then should keep this and present it willy-nilly to others as its own. This document also does not say much about what the server is expected or not to do with this, or how it can refuse provenance which it does not like or permit. I would rather think that a pingback service should work like pingbacks in blogs, where the pingback simply gives the blog anURI of a third-party site which talks about a given blog post at the pingback host. So I should just be posting a URI of a provenance-resource that has a resource as an target-URI. The service would then be able respond to queries about that target-URI, giving links to the posted provenance-URI. Then at my provenance-URI I can have as big, as wrong provenance as I like, without affecting the provenance service. Here is my alternative proposal for this section: (Same blurb about finding prov:pingback) C: HEAD http://acme.example.org/super-widget HTTP/1.1 S: 200 OK S: Link: <http://acme.example.org/pingback/super-widget>; rel=http://www.w3.org/ns/prov#pingback : (as before - each resource MAY also have a prov:pingback relation) A client MAY post a pingback request to any of the returned prov:pingback C: POST http://acme.example.org/pingback/super-widget HTTP/1.1 (I prefer this style as we're doing HTTP/1.1, btw) C: Content-Type: text/uri-list C: C: http://wile-e.example.org/contraption/provenance C: http://wile-e.example.org/another/provenance S: 204 No Content S: Link: <http://wile-e.example.org/contraption/provenance>; rel=http://www.w3.org/ns/prov#hasProvenance; anchor=http://acme.example.org/super-widget S: Link: <http://wile-e.example.org/another/provenance>; rel=http://www.w3.org/ns/prov#hasProvenance; anchor=http://acme.example.org/super-widget S: Link: <http://acme.example.org/pingback/super-widget>; rel=http://www.w3.org/ns/prov#pingback; anchor=http://acme.example.org/super-widget This client request above indicates that the two provenance-URIs at wile-e.example.org contains provenance mentioning http://acme.example.org/super-widget or its target-URI (-- either forward or backwards, we don't know). The client MAY include provenance query services which can describe the target-URI by including the corresponding {prov:hasQueryService} Link headers. The anchor MUST be included, and SHOULD be the target-URI of the resource which this pingback service belongs to, unless the submitted query service would expect a different target-URI to describe the given resource. C: POST http://acme.example.org/pingback/super-widget HTTP/1.1 C: Link: <http://wile-e.example.org/sparql>; rel="http://www.w3.org/ns/prov#hasQueryService"; anchor="http://acme.example.org/pingback/super-widget" C: Content-Type: text/uri-list C: Content-Length: 0 C: In the above example, the client did not submit any provenance-URIs and the URI list is therefore empty. The client MAY similarly include {prov:hasProvenance} Link headers to specify a different anchor. The provenance-URIs of those headers MUST also be included in the content if the POSTed Content-Type is {text/uri-list}. The pingback service MAY resolve the submitted URIs to validate and check the provenance data, however reasonable care should be taken to prevent malicious use of the pingback service for attacks such as distributed denial of service (DDoS) and cross-site request forgery (CSRF). The server MAY, immediately or at a later time, include the submitted *provenance-URI*s in responses to subsequent request to the provenance service for the target-URI. (insert usual blurb about not trust on such provenance) The server SHOULD include a self-referential prov:pingback Link header, which MUST include the anchor for the target-URI this pingback service corresponds to. This serves the purpose for the client to verify it has submitted a pingback to the correct service, in case it has followed an untrusted prov:pingback Link header. The client MAY for this purpose POST an empty text/uri-list to avoid side effects. The server SHOULD indicate immediate acceptance by including the corresponding {prov:hasProvenance} {Link} headers for the accepted *provenance-URI*s. If all submitted provenance-URIs have been immediately accepted, the server SHOULD respond with HTTP status {200 OK} or {204 No Content}. If server acceptance is pending for any of the submitted URIs, for instance because the provenance-URIs are being validated or due to be approved by a moderator, the server SHOULD respond with HTTP status {202 Accepted}, and only include corresponding {prov:hasProvenance} {Link} headers for those provenance-URIs that have been immediately accepted. The server MAY respond with {401 Unauthorized} and standard {{WWW-Authenticate}} headers if authentication is needed. The server SHOULD respond with {403 Forbidden} if for any reason it refuses to accept one or more of the submitted provenance-URIs or provenance-service-URIs. If some URIs were accepted, but others were refused, the server SHOULD respond with {403 Forbidden} and include generated prov:hasProvenance and prov:hasQueryService Link headers for the immediately accepted URIs. (The above needs to be cleaned up so that it talks equally about prov:hasProvenance and prov:hasQueryService in the error handling - and also to separate better the protocol and the example). > 6. Security considerations > When retrieving a provenance URI from a document, steps should be taken to ensure the document itself is an accurate copy of the original whose author is being trusted (e.g. signature checking, or use of a trusted secure web service). 57) What is "document" above? Should this refer to section 3.2? 58) A paragraph should be added about cross-site request forgery and distributed denial attacks, similar to my blurb above: When clients and servers are retrieving submitted URIs such as provenance descriptions and following or registering links; reasonable care should be taken to prevent malicious use such as distributed denial of service attacks (DDoS), cross-site request forgery (CSRF), spamming and hosting of inappropriate materials. Reasonable preventions might include same-origin policy, HTTP authorization, SSL, rate-limiting, spam filters, moderation queues, user acknowledgements and validation. It is out of scope for this document to specify how such mechanisms work and should be applied. > Provenance descriptions may provide a route for leakage of privacy-related information 59) We should also add something obvious like: Accessing provenance services might reveal to the service and third-parties information which is considered private, including which resources a client has taken interest in. For instance, a browser extension which collects all provenance data for a resource which is being saved to the local disk, could be revealing user interest in a sensitive resource to a third-party site listed by prov:hasProvenance or prov:hasQueryService relation. A detailed query submitted to a third-party provenance query service might be revealing personal information such as social security numbers. > B. Names added to prov: namespace 60) Broken definition links: DirectQueryService, provenanceURITemplate 61) Where can I download the OWL for the additional relations? 62) After table, add a note like "In addition, PROV-AQ reuses these terms from the SPARQL service description vocabulary: sd:AA sd: BB" > It is is tempting to think of prov:DirectQueryService as a particular kind of prov:ProvenanceQueryService (..) 63) This section can be deleted if you follow my previous suggestion to rename the latter to prov:ServiceDescription and add prov:describesService relation. (See 43/44 above) > C. References I have NOT checked the validity or correctness of most of these links. Should not SPARQL-SD and URI-template be given as normative references, as this specification depends on them? -- Stian Soiland-Reyes, myGrid team School of Computer Science The University of Manchester
Received on Monday, 28 January 2013 00:36:50 UTC