Re: Provenance Model from Stian Soiland-Reyes on 2015-09-30 (public-annotation@w3.org from September 2015)

From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Date: Wed, 30 Sep 2015 15:01:19 +0100
To: Robert Sanderson <azaroth42@gmail.com>
Cc: Web Annotation <public-annotation@w3.org>
Message-ID: <CAPRnXtnD5aHrjAvc4LPSb2DL6Mw9ySqSeZgdcvbYcw0z+sLZqg@mail.gmail.com>
What is the scope of "creating an annotation"? Does this include the
creation of the body?

dct:creator is (perhaps deliberately) vague about this - in that you
never quite know if it's the creator of the digital resource (uploader
or serializer of the file), its semantic content (structuring in its
current form) or its abstract knowledge (e.g. the statements that are
conveyed).



All of these statements could be seen as valid with dct:creator:

# The person that cropped and uploaded the JPEG
<https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg>
dct:creator <https://commons.wikimedia.org/wiki/User:Dcoetzee> .

# The agency that took the photo in the gallery
<https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg>
dct:creator <http://www.technologies.c2rmf.fr/>

# The actual painter
<https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg>
dct:creator <http://dbpedia.org/resource/Leonardo_da_Vinci>



But depending on which one you go for, you will get quite a different
provenance trail. In PROV you can say these are all
prov:wasAttributedTo - and use a prov:wasDerivedFrom chain (and
possibly prov:specializationOf) to show the detailed provenance.

But it is hard to narrow down one particular one of them as 'the
creator' except in the basic case where they are all the same - e.g.
someone typed their own words into a text into a web form and pushed
the Annotate button.




To make this more up to date, think of a youtube re-upload, say "Mr
Politician MP says something embarrassing again", where we have:

a) The politician (who said something embarassing, no matter which upload)
b) The audience member who filmed him in public
c) The (re)uploader of the video (after the first upload obviously got deleted)


Who is the 'creator' here? Computers will almost always tell you c).
Humans will tell you a).   People like me who care about attribution
will think about b) who was the brave one.  We should let annotation
systems provide you with a) and hopefully also a bit of b) and not
just be stuck with c).


Now to me, this means that dct:creator does not tell me much, because
different applications have widely different interpretation about
which one of these kind of forms is meant.  To me it thus just says
"was somewhat involved with making some part of this resource" - which
is more of a contributing than creation.


In the PAV ontology ( http://purl.org/pav/html ) we tried to clear up
this for normal bibliographic usage on the web by introducing:

- pav:createdBy who made the digital file - e.g. the bytes in the JPEG
if you like (in this case the wikimedia user Dcoetzee)

- pav:authoredBy for who made the "knowledge" that is somewhat
captured - (Leonardo da Vinci the painter).

- pav:curatedBy - someone who helped form the knowledge into its
current form, e.g. the c2rmf photographer

- pav:contributedBy - any other kind of "knowledge" contributions
(including author/curator above) - e.g. someone who made a hole in the
canvas [1]

[1] http://www.theguardian.com/world/2015/aug/25/boy-trips-in-museum-and-punches-hole-through-million-dollar-painting


All of these map to prov:wasAttributedTo and to dcterms:creator /
dcterms:contributor

See http://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-4-37#Sec19
for discussion on issues with DC Terms for provenance :)




So for annotations I get similar questions. It is clear in the case of
say tagging that the creator of the annotation didn't necessarily
"create" the tag word itself - but primarily made the link between the
target and the body - this is particularly the case for semantic tags
from a controlled vocabulary.


If an annotation links between two standalone resources, e.g. a blog
entry and a youtube video, then the annotation creator again might not
have made neither of the blog or the youtube, just found that the
(body) blog is about the (target) youtube video.

The body and the target might therefore have their own creators -
which might have been stated elsewhere.


Then there are the more compound annotations that in JSON-LD would be
a larger object - like if there's a SpecificResource or an embedded
textual body - in this case the creator of the annotation is most
likely also the author of the textual body, and is the one who made
the selection of the SpecificResource.  I don't think we normally want
to attach provenance to each of those - so it would be good if the
'creator' of the annotation was somewhat flexibly to also apply to
these cases.


However on the Semantic Web we have this boring Open World Assumption
- so we can't do rules like "If a dct:creator is set on the annotation
but not on the body, then the annotation creator is also the body
creator" - as the body resource might have other views about who its
creator is.


dct:creator does have some of that ambiguity here that perhaps is
needed - but I don't think it would be too helpful.

So this hints to me that we should get the annotation system to tell
us instead - pav:authoredBy if it knows the agent also made the
'content of the annotation' - which we can say include things like
embedded body text or a specific resource, or just the super-property
pav:contributedBy (or its superprop prov:wasAttributedTo) if the
user's role is more ambigious.

pav:createdBy can be used for the actual serialization and is usually
a computer system - it is basically almost like the existing
oa:serializedBy which I never saw quite the need for in the first
place. :)



On 28 September 2015 at 21:54, Robert Sanderson <azaroth42@gmail.com> wrote:
>
> With the focus on making the model as approachable as possible, I'd like to
> propose that we revise the provenance model somewhat.  In particular, while
> the distinction between creator and annotator is useful from an academic
> perspective, it seems to me to be firmly in the 0.1% of use cases.
>
> Proposal:
>
> * Replace oa:annotatedBy with dcterms:creator  [creator]
> * Replace oa:annotatedAt with dcterms:created  [created]
>
> * Replace oa:serializedBy with prov:generatedBy  [generator]
> * Replace oa:serializedAt with prov:generated  [generated]
>
> Rationale:
>
> * It's simpler, and doesn't invent new terms unnecessarily.
>
> * It solves Luc's issue with the Prov constraints as the annotator is no
> longer a generator of the annotation.
>
> * It also allows us to say that creator and created SHOULD be used with
> embedded textual bodies, rather than hand-waving like we currently do.
>
> * It avoids the "serialization" issue of whether the client that created the
> annotation is the serializer, or the service that makes it available.  The
> activity that generates the annotation is clearly the user creating it,
> rather than the server serializing a graph into a particular format.
>
>
> Thoughts?
>
> Rob
>
> --
> Rob Sanderson
> Information Standards Advocate
> Digital Library Systems and Services
> Stanford, CA 94305



-- 
Stian Soiland-Reyes, eScience Lab
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/    http://orcid.org/0000-0001-9842-9718
Received on Wednesday, 30 September 2015 14:02:09 UTC