Re: OA and provenance from Robert Sanderson on 2013-08-19 (public-openannotation@w3.org from August 2013)

From: Robert Sanderson <azaroth42@gmail.com>
Date: Mon, 19 Aug 2013 10:45:15 -0600
To: Leyla Jael García Castro <leylajael@gmail.com>
Cc: Paolo Ciccarese <paolo.ciccarese@gmail.com>, public-openannotation <public-openannotation@w3.org>
Message-ID: <CABevsUG-6gYKdSBSO=SApPv_swKn6CgJ6FWxoYC6B0KTjSLdUg@mail.gmail.com>
I agree with Paolo that there's not much benefit having two
annotations in this case.

Agent A doesn't really do anything useful in the provenance chain
other than act as a workflow director, or at most the agent that does
the serialization of the Annotation.

So:

<anno1> a oa:Annotation ;
  oa:hasTarget <target1>
  oa:hasBody <ontologyTerm> ;
  oa:annotatedBy <agentE> ;
  oa:serializedBy <agentA> ;
  oa:motivatedBy oa:identifying, oa:tagging .

<target1> a oa:SpecificResource ;
  oa:hasSource <resourceR> ;
  oa:hasSelector <someSelectorGeneratedByAgentA> .

Rob


On Mon, Aug 19, 2013 at 9:38 AM, Leyla Jael García Castro
<leylajael@gmail.com> wrote:
> Hi Robert, all,
>
> Would you also recommend to have two annotations if the annotators are
> software agents?
>
> Let me describe the scenario. An agent A takes a portion of text from
> resource R, and sends it to an entity recognition tool E so E will identify
> some terms and will associate them to a concept in an ontology. At the end A
> parses what is retrieved from E and serializes the annotation(s).
>
> Using PAV, I ended up with something similar to what Paolo proposed for
> Darwin's case, <annotation> pav:authoredBy <E>, and <annotation>
> pav:createdBy <A>. Using OA, two annotations would be the way? If possible,
> I rather to have only one annotation.
>
> Thanks,
> Leyla
>
>
>
> On Mon, Aug 19, 2013 at 4:10 PM, Robert Sanderson <azaroth42@gmail.com>
> wrote:
>>
>> Sorry for jumping in late, I was on vacation last week and offline.
>>
>> To quickly re-express the requirement:  There is a physical object
>> with some text (by Author A), and an annotation written on the object
>> about that text (by Darwin). That physical annotation is transcribed
>> as a digital annotation (by Student 1). Maintaining all of the actors
>> and objects is important.
>>
>> To me this is multiple annotations, but slightly different from the
>> ones that Stian proposes.
>>
>> Actors: AuthorA, Darwin, Student1
>> Objects: PhysicalTextWrittenByAuthorA, PhysicalTextWrittenByDarwin,
>> DigitalTextTranscribedByStudent1, (and potentially the physical page
>> on which the physical texts were written)
>>
>> Annotation 1 records that there is some text of Author A, and some
>> text of Darwin, with a link between the two (the Annotation).
>>
>> <anno1> a oa:Annotation ;
>>   oa:hasBody <uuid1> ;    // PhysicalTextWrittenByDarwin
>>   oa:hasTarget <uuid2> ;  // PhysicalTextWrittenByAuthorA
>>   oa:motivation oa:commenting ;
>>   oa:annotatedBy <darwin> .
>>
>> <uuid1> a xxx:PhysicalText ;
>>   dc:creator <darwin> .
>>
>> <uuid2> a xxx:PhysicalText ;
>>   dc:creator <authorA> .
>>
>> This is the model of the real world physical object.  Darwin wrote
>> some text about something that AuthorA wrote, and by the act of
>> writing it on the object it's an Annotation thus Darwin is the
>> annotator and the motivation is commenting (or similar).  However
>> these are /physical/ things, not the digital transcription.  As with
>> any RDF description of real world objects or concepts, there's a
>> disconnect between the description and the thing itself.
>>
>> And thus we need the transcription as a separate digital annotation:
>>
>> <anno2> a oa:Annotation ;
>>   oa:hasBody <transcription.txt> ;  // DigitalTextTranscribedByStudent1
>>   oa:hasTarget <uuid1> ;
>>   oa:motivation domeo:transcribing ;
>>   oa:annotatedBy <student1> .
>>
>> <transcription.txt> a cnt:ContentAsText, dcterms:Text ;
>>   cnt:chars "... Darwin's text here ..." .
>>   (Doesn't really matter who the dc:creator is for this content as all
>> the actors are above)
>>
>>
>> If you wanted to express it in terms of Shared Canvas, then you would
>> introduce a Canvas to explicitly represent the physical page rather
>> than just an identifier for the text itself, and the uuids would
>> become segments of it.  The only other difference would be the
>> motivation of <anno2> would be sc:painting.  Then you would associate
>> the digitized image with the Canvas as a digital representation of the
>> physical page, using another Annotation also with motivation
>> sc:painting.
>>
>> Hope that helps,
>>
>> Rob
>>
>>
>> On Thu, Aug 15, 2013 at 3:44 AM, Stian Soiland-Reyes
>> <soiland-reyes@cs.manchester.ac.uk> wrote:
>> > With my provenance hat on, I think this all depends on what is the
>> > scope of an oa:Annotation and its creation.
>> >
>> > We have the same challenge with provenance of entities and documents
>> > in general - if I write a letter in Word on Monday, and you (Paolo)
>> > print it out on paper on Tuesday, and then on Wednesday Robert puts it
>> > in an envelope and mails it, then who 'created' that thing that pops
>> > in through the mailbox at the recipient?
>> >
>> > Well it depends what you consider that thing to be - as an envelope
>> > with something inside, Robert made it, on Wednesday. As a printed
>> > letter (which happen to have an envelope in transit), Paolo made it on
>> > Tuesday, and as a conceptual letter, I wrote it on Monday. In a PROV
>> > setting, we recommend everyone to think carefully about the extent of
>> > their entity, in a way determining their life-span and what
>> > aspects/attributes can be considered mutable or fixed. If more than
>> > one kind of characterization is deemed necessary, then PROV has the
>> > concepts of specialization and alternates to relate them to
>> > each-other: http://www.w3.org/TR/prov-dm/#component5
>> >
>> > Now at first glance I think this sounds like one of those use cases
>> > where you would need multiple characterizations to model the
>> > provenance correctly. A quick go:
>> >
>> > <origAnno1> a oa:Annotation ;
>> >   oa:annotatedBy <OriginalAuthor> ;
>> >   oa:hasTarget <somebook> .
>> >
>> > <anno1> a oa:Annotation ;
>> >   oa:annotatedBy <Paolo> ;
>> >   oa:specializationOf <origAnno1> ;
>> >   oa:hasTarget <somebook> .
>> >
>> > This does seem like a bit of duplication - and also a bit strange
>> > considering both <origAnno1> and <anno1> are expressed as
>> > oa:Annotations. This kind of split-up of the annotation could however
>> > make sense in cases where the body/target are also at different
>> > specialization levels:
>> >
>> > <conceptualAnno1> a oa:Annotation ;
>> >   oa:annotatedBy <OriginalAuthor> ;
>> >   oa:hasBody <note.txt> ;
>> >   oa:hasTarget <isbn:0-85131-041-9> .
>> >
>> > <instanceAnno1> a oa:Annotation ;
>> >   oa:annotatedBy <MrLibrarian> ;
>> >   oa:hasBody <scannedNote.jpeg> ;
>> >   oa:hasTarget <redBookOnShelf5> ;
>> >   prov:specializatonOf <conceptualAnno1> .
>> >
>> > <note1.txt> prov:alternateOf <scannedNote.jpeg> ;
>> >     prov:wasDerivedFrom <scannedNote.jpeg> .
>> >
>> > <redBookOnShelf5> prov:specializationOf <isbn:0-85131-041-9> .
>> >
>> >
>> > (This could be expanded with the full FRBR model or equivalent)
>> >
>> >
>> > We have discussed conceptual vs representational oa:Annotations earlier:
>> >
>> >
>> > http://lists.w3.org/Archives/Public/public-openannotation/2013Jan/0051.html
>> >
>> > http://lists.w3.org/Archives/Public/public-openannotation/2013Jan/0027.html
>> >
>> > and the conclusion seemed to have been that it is simpler to merge the
>> > conceptual annotation with the formalized annotation as a
>> > datastructure.
>> >
>> > However, the discussion then did not delve into the provenance aspects
>> > - what we still need to keep somewhat clear is what the two provenance
>> > aspects we do provide cover for, annotatedBy/At and serialisedBy/At.
>> > We have a PROV unrolling of these at
>> > http://www.openannotation.org/spec/core/appendices.html#ProvMapping:
>> >
>> >>  There are two Entities in the Open Annotation model, which for
>> >> expediency and simplicity are collapsed into just oa:Annotation. These are
>> >> the Annotation document, and the concept that the Annotation embodies or
>> >> describes. This is the distinction between oa:annotatedBy and
>> >> oa:annotatedAt, versus oa:serializedBy and oa:serializedAt.
>> >
>> > OK - the wording order here is wrong (annotation/document and
>> > concept/serialized) - perhaps something to fix! But basically it says
>> > that annotated* is who created it conceptually - so in your case:
>> >
>> >   <ann1>  oa:annotatedBy <OriginalAuthor> ;
>> >           oa:serializedBy <Domeo> .
>> >
>> > The reasoning being that it was OriginalAuthor who created the
>> > relation between the body (his note) and the book (where he wrote his
>> > note) - we consider the oa:Annotation as a conceptual entity that was
>> > formed all those years ago, long time before RDF was invented.
>> >
>> > To record the digital formation of the oa:Annotation data structure as
>> > distinct from its 'authorship', then you would need to use other
>> > provenance properties - pav:curatedBy and pav:createdBy sounds like
>> > good matches. I would not put <Paolo> as the serializer, unless he
>> > more directly typed in the RDF.
>> >
>> > (Another practical consideration - I would side with Antoine here and
>> > keep oa:serializedBy at RDF Graph level, so even if Paolo typed in
>> > Turtle and Domeo put out RDF/XML, then it would still be serializedBy
>> > <Paolo>.)
>> >
>> >
>> > This said - there should not be anything in OA that prevents my
>> > expanded form with specialization - but of course then you have to be
>> > much more careful. You might wonder for inter-operability measures
>> > what this would mean - well, an annotatoin mean different thing in
>> > different systems and domains. For instance in my application, Wf4Ever
>> > research objects, we even have annotations where the body is just an
>> > RDF graph to declare the rdf:type of a resource - we needed something
>> > like OA to structure this, because such statements could be made by a
>> > user in the UI (and thus error-prone but more authorative), or
>> > inferred by automatic scripts (which might be guessing wrongly).
>> >
>> >
>> >
>> > On 14 August 2013 15:00, Paolo Ciccarese <paolo.ciccarese@gmail.com>
>> > wrote:
>> >> Dear all,
>> >> I would like to share a solution that I am currently implementing in
>> >> Domeo
>> >> in relation to provenance and a question related to it. Apologies in
>> >> advance
>> >> for the length of the email.
>> >>
>> >> Use Case: I am dealing with an existing annotation that is written on
>> >> paper.
>> >> The author of the annotation can be the author of the original
>> >> manuscript or
>> >> a third party (let's assume the latter for this example). The
>> >> annotation is
>> >> anchored in a specific location of the original text. My user is
>> >> transforming that annotation into a OA annotation. It is very similar
>> >> to the
>> >> Darwin's annotation in the specs [1] but I got to a slightly different
>> >> conclusion.
>> >>
>> >> I would like to keep track of:
>> >> - the agent that creates the OA annotation
>> >> - the application the agent used to create the annotation (could be
>> >> different than the application that serialized the annotation)
>> >> - the author of the body of the annotation (third party)
>> >> - the author of the original association of the annotation with the
>> >> original
>> >> text
>> >>
>> >> In Domeo I use PAV (Provenance Authoring and Versioning ontology)
>> >> [2][3] and
>> >> I append to the oa:Annotation the following properties
>> >>
>> >> 1) pav:createdBy -> Domeo user
>> >> An agent primarily responsible for encoding the digital artifact or
>> >> resource
>> >> representation. This creation is distinct from forming the content,
>> >> which is
>> >> indicated with pav:contributedBy or its subproperties.
>> >> It is more specific than dct:createdBy - which might or might not be
>> >> interpreted to also cover the creation of the content of the artifact.
>> >>
>> >> 2) pav:createdOn -> When the Domeo user created the digital object
>> >> The date of creation of the digital artifact or resource
>> >> representation. The
>> >> agents responsible can be indicated with pav:createdBy.
>> >>
>> >> 3) pav:createdAt -> Where the user created the digital object
>> >> The geo-location of the agent that created the annotation.
>> >>
>> >> 4) pav:createdWith -> In may case the Domeo tool
>> >> The software/tool used by the creator (pav:createdBy) when making the
>> >> digital resource, for instance a word processor or an annotation tool.
>> >> A
>> >> more independent software agent that creates the resource without
>> >> direct
>> >> interactions by a human creator should instead be indicated using
>> >> pav:createdBy.
>> >>
>> >> 5) pav:authoredBy -> The author of the original annotation on paper
>> >> Indicates an agent that originated or gave existence to the work that
>> >> is
>> >> expressed by the digital resource. The author of the content of a
>> >> resource
>> >> may be different from the creator of that resource representation
>> >> (pav:createdBy), although they are often the same. The author is
>> >> usually not
>> >> a software agent (which would be indicated with pav:createdWith,
>> >> pav:createdBy or pav:importedBy), unless the software actually authored
>> >> the
>> >> content itself; for instance an artificial intelligence algorithm which
>> >> authored a piece of music or a machine learning algorithm that authored
>> >> a
>> >> classification of a tumor sample
>> >>
>> >> 6) pav:authoredOn -> The date of the original annotation
>> >> Indicates the date this resource was authored by the agents given by
>> >> pav:authoredBy. Note that pav:authoredOn is different from
>> >> pav:createdOn,
>> >> although their values are often the same.
>> >>
>> >> In summary I have something like:
>> >>
>> >> <ann1> a oa:Annotation
>> >>    pav:createdBy -Paolo-
>> >>    pav:createdOn -today-
>> >>    pav:createdWith -Domeo-
>> >>    pav:createdAt -Boston location-
>> >>    pav:authoredBy -Annotation’s author-
>> >>    pav:authoredOn -Date of the original annotation-
>> >>
>> >> In other words, using PAV I can keep the distinction between the
>> >> creator of
>> >> the digital artifact and the author of the original
>> >> content/association.
>> >>
>> >> However, there are possibly a couple of overlaps with the current OA
>> >> properties. As I would like to provide the OA provenance as well, I am
>> >> wondering which of the following applies:
>> >> <ann1> a oa:Annotation ;
>> >>     oa:annotatedBy <Paolo> .
>> >> or
>> >> <ann1> a oa:Annotation ;
>> >>     oa:annotatedBy <OriginalAuthor> .
>> >>
>> >> Or compared to PAV:
>> >> - pav:createdBy =? oa:annotatedBy --or--
>> >> - pav:authoredBy =? oa:annotatedBy
>> >>
>> >> Looking at the Darwin’s example in the specs, if the student is
>> >> digitizing a
>> >> note from Darwin on his own content I would say:
>> >> <ann2> a oa:Annotation
>> >>    pav:createdBy -Student-
>> >>    pav:createdOn -2013-
>> >>    pav:createdWith -Domeo-
>> >>    pav:createdAt -Boston location-
>> >>    pav:authoredBy -Darwin-
>> >>    pav:authoredOn -Date of the original annotation-
>> >>
>> >> Then of course the ‘body’ of the annotation can be also authored by the
>> >> original author of the annotation. But, as pointed out above, it is
>> >> important for me to attribute also the association of body and target
>> >> to the
>> >> original author as that represent the historical provenance of it.
>> >>
>> >> What this comes down to is basically what an oa:Annotation really is:
>> >> “an
>> >> Annotation expresses the relationship between two or more resources,
>> >> and
>> >> their metadata, using an RDF graph”. We talked about this before - my
>> >> question here becomes if oa:annotatedBy indicates who formed the
>> >> relationship (the ‘author’ of the conceptual annotation); or the person
>> >> who
>> >> (using some OA aware tools) formalized this as an oa:Annotation data
>> >> structure (the RDF structure)?
>> >>
>> >> Best,
>> >> Paolo
>> >>
>> >>
>> >> [1] http://www.openannotation.org/spec/core/core.html#Provenance
>> >> [2] http://arxiv.org/abs/1304.7224
>> >> [3] http://code.google.com/p/pav-ontology/
>> >>
>> >>
>> >> --
>> >> Dr. Paolo Ciccarese
>> >> http://www.paolociccarese.info/
>> >> Biomedical Informatics Research & Development
>> >> Instructor of Neurology at Harvard Medical School
>> >> Assistant in Neuroscience at Mass General Hospital
>> >> Member of the MGH Biomedical Informatics Core
>> >> +1-857-366-1524 (mobile)   +1-617-768-8744 (office)
>> >>
>> >> CONFIDENTIALITY NOTICE: This message is intended only for the
>> >> addressee(s),
>> >> may contain information that is considered
>> >> to be sensitive or confidential and may not be forwarded or disclosed
>> >> to any
>> >> other party without the permission of the sender.
>> >> If you have received this message in error, please notify the sender
>> >> immediately.
>> >
>> >
>> >
>> > --
>> > Stian Soiland-Reyes, myGrid team
>> > School of Computer Science
>> > The University of Manchester
>> > http://soiland-reyes.com/stian/work/
>> > http://orcid.org/0000-0001-9842-9718
>> >
>>
>
Received on Monday, 19 August 2013 16:45:45 UTC