[ACTION-81] consider consolidation of author, revisionAgent and translationAgent from Dave Lewis on 2012-05-08 (public-multilingualweb-lt@w3.org from May 2012)

From: Dave Lewis <dave.lewis@cs.tcd.ie>
Date: Tue, 08 May 2012 23:15:20 +0100
To: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
Message-ID: <4FA99AF8.4090307@cs.tcd.ie>
Dear all,
Here are some notes on how we might consolidate author, revisionAgent 
and translationAgent by alignment with the work of the W3C Provenance WG 
http://www.w3.org/2011/prov/wiki/Main_Page.

The model of this working group is most simply summarised by figure 
http://www.w3.org/TR/prov-dm/#prov-dm-overview in the core Data Model 
specification. Essentially the provenance model is intended to allow 
recording of how *entities* were used and generated by *activities* 
which are conducted through the action of *agents*. There are then a 
bunch of relation that tie these together, such as 'wasGeneratedBy', 
'wasDerivedFrom', 'used' see: 
http://www.w3.org/TR/prov-dm/#prov-dm-types-and-relations

The model specifies the format of provenance records, but unlike most 
ITS tags, the intent is for these records to be maintained in a 
dedicated store. Like ITS, it defines an abstract notation (PROV-DM) for 
such records, and then defines different implementations, namely a 
text-file format (PROV-N), an ontology version that can be mapped into 
RDF (PROV-O), a restful access and query mechanism returning records as 
HTML (PROV-AQ), and an XML binding (PROV-XML). These specs are going 
through the W3C process currently, with the aim of reaching 
recommendations status by Jan'13.

Briefly, the most direct mapping to ITS would be some sort of binding 
between host document and their elements and entities as recorded in 
provenance records. The binding will depend on the implementation used 
for the provenance, e.g. just a URL, an XPOINTER, or a file URL and an 
entitiy record ID within that file. Using the last of these we could 
imagine:

<span its-prov-ref="http://www.eg.org/prov-ex1.txt" its-prov-ent="e1">My 
hovercraft is full of eels.</span>
<span its-prov-ref="http://www.eg.org/prov-ex1.txt" 
its-prov-ent="e2">Mon aéroglisseur est plein d'anguilles.</span>

where http://www.eg.org/provex1.txt would contain something like:

entity(e1)
entity(e2)

which in turn could be referenced by an activities a1:

wasGeneratedBy(e1, a1, 2011-11-16T16:05:30) -- specifies that an entity 
was generated y an activity at a specific time

activity(a1, 2011-11-16T16:05:00, 2011-11-16T16:06:00, 
[its-prov-process-type="authorContent", its-source-lang="en"] ) -- 
identifies an activity, its start and stop time and other relevant 
attributes

-- similarly we can define that e2 was generated by a machine 
translation process
wasGeneratedBy(e2, a2, 2011-11-16T16:07:30)
activity(a1, 2011-11-16T16:07:00, 2011-11-16T16:08:00, 
[its-prov-process-type="mTranslate"] )

-- then we can define agents associated with these activities
agent(Trevor, [ prov:type="Person", its-prov-agent-type="author" ] )
agent(matrex-eng1234, [ prov:type="SoftwareAgent", 
its-prov-agent-type="smt", its-prov-src-lang="en", 
its-prov-tgt-lang="fr" ] )

wasAssociatedWith(a1, Trevor)
wasAssociatedWith(a2, matrex-eng1234)

So you can see that this stand-off meta data approach based on the PROV 
model means we can also record things like the suggested qualityError 
data category 
(http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#qualityError)

entity(e3, [its-ent-type="qa-error-report", its-qa-err-severity="0.5",  
its-qa-err-note="suspect terminology") --actually PROV has an annotation 
structure that could be used instead of its-qa-err-note

wasGeneratedBy(g1, e3, a3, 2011-11-16T16:08:30)
wasDerivedFrom(e3, e2, a3, g1)
activity(a3, 2011-11-16T16:08:00, 2011-11-16T16:09:00, 
[its-prov-process-type="translateQA", its-prov-qa-ruleset="LISAQA"] )
wasAssociatedWith(a3, Pierre)
agent(Pierre, [ prov:type="Person", 
its-prov-agent-type="trans-QA-checker" ] )

This approach makes it easy to have several different provenance 
entities associated with any particular doc, element or span, and heads 
off the likely high level of ITS markup overhead that may occur if  
several provenance records are applied.

What is required in tersm of specification is the set of additional 
attribtue we want to use and their value. Essentially this would be a 
profile of the PROV specs. We may need to liaise with that working group 
on how to do this since I can't see that they have addressed this yet.

Note that its-prov-process-type should result from the consideration 
given to section 
http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#Process_Model 


In this way we can replace the author, revisionAgent, translationAgent 
and perhaps also the quality data categories by the 'its-prov-ent' data 
category to reference the entity representing the doc/element/span and 
then through profiling let the PROV spec do the rest.

all comments welcome, we can discuss this more on thursday's call.
cheers,
Dave
Received on Tuesday, 8 May 2012 22:15:48 UTC