[ISSUE-22] Provenance and agents from Dave Lewis on 2012-06-28 (public-multilingualweb-lt@w3.org from June 2012)

From: Dave Lewis <dave.lewis@cs.tcd.ie>
Date: Thu, 28 Jun 2012 13:46:50 +0100
To: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
Message-ID: <4FEC523A.2010406@cs.tcd.ie>
Hi all,
We have merged the discussions on the provenance and agent related data 
categories under this issue.

description
----------------
In general provenance related to the logging of processes or activities 
that have been performed on content that is the subject of ITS tags. 
Provenance data recorded for a specific process performed could include: 
the person or software involved in the process; the time over which the 
process occurred and any process-specific attributes.

The level and granularity of detail recorded depends on the use case of 
those interested in collecting and analysing provenance data. This could 
span from basic translation workflow monitoring, through details needed 
for end-to-end internationalization and localization process 
optimization to using translation and QA provenance to guide the 
collection of parallel text for MT training.

To keep the requirements on ITS data categories simple the suggested 
data categories essentially support two levels of provenance tracking:
i) simple in-document identification of agents for common tasks
ii) a link to more complex, standoff markup conforming to the 
recommendation being drafted by the W3C PROV group

Provenance data categories can be applied to whole document or to 
individual elements. Its nature means, however, it may often be needed 
to record provenance detail for individual elements, since process 
analysts often need to correlate differences in process to differences 
in process outcomes, e.g.translation style variations. The data category 
must therefore be able to assign more than one provenance record to an 
element.

Some provenance information in the ITS2.0 requirements

'author' agent, as proposed in: 
http://www.w3.org/TR/2012/WD-its2req-20120524/#author overlaps with 
dc:author or dc:creator, so I suggest we do *not* address source content 
authorship agent in ITS2.0

quality profile and error: though this doesn't impact the provenance 
data categories proposed here, but may represent some redundancy in ITS 
data categories - to be resolved under ACTION-113. Will pursue this 
separate to this mail.

As a consequence, the following data categories are proposed here for 
discussion.

*For simple agent provenance:*
-----------------------------------------
The data category only records the agent involved for a specific 
process. Two have been suggested one for translation and one for 
translation revision. I propose we specify these as:

*translation agent *expressed as one of:

  * its-translaton-agent: value is a string giving the name of the
    translation agent or
  * its-translation-agent-ref: value a URI representing the translation
    agent

*translation revision agent *expressed as one of:

  * its-trans-revision-agent: value is a string giving the name of the
    translation revision agent, or
  * its-trans-revision-agent-ref: value a URI representing the
    translation revision agent

This definition says nothing about the values used for the agent names, 
it could be very generic, e.g. type level such as 'smt' or more specific 
'moses v1.2 trained on legal bi-text'. My view is that provenance 
granularity and data values are specific to individual organization and 
service provision contracts, so this may be the best approach.

Question: Is it sufficiently useful therefore to leave the agent values 
to be user defined?

Another question is whether we want to be able to point to an agent name 
within the document, similar to 'locNotePointer' ?

*standoff provenance*
-----------------------------
this was discussed in the Dublin workshop, see:
http://www.w3.org/International/multilingualweb/lt/wiki/images/8/85/LEWIS-DAVE_2012-06-13.pdf

Since then we have confirmed that the W3C PROV group is far enough 
advanced that it is safe to reference their standard. I have also been 
in contact with them about how to reference provenance records, and this 
indicates that we need only one URI to reference any record, not 
separate document and element reference as I proposed in Dublin.

I therefore propose the following data category for this:

*its-prov-ref*: the value of which is a URL pointing to an entity record 
conforming to the W3C PROV specification. The document or element to 
which this attribute applies should correspond to the resource that the 
provenance entity record represents.

There are several different configurations of how to deal with:

  * multiple provenance records for a document or element
  * recording provenance records for document parts without a reference
  * managing the relationship between document level and element or
    fragment level provenance records

However, i think these can all be addressed by PROV usage profiles, and 
therefore can be recorded as non-normative best practice in relation to 
ITS2.0 rather than having to be supported in the normative ITS2.0 spec.

Does anyone have any queries or questions about this approach.
Regards,
Dave
Received on Thursday, 28 June 2012 12:48:27 UTC