- From: Dave Lewis <dave.lewis@cs.tcd.ie>
- Date: Thu, 28 Jun 2012 13:46:50 +0100
- To: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
- Message-ID: <4FEC523A.2010406@cs.tcd.ie>
Hi all,
We have merged the discussions on the provenance and agent related data
categories under this issue.
description
----------------
In general provenance related to the logging of processes or activities
that have been performed on content that is the subject of ITS tags.
Provenance data recorded for a specific process performed could include:
the person or software involved in the process; the time over which the
process occurred and any process-specific attributes.
The level and granularity of detail recorded depends on the use case of
those interested in collecting and analysing provenance data. This could
span from basic translation workflow monitoring, through details needed
for end-to-end internationalization and localization process
optimization to using translation and QA provenance to guide the
collection of parallel text for MT training.
To keep the requirements on ITS data categories simple the suggested
data categories essentially support two levels of provenance tracking:
i) simple in-document identification of agents for common tasks
ii) a link to more complex, standoff markup conforming to the
recommendation being drafted by the W3C PROV group
Provenance data categories can be applied to whole document or to
individual elements. Its nature means, however, it may often be needed
to record provenance detail for individual elements, since process
analysts often need to correlate differences in process to differences
in process outcomes, e.g.translation style variations. The data category
must therefore be able to assign more than one provenance record to an
element.
Some provenance information in the ITS2.0 requirements
'author' agent, as proposed in:
http://www.w3.org/TR/2012/WD-its2req-20120524/#author overlaps with
dc:author or dc:creator, so I suggest we do *not* address source content
authorship agent in ITS2.0
quality profile and error: though this doesn't impact the provenance
data categories proposed here, but may represent some redundancy in ITS
data categories - to be resolved under ACTION-113. Will pursue this
separate to this mail.
As a consequence, the following data categories are proposed here for
discussion.
*For simple agent provenance:*
-----------------------------------------
The data category only records the agent involved for a specific
process. Two have been suggested one for translation and one for
translation revision. I propose we specify these as:
*translation agent *expressed as one of:
* its-translaton-agent: value is a string giving the name of the
translation agent or
* its-translation-agent-ref: value a URI representing the translation
agent
*translation revision agent *expressed as one of:
* its-trans-revision-agent: value is a string giving the name of the
translation revision agent, or
* its-trans-revision-agent-ref: value a URI representing the
translation revision agent
This definition says nothing about the values used for the agent names,
it could be very generic, e.g. type level such as 'smt' or more specific
'moses v1.2 trained on legal bi-text'. My view is that provenance
granularity and data values are specific to individual organization and
service provision contracts, so this may be the best approach.
Question: Is it sufficiently useful therefore to leave the agent values
to be user defined?
Another question is whether we want to be able to point to an agent name
within the document, similar to 'locNotePointer' ?
*standoff provenance*
-----------------------------
this was discussed in the Dublin workshop, see:
http://www.w3.org/International/multilingualweb/lt/wiki/images/8/85/LEWIS-DAVE_2012-06-13.pdf
Since then we have confirmed that the W3C PROV group is far enough
advanced that it is safe to reference their standard. I have also been
in contact with them about how to reference provenance records, and this
indicates that we need only one URI to reference any record, not
separate document and element reference as I proposed in Dublin.
I therefore propose the following data category for this:
*its-prov-ref*: the value of which is a URL pointing to an entity record
conforming to the W3C PROV specification. The document or element to
which this attribute applies should correspond to the resource that the
provenance entity record represents.
There are several different configurations of how to deal with:
* multiple provenance records for a document or element
* recording provenance records for document parts without a reference
* managing the relationship between document level and element or
fragment level provenance records
However, i think these can all be addressed by PROV usage profiles, and
therefore can be recorded as non-normative best practice in relation to
ITS2.0 rather than having to be supported in the normative ITS2.0 spec.
Does anyone have any queries or questions about this approach.
Regards,
Dave
Received on Thursday, 28 June 2012 12:48:27 UTC