- From: Dave Lewis <dave.lewis@cs.tcd.ie>
- Date: Thu, 28 Jun 2012 13:46:50 +0100
- To: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
- Message-ID: <4FEC523A.2010406@cs.tcd.ie>
Hi all, We have merged the discussions on the provenance and agent related data categories under this issue. description ---------------- In general provenance related to the logging of processes or activities that have been performed on content that is the subject of ITS tags. Provenance data recorded for a specific process performed could include: the person or software involved in the process; the time over which the process occurred and any process-specific attributes. The level and granularity of detail recorded depends on the use case of those interested in collecting and analysing provenance data. This could span from basic translation workflow monitoring, through details needed for end-to-end internationalization and localization process optimization to using translation and QA provenance to guide the collection of parallel text for MT training. To keep the requirements on ITS data categories simple the suggested data categories essentially support two levels of provenance tracking: i) simple in-document identification of agents for common tasks ii) a link to more complex, standoff markup conforming to the recommendation being drafted by the W3C PROV group Provenance data categories can be applied to whole document or to individual elements. Its nature means, however, it may often be needed to record provenance detail for individual elements, since process analysts often need to correlate differences in process to differences in process outcomes, e.g.translation style variations. The data category must therefore be able to assign more than one provenance record to an element. Some provenance information in the ITS2.0 requirements 'author' agent, as proposed in: http://www.w3.org/TR/2012/WD-its2req-20120524/#author overlaps with dc:author or dc:creator, so I suggest we do *not* address source content authorship agent in ITS2.0 quality profile and error: though this doesn't impact the provenance data categories proposed here, but may represent some redundancy in ITS data categories - to be resolved under ACTION-113. Will pursue this separate to this mail. As a consequence, the following data categories are proposed here for discussion. *For simple agent provenance:* ----------------------------------------- The data category only records the agent involved for a specific process. Two have been suggested one for translation and one for translation revision. I propose we specify these as: *translation agent *expressed as one of: * its-translaton-agent: value is a string giving the name of the translation agent or * its-translation-agent-ref: value a URI representing the translation agent *translation revision agent *expressed as one of: * its-trans-revision-agent: value is a string giving the name of the translation revision agent, or * its-trans-revision-agent-ref: value a URI representing the translation revision agent This definition says nothing about the values used for the agent names, it could be very generic, e.g. type level such as 'smt' or more specific 'moses v1.2 trained on legal bi-text'. My view is that provenance granularity and data values are specific to individual organization and service provision contracts, so this may be the best approach. Question: Is it sufficiently useful therefore to leave the agent values to be user defined? Another question is whether we want to be able to point to an agent name within the document, similar to 'locNotePointer' ? *standoff provenance* ----------------------------- this was discussed in the Dublin workshop, see: http://www.w3.org/International/multilingualweb/lt/wiki/images/8/85/LEWIS-DAVE_2012-06-13.pdf Since then we have confirmed that the W3C PROV group is far enough advanced that it is safe to reference their standard. I have also been in contact with them about how to reference provenance records, and this indicates that we need only one URI to reference any record, not separate document and element reference as I proposed in Dublin. I therefore propose the following data category for this: *its-prov-ref*: the value of which is a URL pointing to an entity record conforming to the W3C PROV specification. The document or element to which this attribute applies should correspond to the resource that the provenance entity record represents. There are several different configurations of how to deal with: * multiple provenance records for a document or element * recording provenance records for document parts without a reference * managing the relationship between document level and element or fragment level provenance records However, i think these can all be addressed by PROV usage profiles, and therefore can be recorded as non-normative best practice in relation to ITS2.0 rather than having to be supported in the normative ITS2.0 spec. Does anyone have any queries or questions about this approach. Regards, Dave
Received on Thursday, 28 June 2012 12:48:27 UTC