- From: Dave Lewis <dave.lewis@cs.tcd.ie>
- Date: Tue, 08 May 2012 23:15:20 +0100
- To: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
- Message-ID: <4FA99AF8.4090307@cs.tcd.ie>
Dear all, Here are some notes on how we might consolidate author, revisionAgent and translationAgent by alignment with the work of the W3C Provenance WG http://www.w3.org/2011/prov/wiki/Main_Page. The model of this working group is most simply summarised by figure http://www.w3.org/TR/prov-dm/#prov-dm-overview in the core Data Model specification. Essentially the provenance model is intended to allow recording of how *entities* were used and generated by *activities* which are conducted through the action of *agents*. There are then a bunch of relation that tie these together, such as 'wasGeneratedBy', 'wasDerivedFrom', 'used' see: http://www.w3.org/TR/prov-dm/#prov-dm-types-and-relations The model specifies the format of provenance records, but unlike most ITS tags, the intent is for these records to be maintained in a dedicated store. Like ITS, it defines an abstract notation (PROV-DM) for such records, and then defines different implementations, namely a text-file format (PROV-N), an ontology version that can be mapped into RDF (PROV-O), a restful access and query mechanism returning records as HTML (PROV-AQ), and an XML binding (PROV-XML). These specs are going through the W3C process currently, with the aim of reaching recommendations status by Jan'13. Briefly, the most direct mapping to ITS would be some sort of binding between host document and their elements and entities as recorded in provenance records. The binding will depend on the implementation used for the provenance, e.g. just a URL, an XPOINTER, or a file URL and an entitiy record ID within that file. Using the last of these we could imagine: <span its-prov-ref="http://www.eg.org/prov-ex1.txt" its-prov-ent="e1">My hovercraft is full of eels.</span> <span its-prov-ref="http://www.eg.org/prov-ex1.txt" its-prov-ent="e2">Mon aéroglisseur est plein d'anguilles.</span> where http://www.eg.org/provex1.txt would contain something like: entity(e1) entity(e2) which in turn could be referenced by an activities a1: wasGeneratedBy(e1, a1, 2011-11-16T16:05:30) -- specifies that an entity was generated y an activity at a specific time activity(a1, 2011-11-16T16:05:00, 2011-11-16T16:06:00, [its-prov-process-type="authorContent", its-source-lang="en"] ) -- identifies an activity, its start and stop time and other relevant attributes -- similarly we can define that e2 was generated by a machine translation process wasGeneratedBy(e2, a2, 2011-11-16T16:07:30) activity(a1, 2011-11-16T16:07:00, 2011-11-16T16:08:00, [its-prov-process-type="mTranslate"] ) -- then we can define agents associated with these activities agent(Trevor, [ prov:type="Person", its-prov-agent-type="author" ] ) agent(matrex-eng1234, [ prov:type="SoftwareAgent", its-prov-agent-type="smt", its-prov-src-lang="en", its-prov-tgt-lang="fr" ] ) wasAssociatedWith(a1, Trevor) wasAssociatedWith(a2, matrex-eng1234) So you can see that this stand-off meta data approach based on the PROV model means we can also record things like the suggested qualityError data category (http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#qualityError) entity(e3, [its-ent-type="qa-error-report", its-qa-err-severity="0.5", its-qa-err-note="suspect terminology") --actually PROV has an annotation structure that could be used instead of its-qa-err-note wasGeneratedBy(g1, e3, a3, 2011-11-16T16:08:30) wasDerivedFrom(e3, e2, a3, g1) activity(a3, 2011-11-16T16:08:00, 2011-11-16T16:09:00, [its-prov-process-type="translateQA", its-prov-qa-ruleset="LISAQA"] ) wasAssociatedWith(a3, Pierre) agent(Pierre, [ prov:type="Person", its-prov-agent-type="trans-QA-checker" ] ) This approach makes it easy to have several different provenance entities associated with any particular doc, element or span, and heads off the likely high level of ITS markup overhead that may occur if several provenance records are applied. What is required in tersm of specification is the set of additional attribtue we want to use and their value. Essentially this would be a profile of the PROV specs. We may need to liaise with that working group on how to do this since I can't see that they have addressed this yet. Note that its-prov-process-type should result from the consideration given to section http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#Process_Model In this way we can replace the author, revisionAgent, translationAgent and perhaps also the quality data categories by the 'its-prov-ent' data category to reference the entity representing the doc/element/span and then through profiling let the PROV spec do the rest. all comments welcome, we can discuss this more on thursday's call. cheers, Dave
Received on Tuesday, 8 May 2012 22:15:48 UTC