Re: [ACTION-81] consider consolidation of author, revisionAgent and translationAgent from Dave Lewis on 2012-05-10 (public-multilingualweb-lt@w3.org from May 2012)

From: Dave Lewis <dave.lewis@cs.tcd.ie>
Date: Thu, 10 May 2012 01:50:45 +0100
To: Felix Sasaki <fsasaki@w3.org>
CC: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
Message-ID: <4FAB10E5.8070108@cs.tcd.ie>
Hi Felix,
Yes, I'll take an action to contact the provenance group - I have some 
other queries that emerged in considering this. I took that date from an 
updated schedule on their wiki, but its not clear if there has been an 
accompanying charter update. This is a major consideration in any 
alignment obviously.

And yes, consolidating data categories should only occur when when 
implementation of neither of the originals is majorly complicated, which 
I agree would not be that case in this circumstance. I just wanted to 
point out provenance is an appropriate general purpose QA recording 
mechanism that could be easily applied to this need. But it would 
probably only be useful if you were also using provenance anyway for 
other purposes (including different types of QA, such as source QA, 
monolingual target QA etc) and were therefore ready to absorb the 
additional complexity costs. But that's not a good reason for dropping 
the quality data categories I agree.

cheers,
Dave

On 09/05/2012 07:55, Felix Sasaki wrote:
> Hi Dave,
>
> 2012/5/9 Dave Lewis <dave.lewis@cs.tcd.ie <mailto:dave.lewis@cs.tcd.ie>>
>
>     Dear all,
>     Here are some notes on how we might consolidate author,
>     revisionAgent and translationAgent by alignment with the work of
>     the W3C Provenance WG http://www.w3.org/2011/prov/wiki/Main_Page.
>
>     The model of this working group is most simply summarised by
>     figure http://www.w3.org/TR/prov-dm/#prov-dm-overview in the core
>     Data Model specification. Essentially the provenance model is
>     intended to allow recording of how *entities* were used and
>     generated by *activities* which are conducted through the action
>     of *agents*. There are then a bunch of relation that tie these
>     together, such as 'wasGeneratedBy', 'wasDerivedFrom', 'used' see:
>     http://www.w3.org/TR/prov-dm/#prov-dm-types-and-relations
>
>     The model specifies the format of provenance records, but unlike
>     most ITS tags, the intent is for these records to be maintained in
>     a dedicated store. Like ITS, it defines an abstract notation
>     (PROV-DM) for such records, and then defines different
>     implementations, namely a text-file format (PROV-N), an ontology
>     version that can be mapped into RDF (PROV-O), a restful access and
>     query mechanism returning records as HTML (PROV-AQ), and an XML
>     binding (PROV-XML). These specs are going through the W3C process
>     currently, with the aim of reaching recommendations status by Jan'13.
>
>
> Currently this group is chartered only until October this year
> https://www.w3.org/Member/Mail/
> Again I may have missed things, but are you sure about the progress of 
> this or could you take an action to talk to the provenance co-chairs 
> about their timeline?
>
>
>     Briefly, the most direct mapping to ITS would be some sort of
>     binding between host document and their elements and entities as
>     recorded in provenance records. The binding will depend on the
>     implementation used for the provenance, e.g. just a URL, an
>     XPOINTER, or a file URL and an entitiy record ID within that file.
>     Using the last of these we could imagine:
>
>     <span its-prov-ref="http://www.eg.org/prov-ex1.txt"
>     <http://www.eg.org/prov-ex1.txt> its-prov-ent="e1">My hovercraft
>     is full of eels.</span>
>     <span its-prov-ref="http://www.eg.org/prov-ex1.txt"
>     <http://www.eg.org/prov-ex1.txt> its-prov-ent="e2">Mon
>     aéroglisseur est plein d'anguilles.</span>
>
>     where http://www.eg.org/provex1.txt would contain something like:
>
>     entity(e1)
>     entity(e2)
>
>     which in turn could be referenced by an activities a1:
>
>     wasGeneratedBy(e1, a1, 2011-11-16T16:05:30) -- specifies that an
>     entity was generated y an activity at a specific time
>
>     activity(a1, 2011-11-16T16:05:00, 2011-11-16T16:06:00,
>     [its-prov-process-type="authorContent", its-source-lang="en"] ) --
>     identifies an activity, its start and stop time and other relevant
>     attributes
>
>     -- similarly we can define that e2 was generated by a machine
>     translation process
>     wasGeneratedBy(e2, a2, 2011-11-16T16:07:30)
>     activity(a1, 2011-11-16T16:07:00, 2011-11-16T16:08:00,
>     [its-prov-process-type="mTranslate"] )
>
>     -- then we can define agents associated with these activities
>     agent(Trevor, [ prov:type="Person", its-prov-agent-type="author" ] )
>     agent(matrex-eng1234, [ prov:type="SoftwareAgent",
>     its-prov-agent-type="smt", its-prov-src-lang="en",
>     its-prov-tgt-lang="fr" ] )
>
>     wasAssociatedWith(a1, Trevor)
>     wasAssociatedWith(a2, matrex-eng1234)
>
>     So you can see that this stand-off meta data approach based on the
>     PROV model means we can also record things like the suggested
>     qualityError data category
>     (http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#qualityError)
>
>
> This worries me a bit - we agreed not to conflate data categories, and 
> what you suggest created a dependency between qualityError and 
> provenance. Wouldn't it be better to keep them separately?
>
>
>     entity(e3, [its-ent-type="qa-error-report",
>     its-qa-err-severity="0.5",  its-qa-err-note="suspect terminology")
>     --actually PROV has an annotation structure that could be used
>     instead of its-qa-err-note
>
>     wasGeneratedBy(g1, e3, a3, 2011-11-16T16:08:30)
>     wasDerivedFrom(e3, e2, a3, g1)
>     activity(a3, 2011-11-16T16:08:00, 2011-11-16T16:09:00,
>     [its-prov-process-type="translateQA", its-prov-qa-ruleset="LISAQA"] )
>     wasAssociatedWith(a3, Pierre)
>     agent(Pierre, [ prov:type="Person",
>     its-prov-agent-type="trans-QA-checker" ] )
>
>     This approach makes it easy to have several different provenance
>     entities associated with any particular doc, element or span, and
>     heads off the likely high level of ITS markup overhead that may
>     occur if  several provenance records are applied.
>
>     What is required in tersm of specification is the set of
>     additional attribtue we want to use and their value. Essentially
>     this would be a profile of the PROV specs. We may need to liaise
>     with that working group on how to do this since I can't see that
>     they have addressed this yet.
>
>     Note that its-prov-process-type should result from the
>     consideration given to section
>     http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#Process_Model
>
>
>     In this way we can replace the author, revisionAgent,
>     translationAgent and perhaps also the quality data categories by
>     the 'its-prov-ent' data category to reference the entity
>     representing the doc/element/span and then through profiling let
>     the PROV spec do the rest.
>
>
>
> Like above, I am worried by combining data categories. I assume that 
> you see a benefit in merging them, but it may create a lot of 
> complexity for people not interested in provenance.
>
> Felix
>
>
>     all comments welcome, we can discuss this more on thursday's call.
>     cheers,
>     Dave
>
>
>
>
> -- 
> Felix Sasaki
> DFKI / W3C Fellow
>
Received on Thursday, 10 May 2012 00:51:11 UTC