Re: FW: [Action-126] David to come up with a proposal for mtConfidence from Dr. David Filip on 2012-08-02 (public-multilingualweb-lt@w3.org from August 2012)

From: Dr. David Filip <David.Filip@ul.ie>
Date: Thu, 2 Aug 2012 17:14:24 +0100
To: Jan Nelson <Jan.Nelson@microsoft.com>
Cc: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
Message-ID: <CANw5LK=x5FR4b6mWUvCy8KN0hDXU1vBupOQZuQ7ePmCYppx8Kw@mail.gmail.com>
Jan, I really appreciate the quick feedback from MS Research, and the WG
discussion today showed that we want to continue with the unambiguous part
(mtConfidence) that you supported, anyway in the 'first wave'.

The mt*Metrics categories can be eventually worked on later (if we are good
timewise) but should be put on the back burner for now, not to distract us
from making progress on the categories we agreed to move forward by now..

Cheers and thanks again
dF

Dr. David Filip
=======================
LRC | CNGL | LT-Web | CSIS
University of Limerick, Ireland
telephone: +353-6120-2781
*cellphone: +353-86-0222-158*
facsimile: +353-6120-2734
mailto: david.filip@ul.ie



On Thu, Aug 2, 2012 at 5:02 PM, Jan Nelson <Jan.Nelson@microsoft.com> wrote:

>  I hope to get more feedback on the broader spec over the next week from
> other key MS stakeholders, but saw an opportunity to align us on this
> particular topic as quickly as possible with the MS Translator folks.
>
>
>
> Jan
>  ------------------------------
> *From:* Dr. David Filip [David.Filip@ul.ie]
> *Sent:* Thursday, August 02, 2012 5:11 AM
> *To:* Jan Nelson
> *Cc:* public-multilingualweb-lt@w3.org
> *Subject:* Re: FW: [Action-126] David to come up with a proposal for
> mtConfidence
>
>   Jan, thanks for checking with Chris and I am glad that you support the
> mtConfidence part.
>
>  Regarding the mt*Metrics parts.. I see what Chris means but I need to
> clarify and qualify..
>
>  First I think the comment is not valid for mtHumanMetrics part. What I
> mean are the simple score based human usability assessments that are
> increasingly used, but unfortunately the scales are not being standardized.
> AFAIK 4-5 value scales are being used, I would tend  to promote 4 values,
> as best practice, with  0 [publishable without changes] being the best, 3
> [complete retranslate needed] the worst.
>
>  Regarding the mtAutomatedMetrics part, I agree that you need the
> pointers mentioned by Chris to PERFORM these. But that was NOT the goal of
> this part of the proposal. I simply assumed  that the MT producer has the
> metrics service set up (internally or as a third party service) and is able
> to provide it in runtime. I should have said so... The
> metrics-producing-service obviously must have all the mentioned pointers to
> perform the metrics, but IMHO this is not worth passing down..
>
>  Finally, I agree that the mtConfidence part seems most stable and best
> candidate for first wave standardization. If the other parts do not have
> support I am happy to drop them, I just wanted to chart the area in a
> general way..
>
>  Cheers
> dF
>
>  Dr. David Filip
>  =======================
> LRC | CNGL | LT-Web | CSIS
> University of Limerick, Ireland
> telephone: +353-6120-2781 <#138e812ed298c6e0_>
> *cellphone: +353-86-0222-158 <#138e812ed298c6e0_>*
> facsimile: +353-6120-2734 <#138e812ed298c6e0_>
> mailto: david.filip@ul.ie
>
>
>
> On Thu, Aug 2, 2012 at 7:36 AM, Jan Nelson <Jan.Nelson@microsoft.com>wrote:
>
>>  Feedback from Chris Wendt from our Microsoft Translator team on the
>> section below:
>>
>>>
>>>
>>> -mtQuality
>>>
>>> --mtConfidence
>>>
>>> ---mtProducer [string identifying producer Bing, DCU-Matrex etc.]
>>>
>>> ----mtEngine [string identifying the engine on one of the above
>>> platforms, can be potentially quite structured, pair domain etc.]
>>>
>>> -----mtConfidenceScore [0-100% or interval 0-1]
>>>
>>>
>>>
>>> All of the above makes total sense. I’d vote for it as proposed. Must be
>>> at sentence level, not below.
>>>
>>>
>>>
>>> The following does not make sense without significant enhancements:
>>>
>>>
>>>
>>> --mtAutomatedMetrics
>>>
>>> ---mtScoreType [METEOR, TER, BLEU, Levensthein distance etc.]
>>>
>>> ----mtAutomatedMetricsScore [0-100% or interval 0-1]
>>>
>>> --mtHumanMetrics
>>>
>>> ---mtHumanMetricsScale [{4,3,2,1,0},{0,1,2,3,4}.{3,2,1,0} etc.]
>>>
>>> ----mtHumanMetricsValue [one of the above values depending on scale]
>>>
>>>
>>>
>>>
>>>
>>> All of the scoring needs a pointer to one or more reference translations
>>> AND pointer to the source document. In addition, there may not be the exact
>>> same number of reference per element. You’d need to either add syntax for
>>> specifying the source and reference, or embed it all in the document.
>>> That’ll be messy.
>>>
>>> These numbers are relevant only for MT system evaluations, which I
>>> consider a niche, and I would not try to standardize here in ITS.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Jan
>>>
>>>
>>>
>>> ------------------------------
>>>
>>>   *From:* Felix Sasaki [fsasaki@w3.org]
>>> *Sent:* Wednesday, August 01, 2012 12:50 AM
>>> *To:* Dr. David Filip
>>> *Cc:* public-multilingualweb-lt@w3.org
>>>
>>> *Subject:* Re: [Action-126] David to come up with a proposal for
>>> mtConfidence
>>>
>>>   Hi David, all,
>>>
>>> 2012/7/31 Dr. David Filip <David.Filip@ul.ie>
>>>
>>> HI all, I was trying to engage a PhD student here at LRC to produce a
>>> proposal for this data category but I failed.
>>>
>>>
>>>
>>> Nevertheless, here is my thinking on the category that maybe someone
>>> else (Declan?) could take it to the call for consensus stage.
>>>
>>>
>>>
>>> co-chair hat on: If there is no strong support for this, I would propose
>>> to put this on hold until we have finished all other data categories. As
>>> you wrote in your agenda,
>>>
>>>
>>>
>>>
>>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jul/0311.html
>>>
>>>
>>>
>>> we have various data categories proposals on the table that are not
>>> finished: special requirements, named entity, quality, ... I will send a
>>> proposal for the time until last call later today, which will show that we
>>> need to finish these and the various "ed. notes" in
>>>
>>>
>>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html
>>>
>>> I think we need the time and your input to work on these.
>>>
>>>
>>>
>>> I very much hope for your understanding - let's discuss this also during
>>> the call on Thursday,
>>>
>>>
>>>
>>> Felix
>>>
>>>
>>>
>>>
>>>
>>> I believe that mtConfidence is being produced in some form or other by
>>> all major current MT systems. as discussed in Dublin, the issue is that
>>> these confidence scores are not really comparable between engines, I mean
>>> not only between Ging and Google, or Matrex, but even not between different
>>> pair engines or even specific domain trained engines based on the same
>>> general technology.
>>>
>>>
>>>
>>> Nevertheless there are prospects for standardizing based on cognitive
>>> effort on post-editing etc. Even knowing that the usability of confidence
>>> scores is limited, there are valid production-consumption scenarios in the
>>> content lifecycle.
>>>
>>> If a client/service provider/translator/reviewer do repeatedly work with
>>> the same engine, they will find even the engines self evaluation useful.
>>>
>>>
>>>
>>> Further to this, there is potential of connecting this with automated
>>> and human MT evaluation scores, so I'd propose to generalize as mtQuality
>>> [mening raw MT quality, NOT talking about levels of PE] that would subume
>>> mtConfidence etc. as seen below
>>>
>>>
>>>
>>> My proposal of the data model based on the above
>>>
>>>
>>>
>>> -mtQuality
>>>
>>> --mtConfidence
>>>
>>> ---mtProducer [string identifying producer Bing, DCU-Matrex etc.]
>>>
>>> ----mtEngine [string identifying the engine on one of the above
>>> platforms, can be potentially quite structured, pair domain etc.]
>>>
>>> -----mtConfidenceScore [0-100% or interval 0-1]
>>>
>>> --mtAutomatedMetrics
>>>
>>> ---mtScoreType [METEOR, TER, BLEU, Levensthein distance etc.]
>>>
>>> ----mtAutomatedMetricsScore [0-100% or interval 0-1]
>>>
>>> --mtHumanMetrics
>>>
>>> ---mtHumanMetricsScale [{4,3,2,1,0},{0,1,2,3,4}.{3,2,1,0} etc.]
>>>
>>> ----mtHumanMetricsValue [one of the above values depending on scale]
>>>
>>>
>>>
>>> mtQuality is an optional attribute of a machine text segment (as in
>>> Unicode or localization segmentations). I do not think this is useful on
>>> higher or lower levels.
>>>
>>>
>>>
>>> mtQuality must be specified as mtConfidence XOR mtAutomatedMetrics
>>> XOR mtHumanMetrics
>>>
>>>
>>>
>>> Then comes the compulsory specification the actual value (eventaully
>>> preceded by value change if more options exist)..
>>>
>>>
>>>
>>> Cheers
>>>
>>> dF
>>>
>>>
>>>
>>>
>>>   Dr. David Filip
>>>
>>> =======================
>>>
>>> LRC | CNGL | LT-Web | CSIS
>>>
>>> University of Limerick, Ireland
>>>
>>> telephone: +353-6120-2781
>>>
>>> *cellphone: +353-86-0222-158*
>>>
>>> facsimile: +353-6120-2734
>>>
>>> mailto: david.filip@ul.ie
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Felix Sasaki
>>>
>>> DFKI / W3C Fellow
>>>
>>>
>>>
>>
>>
>>
>>  --
>> Felix Sasaki
>> DFKI / W3C Fellow
>>
>>
>
Received on Thursday, 2 August 2012 16:15:32 UTC