Re: FW: [Action-126] David to come up with a proposal for mtConfidence

Jan, thanks for checking with Chris and I am glad that you support the
mtConfidence part.

Regarding the mt*Metrics parts.. I see what Chris means but I need to
clarify and qualify..

First I think the comment is not valid for mtHumanMetrics part. What I mean
are the simple score based human usability assessments that are
increasingly used, but unfortunately the scales are not being standardized.
AFAIK 4-5 value scales are being used, I would tend  to promote 4 values,
as best practice, with  0 [publishable without changes] being the best, 3
[complete retranslate needed] the worst.

Regarding the mtAutomatedMetrics part, I agree that you need the pointers
mentioned by Chris to PERFORM these. But that was NOT the goal of this part
of the proposal. I simply assumed  that the MT producer has the metrics
service set up (internally or as a third party service) and is able to
provide it in runtime. I should have said so... The
metrics-producing-service obviously must have all the mentioned pointers to
perform the metrics, but IMHO this is not worth passing down..

Finally, I agree that the mtConfidence part seems most stable and best
candidate for first wave standardization. If the other parts do not have
support I am happy to drop them, I just wanted to chart the area in a
general way..

Cheers
dF

Dr. David Filip
=======================
LRC | CNGL | LT-Web | CSIS
University of Limerick, Ireland
telephone: +353-6120-2781
*cellphone: +353-86-0222-158*
facsimile: +353-6120-2734
mailto: david.filip@ul.ie



On Thu, Aug 2, 2012 at 7:36 AM, Jan Nelson <Jan.Nelson@microsoft.com> wrote:

>  Feedback from Chris Wendt from our Microsoft Translator team on the
> section below:
>
>>
>>
>> -mtQuality
>>
>> --mtConfidence
>>
>> ---mtProducer [string identifying producer Bing, DCU-Matrex etc.]
>>
>> ----mtEngine [string identifying the engine on one of the above
>> platforms, can be potentially quite structured, pair domain etc.]
>>
>> -----mtConfidenceScore [0-100% or interval 0-1]
>>
>>
>>
>> All of the above makes total sense. I’d vote for it as proposed. Must be
>> at sentence level, not below.
>>
>>
>>
>> The following does not make sense without significant enhancements:
>>
>>
>>
>> --mtAutomatedMetrics
>>
>> ---mtScoreType [METEOR, TER, BLEU, Levensthein distance etc.]
>>
>> ----mtAutomatedMetricsScore [0-100% or interval 0-1]
>>
>> --mtHumanMetrics
>>
>> ---mtHumanMetricsScale [{4,3,2,1,0},{0,1,2,3,4}.{3,2,1,0} etc.]
>>
>> ----mtHumanMetricsValue [one of the above values depending on scale]
>>
>>
>>
>>
>>
>> All of the scoring needs a pointer to one or more reference translations
>> AND pointer to the source document. In addition, there may not be the exact
>> same number of reference per element. You’d need to either add syntax for
>> specifying the source and reference, or embed it all in the document.
>> That’ll be messy.
>>
>> These numbers are relevant only for MT system evaluations, which I
>> consider a niche, and I would not try to standardize here in ITS.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Jan
>>
>>
>>
>> ------------------------------
>>
>>   *From:* Felix Sasaki [fsasaki@w3.org]
>> *Sent:* Wednesday, August 01, 2012 12:50 AM
>> *To:* Dr. David Filip
>> *Cc:* public-multilingualweb-lt@w3.org
>>
>> *Subject:* Re: [Action-126] David to come up with a proposal for
>> mtConfidence
>>
>>  Hi David, all,
>>
>> 2012/7/31 Dr. David Filip <David.Filip@ul.ie>
>>
>> HI all, I was trying to engage a PhD student here at LRC to produce a
>> proposal for this data category but I failed.
>>
>>
>>
>> Nevertheless, here is my thinking on the category that maybe someone else
>> (Declan?) could take it to the call for consensus stage.
>>
>>
>>
>> co-chair hat on: If there is no strong support for this, I would propose
>> to put this on hold until we have finished all other data categories. As
>> you wrote in your agenda,
>>
>>
>>
>>
>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jul/0311.html
>>
>>
>>
>> we have various data categories proposals on the table that are not
>> finished: special requirements, named entity, quality, ... I will send a
>> proposal for the time until last call later today, which will show that we
>> need to finish these and the various "ed. notes" in
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html
>>
>> I think we need the time and your input to work on these.
>>
>>
>>
>> I very much hope for your understanding - let's discuss this also during
>> the call on Thursday,
>>
>>
>>
>> Felix
>>
>>
>>
>>
>>
>> I believe that mtConfidence is being produced in some form or other by
>> all major current MT systems. as discussed in Dublin, the issue is that
>> these confidence scores are not really comparable between engines, I mean
>> not only between Ging and Google, or Matrex, but even not between different
>> pair engines or even specific domain trained engines based on the same
>> general technology.
>>
>>
>>
>> Nevertheless there are prospects for standardizing based on cognitive
>> effort on post-editing etc. Even knowing that the usability of confidence
>> scores is limited, there are valid production-consumption scenarios in the
>> content lifecycle.
>>
>> If a client/service provider/translator/reviewer do repeatedly work with
>> the same engine, they will find even the engines self evaluation useful.
>>
>>
>>
>> Further to this, there is potential of connecting this with automated and
>> human MT evaluation scores, so I'd propose to generalize as mtQuality
>> [mening raw MT quality, NOT talking about levels of PE] that would subume
>> mtConfidence etc. as seen below
>>
>>
>>
>> My proposal of the data model based on the above
>>
>>
>>
>> -mtQuality
>>
>> --mtConfidence
>>
>> ---mtProducer [string identifying producer Bing, DCU-Matrex etc.]
>>
>> ----mtEngine [string identifying the engine on one of the above
>> platforms, can be potentially quite structured, pair domain etc.]
>>
>> -----mtConfidenceScore [0-100% or interval 0-1]
>>
>> --mtAutomatedMetrics
>>
>> ---mtScoreType [METEOR, TER, BLEU, Levensthein distance etc.]
>>
>> ----mtAutomatedMetricsScore [0-100% or interval 0-1]
>>
>> --mtHumanMetrics
>>
>> ---mtHumanMetricsScale [{4,3,2,1,0},{0,1,2,3,4}.{3,2,1,0} etc.]
>>
>> ----mtHumanMetricsValue [one of the above values depending on scale]
>>
>>
>>
>> mtQuality is an optional attribute of a machine text segment (as in
>> Unicode or localization segmentations). I do not think this is useful on
>> higher or lower levels.
>>
>>
>>
>> mtQuality must be specified as mtConfidence XOR mtAutomatedMetrics
>> XOR mtHumanMetrics
>>
>>
>>
>> Then comes the compulsory specification the actual value (eventaully
>> preceded by value change if more options exist)..
>>
>>
>>
>> Cheers
>>
>> dF
>>
>>
>>
>>
>>   Dr. David Filip
>>
>> =======================
>>
>> LRC | CNGL | LT-Web | CSIS
>>
>> University of Limerick, Ireland
>>
>> telephone: +353-6120-2781
>>
>> *cellphone: +353-86-0222-158*
>>
>> facsimile: +353-6120-2734
>>
>> mailto: david.filip@ul.ie
>>
>>
>>
>>
>>
>>
>>
>> --
>> Felix Sasaki
>>
>> DFKI / W3C Fellow
>>
>>
>>
>
>
>
>  --
> Felix Sasaki
> DFKI / W3C Fellow
>
>

Received on Thursday, 2 August 2012 12:12:20 UTC