Re: MT Confidence definition [ACTION-556]

Hi all
I managed to talk to Declan Groves about this yesterday. His view was 
that the original use case was to enable to confidence score that all 
statistical MT already generate in selecting the final output to be 
propagated in an open way. So using other method is some change (a 
broadening) of the use case.

He also saw the danger of confusion by users/implementors if something 
labelled as a 'confidence score' (which has a certain meaning in NLP 
circles) might be used to convey quality estimation (QE), which, 
depending how its done,  has a different sort of significance.

We did discuss the option of mtconfidence being used to convey the 
output of an automated score (e.g. BLEU) that had been integrated into 
an MT engine. This would be reasonable in use cases where MT engines are 
being dynamically retrained, but would require relaxing the wording.

I also asked questions of some  QE researchers in CNGL and got some 
interesting clarifications. Certainly QE is being used to provide scores 
of MT output (i was mistaken about that on the call), often trained on 
some human annotation collected on the quality of previous translations 
correlated to the current translation and perhaps other meta data 
(including the self reported confidence scores) from the MT engine. 
Certainly there are also occasions where QE operates in a very similar 
fashion to that intended for non-conformance in  LQI, so I think that 
remains an option also.

So, Yves, you are right that the current definition is limiting to other 
possible 'scores' representing a confidence in the translation being a 
'good' one,  beyond just the MT-engine generated scores.

At the same time I have the impression that the technologies for this 
are still emerging from the lab and don't have the benefit of widely 
used common platforms and industrial experience that SMT does. Overall 
this makes it difficult to make any hard and fast statements about what 
should and should not be used to generate MtConfidence scores right now.

So softening that limitation as Yves suggests may be useful in 
accommodating innovations in this area, but may also open the door to 
some confusion by users that may impact negatively on the business 
benefits of interoperation, e.g. a translation client gets a score that 
they think has a certain significance when in fact it has another.

So, if we were to make the changes suggested by Yves, we should 
accompany it with some best practice work to suggest how the 
annotatorRef value could be used to inform on the particular method used 
to generate the mtconfidence score, including some classification 
encodings, explanations of the different methods and the significance 
that can be placed on the resulting scores in different situations. My 
general feeling, perhaps incorrect, is that the current IG membership 
probably doesn't have the breadth of expertise to provide this best 
practice. Arle, could this be something that QT-Launchpad could take on?

To sum up:
1) the text proposed by yves may relax limits of what can produce 
mtconfidence score  in a useful way by accommodating different 
techniques, but also has the potential to cause confusion about the 
singificance of score produced by different methods. Some of these could 
anyway be conveyed in the non-compliance in LQI, but not all.

2) it seems very difficult to formulate wording that would constrain the 
range of methods in any usable way between the current text and what 
Yves suggests. So let restrict ourselves to these two options.

3) If we relax the wording as Yves suggests, expertise would be needed 
to form best practice on the use of the annotatorsRef value to provide a 
way of classifying the different scoring methods in a way that's useful 
for users.

Apologies for the long email, but unfortunately i could find any clear 
pointers one way or another.  Personally, I'm more neutral the proposal. 
But also I don't know if we could categorize this as a minor 
clarification or not either.

Please voice your views on the list, and lets try and get consensus 
before the call next week. Note I'm not available for the call and I 
think Felix is away also.

But we need to form a consensus quickly if we are to avoid delaying the 
PR stage further.

Regards,
Dave


On 17/07/2013 11:35, Yves Savourel wrote:
> Hi Dave,
>
> In the case of QuEst, for the scenario I have in mind, one would for example perform the MT part with MS Hub, then pass that
> information to QuEst and get back a score that indicate a level of confidence for that translation candidate. So that's a step after
> Mt and before any human looks at it.
>
> I may be wrong, but "MT Confidence" seems to be a good place to put that information.
>
> Even if QuEst is a wrong example. Having MT Confidence restricted to *self-reported* value seems very limiting. But maybe I'm mis
> interpreting the initial aim of the data category.
>
> Cheers,
> -ys
>
> -----Original Message-----
> From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
> Sent: Wednesday, July 17, 2013 12:25 PM
> To: public-multilingualweb-lt@w3.org
> Subject: Re: MT Confidence definition
>
> Hi Yves,
> I don't necessarily agree with this based on the example you give in relation to quality estimation in Quest.
>
> Is not the goal of quality estimation to predict the quality of a translation of a given source string for a given MT engine
> training corpora and training regime _prior_ to actually performing the translation?
>
> In which case it would be an annotation of a translation but of a _source_ with reference to an existing or planned MT engine (which
> you rightly say in response to Sergey can be resolved via the annotatorsRef).
>
> So while the basic data structure of mtConfidence would work for, the use case, name and wording don't i think match the use of MT
> QE.
>
> Declan, Ankit could you comment - I'm not really an expert here, and not up to speed on the different applications of MT QE.
>
> cheers,
> Dave
>
>
> On 17/07/2013 08:29, Yves Savourel wrote:
>> Hi all,
>>
>> I've noticed a minor text issue in the specification:
>>
>> For the MT Confidence data category we say:
>>
>> "The MT Confidence data category is used to communicate the
>> self-reported confidence score from a machine translation engine of the accuracy of a translation it has provided."
>>
>> This is very limiting.
>>
>> I think it should say:
>>
>> "The MT Confidence data category is used to communicate the confidence
>> score of the accuracy of a translation provided by a machine translation."
>>
>> (and later: "the self-reported confidence score" should be "the reported confidence score").
>>
>> There could be cases where the confidence score is provided by another
>> system than the one that provided the MT candidate. The QuEst project
>> is an example of this
>> http://staffwww.dcs.shef.ac.uk/people/L.Specia/projects/quest.html)
>>
>> Cheers,
>> -ys
>>
>>
>>
>

Received on Friday, 19 July 2013 17:38:38 UTC