Re: MT Confidence definition [ACTION-556] from Felix Sasaki on 2013-07-21 (public-multilingualweb-lt@w3.org from July 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Sun, 21 Jul 2013 14:10:45 +0200
To: Yves Savourel <ysavourel@enlaso.com>
CC: 'Dave Lewis' <dave.lewis@cs.tcd.ie>, public-multilingualweb-lt@w3.org
Message-ID: <51EBCFC5.6070903@w3.org>
Hi Dave, Yves, all,

one information about the "proposed recommendation": we don't have to 
delay it. The topic that we are discussing does not influence 
implementations of ITS 2.0. As said in this thread it is rather a best 
practice for producing machine translation confidence information and 
for working with annotatorsRef. So this won't influence anything of the 
proposed recommendation relevant conformance testing at
http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20-implementation-report.html#MTConfidenceconformance-overview

As for the discussion about MT confidence, one comment on the design of 
mtConfidence we got was from Microsoft:
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Aug/0040.html
citing the relevant part here
[
-mtQuality
--mtConfidence
---mtProducer [string identifying producer Bing, DCU-Matrex etc.]
----mtEngine [string identifying the engine on one of the above 
platforms, can be potentially quite structured, pair domain etc.]
-----mtConfidenceScore [0-100% or interval 0-1]
]
To me this looks like another example of mtConfidence bound to the 
producer / engine. Also, the original requirement
http://www.w3.org/TR/2012/WD-its2req-20120524/#mtConfidence
"used by MT systems to indicate their confidence in the provided 
translation"
Sounds like a restriction to self-reporting.



Am 20.07.13 22:37, schrieb Yves Savourel:
> Hi Dave, all,
>
> If MT Confidence has been design to hold only a self-reported score, then maybe it should stay that way. I just didn't know the
> reasoning behind the origin of the data category. But IMO it becomes a data category that is going to very rarely used, except for
> research tools, production tools have rarely access to such measurement as far as I see. But maybe it's a question of time.
>
> This said, in the case of QuEst, while I may be wrong, my understanding is that the type of score you get is very comparable to a
> self-reporting confidence. You will note that I didn't ask to change the meaning of what MT Confidence is reporting, only that we
> didn't restrict the tool that generate that score to the MT system itself.

Currently we say in the definition of mtConfidence

"It is not intended to provide a score that is comparable between 
machine translation engines and platforms."

It seems that Yves' proposal would provide a path towards having that 
comparability. But from this thread emphasizing the "researchy state" of 
MT confidence information and the current definition, I am not sure 
whether we want to create such expectations?

On the other hand, in the past we restricted ourself in other areas and 
even had to do a re-chartering for that - remembering RDFa. So I am not 
sure what the best solution here would be.

Best,

Felix

>
> The other option would be to use LQI/non-conformance? But I have to say that despite the description that sort of backup that
> notion, the type name and the data category sound rather off to an end-user like me: Localization quality *Issue* are about
> reporting problems, and I would imagine a (non)-conformance type is about aggregating data and types of errors to come up with an
> overall score that is more a composite measurement than something close to an MT Confidence.
>
> Would localization Quality Rating be better? It is a rating of the quality of the translation with a rather vague definition.
>
> Cheers,
> -yves
>
>
> -----Original Message-----
> From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
> Sent: Friday, July 19, 2013 7:39 PM
> To: Yves Savourel; public-multilingualweb-lt@w3.org
> Subject: Re: MT Confidence definition [ACTION-556]
>
> Hi all
> I managed to talk to Declan Groves about this yesterday. His view was that the original use case was to enable to confidence score
> that all statistical MT already generate in selecting the final output to be propagated in an open way. So using other method is
> some change (a
> broadening) of the use case.
>
> He also saw the danger of confusion by users/implementors if something labelled as a 'confidence score' (which has a certain meaning
> in NLP
> circles) might be used to convey quality estimation (QE), which, depending how its done,  has a different sort of significance.
>
> We did discuss the option of mtconfidence being used to convey the output of an automated score (e.g. BLEU) that had been integrated
> into an MT engine. This would be reasonable in use cases where MT engines are being dynamically retrained, but would require
> relaxing the wording.
>
> I also asked questions of some  QE researchers in CNGL and got some interesting clarifications. Certainly QE is being used to
> provide scores of MT output (i was mistaken about that on the call), often trained on some human annotation collected on the quality
> of previous translations correlated to the current translation and perhaps other meta data (including the self reported confidence
> scores) from the MT engine.
> Certainly there are also occasions where QE operates in a very similar fashion to that intended for non-conformance in  LQI, so I
> think that remains an option also.
>
> So, Yves, you are right that the current definition is limiting to other possible 'scores' representing a confidence in the
> translation being a 'good' one,  beyond just the MT-engine generated scores.
>
> At the same time I have the impression that the technologies for this are still emerging from the lab and don't have the benefit of
> widely used common platforms and industrial experience that SMT does. Overall this makes it difficult to make any hard and fast
> statements about what should and should not be used to generate MtConfidence scores right now.
>
> So softening that limitation as Yves suggests may be useful in accommodating innovations in this area, but may also open the door to
> some confusion by users that may impact negatively on the business benefits of interoperation, e.g. a translation client gets a
> score that they think has a certain significance when in fact it has another.
>
> So, if we were to make the changes suggested by Yves, we should accompany it with some best practice work to suggest how the
> annotatorRef value could be used to inform on the particular method used to generate the mtconfidence score, including some
> classification encodings, explanations of the different methods and the significance that can be placed on the resulting scores in
> different situations. My general feeling, perhaps incorrect, is that the current IG membership probably doesn't have the breadth of
> expertise to provide this best practice. Arle, could this be something that QT-Launchpad could take on?
>
> To sum up:
> 1) the text proposed by yves may relax limits of what can produce mtconfidence score  in a useful way by accommodating different
> techniques, but also has the potential to cause confusion about the singificance of score produced by different methods. Some of
> these could anyway be conveyed in the non-compliance in LQI, but not all.
>
> 2) it seems very difficult to formulate wording that would constrain the range of methods in any usable way between the current text
> and what Yves suggests. So let restrict ourselves to these two options.
>
> 3) If we relax the wording as Yves suggests, expertise would be needed to form best practice on the use of the annotatorsRef value
> to provide a way of classifying the different scoring methods in a way that's useful for users.
>
> Apologies for the long email, but unfortunately i could find any clear pointers one way or another.  Personally, I'm more neutral
> the proposal.
> But also I don't know if we could categorize this as a minor clarification or not either.
>
> Please voice your views on the list, and lets try and get consensus before the call next week. Note I'm not available for the call
> and I think Felix is away also.
>
> But we need to form a consensus quickly if we are to avoid delaying the PR stage further.
>
> Regards,
> Dave
>
>
> On 17/07/2013 11:35, Yves Savourel wrote:
>> Hi Dave,
>>
>> In the case of QuEst, for the scenario I have in mind, one would for
>> example perform the MT part with MS Hub, then pass that information to
>> QuEst and get back a score that indicate a level of confidence for that translation candidate. So that's a step after Mt and
> before any human looks at it.
>> I may be wrong, but "MT Confidence" seems to be a good place to put that information.
>>
>> Even if QuEst is a wrong example. Having MT Confidence restricted to
>> *self-reported* value seems very limiting. But maybe I'm mis interpreting the initial aim of the data category.
>>
>> Cheers,
>> -ys
>>
>> -----Original Message-----
>> From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
>> Sent: Wednesday, July 17, 2013 12:25 PM
>> To: public-multilingualweb-lt@w3.org
>> Subject: Re: MT Confidence definition
>>
>> Hi Yves,
>> I don't necessarily agree with this based on the example you give in relation to quality estimation in Quest.
>>
>> Is not the goal of quality estimation to predict the quality of a
>> translation of a given source string for a given MT engine training corpora and training regime _prior_ to actually performing the
> translation?
>> In which case it would be an annotation of a translation but of a
>> _source_ with reference to an existing or planned MT engine (which you rightly say in response to Sergey can be resolved via the
> annotatorsRef).
>> So while the basic data structure of mtConfidence would work for, the
>> use case, name and wording don't i think match the use of MT QE.
>>
>> Declan, Ankit could you comment - I'm not really an expert here, and not up to speed on the different applications of MT QE.
>>
>> cheers,
>> Dave
>>
>>
>> On 17/07/2013 08:29, Yves Savourel wrote:
>>> Hi all,
>>>
>>> I've noticed a minor text issue in the specification:
>>>
>>> For the MT Confidence data category we say:
>>>
>>> "The MT Confidence data category is used to communicate the
>>> self-reported confidence score from a machine translation engine of the accuracy of a translation it has provided."
>>>
>>> This is very limiting.
>>>
>>> I think it should say:
>>>
>>> "The MT Confidence data category is used to communicate the
>>> confidence score of the accuracy of a translation provided by a machine translation."
>>>
>>> (and later: "the self-reported confidence score" should be "the reported confidence score").
>>>
>>> There could be cases where the confidence score is provided by
>>> another system than the one that provided the MT candidate. The QuEst
>>> project is an example of this
>>> http://staffwww.dcs.shef.ac.uk/people/L.Specia/projects/quest.html)
>>>
>>> Cheers,
>>> -ys
>>>
>>>
>>>
>
>
Received on Sunday, 21 July 2013 12:11:13 UTC