Re: MT Confidence definition [ACTION-556] from Felix Sasaki on 2013-07-23 (public-multilingualweb-lt@w3.org from July 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 23 Jul 2013 02:41:53 +0200
To: Declan Groves <dgroves@computing.dcu.ie>
CC: public-multilingualweb-lt@w3.org, Yves Savourel <ysavourel@enlaso.com>, Dave Lewis <dave.lewis@cs.tcd.ie>
Message-ID: <51EDD151.8030102@w3.org>
Hi Declan, all,

I tried to implement this in section
http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-definition
and explain it with a dedicated note
http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mt-confidence-score-generation-tools
This should resolve the "we need to explain this" part. Can you have a 
look before the Wednesday call? With regards to examples I propose to 
wait after the proposed recommendation examples - since these are just 
examples. If people want to wait with this change after the PR 
publication to have more time to review it I am fine with that too - 
please let me know and I will revert it.

Best,

Felix

Am 22.07.13 19:32, schrieb Declan Groves:
> Hi all,
>
> I think Yves makes a good point.
>
> In my view, on reviewing the discussions, as it stands MT Confidence 
> can be used to represent two different types of "confidence" scores. 
> They are very closely related, but still quite different.
>
> It is worth remembering that the original motivation behind the MT 
> Confidence category is to provide an automatically-generated value 
> which offers some information on the perceived quality of a 
> translation produced by an MT engine. This value can then be used in 
> subsequent processes e.g. during post-editing processes, during 
> additional more sophisticated quality estimation processes etc.
>
>  1. The quality score of the translation as produced by an MT engine
>     (for the most part this type of score is usually only produce by
>     statistical-based engine and usually equates to the probability of
>     that translation, given specific models used by the engine).
>  2. The quality estimation score (such as provided by the QuEst tool
>     or by some additional process).
>
> Both are dependant on the MT engine. The first is produced directly by 
> the MT engine. The second uses both MT-system-internal features 
> (including features extracted from internal MT translation and 
> language models as well as the final translation probability as 
> produced by the MT engine) and additional external features. This is 
> the reason why MT confidence needs to additional provide information 
> about the engine (and perhaps in the case of #2 any additional tools 
> that were used in deriving the MT confidence), otherwise the number on 
> it's own is hard to interpret and to reuse.
>
> Based on this, I think, therefore, we can safely remove the 
> self-referential part of the description of MT Confidence to allow to 
> be used to capture both #1 and #2 above, but, following Dave's point, 
> we would need to clarify it with examples of best practises for both 
> instances to make it clear for implementers. It is not the intention 
> of the category to define how the score is calculated, so I also think 
> it's a good idea to use annotatorRef to provide further details on the 
> tools and methods used to generate the MT Confidence, if required.
>
>
> Declan
>
>
>
> On 21 July 2013 15:26, Jörg Schütz <joerg@bioloom.de 
> <mailto:joerg@bioloom.de>> wrote:
>
>     Hi Yves, Dave, and all,
>
>     As of yet, the definition of MT Confidence restricts its use case
>     to a score internally generated by the employed MT engine. If we
>     would allow for the specification of the scoring tool then this
>     data category could be easily extended to the score generated by
>     an external tool, for example, the QuEst application for Moses
>     based MT engines. Probably, such an extension would need further
>     information elements like the models and data that have been used
>     in the scoring process.
>
>     IMO LQI/non-conformance would be less appropriate for a
>     "confidence" measure given the list of possible "quality issues"
>     which are more linguistically oriented. Even if we would aggregate
>     the different result types with a certain weigthing (penalty),
>     what we would get is an approximated quality rating, which we have
>     with LQR (on the document level), but not a confidence measure in
>     the above sense.
>
>     This is an interesting and forward looking discussion which we
>     should continue for future versions of ITS.
>
>     Cheers -- Jörg
>
>
>     On July 20, 2013 at 22:37 (CEST), Yves Savourel wrote:
>
>         Hi Dave, all,
>
>         If MT Confidence has been design to hold only a self-reported
>         score, then maybe it should stay that way. I just didn't know the
>         reasoning behind the origin of the data category. But IMO it
>         becomes a data category that is going to very rarely used,
>         except for
>         research tools, production tools have rarely access to such
>         measurement as far as I see. But maybe it's a question of time.
>
>         This said, in the case of QuEst, while I may be wrong, my
>         understanding is that the type of score you get is very
>         comparable to a
>         self-reporting confidence. You will note that I didn't ask to
>         change the meaning of what MT Confidence is reporting, only
>         that we
>         didn't restrict the tool that generate that score to the MT
>         system itself.
>
>         The other option would be to use LQI/non-conformance? But I
>         have to say that despite the description that sort of backup that
>         notion, the type name and the data category sound rather off
>         to an end-user like me: Localization quality *Issue* are about
>         reporting problems, and I would imagine a (non)-conformance
>         type is about aggregating data and types of errors to come up
>         with an
>         overall score that is more a composite measurement than
>         something close to an MT Confidence.
>
>         Would localization Quality Rating be better? It is a rating of
>         the quality of the translation with a rather vague definition.
>
>         Cheers,
>         -yves
>
>
>         -----Original Message-----
>         From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie
>         <mailto:dave.lewis@cs.tcd.ie>]
>         Sent: Friday, July 19, 2013 7:39 PM
>         To: Yves Savourel; public-multilingualweb-lt@w3.org
>         <mailto:public-multilingualweb-lt@w3.org>
>         Subject: Re: MT Confidence definition [ACTION-556]
>
>         Hi all
>         I managed to talk to Declan Groves about this yesterday. His
>         view was that the original use case was to enable to
>         confidence score
>         that all statistical MT already generate in selecting the
>         final output to be propagated in an open way. So using other
>         method is
>         some change (a
>         broadening) of the use case.
>
>         He also saw the danger of confusion by users/implementors if
>         something labelled as a 'confidence score' (which has a
>         certain meaning
>         in NLP
>         circles) might be used to convey quality estimation (QE),
>         which, depending how its done,  has a different sort of
>         significance.
>
>         We did discuss the option of mtconfidence being used to convey
>         the output of an automated score (e.g. BLEU) that had been
>         integrated
>         into an MT engine. This would be reasonable in use cases where
>         MT engines are being dynamically retrained, but would require
>         relaxing the wording.
>
>         I also asked questions of some  QE researchers in CNGL and got
>         some interesting clarifications. Certainly QE is being used to
>         provide scores of MT output (i was mistaken about that on the
>         call), often trained on some human annotation collected on the
>         quality
>         of previous translations correlated to the current translation
>         and perhaps other meta data (including the self reported
>         confidence
>         scores) from the MT engine.
>         Certainly there are also occasions where QE operates in a very
>         similar fashion to that intended for non-conformance in  LQI, so I
>         think that remains an option also.
>
>         So, Yves, you are right that the current definition is
>         limiting to other possible 'scores' representing a confidence
>         in the
>         translation being a 'good' one,  beyond just the MT-engine
>         generated scores.
>
>         At the same time I have the impression that the technologies
>         for this are still emerging from the lab and don't have the
>         benefit of
>         widely used common platforms and industrial experience that
>         SMT does. Overall this makes it difficult to make any hard and
>         fast
>         statements about what should and should not be used to
>         generate MtConfidence scores right now.
>
>         So softening that limitation as Yves suggests may be useful in
>         accommodating innovations in this area, but may also open the
>         door to
>         some confusion by users that may impact negatively on the
>         business benefits of interoperation, e.g. a translation client
>         gets a
>         score that they think has a certain significance when in fact
>         it has another.
>
>         So, if we were to make the changes suggested by Yves, we
>         should accompany it with some best practice work to suggest
>         how the
>         annotatorRef value could be used to inform on the particular
>         method used to generate the mtconfidence score, including some
>         classification encodings, explanations of the different
>         methods and the significance that can be placed on the
>         resulting scores in
>         different situations. My general feeling, perhaps incorrect,
>         is that the current IG membership probably doesn't have the
>         breadth of
>         expertise to provide this best practice. Arle, could this be
>         something that QT-Launchpad could take on?
>
>         To sum up:
>         1) the text proposed by yves may relax limits of what can
>         produce mtconfidence score  in a useful way by accommodating
>         different
>         techniques, but also has the potential to cause confusion
>         about the singificance of score produced by different methods.
>         Some of
>         these could anyway be conveyed in the non-compliance in LQI,
>         but not all.
>
>         2) it seems very difficult to formulate wording that would
>         constrain the range of methods in any usable way between the
>         current text
>         and what Yves suggests. So let restrict ourselves to these two
>         options.
>
>         3) If we relax the wording as Yves suggests, expertise would
>         be needed to form best practice on the use of the
>         annotatorsRef value
>         to provide a way of classifying the different scoring methods
>         in a way that's useful for users.
>
>         Apologies for the long email, but unfortunately i could find
>         any clear pointers one way or another.  Personally, I'm more
>         neutral
>         the proposal.
>         But also I don't know if we could categorize this as a minor
>         clarification or not either.
>
>         Please voice your views on the list, and lets try and get
>         consensus before the call next week. Note I'm not available
>         for the call
>         and I think Felix is away also.
>
>         But we need to form a consensus quickly if we are to avoid
>         delaying the PR stage further.
>
>         Regards,
>         Dave
>
>
>         On 17/07/2013 11:35, Yves Savourel wrote:
>
>             Hi Dave,
>
>             In the case of QuEst, for the scenario I have in mind, one
>             would for
>             example perform the MT part with MS Hub, then pass that
>             information to
>             QuEst and get back a score that indicate a level of
>             confidence for that translation candidate. So that's a
>             step after Mt and
>
>         before any human looks at it.
>
>
>             I may be wrong, but "MT Confidence" seems to be a good
>             place to put that information.
>
>             Even if QuEst is a wrong example. Having MT Confidence
>             restricted to
>             *self-reported* value seems very limiting. But maybe I'm
>             mis interpreting the initial aim of the data category.
>
>             Cheers,
>             -ys
>
>             -----Original Message-----
>             From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie
>             <mailto:dave.lewis@cs.tcd.ie>]
>             Sent: Wednesday, July 17, 2013 12:25 PM
>             To: public-multilingualweb-lt@w3.org
>             <mailto:public-multilingualweb-lt@w3.org>
>             Subject: Re: MT Confidence definition
>
>             Hi Yves,
>             I don't necessarily agree with this based on the example
>             you give in relation to quality estimation in Quest.
>
>             Is not the goal of quality estimation to predict the
>             quality of a
>             translation of a given source string for a given MT engine
>             training corpora and training regime _prior_ to actually
>             performing the
>
>         translation?
>
>
>             In which case it would be an annotation of a translation
>             but of a
>             _source_ with reference to an existing or planned MT
>             engine (which you rightly say in response to Sergey can be
>             resolved via the
>
>         annotatorsRef).
>
>
>             So while the basic data structure of mtConfidence would
>             work for, the
>             use case, name and wording don't i think match the use of
>             MT QE.
>
>             Declan, Ankit could you comment - I'm not really an expert
>             here, and not up to speed on the different applications of
>             MT QE.
>
>             cheers,
>             Dave
>
>
>             On 17/07/2013 08:29, Yves Savourel wrote:
>
>                 Hi all,
>
>                 I've noticed a minor text issue in the specification:
>
>                 For the MT Confidence data category we say:
>
>                 "The MT Confidence data category is used to
>                 communicate the
>                 self-reported confidence score from a machine
>                 translation engine of the accuracy of a translation it
>                 has provided."
>
>                 This is very limiting.
>
>                 I think it should say:
>
>                 "The MT Confidence data category is used to
>                 communicate the
>                 confidence score of the accuracy of a translation
>                 provided by a machine translation."
>
>                 (and later: "the self-reported confidence score"
>                 should be "the reported confidence score").
>
>                 There could be cases where the confidence score is
>                 provided by
>                 another system than the one that provided the MT
>                 candidate. The QuEst
>                 project is an example of this
>                 http://staffwww.dcs.shef.ac.uk/people/L.Specia/projects/quest.html)
>
>                 Cheers,
>                 -ys
>
>
>
>
>
> -- 
> /Dr. Declan Groves
> Applied Research and Development Coordinator
> Centre for Next Generation Localisation (CNGL)
> Dublin City University
>
> email: dgroves@computing.dcu.ie 
> <mailto:dgroves@computing.dcu.ie><mailto:dgroves@computing.dcu.ie>
> phone: +353 (0)1 700 6906/
Received on Tuesday, 23 July 2013 00:42:26 UTC