- From: Felix Sasaki <fsasaki@w3.org>
- Date: Tue, 23 Jul 2013 02:41:53 +0200
- To: Declan Groves <dgroves@computing.dcu.ie>
- CC: public-multilingualweb-lt@w3.org, Yves Savourel <ysavourel@enlaso.com>, Dave Lewis <dave.lewis@cs.tcd.ie>
- Message-ID: <51EDD151.8030102@w3.org>
Hi Declan, all, I tried to implement this in section http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-definition and explain it with a dedicated note http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mt-confidence-score-generation-tools This should resolve the "we need to explain this" part. Can you have a look before the Wednesday call? With regards to examples I propose to wait after the proposed recommendation examples - since these are just examples. If people want to wait with this change after the PR publication to have more time to review it I am fine with that too - please let me know and I will revert it. Best, Felix Am 22.07.13 19:32, schrieb Declan Groves: > Hi all, > > I think Yves makes a good point. > > In my view, on reviewing the discussions, as it stands MT Confidence > can be used to represent two different types of "confidence" scores. > They are very closely related, but still quite different. > > It is worth remembering that the original motivation behind the MT > Confidence category is to provide an automatically-generated value > which offers some information on the perceived quality of a > translation produced by an MT engine. This value can then be used in > subsequent processes e.g. during post-editing processes, during > additional more sophisticated quality estimation processes etc. > > 1. The quality score of the translation as produced by an MT engine > (for the most part this type of score is usually only produce by > statistical-based engine and usually equates to the probability of > that translation, given specific models used by the engine). > 2. The quality estimation score (such as provided by the QuEst tool > or by some additional process). > > Both are dependant on the MT engine. The first is produced directly by > the MT engine. The second uses both MT-system-internal features > (including features extracted from internal MT translation and > language models as well as the final translation probability as > produced by the MT engine) and additional external features. This is > the reason why MT confidence needs to additional provide information > about the engine (and perhaps in the case of #2 any additional tools > that were used in deriving the MT confidence), otherwise the number on > it's own is hard to interpret and to reuse. > > Based on this, I think, therefore, we can safely remove the > self-referential part of the description of MT Confidence to allow to > be used to capture both #1 and #2 above, but, following Dave's point, > we would need to clarify it with examples of best practises for both > instances to make it clear for implementers. It is not the intention > of the category to define how the score is calculated, so I also think > it's a good idea to use annotatorRef to provide further details on the > tools and methods used to generate the MT Confidence, if required. > > > Declan > > > > On 21 July 2013 15:26, Jörg Schütz <joerg@bioloom.de > <mailto:joerg@bioloom.de>> wrote: > > Hi Yves, Dave, and all, > > As of yet, the definition of MT Confidence restricts its use case > to a score internally generated by the employed MT engine. If we > would allow for the specification of the scoring tool then this > data category could be easily extended to the score generated by > an external tool, for example, the QuEst application for Moses > based MT engines. Probably, such an extension would need further > information elements like the models and data that have been used > in the scoring process. > > IMO LQI/non-conformance would be less appropriate for a > "confidence" measure given the list of possible "quality issues" > which are more linguistically oriented. Even if we would aggregate > the different result types with a certain weigthing (penalty), > what we would get is an approximated quality rating, which we have > with LQR (on the document level), but not a confidence measure in > the above sense. > > This is an interesting and forward looking discussion which we > should continue for future versions of ITS. > > Cheers -- Jörg > > > On July 20, 2013 at 22:37 (CEST), Yves Savourel wrote: > > Hi Dave, all, > > If MT Confidence has been design to hold only a self-reported > score, then maybe it should stay that way. I just didn't know the > reasoning behind the origin of the data category. But IMO it > becomes a data category that is going to very rarely used, > except for > research tools, production tools have rarely access to such > measurement as far as I see. But maybe it's a question of time. > > This said, in the case of QuEst, while I may be wrong, my > understanding is that the type of score you get is very > comparable to a > self-reporting confidence. You will note that I didn't ask to > change the meaning of what MT Confidence is reporting, only > that we > didn't restrict the tool that generate that score to the MT > system itself. > > The other option would be to use LQI/non-conformance? But I > have to say that despite the description that sort of backup that > notion, the type name and the data category sound rather off > to an end-user like me: Localization quality *Issue* are about > reporting problems, and I would imagine a (non)-conformance > type is about aggregating data and types of errors to come up > with an > overall score that is more a composite measurement than > something close to an MT Confidence. > > Would localization Quality Rating be better? It is a rating of > the quality of the translation with a rather vague definition. > > Cheers, > -yves > > > -----Original Message----- > From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie > <mailto:dave.lewis@cs.tcd.ie>] > Sent: Friday, July 19, 2013 7:39 PM > To: Yves Savourel; public-multilingualweb-lt@w3.org > <mailto:public-multilingualweb-lt@w3.org> > Subject: Re: MT Confidence definition [ACTION-556] > > Hi all > I managed to talk to Declan Groves about this yesterday. His > view was that the original use case was to enable to > confidence score > that all statistical MT already generate in selecting the > final output to be propagated in an open way. So using other > method is > some change (a > broadening) of the use case. > > He also saw the danger of confusion by users/implementors if > something labelled as a 'confidence score' (which has a > certain meaning > in NLP > circles) might be used to convey quality estimation (QE), > which, depending how its done, has a different sort of > significance. > > We did discuss the option of mtconfidence being used to convey > the output of an automated score (e.g. BLEU) that had been > integrated > into an MT engine. This would be reasonable in use cases where > MT engines are being dynamically retrained, but would require > relaxing the wording. > > I also asked questions of some QE researchers in CNGL and got > some interesting clarifications. Certainly QE is being used to > provide scores of MT output (i was mistaken about that on the > call), often trained on some human annotation collected on the > quality > of previous translations correlated to the current translation > and perhaps other meta data (including the self reported > confidence > scores) from the MT engine. > Certainly there are also occasions where QE operates in a very > similar fashion to that intended for non-conformance in LQI, so I > think that remains an option also. > > So, Yves, you are right that the current definition is > limiting to other possible 'scores' representing a confidence > in the > translation being a 'good' one, beyond just the MT-engine > generated scores. > > At the same time I have the impression that the technologies > for this are still emerging from the lab and don't have the > benefit of > widely used common platforms and industrial experience that > SMT does. Overall this makes it difficult to make any hard and > fast > statements about what should and should not be used to > generate MtConfidence scores right now. > > So softening that limitation as Yves suggests may be useful in > accommodating innovations in this area, but may also open the > door to > some confusion by users that may impact negatively on the > business benefits of interoperation, e.g. a translation client > gets a > score that they think has a certain significance when in fact > it has another. > > So, if we were to make the changes suggested by Yves, we > should accompany it with some best practice work to suggest > how the > annotatorRef value could be used to inform on the particular > method used to generate the mtconfidence score, including some > classification encodings, explanations of the different > methods and the significance that can be placed on the > resulting scores in > different situations. My general feeling, perhaps incorrect, > is that the current IG membership probably doesn't have the > breadth of > expertise to provide this best practice. Arle, could this be > something that QT-Launchpad could take on? > > To sum up: > 1) the text proposed by yves may relax limits of what can > produce mtconfidence score in a useful way by accommodating > different > techniques, but also has the potential to cause confusion > about the singificance of score produced by different methods. > Some of > these could anyway be conveyed in the non-compliance in LQI, > but not all. > > 2) it seems very difficult to formulate wording that would > constrain the range of methods in any usable way between the > current text > and what Yves suggests. So let restrict ourselves to these two > options. > > 3) If we relax the wording as Yves suggests, expertise would > be needed to form best practice on the use of the > annotatorsRef value > to provide a way of classifying the different scoring methods > in a way that's useful for users. > > Apologies for the long email, but unfortunately i could find > any clear pointers one way or another. Personally, I'm more > neutral > the proposal. > But also I don't know if we could categorize this as a minor > clarification or not either. > > Please voice your views on the list, and lets try and get > consensus before the call next week. Note I'm not available > for the call > and I think Felix is away also. > > But we need to form a consensus quickly if we are to avoid > delaying the PR stage further. > > Regards, > Dave > > > On 17/07/2013 11:35, Yves Savourel wrote: > > Hi Dave, > > In the case of QuEst, for the scenario I have in mind, one > would for > example perform the MT part with MS Hub, then pass that > information to > QuEst and get back a score that indicate a level of > confidence for that translation candidate. So that's a > step after Mt and > > before any human looks at it. > > > I may be wrong, but "MT Confidence" seems to be a good > place to put that information. > > Even if QuEst is a wrong example. Having MT Confidence > restricted to > *self-reported* value seems very limiting. But maybe I'm > mis interpreting the initial aim of the data category. > > Cheers, > -ys > > -----Original Message----- > From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie > <mailto:dave.lewis@cs.tcd.ie>] > Sent: Wednesday, July 17, 2013 12:25 PM > To: public-multilingualweb-lt@w3.org > <mailto:public-multilingualweb-lt@w3.org> > Subject: Re: MT Confidence definition > > Hi Yves, > I don't necessarily agree with this based on the example > you give in relation to quality estimation in Quest. > > Is not the goal of quality estimation to predict the > quality of a > translation of a given source string for a given MT engine > training corpora and training regime _prior_ to actually > performing the > > translation? > > > In which case it would be an annotation of a translation > but of a > _source_ with reference to an existing or planned MT > engine (which you rightly say in response to Sergey can be > resolved via the > > annotatorsRef). > > > So while the basic data structure of mtConfidence would > work for, the > use case, name and wording don't i think match the use of > MT QE. > > Declan, Ankit could you comment - I'm not really an expert > here, and not up to speed on the different applications of > MT QE. > > cheers, > Dave > > > On 17/07/2013 08:29, Yves Savourel wrote: > > Hi all, > > I've noticed a minor text issue in the specification: > > For the MT Confidence data category we say: > > "The MT Confidence data category is used to > communicate the > self-reported confidence score from a machine > translation engine of the accuracy of a translation it > has provided." > > This is very limiting. > > I think it should say: > > "The MT Confidence data category is used to > communicate the > confidence score of the accuracy of a translation > provided by a machine translation." > > (and later: "the self-reported confidence score" > should be "the reported confidence score"). > > There could be cases where the confidence score is > provided by > another system than the one that provided the MT > candidate. The QuEst > project is an example of this > http://staffwww.dcs.shef.ac.uk/people/L.Specia/projects/quest.html) > > Cheers, > -ys > > > > > > -- > /Dr. Declan Groves > Applied Research and Development Coordinator > Centre for Next Generation Localisation (CNGL) > Dublin City University > > email: dgroves@computing.dcu.ie > <mailto:dgroves@computing.dcu.ie><mailto:dgroves@computing.dcu.ie> > phone: +353 (0)1 700 6906/
Received on Tuesday, 23 July 2013 00:42:26 UTC