- From: Declan Groves <dgroves@computing.dcu.ie>
- Date: Tue, 23 Jul 2013 10:23:59 +0100
- To: Felix Sasaki <fsasaki@w3.org>
- Cc: public-multilingualweb-lt@w3.org, Dave Lewis <dave.lewis@cs.tcd.ie>, Yves Savourel <ysavourel@enlaso.com>
- Message-ID: <CAOi_1PbVXDV6++yONoAMfPe3b+=zMRCRiAt2YGfCqKxmhOWUMw@mail.gmail.com>
Hi Felix, It looks good to me too. small typo: "MT confidence scores can be displayed.....by simple web-based translation editors or *by* Computer Aided Translation (CAT) tools" Declan On 23 July 2013 05:06, Yves Savourel <ysavourel@enlaso.com> wrote: > Hi Felix, > > Looks fine to me. > > Typo: "...the score on it's own is..." should be "...the score on its own > is..." > > -ys > > From: Felix Sasaki [mailto:fsasaki@w3.org] > Sent: Tuesday, July 23, 2013 2:42 AM > To: Declan Groves > Cc: public-multilingualweb-lt@w3.org; Yves Savourel; Dave Lewis > Subject: Re: MT Confidence definition [ACTION-556] > > Hi Declan, all, > > I tried to implement this in section > > http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-definition > and explain it with a dedicated note > > http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mt-confidence-score-generation-tools > This should resolve the "we need to explain this" part. Can you have a > look before the Wednesday call? With regards to examples I > propose to wait after the proposed recommendation examples - since these > are just examples. If people want to wait with this change > after the PR publication to have more time to review it I am fine with > that too - please let me know and I will revert it. > > Best, > > Felix > > Am 22.07.13 19:32, schrieb Declan Groves: > Hi all, > > I think Yves makes a good point. > > In my view, on reviewing the discussions, as it stands MT Confidence can > be used to represent two different types of "confidence" > scores. They are very closely related, but still quite different. > > It is worth remembering that the original motivation behind the MT > Confidence category is to provide an automatically-generated > value which offers some information on the perceived quality of a > translation produced by an MT engine. This value can then be used > in subsequent processes e.g. during post-editing processes, during > additional more sophisticated quality estimation processes etc. > 1. The quality score of the translation as produced by an MT engine (for > the most part this type of score is usually only produce by > statistical-based engine and usually equates to the probability of that > translation, given specific models used by the engine). > 2. The quality estimation score (such as provided by the QuEst tool or by > some additional process). > Both are dependant on the MT engine. The first is produced directly by the > MT engine. The second uses both MT-system-internal > features (including features extracted from internal MT translation and > language models as well as the final translation probability > as produced by the MT engine) and additional external features. This is > the reason why MT confidence needs to additional provide > information about the engine (and perhaps in the case of #2 any additional > tools that were used in deriving the MT confidence), > otherwise the number on it's own is hard to interpret and to reuse. > Based on this, I think, therefore, we can safely remove the > self-referential part of the description of MT Confidence to allow to be > used to capture both #1 and #2 above, but, following Dave's point, we > would need to clarify it with examples of best practises for > both instances to make it clear for implementers. It is not the intention > of the category to define how the score is calculated, so > I also think it's a good idea to use annotatorRef to provide further > details on the tools and methods used to generate the MT > Confidence, if required. > > Declan > > On 21 July 2013 15:26, Jörg Schütz <joerg@bioloom.de> wrote: > Hi Yves, Dave, and all, > > As of yet, the definition of MT Confidence restricts its use case to a > score internally generated by the employed MT engine. If we > would allow for the specification of the scoring tool then this data > category could be easily extended to the score generated by an > external tool, for example, the QuEst application for Moses based MT > engines. Probably, such an extension would need further > information elements like the models and data that have been used in the > scoring process. > > IMO LQI/non-conformance would be less appropriate for a "confidence" > measure given the list of possible "quality issues" which are > more linguistically oriented. Even if we would aggregate the different > result types with a certain weigthing (penalty), what we > would get is an approximated quality rating, which we have with LQR (on > the document level), but not a confidence measure in the > above sense. > > This is an interesting and forward looking discussion which we should > continue for future versions of ITS. > > Cheers -- Jörg > > > On July 20, 2013 at 22:37 (CEST), Yves Savourel wrote: > Hi Dave, all, > > If MT Confidence has been design to hold only a self-reported score, then > maybe it should stay that way. I just didn't know the > reasoning behind the origin of the data category. But IMO it becomes a > data category that is going to very rarely used, except for > research tools, production tools have rarely access to such measurement as > far as I see. But maybe it's a question of time. > > This said, in the case of QuEst, while I may be wrong, my understanding is > that the type of score you get is very comparable to a > self-reporting confidence. You will note that I didn't ask to change the > meaning of what MT Confidence is reporting, only that we > didn't restrict the tool that generate that score to the MT system itself. > > The other option would be to use LQI/non-conformance? But I have to say > that despite the description that sort of backup that > notion, the type name and the data category sound rather off to an > end-user like me: Localization quality *Issue* are about > reporting problems, and I would imagine a (non)-conformance type is about > aggregating data and types of errors to come up with an > overall score that is more a composite measurement than something close to > an MT Confidence. > > Would localization Quality Rating be better? It is a rating of the quality > of the translation with a rather vague definition. > > Cheers, > -yves > > > -----Original Message----- > From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie] > Sent: Friday, July 19, 2013 7:39 PM > To: Yves Savourel; public-multilingualweb-lt@w3.org > Subject: Re: MT Confidence definition [ACTION-556] > > Hi all > I managed to talk to Declan Groves about this yesterday. His view was that > the original use case was to enable to confidence score > that all statistical MT already generate in selecting the final output to > be propagated in an open way. So using other method is > some change (a > broadening) of the use case. > > He also saw the danger of confusion by users/implementors if something > labelled as a 'confidence score' (which has a certain meaning > in NLP > circles) might be used to convey quality estimation (QE), which, depending > how its done, has a different sort of significance. > > We did discuss the option of mtconfidence being used to convey the output > of an automated score (e.g. BLEU) that had been integrated > into an MT engine. This would be reasonable in use cases where MT engines > are being dynamically retrained, but would require > relaxing the wording. > > I also asked questions of some QE researchers in CNGL and got some > interesting clarifications. Certainly QE is being used to > provide scores of MT output (i was mistaken about that on the call), often > trained on some human annotation collected on the quality > of previous translations correlated to the current translation and perhaps > other meta data (including the self reported confidence > scores) from the MT engine. > Certainly there are also occasions where QE operates in a very similar > fashion to that intended for non-conformance in LQI, so I > think that remains an option also. > > So, Yves, you are right that the current definition is limiting to other > possible 'scores' representing a confidence in the > translation being a 'good' one, beyond just the MT-engine generated > scores. > > At the same time I have the impression that the technologies for this are > still emerging from the lab and don't have the benefit of > widely used common platforms and industrial experience that SMT does. > Overall this makes it difficult to make any hard and fast > statements about what should and should not be used to generate > MtConfidence scores right now. > > So softening that limitation as Yves suggests may be useful in > accommodating innovations in this area, but may also open the door to > some confusion by users that may impact negatively on the business > benefits of interoperation, e.g. a translation client gets a > score that they think has a certain significance when in fact it has > another. > > So, if we were to make the changes suggested by Yves, we should accompany > it with some best practice work to suggest how the > annotatorRef value could be used to inform on the particular method used > to generate the mtconfidence score, including some > classification encodings, explanations of the different methods and the > significance that can be placed on the resulting scores in > different situations. My general feeling, perhaps incorrect, is that the > current IG membership probably doesn't have the breadth of > expertise to provide this best practice. Arle, could this be something > that QT-Launchpad could take on? > > To sum up: > 1) the text proposed by yves may relax limits of what can produce > mtconfidence score in a useful way by accommodating different > techniques, but also has the potential to cause confusion about the > singificance of score produced by different methods. Some of > these could anyway be conveyed in the non-compliance in LQI, but not all. > > 2) it seems very difficult to formulate wording that would constrain the > range of methods in any usable way between the current text > and what Yves suggests. So let restrict ourselves to these two options. > > 3) If we relax the wording as Yves suggests, expertise would be needed to > form best practice on the use of the annotatorsRef value > to provide a way of classifying the different scoring methods in a way > that's useful for users. > > Apologies for the long email, but unfortunately i could find any clear > pointers one way or another. Personally, I'm more neutral > the proposal. > But also I don't know if we could categorize this as a minor clarification > or not either. > > Please voice your views on the list, and lets try and get consensus before > the call next week. Note I'm not available for the call > and I think Felix is away also. > > But we need to form a consensus quickly if we are to avoid delaying the PR > stage further. > > Regards, > Dave > > > On 17/07/2013 11:35, Yves Savourel wrote: > Hi Dave, > > In the case of QuEst, for the scenario I have in mind, one would for > example perform the MT part with MS Hub, then pass that information to > QuEst and get back a score that indicate a level of confidence for that > translation candidate. So that's a step after Mt and > before any human looks at it. > > I may be wrong, but "MT Confidence" seems to be a good place to put that > information. > > Even if QuEst is a wrong example. Having MT Confidence restricted to > *self-reported* value seems very limiting. But maybe I'm mis interpreting > the initial aim of the data category. > > Cheers, > -ys > > -----Original Message----- > From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie] > Sent: Wednesday, July 17, 2013 12:25 PM > To: public-multilingualweb-lt@w3.org > Subject: Re: MT Confidence definition > > Hi Yves, > I don't necessarily agree with this based on the example you give in > relation to quality estimation in Quest. > > Is not the goal of quality estimation to predict the quality of a > translation of a given source string for a given MT engine training > corpora and training regime _prior_ to actually performing the > translation? > > In which case it would be an annotation of a translation but of a > _source_ with reference to an existing or planned MT engine (which you > rightly say in response to Sergey can be resolved via the > annotatorsRef). > > So while the basic data structure of mtConfidence would work for, the > use case, name and wording don't i think match the use of MT QE. > > Declan, Ankit could you comment - I'm not really an expert here, and not > up to speed on the different applications of MT QE. > > cheers, > Dave > > > On 17/07/2013 08:29, Yves Savourel wrote: > Hi all, > > I've noticed a minor text issue in the specification: > > For the MT Confidence data category we say: > > "The MT Confidence data category is used to communicate the > self-reported confidence score from a machine translation engine of the > accuracy of a translation it has provided." > > This is very limiting. > > I think it should say: > > "The MT Confidence data category is used to communicate the > confidence score of the accuracy of a translation provided by a machine > translation." > > (and later: "the self-reported confidence score" should be "the reported > confidence score"). > > There could be cases where the confidence score is provided by > another system than the one that provided the MT candidate. The QuEst > project is an example of this > http://staffwww.dcs.shef.ac.uk/people/L.Specia/projects/quest.html) > > Cheers, > -ys > > > > > -- > Dr. Declan Groves > Applied Research and Development Coordinator > Centre for Next Generation Localisation (CNGL) > Dublin City University > > email: dgroves@computing.dcu.ie > phone: +353 (0)1 700 6906 > > > -- *Dr. Declan Groves Applied Research and Development Coordinator Centre for Next Generation Localisation (CNGL) Dublin City University email: dgroves@computing.dcu.ie <dgroves@computing.dcu.ie> phone: +353 (0)1 700 6906*
Received on Tuesday, 23 July 2013 09:24:29 UTC