- From: Felix Sasaki <fsasaki@w3.org>
- Date: Sun, 21 Jul 2013 14:10:45 +0200
- To: Yves Savourel <ysavourel@enlaso.com>
- CC: 'Dave Lewis' <dave.lewis@cs.tcd.ie>, public-multilingualweb-lt@w3.org
- Message-ID: <51EBCFC5.6070903@w3.org>
Hi Dave, Yves, all, one information about the "proposed recommendation": we don't have to delay it. The topic that we are discussing does not influence implementations of ITS 2.0. As said in this thread it is rather a best practice for producing machine translation confidence information and for working with annotatorsRef. So this won't influence anything of the proposed recommendation relevant conformance testing at http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20-implementation-report.html#MTConfidenceconformance-overview As for the discussion about MT confidence, one comment on the design of mtConfidence we got was from Microsoft: http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Aug/0040.html citing the relevant part here [ -mtQuality --mtConfidence ---mtProducer [string identifying producer Bing, DCU-Matrex etc.] ----mtEngine [string identifying the engine on one of the above platforms, can be potentially quite structured, pair domain etc.] -----mtConfidenceScore [0-100% or interval 0-1] ] To me this looks like another example of mtConfidence bound to the producer / engine. Also, the original requirement http://www.w3.org/TR/2012/WD-its2req-20120524/#mtConfidence "used by MT systems to indicate their confidence in the provided translation" Sounds like a restriction to self-reporting. Am 20.07.13 22:37, schrieb Yves Savourel: > Hi Dave, all, > > If MT Confidence has been design to hold only a self-reported score, then maybe it should stay that way. I just didn't know the > reasoning behind the origin of the data category. But IMO it becomes a data category that is going to very rarely used, except for > research tools, production tools have rarely access to such measurement as far as I see. But maybe it's a question of time. > > This said, in the case of QuEst, while I may be wrong, my understanding is that the type of score you get is very comparable to a > self-reporting confidence. You will note that I didn't ask to change the meaning of what MT Confidence is reporting, only that we > didn't restrict the tool that generate that score to the MT system itself. Currently we say in the definition of mtConfidence "It is not intended to provide a score that is comparable between machine translation engines and platforms." It seems that Yves' proposal would provide a path towards having that comparability. But from this thread emphasizing the "researchy state" of MT confidence information and the current definition, I am not sure whether we want to create such expectations? On the other hand, in the past we restricted ourself in other areas and even had to do a re-chartering for that - remembering RDFa. So I am not sure what the best solution here would be. Best, Felix > > The other option would be to use LQI/non-conformance? But I have to say that despite the description that sort of backup that > notion, the type name and the data category sound rather off to an end-user like me: Localization quality *Issue* are about > reporting problems, and I would imagine a (non)-conformance type is about aggregating data and types of errors to come up with an > overall score that is more a composite measurement than something close to an MT Confidence. > > Would localization Quality Rating be better? It is a rating of the quality of the translation with a rather vague definition. > > Cheers, > -yves > > > -----Original Message----- > From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie] > Sent: Friday, July 19, 2013 7:39 PM > To: Yves Savourel; public-multilingualweb-lt@w3.org > Subject: Re: MT Confidence definition [ACTION-556] > > Hi all > I managed to talk to Declan Groves about this yesterday. His view was that the original use case was to enable to confidence score > that all statistical MT already generate in selecting the final output to be propagated in an open way. So using other method is > some change (a > broadening) of the use case. > > He also saw the danger of confusion by users/implementors if something labelled as a 'confidence score' (which has a certain meaning > in NLP > circles) might be used to convey quality estimation (QE), which, depending how its done, has a different sort of significance. > > We did discuss the option of mtconfidence being used to convey the output of an automated score (e.g. BLEU) that had been integrated > into an MT engine. This would be reasonable in use cases where MT engines are being dynamically retrained, but would require > relaxing the wording. > > I also asked questions of some QE researchers in CNGL and got some interesting clarifications. Certainly QE is being used to > provide scores of MT output (i was mistaken about that on the call), often trained on some human annotation collected on the quality > of previous translations correlated to the current translation and perhaps other meta data (including the self reported confidence > scores) from the MT engine. > Certainly there are also occasions where QE operates in a very similar fashion to that intended for non-conformance in LQI, so I > think that remains an option also. > > So, Yves, you are right that the current definition is limiting to other possible 'scores' representing a confidence in the > translation being a 'good' one, beyond just the MT-engine generated scores. > > At the same time I have the impression that the technologies for this are still emerging from the lab and don't have the benefit of > widely used common platforms and industrial experience that SMT does. Overall this makes it difficult to make any hard and fast > statements about what should and should not be used to generate MtConfidence scores right now. > > So softening that limitation as Yves suggests may be useful in accommodating innovations in this area, but may also open the door to > some confusion by users that may impact negatively on the business benefits of interoperation, e.g. a translation client gets a > score that they think has a certain significance when in fact it has another. > > So, if we were to make the changes suggested by Yves, we should accompany it with some best practice work to suggest how the > annotatorRef value could be used to inform on the particular method used to generate the mtconfidence score, including some > classification encodings, explanations of the different methods and the significance that can be placed on the resulting scores in > different situations. My general feeling, perhaps incorrect, is that the current IG membership probably doesn't have the breadth of > expertise to provide this best practice. Arle, could this be something that QT-Launchpad could take on? > > To sum up: > 1) the text proposed by yves may relax limits of what can produce mtconfidence score in a useful way by accommodating different > techniques, but also has the potential to cause confusion about the singificance of score produced by different methods. Some of > these could anyway be conveyed in the non-compliance in LQI, but not all. > > 2) it seems very difficult to formulate wording that would constrain the range of methods in any usable way between the current text > and what Yves suggests. So let restrict ourselves to these two options. > > 3) If we relax the wording as Yves suggests, expertise would be needed to form best practice on the use of the annotatorsRef value > to provide a way of classifying the different scoring methods in a way that's useful for users. > > Apologies for the long email, but unfortunately i could find any clear pointers one way or another. Personally, I'm more neutral > the proposal. > But also I don't know if we could categorize this as a minor clarification or not either. > > Please voice your views on the list, and lets try and get consensus before the call next week. Note I'm not available for the call > and I think Felix is away also. > > But we need to form a consensus quickly if we are to avoid delaying the PR stage further. > > Regards, > Dave > > > On 17/07/2013 11:35, Yves Savourel wrote: >> Hi Dave, >> >> In the case of QuEst, for the scenario I have in mind, one would for >> example perform the MT part with MS Hub, then pass that information to >> QuEst and get back a score that indicate a level of confidence for that translation candidate. So that's a step after Mt and > before any human looks at it. >> I may be wrong, but "MT Confidence" seems to be a good place to put that information. >> >> Even if QuEst is a wrong example. Having MT Confidence restricted to >> *self-reported* value seems very limiting. But maybe I'm mis interpreting the initial aim of the data category. >> >> Cheers, >> -ys >> >> -----Original Message----- >> From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie] >> Sent: Wednesday, July 17, 2013 12:25 PM >> To: public-multilingualweb-lt@w3.org >> Subject: Re: MT Confidence definition >> >> Hi Yves, >> I don't necessarily agree with this based on the example you give in relation to quality estimation in Quest. >> >> Is not the goal of quality estimation to predict the quality of a >> translation of a given source string for a given MT engine training corpora and training regime _prior_ to actually performing the > translation? >> In which case it would be an annotation of a translation but of a >> _source_ with reference to an existing or planned MT engine (which you rightly say in response to Sergey can be resolved via the > annotatorsRef). >> So while the basic data structure of mtConfidence would work for, the >> use case, name and wording don't i think match the use of MT QE. >> >> Declan, Ankit could you comment - I'm not really an expert here, and not up to speed on the different applications of MT QE. >> >> cheers, >> Dave >> >> >> On 17/07/2013 08:29, Yves Savourel wrote: >>> Hi all, >>> >>> I've noticed a minor text issue in the specification: >>> >>> For the MT Confidence data category we say: >>> >>> "The MT Confidence data category is used to communicate the >>> self-reported confidence score from a machine translation engine of the accuracy of a translation it has provided." >>> >>> This is very limiting. >>> >>> I think it should say: >>> >>> "The MT Confidence data category is used to communicate the >>> confidence score of the accuracy of a translation provided by a machine translation." >>> >>> (and later: "the self-reported confidence score" should be "the reported confidence score"). >>> >>> There could be cases where the confidence score is provided by >>> another system than the one that provided the MT candidate. The QuEst >>> project is an example of this >>> http://staffwww.dcs.shef.ac.uk/people/L.Specia/projects/quest.html) >>> >>> Cheers, >>> -ys >>> >>> >>> > >
Received on Sunday, 21 July 2013 12:11:13 UTC