Re: MT Confidence definition [ACTION-556]

Hi Felix,

It looks good to me too.

small typo: "MT confidence scores can be displayed.....by simple web-based
translation editors or *by* Computer Aided Translation (CAT) tools"


Declan


On 23 July 2013 05:06, Yves Savourel <ysavourel@enlaso.com> wrote:

> Hi Felix,
>
> Looks fine to me.
>
> Typo: "...the score on it's own is..." should be "...the score on its own
> is..."
>
> -ys
>
> From: Felix Sasaki [mailto:fsasaki@w3.org]
> Sent: Tuesday, July 23, 2013 2:42 AM
> To: Declan Groves
> Cc: public-multilingualweb-lt@w3.org; Yves Savourel; Dave Lewis
> Subject: Re: MT Confidence definition [ACTION-556]
>
> Hi Declan, all,
>
> I tried to implement this in section
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-definition
> and explain it with a dedicated note
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mt-confidence-score-generation-tools
> This should resolve the "we need to explain this" part. Can you have a
> look before the Wednesday call? With regards to examples I
> propose to wait after the proposed recommendation examples - since these
> are just examples. If people want to wait with this change
> after the PR publication to have more time to review it I am fine with
> that too - please let me know and I will revert it.
>
> Best,
>
> Felix
>
> Am 22.07.13 19:32, schrieb Declan Groves:
> Hi all,
>
> I think Yves makes a good point.
>
> In my view, on reviewing the discussions, as it stands MT Confidence can
> be used to represent two different types of "confidence"
> scores. They are very closely related, but still quite different.
>
> It is worth remembering that the original motivation behind the MT
> Confidence category is to provide an automatically-generated
> value which offers some information on the perceived quality of a
> translation produced by an MT engine. This value can then be used
> in subsequent processes e.g. during post-editing processes, during
> additional more sophisticated quality estimation processes etc.
> 1. The quality score of the translation as produced by an MT engine (for
> the most part this type of score is usually only produce by
> statistical-based engine and usually equates to the probability of that
> translation, given specific models used by the engine).
> 2. The quality estimation score (such as provided by the QuEst tool or by
> some additional process).
> Both are dependant on the MT engine. The first is produced directly by the
> MT engine. The second uses both MT-system-internal
> features (including features extracted from internal MT translation and
> language models as well as the final translation probability
> as produced by the MT engine) and additional external features. This is
> the reason why MT confidence needs to additional provide
> information about the engine (and perhaps in the case of #2 any additional
> tools that were used in deriving the MT confidence),
> otherwise the number on it's own is hard to interpret and to reuse.
> Based on this, I think, therefore, we can safely remove the
> self-referential part of the description of MT Confidence to allow to be
> used to capture both #1 and #2 above, but, following Dave's point, we
> would need to clarify it with examples of best practises for
> both instances to make it clear for implementers. It is not the intention
> of the category to define how the score is calculated, so
> I also think it's a good idea to use annotatorRef to provide further
> details on the tools and methods used to generate the MT
> Confidence, if required.
>
> Declan
>
> On 21 July 2013 15:26, Jörg Schütz <joerg@bioloom.de> wrote:
> Hi Yves, Dave, and all,
>
> As of yet, the definition of MT Confidence restricts its use case to a
> score internally generated by the employed MT engine. If we
> would allow for the specification of the scoring tool then this data
> category could be easily extended to the score generated by an
> external tool, for example, the QuEst application for Moses based MT
> engines. Probably, such an extension would need further
> information elements like the models and data that have been used in the
> scoring process.
>
> IMO LQI/non-conformance would be less appropriate for a "confidence"
> measure given the list of possible "quality issues" which are
> more linguistically oriented. Even if we would aggregate the different
> result types with a certain weigthing (penalty), what we
> would get is an approximated quality rating, which we have with LQR (on
> the document level), but not a confidence measure in the
> above sense.
>
> This is an interesting and forward looking discussion which we should
> continue for future versions of ITS.
>
> Cheers -- Jörg
>
>
> On July 20, 2013 at 22:37 (CEST), Yves Savourel wrote:
> Hi Dave, all,
>
> If MT Confidence has been design to hold only a self-reported score, then
> maybe it should stay that way. I just didn't know the
> reasoning behind the origin of the data category. But IMO it becomes a
> data category that is going to very rarely used, except for
> research tools, production tools have rarely access to such measurement as
> far as I see. But maybe it's a question of time.
>
> This said, in the case of QuEst, while I may be wrong, my understanding is
> that the type of score you get is very comparable to a
> self-reporting confidence. You will note that I didn't ask to change the
> meaning of what MT Confidence is reporting, only that we
> didn't restrict the tool that generate that score to the MT system itself.
>
> The other option would be to use LQI/non-conformance? But I have to say
> that despite the description that sort of backup that
> notion, the type name and the data category sound rather off to an
> end-user like me: Localization quality *Issue* are about
> reporting problems, and I would imagine a (non)-conformance type is about
> aggregating data and types of errors to come up with an
> overall score that is more a composite measurement than something close to
> an MT Confidence.
>
> Would localization Quality Rating be better? It is a rating of the quality
> of the translation with a rather vague definition.
>
> Cheers,
> -yves
>
>
> -----Original Message-----
> From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
> Sent: Friday, July 19, 2013 7:39 PM
> To: Yves Savourel; public-multilingualweb-lt@w3.org
> Subject: Re: MT Confidence definition [ACTION-556]
>
> Hi all
> I managed to talk to Declan Groves about this yesterday. His view was that
> the original use case was to enable to confidence score
> that all statistical MT already generate in selecting the final output to
> be propagated in an open way. So using other method is
> some change (a
> broadening) of the use case.
>
> He also saw the danger of confusion by users/implementors if something
> labelled as a 'confidence score' (which has a certain meaning
> in NLP
> circles) might be used to convey quality estimation (QE), which, depending
> how its done,  has a different sort of significance.
>
> We did discuss the option of mtconfidence being used to convey the output
> of an automated score (e.g. BLEU) that had been integrated
> into an MT engine. This would be reasonable in use cases where MT engines
> are being dynamically retrained, but would require
> relaxing the wording.
>
> I also asked questions of some  QE researchers in CNGL and got some
> interesting clarifications. Certainly QE is being used to
> provide scores of MT output (i was mistaken about that on the call), often
> trained on some human annotation collected on the quality
> of previous translations correlated to the current translation and perhaps
> other meta data (including the self reported confidence
> scores) from the MT engine.
> Certainly there are also occasions where QE operates in a very similar
> fashion to that intended for non-conformance in  LQI, so I
> think that remains an option also.
>
> So, Yves, you are right that the current definition is limiting to other
> possible 'scores' representing a confidence in the
> translation being a 'good' one,  beyond just the MT-engine generated
> scores.
>
> At the same time I have the impression that the technologies for this are
> still emerging from the lab and don't have the benefit of
> widely used common platforms and industrial experience that SMT does.
> Overall this makes it difficult to make any hard and fast
> statements about what should and should not be used to generate
> MtConfidence scores right now.
>
> So softening that limitation as Yves suggests may be useful in
> accommodating innovations in this area, but may also open the door to
> some confusion by users that may impact negatively on the business
> benefits of interoperation, e.g. a translation client gets a
> score that they think has a certain significance when in fact it has
> another.
>
> So, if we were to make the changes suggested by Yves, we should accompany
> it with some best practice work to suggest how the
> annotatorRef value could be used to inform on the particular method used
> to generate the mtconfidence score, including some
> classification encodings, explanations of the different methods and the
> significance that can be placed on the resulting scores in
> different situations. My general feeling, perhaps incorrect, is that the
> current IG membership probably doesn't have the breadth of
> expertise to provide this best practice. Arle, could this be something
> that QT-Launchpad could take on?
>
> To sum up:
> 1) the text proposed by yves may relax limits of what can produce
> mtconfidence score  in a useful way by accommodating different
> techniques, but also has the potential to cause confusion about the
> singificance of score produced by different methods. Some of
> these could anyway be conveyed in the non-compliance in LQI, but not all.
>
> 2) it seems very difficult to formulate wording that would constrain the
> range of methods in any usable way between the current text
> and what Yves suggests. So let restrict ourselves to these two options.
>
> 3) If we relax the wording as Yves suggests, expertise would be needed to
> form best practice on the use of the annotatorsRef value
> to provide a way of classifying the different scoring methods in a way
> that's useful for users.
>
> Apologies for the long email, but unfortunately i could find any clear
> pointers one way or another.  Personally, I'm more neutral
> the proposal.
> But also I don't know if we could categorize this as a minor clarification
> or not either.
>
> Please voice your views on the list, and lets try and get consensus before
> the call next week. Note I'm not available for the call
> and I think Felix is away also.
>
> But we need to form a consensus quickly if we are to avoid delaying the PR
> stage further.
>
> Regards,
> Dave
>
>
> On 17/07/2013 11:35, Yves Savourel wrote:
> Hi Dave,
>
> In the case of QuEst, for the scenario I have in mind, one would for
> example perform the MT part with MS Hub, then pass that information to
> QuEst and get back a score that indicate a level of confidence for that
> translation candidate. So that's a step after Mt and
> before any human looks at it.
>
> I may be wrong, but "MT Confidence" seems to be a good place to put that
> information.
>
> Even if QuEst is a wrong example. Having MT Confidence restricted to
> *self-reported* value seems very limiting. But maybe I'm mis interpreting
> the initial aim of the data category.
>
> Cheers,
> -ys
>
> -----Original Message-----
> From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
> Sent: Wednesday, July 17, 2013 12:25 PM
> To: public-multilingualweb-lt@w3.org
> Subject: Re: MT Confidence definition
>
> Hi Yves,
> I don't necessarily agree with this based on the example you give in
> relation to quality estimation in Quest.
>
> Is not the goal of quality estimation to predict the quality of a
> translation of a given source string for a given MT engine training
> corpora and training regime _prior_ to actually performing the
> translation?
>
> In which case it would be an annotation of a translation but of a
> _source_ with reference to an existing or planned MT engine (which you
> rightly say in response to Sergey can be resolved via the
> annotatorsRef).
>
> So while the basic data structure of mtConfidence would work for, the
> use case, name and wording don't i think match the use of MT QE.
>
> Declan, Ankit could you comment - I'm not really an expert here, and not
> up to speed on the different applications of MT QE.
>
> cheers,
> Dave
>
>
> On 17/07/2013 08:29, Yves Savourel wrote:
> Hi all,
>
> I've noticed a minor text issue in the specification:
>
> For the MT Confidence data category we say:
>
> "The MT Confidence data category is used to communicate the
> self-reported confidence score from a machine translation engine of the
> accuracy of a translation it has provided."
>
> This is very limiting.
>
> I think it should say:
>
> "The MT Confidence data category is used to communicate the
> confidence score of the accuracy of a translation provided by a machine
> translation."
>
> (and later: "the self-reported confidence score" should be "the reported
> confidence score").
>
> There could be cases where the confidence score is provided by
> another system than the one that provided the MT candidate. The QuEst
> project is an example of this
> http://staffwww.dcs.shef.ac.uk/people/L.Specia/projects/quest.html)
>
> Cheers,
> -ys
>
>
>
>
> --
> Dr. Declan Groves
> Applied Research and Development Coordinator
> Centre for Next Generation Localisation (CNGL)
> Dublin City University
>
> email: dgroves@computing.dcu.ie
> phone: +353 (0)1 700 6906
>
>
>


-- 
*Dr. Declan Groves
Applied Research and Development Coordinator
Centre for Next Generation Localisation (CNGL)
Dublin City University

email: dgroves@computing.dcu.ie <dgroves@computing.dcu.ie>
 phone: +353 (0)1 700 6906*

Received on Tuesday, 23 July 2013 09:24:29 UTC