RE: MT Confidence definition [ACTION-556]

Hi Felix,

Looks fine to me.

Typo: "...the score on it's own is..." should be "...the score on its own is..."

-ys

From: Felix Sasaki [mailto:fsasaki@w3.org] 
Sent: Tuesday, July 23, 2013 2:42 AM
To: Declan Groves
Cc: public-multilingualweb-lt@w3.org; Yves Savourel; Dave Lewis
Subject: Re: MT Confidence definition [ACTION-556]

Hi Declan, all,

I tried to implement this in section
http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-definition
and explain it with a dedicated note
http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mt-confidence-score-generation-tools
This should resolve the "we need to explain this" part. Can you have a look before the Wednesday call? With regards to examples I
propose to wait after the proposed recommendation examples - since these are just examples. If people want to wait with this change
after the PR publication to have more time to review it I am fine with that too - please let me know and I will revert it.

Best,

Felix

Am 22.07.13 19:32, schrieb Declan Groves:
Hi all,

I think Yves makes a good point.

In my view, on reviewing the discussions, as it stands MT Confidence can be used to represent two different types of "confidence"
scores. They are very closely related, but still quite different.

It is worth remembering that the original motivation behind the MT Confidence category is to provide an automatically-generated
value which offers some information on the perceived quality of a translation produced by an MT engine. This value can then be used
in subsequent processes e.g. during post-editing processes, during additional more sophisticated quality estimation processes etc.
1. The quality score of the translation as produced by an MT engine (for the most part this type of score is usually only produce by
statistical-based engine and usually equates to the probability of that translation, given specific models used by the engine). 
2. The quality estimation score (such as provided by the QuEst tool or by some additional process). 
Both are dependant on the MT engine. The first is produced directly by the MT engine. The second uses both MT-system-internal
features (including features extracted from internal MT translation and language models as well as the final translation probability
as produced by the MT engine) and additional external features. This is the reason why MT confidence needs to additional provide
information about the engine (and perhaps in the case of #2 any additional tools that were used in deriving the MT confidence),
otherwise the number on it's own is hard to interpret and to reuse.
Based on this, I think, therefore, we can safely remove the self-referential part of the description of MT Confidence to allow to be
used to capture both #1 and #2 above, but, following Dave's point, we would need to clarify it with examples of best practises for
both instances to make it clear for implementers. It is not the intention of the category to define how the score is calculated, so
I also think it's a good idea to use annotatorRef to provide further details on the tools and methods used to generate the MT
Confidence, if required.

Declan

On 21 July 2013 15:26, Jörg Schütz <joerg@bioloom.de> wrote:
Hi Yves, Dave, and all,

As of yet, the definition of MT Confidence restricts its use case to a score internally generated by the employed MT engine. If we
would allow for the specification of the scoring tool then this data category could be easily extended to the score generated by an
external tool, for example, the QuEst application for Moses based MT engines. Probably, such an extension would need further
information elements like the models and data that have been used in the scoring process.

IMO LQI/non-conformance would be less appropriate for a "confidence" measure given the list of possible "quality issues" which are
more linguistically oriented. Even if we would aggregate the different result types with a certain weigthing (penalty), what we
would get is an approximated quality rating, which we have with LQR (on the document level), but not a confidence measure in the
above sense.

This is an interesting and forward looking discussion which we should continue for future versions of ITS.

Cheers -- Jörg 


On July 20, 2013 at 22:37 (CEST), Yves Savourel wrote:
Hi Dave, all,

If MT Confidence has been design to hold only a self-reported score, then maybe it should stay that way. I just didn't know the
reasoning behind the origin of the data category. But IMO it becomes a data category that is going to very rarely used, except for
research tools, production tools have rarely access to such measurement as far as I see. But maybe it's a question of time.

This said, in the case of QuEst, while I may be wrong, my understanding is that the type of score you get is very comparable to a
self-reporting confidence. You will note that I didn't ask to change the meaning of what MT Confidence is reporting, only that we
didn't restrict the tool that generate that score to the MT system itself.

The other option would be to use LQI/non-conformance? But I have to say that despite the description that sort of backup that
notion, the type name and the data category sound rather off to an end-user like me: Localization quality *Issue* are about
reporting problems, and I would imagine a (non)-conformance type is about aggregating data and types of errors to come up with an
overall score that is more a composite measurement than something close to an MT Confidence.

Would localization Quality Rating be better? It is a rating of the quality of the translation with a rather vague definition.

Cheers,
-yves


-----Original Message-----
From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
Sent: Friday, July 19, 2013 7:39 PM
To: Yves Savourel; public-multilingualweb-lt@w3.org
Subject: Re: MT Confidence definition [ACTION-556]

Hi all
I managed to talk to Declan Groves about this yesterday. His view was that the original use case was to enable to confidence score
that all statistical MT already generate in selecting the final output to be propagated in an open way. So using other method is
some change (a
broadening) of the use case.

He also saw the danger of confusion by users/implementors if something labelled as a 'confidence score' (which has a certain meaning
in NLP
circles) might be used to convey quality estimation (QE), which, depending how its done,  has a different sort of significance.

We did discuss the option of mtconfidence being used to convey the output of an automated score (e.g. BLEU) that had been integrated
into an MT engine. This would be reasonable in use cases where MT engines are being dynamically retrained, but would require
relaxing the wording.

I also asked questions of some  QE researchers in CNGL and got some interesting clarifications. Certainly QE is being used to
provide scores of MT output (i was mistaken about that on the call), often trained on some human annotation collected on the quality
of previous translations correlated to the current translation and perhaps other meta data (including the self reported confidence
scores) from the MT engine.
Certainly there are also occasions where QE operates in a very similar fashion to that intended for non-conformance in  LQI, so I
think that remains an option also.

So, Yves, you are right that the current definition is limiting to other possible 'scores' representing a confidence in the
translation being a 'good' one,  beyond just the MT-engine generated scores.

At the same time I have the impression that the technologies for this are still emerging from the lab and don't have the benefit of
widely used common platforms and industrial experience that SMT does. Overall this makes it difficult to make any hard and fast
statements about what should and should not be used to generate MtConfidence scores right now.

So softening that limitation as Yves suggests may be useful in accommodating innovations in this area, but may also open the door to
some confusion by users that may impact negatively on the business benefits of interoperation, e.g. a translation client gets a
score that they think has a certain significance when in fact it has another.

So, if we were to make the changes suggested by Yves, we should accompany it with some best practice work to suggest how the
annotatorRef value could be used to inform on the particular method used to generate the mtconfidence score, including some
classification encodings, explanations of the different methods and the significance that can be placed on the resulting scores in
different situations. My general feeling, perhaps incorrect, is that the current IG membership probably doesn't have the breadth of
expertise to provide this best practice. Arle, could this be something that QT-Launchpad could take on?

To sum up:
1) the text proposed by yves may relax limits of what can produce mtconfidence score  in a useful way by accommodating different
techniques, but also has the potential to cause confusion about the singificance of score produced by different methods. Some of
these could anyway be conveyed in the non-compliance in LQI, but not all.

2) it seems very difficult to formulate wording that would constrain the range of methods in any usable way between the current text
and what Yves suggests. So let restrict ourselves to these two options.

3) If we relax the wording as Yves suggests, expertise would be needed to form best practice on the use of the annotatorsRef value
to provide a way of classifying the different scoring methods in a way that's useful for users.

Apologies for the long email, but unfortunately i could find any clear pointers one way or another.  Personally, I'm more neutral
the proposal.
But also I don't know if we could categorize this as a minor clarification or not either.

Please voice your views on the list, and lets try and get consensus before the call next week. Note I'm not available for the call
and I think Felix is away also.

But we need to form a consensus quickly if we are to avoid delaying the PR stage further.

Regards,
Dave


On 17/07/2013 11:35, Yves Savourel wrote:
Hi Dave,

In the case of QuEst, for the scenario I have in mind, one would for
example perform the MT part with MS Hub, then pass that information to
QuEst and get back a score that indicate a level of confidence for that translation candidate. So that's a step after Mt and
before any human looks at it.

I may be wrong, but "MT Confidence" seems to be a good place to put that information.

Even if QuEst is a wrong example. Having MT Confidence restricted to
*self-reported* value seems very limiting. But maybe I'm mis interpreting the initial aim of the data category.

Cheers,
-ys

-----Original Message-----
From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
Sent: Wednesday, July 17, 2013 12:25 PM
To: public-multilingualweb-lt@w3.org
Subject: Re: MT Confidence definition

Hi Yves,
I don't necessarily agree with this based on the example you give in relation to quality estimation in Quest.

Is not the goal of quality estimation to predict the quality of a
translation of a given source string for a given MT engine training corpora and training regime _prior_ to actually performing the
translation?

In which case it would be an annotation of a translation but of a
_source_ with reference to an existing or planned MT engine (which you rightly say in response to Sergey can be resolved via the
annotatorsRef).

So while the basic data structure of mtConfidence would work for, the
use case, name and wording don't i think match the use of MT QE.

Declan, Ankit could you comment - I'm not really an expert here, and not up to speed on the different applications of MT QE.

cheers,
Dave


On 17/07/2013 08:29, Yves Savourel wrote:
Hi all,

I've noticed a minor text issue in the specification:

For the MT Confidence data category we say:

"The MT Confidence data category is used to communicate the
self-reported confidence score from a machine translation engine of the accuracy of a translation it has provided."

This is very limiting.

I think it should say:

"The MT Confidence data category is used to communicate the
confidence score of the accuracy of a translation provided by a machine translation."

(and later: "the self-reported confidence score" should be "the reported confidence score").

There could be cases where the confidence score is provided by
another system than the one that provided the MT candidate. The QuEst
project is an example of this
http://staffwww.dcs.shef.ac.uk/people/L.Specia/projects/quest.html)

Cheers,
-ys




-- 
Dr. Declan Groves
Applied Research and Development Coordinator
Centre for Next Generation Localisation (CNGL)
Dublin City University

email: dgroves@computing.dcu.ie
phone: +353 (0)1 700 6906 

Received on Tuesday, 23 July 2013 04:06:47 UTC