- From: Dr. David Filip <David.Filip@ul.ie>
- Date: Thu, 2 Aug 2012 13:11:07 +0100
- To: Jan Nelson <Jan.Nelson@microsoft.com>
- Cc: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
- Message-ID: <CANw5LKmWxd7HP1Y+raWeC8RFpGwvagEBBhoRkmRiK7sDJ2bVrA@mail.gmail.com>
Jan, thanks for checking with Chris and I am glad that you support the mtConfidence part. Regarding the mt*Metrics parts.. I see what Chris means but I need to clarify and qualify.. First I think the comment is not valid for mtHumanMetrics part. What I mean are the simple score based human usability assessments that are increasingly used, but unfortunately the scales are not being standardized. AFAIK 4-5 value scales are being used, I would tend to promote 4 values, as best practice, with 0 [publishable without changes] being the best, 3 [complete retranslate needed] the worst. Regarding the mtAutomatedMetrics part, I agree that you need the pointers mentioned by Chris to PERFORM these. But that was NOT the goal of this part of the proposal. I simply assumed that the MT producer has the metrics service set up (internally or as a third party service) and is able to provide it in runtime. I should have said so... The metrics-producing-service obviously must have all the mentioned pointers to perform the metrics, but IMHO this is not worth passing down.. Finally, I agree that the mtConfidence part seems most stable and best candidate for first wave standardization. If the other parts do not have support I am happy to drop them, I just wanted to chart the area in a general way.. Cheers dF Dr. David Filip ======================= LRC | CNGL | LT-Web | CSIS University of Limerick, Ireland telephone: +353-6120-2781 *cellphone: +353-86-0222-158* facsimile: +353-6120-2734 mailto: david.filip@ul.ie On Thu, Aug 2, 2012 at 7:36 AM, Jan Nelson <Jan.Nelson@microsoft.com> wrote: > Feedback from Chris Wendt from our Microsoft Translator team on the > section below: > >> >> >> -mtQuality >> >> --mtConfidence >> >> ---mtProducer [string identifying producer Bing, DCU-Matrex etc.] >> >> ----mtEngine [string identifying the engine on one of the above >> platforms, can be potentially quite structured, pair domain etc.] >> >> -----mtConfidenceScore [0-100% or interval 0-1] >> >> >> >> All of the above makes total sense. I’d vote for it as proposed. Must be >> at sentence level, not below. >> >> >> >> The following does not make sense without significant enhancements: >> >> >> >> --mtAutomatedMetrics >> >> ---mtScoreType [METEOR, TER, BLEU, Levensthein distance etc.] >> >> ----mtAutomatedMetricsScore [0-100% or interval 0-1] >> >> --mtHumanMetrics >> >> ---mtHumanMetricsScale [{4,3,2,1,0},{0,1,2,3,4}.{3,2,1,0} etc.] >> >> ----mtHumanMetricsValue [one of the above values depending on scale] >> >> >> >> >> >> All of the scoring needs a pointer to one or more reference translations >> AND pointer to the source document. In addition, there may not be the exact >> same number of reference per element. You’d need to either add syntax for >> specifying the source and reference, or embed it all in the document. >> That’ll be messy. >> >> These numbers are relevant only for MT system evaluations, which I >> consider a niche, and I would not try to standardize here in ITS. >> >> >> >> Thanks, >> >> >> >> Jan >> >> >> >> ------------------------------ >> >> *From:* Felix Sasaki [fsasaki@w3.org] >> *Sent:* Wednesday, August 01, 2012 12:50 AM >> *To:* Dr. David Filip >> *Cc:* public-multilingualweb-lt@w3.org >> >> *Subject:* Re: [Action-126] David to come up with a proposal for >> mtConfidence >> >> Hi David, all, >> >> 2012/7/31 Dr. David Filip <David.Filip@ul.ie> >> >> HI all, I was trying to engage a PhD student here at LRC to produce a >> proposal for this data category but I failed. >> >> >> >> Nevertheless, here is my thinking on the category that maybe someone else >> (Declan?) could take it to the call for consensus stage. >> >> >> >> co-chair hat on: If there is no strong support for this, I would propose >> to put this on hold until we have finished all other data categories. As >> you wrote in your agenda, >> >> >> >> >> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jul/0311.html >> >> >> >> we have various data categories proposals on the table that are not >> finished: special requirements, named entity, quality, ... I will send a >> proposal for the time until last call later today, which will show that we >> need to finish these and the various "ed. notes" in >> >> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html >> >> I think we need the time and your input to work on these. >> >> >> >> I very much hope for your understanding - let's discuss this also during >> the call on Thursday, >> >> >> >> Felix >> >> >> >> >> >> I believe that mtConfidence is being produced in some form or other by >> all major current MT systems. as discussed in Dublin, the issue is that >> these confidence scores are not really comparable between engines, I mean >> not only between Ging and Google, or Matrex, but even not between different >> pair engines or even specific domain trained engines based on the same >> general technology. >> >> >> >> Nevertheless there are prospects for standardizing based on cognitive >> effort on post-editing etc. Even knowing that the usability of confidence >> scores is limited, there are valid production-consumption scenarios in the >> content lifecycle. >> >> If a client/service provider/translator/reviewer do repeatedly work with >> the same engine, they will find even the engines self evaluation useful. >> >> >> >> Further to this, there is potential of connecting this with automated and >> human MT evaluation scores, so I'd propose to generalize as mtQuality >> [mening raw MT quality, NOT talking about levels of PE] that would subume >> mtConfidence etc. as seen below >> >> >> >> My proposal of the data model based on the above >> >> >> >> -mtQuality >> >> --mtConfidence >> >> ---mtProducer [string identifying producer Bing, DCU-Matrex etc.] >> >> ----mtEngine [string identifying the engine on one of the above >> platforms, can be potentially quite structured, pair domain etc.] >> >> -----mtConfidenceScore [0-100% or interval 0-1] >> >> --mtAutomatedMetrics >> >> ---mtScoreType [METEOR, TER, BLEU, Levensthein distance etc.] >> >> ----mtAutomatedMetricsScore [0-100% or interval 0-1] >> >> --mtHumanMetrics >> >> ---mtHumanMetricsScale [{4,3,2,1,0},{0,1,2,3,4}.{3,2,1,0} etc.] >> >> ----mtHumanMetricsValue [one of the above values depending on scale] >> >> >> >> mtQuality is an optional attribute of a machine text segment (as in >> Unicode or localization segmentations). I do not think this is useful on >> higher or lower levels. >> >> >> >> mtQuality must be specified as mtConfidence XOR mtAutomatedMetrics >> XOR mtHumanMetrics >> >> >> >> Then comes the compulsory specification the actual value (eventaully >> preceded by value change if more options exist).. >> >> >> >> Cheers >> >> dF >> >> >> >> >> Dr. David Filip >> >> ======================= >> >> LRC | CNGL | LT-Web | CSIS >> >> University of Limerick, Ireland >> >> telephone: +353-6120-2781 >> >> *cellphone: +353-86-0222-158* >> >> facsimile: +353-6120-2734 >> >> mailto: david.filip@ul.ie >> >> >> >> >> >> >> >> -- >> Felix Sasaki >> >> DFKI / W3C Fellow >> >> >> > > > > -- > Felix Sasaki > DFKI / W3C Fellow > >
Received on Thursday, 2 August 2012 12:12:20 UTC