[MLW-LT] Queries on Test Suite for MT System Implementation

---------- Forwarded message ----------
From: Felix Sasaki <fsasaki@w3.org>
Date: 2012/8/20
Subject: Re: [MLW-LT] Queries on Test Suite for MT System Implementation
To: "Ankit K. Srivastava" <asrivastava@computing.dcu.ie>
Cc: Dominic Jones <Dominic.Jones@scss.tcd.ie>


Hi Ankit,

thanks for your mail. Would it be OK for you to discuss this on the public
list? If not that's fine too, just wanted to ask because other implementors
new to ITS might have similar questions.

Am Montag, 20. August 2012 schrieb Ankit K. Srivastava :


> Hi,
>
> I am working on adapting the DCU MaTrEx machine translation system
> to implement the metadata and pass the ITS 2.0 test suite specifications (
> http://phaedrus.scss.tcd.ie/its2.0/its-testsuite.html)
>
> I have a couple of queries / clarifications:
>
> 1) The "Input Test File" specifies the data format our MT system will
> receive to translate and the "Output Expected Results" is the format which
> is expected to be produced by our MT system, correct?
>


The "input test file" specifies the input, correct. The output is an
artificial output, that is not used by any language technology or other
software, to do something with the ITS metadata. We developed the output
format for the implementation of ITS 1.0 and have added it here - for the
moment. For many implementations, like MaTrEx, producing such an output is
not useful.

In other words, we haven't decided about the output format yet and would
like to hear input from implementors like you what is useful and what is
the least effort - both at the same time hopefully. We are very likely to
keep test cases with the output format, to demonstrate some basic ITS
functionality. But for systems like MaTrEx, that is not useful.


>
> The reason I ask this is because currently our MT system implementation
> can parse the XML/ITS files, translate, and output in the same format as
> input (i.e. without splitting everything into <node>)
>

Sure, that makes sense.


>
> So to pass the testsuite, we will need to provide the output in the <node>
> format, right?
>


No, see above. So far we only have input files, no output is decided upon.


>
>
> 2) This question is specific to our implementation for the September
> meeting.
> a) Has a common language pair / training dataset been decided upon for
> demonstration purposes. (Currently we plan to artificially pattern some toy
> data in the ITS format as specified on the testsuite webpage for
> English-Spanish)
>

No, there is no specific data set. For September, having a toy data set
makes sense.


>
> b) The Imlementation specification has two columns "global" and "local"
> What exactly does this mean for MT?
>

ITS 2.0 metadata can be formulated with two approaches: local and global.
Below is an example for "Translate":

LOCAL: <doc ...><p>We need a new <code
its:translate="no">motherboard</code></p></doc>
GLOBAL: <its:rules version="2.0" ...><its:translateRule selector="//code"
translate="no"/></its:rules>
Both approaches, applied to the document with the "doc" element, express
the same metadata: the content of the "code" element should not translated.
The difference between global and local is that global can be formulated
without changing the XML (or HTML5) document that is being processed, and
metadata can be applied to several nodes (e.g. all "code" elements).

What does that mean for MT? An MT system that implements the "Translate"
metadata or "data category" both global and locally should take both
approaches into account. In the example, the MT system must not translate
"motherboard" during translation. A more complex example involving both
local and global and some nesting of metadata is this:

<doc ...><p>We need a new <code
its:translate="no">motherboard</code></p><p>Some example code: <code>Some
code <span its:translate="yes">This should be translated</span></code></doc>

Here, via the global "translateRule", "code" is set to be not translatable.
But this setting is overridden via the local "its:translate" attribute at
the "span" element. So the MT system should translate "We need a new",
"Some example code:", and "This should be translated", but not "Some code".

Let me know if this is clear. It might help to compare this to CSS: global
is like a CSS stylesheet, local like a "style" attribute at a given
element. And there are defaults  - e.g. for "Translate" that attribute
content should not be translated (you can override these with a global rule
that selects the attribute in question).

It would be good to have test cases that demonstrate the effect. E.g. for
XHTML we have some basic translate rules that "do the right thing", see
www.w3.org/TR/2008/NOTE-xml-i18n-bp-20080213/
(the rules for XHTML)

Let me know if that helps and if you need more information.

Best,

Felix








>
> If I should have emailed someone else regarding this, please let me know.
>
> Thanks,
> Ankit.
>
>
> On Mon, Aug 20, 2012 at 12:14 PM, Felix Sasaki <fsasaki@w3.org> wrote:
>
>> Hi Dom,
>>
>> I checked the spec, and actually most of the sections didn't have an
>> idea. I added them and hopefully didn't forget any. See below. The base-uri
>> needs to be changed later but should be OK for now.
>>
>>
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#translate-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#translate-local
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#localizationnote-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#localizationnote-local
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-local
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#directionality-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#directionality-local
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#ruby-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#ruby-local
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#languageinformation-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#withintext-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#withintext-local
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#domain-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#disambiguation-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#disambiguation-local
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#localefilter-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#localefilter-local
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#externalresource-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#targetpointer-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#idvalue-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#preservespace-global
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#preservespace-local
>>
>> Best,
>>
>>
>> Felix
>>
>> 2012/8/20 Dominic Jones <Dominic.Jones@scss.tcd.ie>
>>
>>> Hey,
>>>
>>> Of course, good idea.
>>>
>>> Dom
>>>
>>>
>>> --
>>> Dominic Jones | Research Assistant
>>> KDEG, Trinity College Dublin, Ireland.
>>> http://www.scss.tcd.ie/dominic.jones
>>>
>>>
>>>
>>> On 20 Aug 2012, at 10:44, Felix Sasaki <fsasaki@w3.org> wrote:
>>>
>>> > Hi Dom, all,
>>> >
>>> > one additional suggestion: would it be OK to add a column "link to
>>> spec" with the related section? For the time being you could use
>>> >
>>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html
>>> > as the base URI. Most of the sections should have ID attributes, but
>>> if there are some missing, I'll add them - just let me know.
>>> >
>>> > Best,
>>> >
>>> > Felix
>>> >
>>> > 2012/8/20 Dominic Jones <Dominic.Jones@scss.tcd.ie>
>>> > Great, thanks for this feedback. Will work on re-naming files, we're
>>> then in a position to simply add new examples to the test-suite as they are
>>> formalised in the specification.
>>> >
>>> > Dom.
>>> >
>>> >
>>> > --
>>> > Dominic Jones | Research Assistant
>>> > KDEG, Trinity College Dublin, Ireland.
>>> > http://www.scss.tcd.ie/dominic.jones
>>> >
>>> >
>>> >
>>> > On 17 Aug 2012, at 16:23, Yves Savourel <yves.savourel@gmail.com>
>>> wrote:
>>> >
>>> > > Re-sending using a difference email (can't use enlaso.com from here
>>> for some reason)
>>> > >
>>> > >
>>> > > -----Original Message-----
>>> > > From: Yves Savourel [mailto:ysavourel@enlaso.com]
>>> > > Sent: Friday, August 17, 2012 9:22 AM
>>> > > To: 'Dominic Jones'; 'Felix Sasaki'
>>> > > Cc: 'Multilingual Web LT-TESTS Public'
>>> > > Subject: RE: Test suite file naming conventions.
>>> > >
>>> > > Hi Dom,
>>> > >
>>> > > I think using a convention is good.
>>> > > I have no preference on a pattern: what you propose sounds good.
>>> > >
>>> > > Cheers,
>>> > > -yves
>>> > >
>>> > >
>>> > > -----Original Message-----
>>> > > From: Dominic Jones [mailto:Dominic.Jones@scss.tcd.ie]
>>> > > Sent: Friday, August 17, 2012 7:58 AM
>>> > > To: Yves Savourel; Felix Sasaki
>>> > > Cc: Multilingual Web LT-TESTS Public
>>> > > Subject: Test suite file naming conventions.
>>> > >
>>> > > Hi Yves, Felix, All…
>>> > >
>>> > > I have got through the below. Took some time to re-work the file
>>> structure behind the website so that future changes are more easily
>>> manageable. You can now find links to zip files for each data category at
>>> http://phaedrus.scss.tcd.ie/its2.0/its-testsuite.html
>>> > >
>>> > > However as you'll notice the naming of each input and output file is
>>> rather sporadic in convention. I want to work on renaming this files and
>>> wondered if you had a preference… There are around 160 test files and this
>>> is likely to grow so some planning now will help us in the future! For
>>> example in translate "Translate1.xml" could become
>>> "TranslateGlobalEmbedRules.xml" although this would lead to some rather
>>> long file names. I propose camel casing, removing white space and stop
>>> words for input and output file types.
>>> > >
>>> > > This is about a days worth of work so wanted to run it by you guys
>>> before I started!
>>> > >
>>> > > Have a good weekend,
>>> > >
>>> > > Dom
>>> > >
>>> > >
>>> > > On 9 Aug 2012, at 16:10, Dominic Jones <Dominic.Jones@scss.tcd.ie>
>>> wrote:
>>> > >
>>> > >> (Sorry our CS mail was down and I missed this mail)
>>> > >>
>>> > >>
>>> > >> On 9 August 2012 13:24, Yves Savourel <ysavourel@enlaso.com> wrote:
>>> > >>> Hi Dom,
>>> > >>>
>>> > >>> - Nice presentation.
>>> > >>
>>> > >> Thanks.
>>> > >>
>>> > >>>
>>> > >>> - A few  very minor details: spaces around '+' is not consistent.
>>> Casing for some attributes like 'domainpointer' is incorrect.
>>> > >>
>>> > >> Will fix this tomorrow.
>>> > >>
>>> > >>>
>>> > >>> - there is no test for the local use of the Element Within Text
>>> data category, which was in the early draft (I think, but I may be wrong).
>>> > >>>
>>> > >>
>>> > >> Will come back to this next week.
>>> > >>
>>> > >>
>>> > >>> - One suggestion: It would be great to have a way for developers
>>> to get all the source, rules and expected results files in a zip file. That
>>> way we could easily get the latest list/correction.
>>> > >>>
>>> > >>
>>> > >>
>>> > >> Will fix this tomorrow.
>>> > >>
>>> > >>
>>> > >>> Cheers,
>>> > >>> -ys
>>> > >>>
>>> > >>
>>> > >>
>>> > >> Thanks for your feedback!
>>> > >>
>>> > >> Dom
>>> > >>
>>> > >>
>>> > >>>
>>> > >>> -----Original Message-----
>>> > >>> From: Dominic Jones [mailto:Dominic.Jones@scss.tcd.ie]
>>> > >>> Sent: Thursday, August 09, 2012 3:44 AM
>>> > >>> To: Multilingual Web LT-TESTS Public
>>> > >>> Subject: Beginnings of the ITS 2.0 test suite
>>> > >>>
>>> > >>> Dear All,
>>> > >>>
>>> > >>> We have updated the 1.0 data categories with HTML5 examples and
>>> added in the new 2.0 categories based on the spec draft as of 31st of July
>>> (Domain, Locale Filter and External Resources). You can find a draft
>>> version of the test suite here: (
>>> http://phaedrus.scss.tcd.ie/its2.0/its-testsuite.html). This is a very
>>> first draft and we're looking for your feedback on the "Test Files." We'd
>>> like to get to the point where all the input test files are agreed upon as
>>> being correct and valid representations of the data categories implemented
>>> in XML and HTML where applicable allowing as the spec develops for us to
>>> add in the new categories and examples.
>>> > >>>
>>> > >>> Plan to come back to Yves discussion around tabular output from
>>> implementations around the middle of next week.
>>> > >>>
>>> > >>> Both leroy and I will be on the call this afternoon.
>>> > >>>
>>> > >>> Dom.
>>> > >>>
>>> > >>> --
>>> > >>> Dominic Jones | Research Assistant
>>> > >>> KDEG, Trinity College Dublin, Ireland.
>>> > >>> Work: + 353 (0) 1896 8426
>>> > >>> Mobile: + 353 (0) 879259719
>>> > >>> http://www.scss.tcd.ie/dominic.jones/
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >> --
>>> > >> Dominic Jones | Research Assistant
>>> > >> KDEG, Trinity College Dublin, Ireland.
>>> > >> Work: + 353 (0) 1896 8426
>>> > >> Mobile: + 353 (0) 879259719
>>> > >> http://www.scss.tcd.ie/dominic.jones/
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >
>>> > >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Felix Sasaki
>>> > DFKI / W3C Fellow
>>> >
>>>
>>>
>>>
>>
>>
>> --
>> Felix Sasaki
>> DFKI / W3C Fellow
>>
>>
>

-- 
Felix Sasaki
DFKI / W3C Fellow




-- 
Felix Sasaki
DFKI / W3C Fellow

Received on Monday, 20 August 2012 15:35:56 UTC