Re: [All] domain data category section proposal, please review from Felix Sasaki on 2012-07-11 (public-multilingualweb-lt@w3.org from July 2012)

From: Felix Sasaki <fsasaki@w3.org>
Date: Wed, 11 Jul 2012 07:29:12 +0200
To: Yves Savourel <ysavourel@enlaso.com>
Cc: Dave Lewis <dave.lewis@cs.tcd.ie>, public-multilingualweb-lt@w3.org
Message-ID: <CAL58czqN-+rPx3nP3+Zrn=1VqwFkNUTfTy36+UiRDT08HLLMuQ@mail.gmail.com>
2012/7/10 Yves Savourel <ysavourel@enlaso.com>

> Hi Dave,
>
> Thanks for the explanation.
> It sounds fine to me.
>
> I'll make sure my implementation allows for multiple domain values on the
> same node.
>
> I wonder--even if it's not a recommended behavior--if this possible
> multiplicity should not be noted somewhere in the specification: supporting
> it does change how to present the results in the implementations.
>

I used the material from Dave (thanks a lot for that!) to create an
explanatory note, see

http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#domain-implementation
and below:

[

It is possible to have more than one domain associated with a piece of
content. For example, if the consumer tool is a statistical machine
translation engine, it could include corpora from all domains available in
the source content in training the machine translation engine.

The consumer machine translation engine might choose to ignore the domain
and take a one size fits all approach, or may be selective in which domains
to use, based on the range of content marked with domain. For example, if
the content has hundreds of sentences marked with domain 'automotive' and
'medical', but only a couple of sentences marked with additional domains
'criminal law' and 'property law', the consumer tool may opt to include its
domains 'auto' and 'medicine', but not 'law', since the extra training
resources does not justify the improvement in the output.
]

Best,

Felix


>
> Cheers,
> -yves
>
>
>
> -----Original Message-----
> From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
> Sent: Tuesday, July 10, 2012 10:56 PM
> To: Yves Savourel
> Cc: 'Felix Sasaki'; public-multilingualweb-lt@w3.org
> Subject: Re: [All] domain data category section proposal, please review
>
> Yves,
> Yes, I think it should be possible to have more than one domain associated
> with a piece of content.
>
> If this maps to multiple domain in the consumer tool, for statistical
> machine translation at least, this _can_ be handled by including training
> corpora from all the domains in training the MT engine you will use for
> this, so this is useful information.
>
> Given this case, we should not however, I believe, specify such behaviour
> in the recommendation, since this is really a business/resourcing decision
> on the consumer tool side.
>
> For instance, the consumer MT could choose to just ignore the domain and
> take a one size fits all approach. Alternatively if may be selective in
> which domains based on the range of content marked with domain, e.g.
> with reference to Felix's example if the content has hundreds of sentences
> marked with domain 'automotive' AND 'medical', but only a couple of
> sentences marked with additional domains 'criminal law' and 'property law',
> the consumer tool may opt to include its domains 'auto'
> and 'medicine', but not 'law', since the extra training resources doesn't
> justify the improvement in the output.
>
> cheers,
> Dave
>
>
>
>
> On 10/07/2012 14:05, Yves Savourel wrote:
> > Hi Felix, Dave, all,
> >
> > Sorry, one more question related to the implementation of Domain:
> >
> > I was looking for example and run into this DocBook one:
> >
> > <article xmlns='http://docbook.org/ns/docbook'>
> >   <info>
> >    <title>Example of subjectset</title>
> >    <subjectset scheme="libraryofcongress">
> >     <subject>
> >      <subjectterm>Electronic Publishing</subjectterm>
> >     </subject>
> >     <subject>
> >      <subjectterm>SGML (Computer program language)</subjectterm>
> >     </subject>
> >    </subjectset>
> >   </info>
> >   <para>Text of the document</para>
> > </article>
> >
> > Where they explain that the //subjectset/subjectterm element indicates
> > the DC subject (so it falls into our domain data category). See
> > http://www.docbook.org/tdg5/publishers/5.1b3/en/html/ch02.html#ch-gsxm
> > l.3.8
> >
> > As you can see, there are actually two entries in the example, so two
> domains.
> > The question is: Can we have more than one domain associated with a
> content?
> >
> > Just wondering what the implications are for the tools downstream like
> MT.
> >
> > If the answer is 'no'. Then how do we know which one to use? We just
> leave that decision to the author (i.e. s/he is responsible to provide only
> a mapping to a single entry per document)?
> >
> > Or do we provide some kind of default behavior, like: the first or last
> one wins?
> >
> > Thanks,
> > -yves
> >
> >
>
>
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Wednesday, 11 July 2012 05:29:37 UTC