W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > July 2012

Re: [All] domain data category section proposal, please review

From: Dave Lewis <dave.lewis@cs.tcd.ie>
Date: Tue, 10 Jul 2012 21:56:16 +0100
Message-ID: <4FFC96F0.1080207@cs.tcd.ie>
To: Yves Savourel <ysavourel@enlaso.com>
CC: 'Felix Sasaki' <fsasaki@w3.org>, public-multilingualweb-lt@w3.org
Yves,
Yes, I think it should be possible to have more than one domain 
associated with a piece of content.

If this maps to multiple domain in the consumer tool, for statistical 
machine translation at least, this _can_ be handled by including 
training corpora from all the domains in training the MT engine you will 
use for this, so this is useful information.

Given this case, we should not however, I believe, specify such 
behaviour in the recommendation, since this is really a 
business/resourcing decision on the consumer tool side.

For instance, the consumer MT could choose to just ignore the domain and 
take a one size fits all approach. Alternatively if may be selective in 
which domains based on the range of content marked with domain, e.g. 
with reference to Felix's example if the content has hundreds of 
sentences marked with domain 'automotive' AND 'medical', but only a 
couple of sentences marked with additional domains 'criminal law' and 
'property law', the consumer tool may opt to include its domains 'auto' 
and 'medicine', but not 'law', since the extra training resources 
doesn't justify the improvement in the output.

cheers,
Dave




On 10/07/2012 14:05, Yves Savourel wrote:
> Hi Felix, Dave, all,
>
> Sorry, one more question related to the implementation of Domain:
>
> I was looking for example and run into this DocBook one:
>
> <article xmlns='http://docbook.org/ns/docbook'>
>   <info>
>    <title>Example of subjectset</title>
>    <subjectset scheme="libraryofcongress">
>     <subject>
>      <subjectterm>Electronic Publishing</subjectterm>
>     </subject>
>     <subject>
>      <subjectterm>SGML (Computer program language)</subjectterm>
>     </subject>
>    </subjectset>
>   </info>
>   <para>Text of the document</para>
> </article>
>
> Where they explain that the //subjectset/subjectterm element indicates the DC subject (so it falls into our domain data category). See http://www.docbook.org/tdg5/publishers/5.1b3/en/html/ch02.html#ch-gsxml.3.8
>
> As you can see, there are actually two entries in the example, so two domains.
> The question is: Can we have more than one domain associated with a content?
>
> Just wondering what the implications are for the tools downstream like MT.
>
> If the answer is 'no'. Then how do we know which one to use? We just leave that decision to the author (i.e. s/he is responsible to provide only a mapping to a single entry per document)?
>
> Or do we provide some kind of default behavior, like: the first or last one wins?
>
> Thanks,
> -yves
>
>
Received on Tuesday, 10 July 2012 20:56:42 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:31:47 UTC