RE: [All] domain data category section proposal, please review

Hi Dave,

Thanks for the explanation.
It sounds fine to me.

I'll make sure my implementation allows for multiple domain values on the same node.

I wonder--even if it's not a recommended behavior--if this possible multiplicity should not be noted somewhere in the specification: supporting it does change how to present the results in the implementations.

Cheers,
-yves



-----Original Message-----
From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie] 
Sent: Tuesday, July 10, 2012 10:56 PM
To: Yves Savourel
Cc: 'Felix Sasaki'; public-multilingualweb-lt@w3.org
Subject: Re: [All] domain data category section proposal, please review

Yves,
Yes, I think it should be possible to have more than one domain associated with a piece of content.

If this maps to multiple domain in the consumer tool, for statistical machine translation at least, this _can_ be handled by including training corpora from all the domains in training the MT engine you will use for this, so this is useful information.

Given this case, we should not however, I believe, specify such behaviour in the recommendation, since this is really a business/resourcing decision on the consumer tool side.

For instance, the consumer MT could choose to just ignore the domain and take a one size fits all approach. Alternatively if may be selective in which domains based on the range of content marked with domain, e.g. 
with reference to Felix's example if the content has hundreds of sentences marked with domain 'automotive' AND 'medical', but only a couple of sentences marked with additional domains 'criminal law' and 'property law', the consumer tool may opt to include its domains 'auto' 
and 'medicine', but not 'law', since the extra training resources doesn't justify the improvement in the output.

cheers,
Dave




On 10/07/2012 14:05, Yves Savourel wrote:
> Hi Felix, Dave, all,
>
> Sorry, one more question related to the implementation of Domain:
>
> I was looking for example and run into this DocBook one:
>
> <article xmlns='http://docbook.org/ns/docbook'>
>   <info>
>    <title>Example of subjectset</title>
>    <subjectset scheme="libraryofcongress">
>     <subject>
>      <subjectterm>Electronic Publishing</subjectterm>
>     </subject>
>     <subject>
>      <subjectterm>SGML (Computer program language)</subjectterm>
>     </subject>
>    </subjectset>
>   </info>
>   <para>Text of the document</para>
> </article>
>
> Where they explain that the //subjectset/subjectterm element indicates 
> the DC subject (so it falls into our domain data category). See 
> http://www.docbook.org/tdg5/publishers/5.1b3/en/html/ch02.html#ch-gsxm
> l.3.8
>
> As you can see, there are actually two entries in the example, so two domains.
> The question is: Can we have more than one domain associated with a content?
>
> Just wondering what the implications are for the tools downstream like MT.
>
> If the answer is 'no'. Then how do we know which one to use? We just leave that decision to the author (i.e. s/he is responsible to provide only a mapping to a single entry per document)?
>
> Or do we provide some kind of default behavior, like: the first or last one wins?
>
> Thanks,
> -yves
>
>

Received on Tuesday, 10 July 2012 21:47:56 UTC