Re: [All] domain data category section proposal, please review

Hi all,


2012/7/26 Declan Groves <dgroves@computing.dcu.ie>

> Hi Dave,
>
> Thanks for the clarification re. precedence!
>
>
>> However, my understanding from Declan is that he's referring to how
>> domain annotation usually have some precedence in importance, i.e. the text
>> is usually assumed to be largely in one domain (hence parallel text and
>> language model training data is drawn mostly from that domain), and other
>> domains are added to indicate further sources of training data might be
>> drawn to get better covering of this particular input. Have i got that
>> right Declan?
>>
>
>
> Yes - that's what I'm referring to.
>
>
>>
>> So i don't thing therefore the ITS rule precedence helps with Declan's
>> domain precedence issue, and the data category as currently defined doesn't
>> support communication of domain precedence to the consumer tool.
>>
>> Now, we already indicate in the note for the domain data category that
>> the consumer tool may chose of ignore or make its own decision about the
>> importance of the domains specified, for instance based on the relative
>> volume of content annotated with a domain tag. So we could just rely on
>> that to handle relative domain significance, though that doesn't help in
>> cases where the meta-data tags pointers to are all in the header - but I'm
>> not sure how common a use case that is?
>>
>>
> I would have assumed (although it must be noted my relative lack of
> experience with processing this type of metadata!) that a particular
> document to be translated would more than likely be of the same domain -
> hence why I had assumed that the meta-data tags would be primarily in the
> header. If smaller pieces of content were annotated with domain
> information, then yes, allowing the tool to make the decision using, as you
> suggested, frequency/volume statistics, would be the best way to go. Domain
> precedence could also be based upon ordering within an existing ontology.
>
>
>
>> However, if we _do_ want to have a mean of communicating the relative
>> significance of domains couple of options to do this might be:
>> A)
>> add a new optional attribute 'domainPrecedence' which containsone or more
>> of local tags that match the selector that are considered the 'primary'
>> domain of the document, but without the order of those provided being
>> significant (essentially providing a two tier domain precendence annotation)
>>
>> <its:domainRule selector="/html/body" domainPointer="/html/head/meta[@name='DC.subject']/@content" </html/head/meta%5B@name=%27DC.subject%27%5D/@content>
>>    domainPrecendence="criminal law, medical"
>>    domainMapping="automotive auto, medical medicine, 'criminal law' law, 'property law' law"/>
>> </its:rules>
>>
>>
>> B)
>> Overload the domainMapping so that the order also represents the
>> significance of the domain. This is a bit messy however, since we would
>> need to accommodate local annotation that may be more significant but don't
>> require a mapping (e.g. by omitting the RHS of the pair or repeating the
>> LHS).
>>
>> I'm just laying out some options. I think we need a steer from the MT
>> guys on whether the need to convey (rather than calculate based on volume)
>> the relative significance of domains is a common enough a use case to be
>> accommodated by such a change and, of course, implemented?
>>
>>
> Options A) would be more preferable I think, and is a great suggestion, as
> it would allow local tools, where necessary, to augment the
> domainPrecedence fairly easily.
>

One thing you should be aware of, just to re-iterate: for each of these
(and all other features) we need to provide test cases, and at least two
tools implementing this. That would be at least MaTrex and Lucy. So before
adding this feature, we should be sure about this aspect too.

Felix


>
> I'd be interested to hear if any of the other MT users/providers have a
> view on this.
>
> Declan
>
>
>
> --
> Dr. Declan Groves
> Research Integration Officer
> Centre for Next Generation Localisation (CNGL)
> Dublin City University
>
> email: dgroves@computing.dcu.ie <dgroves@computing.dcu.ie>
>  phone: +353 (0)1 700 6906
>



-- 
Felix Sasaki
DFKI / W3C Fellow

Received on Thursday, 26 July 2012 11:55:40 UTC