W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > April 2012

[ACTION-80] consider consolidation of mtDisambiguationData, namedEntity, terminology and textAnalyticsAnnotation

From: David Lewis <dave.lewis@cs.tcd.ie>
Date: Sat, 28 Apr 2012 01:04:31 +0100
Message-ID: <4F9B340F.1070700@cs.tcd.ie>
To: Tadej Stajner <tadej.stajner@ijs.si>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
Hi Tadej, guys,
I've moved this branch of the thread under a subject line for a new 
action to consider this consolidation (which I gave to you Tadej).

Could I ask the MT guys, Declan, Pedro, Daniel, to give some insight 
into what is needed for mtDisambiguation. What form of disambiguation 
information would this point to? Would this have different (lexical) 
properties from a regular term base, e.g. contextual conditions? I guess 
this would be different based on whether this is RBMT or SMT right?

Similarly, is the namedEntityRecognition any different to the general 
sense of terminology?

Tadej, I'd be cautious of considering textAnalyticsAnnotation as the 
superclass here, since certainly it will still often be the case that 
the annotation results from human source text review/QA, rather than 
necessarily from an automated, NLP component doing text analytics.

The common characteristic would seem to be the need to associate a term 
or phrase with some external information for use in later processing, as 
is broadly supported by the current terminology data category. The 
definition in http://www.w3.org/TR/2007/REC-its-20070403/#terminology is 
actually fairly loose in this regard - it doesn't specify that the data 
category be used for terminology management specifically, despite what 
the name would indicate.

Could the same approach of simply associating with external information 
be taken regardless of whether this be a link to a term base, including 
term translations and definitions, link to a conceptual node in a 
semantic web ontology or lexical store or some special MT 
disambiguations store. Or does the differing nature of these external 
resources require a hint in the data category name that it should be 
accessed in different ways.

Perhaps you guys could list out some of the use cases in a bit more 
detail, it might become clearer what the commonalities really are, and 
then to make a judgement on whether they can really be consolidated, or 
whether they represent very separate use cases that should be kept 
separate (which is fine, we should not consolidate for the sake of it, 
as forcing unlinked functionality behaviour together in a data category 
could harm its uptake by implementers).

Also, consider if the need for additional attributes requires a 
separation. For instance, terminology  associations arising from NLP may 
benefit from a confidence score as we discussed previously. But perhaps 
that only needs an additional optional attribute to accompany the 
terminology attribtue?

Note, if we are interested in recording properties of the process, e.g. 
which text analytics engine or terminology expert was involved in 
entering the attribute into a document, this may be better captured 
using the provenance data category.

Sorry for the long post, but please try and advance the discussion 
before the call next friday.

cheers,
Dave


On 26/04/2012 14:29, Tadej Stajner wrote:
> On 4/26/2012 2:23 PM, David Lewis wrote:
>> Dear all,
>> I have four further suggestions for consolidating requirements that 
>> I'd like to discuss briefly on the call with the relevant people:
>>
>> Pedro, Dabiel, Declan, Tadej: I think there may be opportunity to 
>> consolidate mtDisambiguationData, namedEntity, terminology and 
>> textAnalyticsAnnotation. For instance is MT disambiguation really 
>> terminology support for MT?
>>
>
> Yes, they all have a lot in common. The way I see it, textAnalytics 
> annotation is the common superclass of the other three, 
> mtDisambiguation seems to focus on difficult content, namedEntity on 
> named entities and term on terms. They all allow referring to an 
> ontology URI behind the fragment they are annotating - this property 
> is equivalent across all three, but there are specifics in each category.
>
> - what would qualify as difficult content under mtDisambiguation?
> - is there anything MT-specific in mtDisambiguation? Or can we call it 
> simply "disambiguation"?
> - mtDisambiguation-domainSelector is very similar in functionality to 
> term-terminologyResource, could we consolidate those?
> - namedEntity-type can be seen as a special case of a 
> mtDisambiguation-semanticSelector;
>
> My recommendation would be to gather some common properties and pull 
> them in the textAnalyticsAnnotation superclass.
>
>> comments weclome,
>> Dave
>>
>>
>
Received on Saturday, 28 April 2012 00:04:58 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 9 June 2013 00:24:55 UTC