RE: [ACTION-80] consider consolidation of mtDisambiguationData, namedEntity, terminology and textAnalyticsAnnotation from Thomas Ruedesheim on 2012-05-10 (public-multilingualweb-lt@w3.org from May 2012)

From: Thomas Ruedesheim <thomas.ruedesheim@lucysoftware.com>
Date: Thu, 10 May 2012 15:37:55 +0200
To: "Tadej Stajner" <tadej.stajner@ijs.si>
Cc: <public-multilingualweb-lt@w3.org>
Message-ID: <D0689FBE85FD1246A4EE317903897919F44CA9@team.lucysoftware.com>
Hi Tadej,

I would generally agree to your points. Which range of values would you
suggest for the 'concept' property? From the perspective of an MT tool
provider, a closed set would be preferred.

Thomas

-----Original Message-----
From: Tadej Stajner [mailto:tadej.stajner@ijs.si] 
Sent: Donnerstag, 10. Mai 2012 14:07
To: Thomas Ruedesheim
Cc: public-multilingualweb-lt@w3.org
Subject: Re: [ACTION-80] consider consolidation of mtDisambiguationData,
namedEntity, terminology and textAnalyticsAnnotation

Hi,

I didn't mention some details about textAnalysisAnnotation that became
clearer at the last call (the results of which are not reflected yet in
the Requirements page): although one could interpret it as a superclass
(which I had as well until then), the other part of the interpretation
is to express *how* individual annotations were generated, having:

- tool that was used for annotation (tool name, URI)
- confidence in the tool output (0.0 - 1.0)

The reason for separating this out is that people might as well manually
annotate entities or terms in their content, in which case
"textAnalyticsAnnotation" has no sense, since it doesn't involve any
text anayltics tools. This makes 'textAnalyiticsAnnotation' ambiguous,
so I suggest some changes that would avoid using that expression.

Following this logic, we are left with the 'tool' and 'confidence' 
properties. Looking at the requirements, we already have 'author' under
the Provenance section and 'mtConfidence' under Translation. Could we
expand the scope of author to allow anotating individual fragments and
generalize 'mtConfidence' into 'confidence' that would be applicable to
any auto annotation?

What I propose is:

- Provenance.author extended to represent automatic annotators, allowed
to annotate fragments (if it doesn't already);
- Translation.mtConfidence generalized to 'confidence' so it can also
cover the auto annotation case;
- Terminology.conceptMention introduced as an abstract class that is the
umbrella term (eqivalent what used to be textAnalysisAnnotation, but
without the connotation that it was automatically generated);
- Terminology.mtDisambiguation generalized to
Terminology.disambiguation. being a subclass of conceptMention,
additionally having a set of 'labels' in alternative languages; It would
be used to disambiguating arbitrary fragments of text, like specific
phrases, individual words, etc.
- Terminology.namedEntity becomes a subclass of disambiguation, with the
added 'type';
- Terminology.term becomes a subclass of disambiguation, with the added
'terminology lexicon'

The open thing remaining is how is the 'semantic selector' property
different from the 'concept reference'? Does it need to be its own
property, or is it fine if we just allow the 'concept' property to
accept various formats of selectors, not just URIs?

-- Tadej

On 5/10/2012 1:38 PM, Thomas Ruedesheim wrote:
>
> Hi Tadej, hi all,
>
> You are apparently right, these data categories are strongly 
> interrelated. In our opinion, 'textAnalysisAnnotation' is the umbrella

> for the remaining categories in the Terminology section. We would 
> suggest to drop it in favour of the others.
>
> I would rename 'mtDisamiguation' as 'disambiguation', because its 
> usage might not be MT specific. As Pedro already said, this tag may 
> add some info to the more general 'domain' category without proposing 
> concrete target terms. Its only attribute could be:
>    'semantic selector': a URI pointing into a common ontology.
>
> Both 'namedEntity' and 'terminology' categories seem to be clear (see 
> below).
>
> Best,
> Thomas
>
> -----Original Message-----
> From: Tadej Stajner [mailto:tadej.stajner@ijs.si]
> Sent: Mittwoch, 9. Mai 2012 19:50
> To: public-multilingualweb-lt@w3.org
> Subject: [ACTION-80] consider consolidation of mtDisambiguationData, 
> namedEntity, terminology and textAnalyticsAnnotation
>
> Hi, all,
>
> this question is mostly directed to people working in MT with regard 
> to disambiguation.
>
> Since we came to a conclusion that there is strong overlap between the

> following data categories, we're consolidating them:
> mtDisambiguationData
> namedEntity
> terminology
> textAnalyticsAnnotation
>
> First of all, there is an obvious common part to the first three. 
> Let's call it the 'concept mention' recipe. It's meant to represent 
> that some fragment of text is lexicalizing (mentioning) some concept
with an URI.
>
> namedEntity has the following specifics:
> - type of entity (pointing to an URI, describing that type)
> - alternative labels (names in different languages)
>
> terminology has the following specifics:
> - terminology lexicon
> - alternative labels
>
> mtDisambiguation also has the concept URI, but additionally define
> - 'disambiguation data'
> - 'semantic selector'
>
> The open question is: that do these two additional attributes bring 
> any additional infomation if we already have the fragment 
> disambiguated with the URI?
>
>    If not, is there anything else in mtDisambiguation that could not 
> be covered by the namedEntity and terminology categories?
>
> thanks for the input,
> -- Tadej
>
>
>
>
>
>
Received on Thursday, 10 May 2012 17:40:34 UTC