Re: [ACTION-80] consider consolidation of mtDisambiguationData, namedEntity, terminology and textAnalyticsAnnotation from Tadej Stajner on 2012-05-10 (public-multilingualweb-lt@w3.org from May 2012)

From: Tadej Stajner <tadej.stajner@ijs.si>
Date: Thu, 10 May 2012 14:06:59 +0200
To: Thomas Ruedesheim <thomas.ruedesheim@lucysoftware.com>
CC: public-multilingualweb-lt@w3.org
Message-ID: <4FABAF63.10202@ijs.si>
Hi,

I didn't mention some details about textAnalysisAnnotation that became 
clearer at the last call (the results of which are not reflected yet in 
the Requirements page): although one could interpret it as a superclass 
(which I had as well until then), the other part of the interpretation 
is to express *how* individual annotations were generated, having:

- tool that was used for annotation (tool name, URI)
- confidence in the tool output (0.0 - 1.0)

The reason for separating this out is that people might as well manually 
annotate entities or terms in their content, in which case 
"textAnalyticsAnnotation" has no sense, since it doesn't involve any 
text anayltics tools. This makes 'textAnalyiticsAnnotation' ambiguous, 
so I suggest some changes that would avoid using that expression.

Following this logic, we are left with the 'tool' and 'confidence' 
properties. Looking at the requirements, we already have 'author' under 
the Provenance section and 'mtConfidence' under Translation. Could we 
expand the scope of author to allow anotating individual fragments and 
generalize 'mtConfidence' into 'confidence' that would be applicable to 
any auto annotation?

What I propose is:

- Provenance.author extended to represent automatic annotators, allowed 
to annotate fragments (if it doesn't already);
- Translation.mtConfidence generalized to 'confidence' so it can also 
cover the auto annotation case;
- Terminology.conceptMention introduced as an abstract class that is the 
umbrella term (eqivalent what used to be textAnalysisAnnotation, but 
without the connotation that it was automatically generated);
- Terminology.mtDisambiguation generalized to 
Terminology.disambiguation. being a subclass of conceptMention, 
additionally having a set of 'labels' in alternative languages; It would 
be used to disambiguating arbitrary fragments of text, like specific 
phrases, individual words, etc.
- Terminology.namedEntity becomes a subclass of disambiguation, with the 
added 'type';
- Terminology.term becomes a subclass of disambiguation, with the added 
'terminology lexicon'

The open thing remaining is how is the 'semantic selector' property 
different from the 'concept reference'? Does it need to be its own 
property, or is it fine if we just allow the 'concept' property to 
accept various formats of selectors, not just URIs?

-- Tadej

On 5/10/2012 1:38 PM, Thomas Ruedesheim wrote:
>
> Hi Tadej, hi all,
>
> You are apparently right, these data categories are strongly
> interrelated. In our opinion, 'textAnalysisAnnotation' is the umbrella
> for the remaining categories in the Terminology section. We would
> suggest to drop it in favour of the others.
>
> I would rename 'mtDisamiguation' as 'disambiguation', because its usage
> might not be MT specific. As Pedro already said, this tag may add some
> info to the more general 'domain' category without proposing concrete
> target terms. Its only attribute could be:
>    'semantic selector': a URI pointing into a common ontology.
>
> Both 'namedEntity' and 'terminology' categories seem to be clear (see
> below).
>
> Best,
> Thomas
>
> -----Original Message-----
> From: Tadej Stajner [mailto:tadej.stajner@ijs.si]
> Sent: Mittwoch, 9. Mai 2012 19:50
> To: public-multilingualweb-lt@w3.org
> Subject: [ACTION-80] consider consolidation of mtDisambiguationData,
> namedEntity, terminology and textAnalyticsAnnotation
>
> Hi, all,
>
> this question is mostly directed to people working in MT with regard to
> disambiguation.
>
> Since we came to a conclusion that there is strong overlap between the
> following data categories, we're consolidating them:
> mtDisambiguationData
> namedEntity
> terminology
> textAnalyticsAnnotation
>
> First of all, there is an obvious common part to the first three. Let's
> call it the 'concept mention' recipe. It's meant to represent that some
> fragment of text is lexicalizing (mentioning) some concept with an URI.
>
> namedEntity has the following specifics:
> - type of entity (pointing to an URI, describing that type)
> - alternative labels (names in different languages)
>
> terminology has the following specifics:
> - terminology lexicon
> - alternative labels
>
> mtDisambiguation also has the concept URI, but additionally define
> - 'disambiguation data'
> - 'semantic selector'
>
> The open question is: that do these two additional attributes bring any
> additional infomation if we already have the fragment disambiguated with
> the URI?
>
>    If not, is there anything else in mtDisambiguation that could not be
> covered by the namedEntity and terminology categories?
>
> thanks for the input,
> -- Tadej
>
>
>
>
>
>
Received on Thursday, 10 May 2012 12:07:57 UTC