Re: [ACTION-80] consider consolidation of mtDisambiguationData, namedEntity, terminology and textAnalyticsAnnotation

Hi, Thomas,
It's hard to promise a strict closed set for this use case, since 
describing concepts that are mentioned in text is as open domain as it 
gets. What we can reasonably require is the following:

- the concept should be dereferencible so that additional information 
about the concept available, either via a URI or via an XPath expression 
(or via a XPath expression to the URI); Here, we can at least have some 
idea of what is well-formed.
- in the case of terms, the users should point to the terminology 
lexicon that defines the list of terms; Here, we can actually validate 
the values.
- in the case of named entities, there may be only one type;

-- Tadej

On 5/10/2012 3:37 PM, Thomas Ruedesheim wrote:
> Hi Tadej,
>
> I would generally agree to your points. Which range of values would you
> suggest for the 'concept' property? From the perspective of an MT tool
> provider, a closed set would be preferred.
>
> Thomas
>
> -----Original Message-----
> From: Tadej Stajner [mailto:tadej.stajner@ijs.si]
> Sent: Donnerstag, 10. Mai 2012 14:07
> To: Thomas Ruedesheim
> Cc: public-multilingualweb-lt@w3.org
> Subject: Re: [ACTION-80] consider consolidation of mtDisambiguationData,
> namedEntity, terminology and textAnalyticsAnnotation
>
> Hi,
>
> I didn't mention some details about textAnalysisAnnotation that became
> clearer at the last call (the results of which are not reflected yet in
> the Requirements page): although one could interpret it as a superclass
> (which I had as well until then), the other part of the interpretation
> is to express *how* individual annotations were generated, having:
>
> - tool that was used for annotation (tool name, URI)
> - confidence in the tool output (0.0 - 1.0)
>
> The reason for separating this out is that people might as well manually
> annotate entities or terms in their content, in which case
> "textAnalyticsAnnotation" has no sense, since it doesn't involve any
> text anayltics tools. This makes 'textAnalyiticsAnnotation' ambiguous,
> so I suggest some changes that would avoid using that expression.
>
> Following this logic, we are left with the 'tool' and 'confidence'
> properties. Looking at the requirements, we already have 'author' under
> the Provenance section and 'mtConfidence' under Translation. Could we
> expand the scope of author to allow anotating individual fragments and
> generalize 'mtConfidence' into 'confidence' that would be applicable to
> any auto annotation?
>
> What I propose is:
>
> - Provenance.author extended to represent automatic annotators, allowed
> to annotate fragments (if it doesn't already);
> - Translation.mtConfidence generalized to 'confidence' so it can also
> cover the auto annotation case;
> - Terminology.conceptMention introduced as an abstract class that is the
> umbrella term (eqivalent what used to be textAnalysisAnnotation, but
> without the connotation that it was automatically generated);
> - Terminology.mtDisambiguation generalized to
> Terminology.disambiguation. being a subclass of conceptMention,
> additionally having a set of 'labels' in alternative languages; It would
> be used to disambiguating arbitrary fragments of text, like specific
> phrases, individual words, etc.
> - Terminology.namedEntity becomes a subclass of disambiguation, with the
> added 'type';
> - Terminology.term becomes a subclass of disambiguation, with the added
> 'terminology lexicon'
>
> The open thing remaining is how is the 'semantic selector' property
> different from the 'concept reference'? Does it need to be its own
> property, or is it fine if we just allow the 'concept' property to
> accept various formats of selectors, not just URIs?
>
> -- Tadej
>
> On 5/10/2012 1:38 PM, Thomas Ruedesheim wrote:
>> Hi Tadej, hi all,
>>
>> You are apparently right, these data categories are strongly
>> interrelated. In our opinion, 'textAnalysisAnnotation' is the umbrella
>> for the remaining categories in the Terminology section. We would
>> suggest to drop it in favour of the others.
>>
>> I would rename 'mtDisamiguation' as 'disambiguation', because its
>> usage might not be MT specific. As Pedro already said, this tag may
>> add some info to the more general 'domain' category without proposing
>> concrete target terms. Its only attribute could be:
>>     'semantic selector': a URI pointing into a common ontology.
>>
>> Both 'namedEntity' and 'terminology' categories seem to be clear (see
>> below).
>>
>> Best,
>> Thomas
>>
>> -----Original Message-----
>> From: Tadej Stajner [mailto:tadej.stajner@ijs.si]
>> Sent: Mittwoch, 9. Mai 2012 19:50
>> To: public-multilingualweb-lt@w3.org
>> Subject: [ACTION-80] consider consolidation of mtDisambiguationData,
>> namedEntity, terminology and textAnalyticsAnnotation
>>
>> Hi, all,
>>
>> this question is mostly directed to people working in MT with regard
>> to disambiguation.
>>
>> Since we came to a conclusion that there is strong overlap between the
>> following data categories, we're consolidating them:
>> mtDisambiguationData
>> namedEntity
>> terminology
>> textAnalyticsAnnotation
>>
>> First of all, there is an obvious common part to the first three.
>> Let's call it the 'concept mention' recipe. It's meant to represent
>> that some fragment of text is lexicalizing (mentioning) some concept
> with an URI.
>> namedEntity has the following specifics:
>> - type of entity (pointing to an URI, describing that type)
>> - alternative labels (names in different languages)
>>
>> terminology has the following specifics:
>> - terminology lexicon
>> - alternative labels
>>
>> mtDisambiguation also has the concept URI, but additionally define
>> - 'disambiguation data'
>> - 'semantic selector'
>>
>> The open question is: that do these two additional attributes bring
>> any additional infomation if we already have the fragment
>> disambiguated with the URI?
>>
>>     If not, is there anything else in mtDisambiguation that could not
>> be covered by the namedEntity and terminology categories?
>>
>> thanks for the input,
>> -- Tadej
>>
>>
>>
>>
>>
>>

Received on Thursday, 10 May 2012 13:48:25 UTC