Re: [ACTION-80] consider consolidation of mtDisambiguationData, namedEntity, terminology and textAnalyticsAnnotation from Sebastian Hellmann on 2012-05-11 (public-multilingualweb-lt@w3.org from May 2012)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Fri, 11 May 2012 09:18:41 +0200
To: Tadej Stajner <tadej.stajner@ijs.si>
CC: Thomas Ruedesheim <thomas.ruedesheim@lucysoftware.com>, public-multilingualweb-lt@w3.org
Message-ID: <4FACBD51.7000105@informatik.uni-leipzig.de>
Hi Thomas and Tadej,
Regarding the set of concepts, please have a look at the NERD ontology: 
http://nerd.eurecom.fr/ontology/
Giuseppe Rizzo and Raphael Troncy from NERD made a mapping of virtually 
all major concept repositories used in NER.
Best practice would be either to reuse concepts where a mapping to NERD 
already exist or to create a NERD mapping for your concepts.

You could consider using the DBpedia Ontology with its 300 concepts.
+ they are widely used and a mapping to NERD exists
+ you can edit and add concepts, if they are missing (it is a community 
ontology: 
http://mappings.dbpedia.org/index.php/How_to_edit_the_DBpedia_Ontology)
- editable implies not stable (although the change rate is not really 
high, I think, we have three new concepts a year)
- the concepts need to reflect the data in DBpedia (which is not bad, if 
you are linking your entities to Wikipedia/DBpedia URIs, also)

If I recall correctly, the core of NERD will be reduced to 40 core types 
with an additional extension for more granularity. But you would have to 
ask them, directly.
NIF and NERD are also compatible: 
http://nerd.eurecom.fr/ui/paper/Rizzo_Troncy_Hellmann_Bruemmer-ldow2012.pdf

All the best,
Sebastian



On 05/10/2012 03:47 PM, Tadej Stajner wrote:
> Hi, Thomas,
> It's hard to promise a strict closed set for this use case, since 
> describing concepts that are mentioned in text is as open domain as it 
> gets. What we can reasonably require is the following:
>
> - the concept should be dereferencible so that additional information 
> about the concept available, either via a URI or via an XPath 
> expression (or via a XPath expression to the URI); Here, we can at 
> least have some idea of what is well-formed.
> - in the case of terms, the users should point to the terminology 
> lexicon that defines the list of terms; Here, we can actually validate 
> the values.
> - in the case of named entities, there may be only one type;
>
> -- Tadej
>
> On 5/10/2012 3:37 PM, Thomas Ruedesheim wrote:
>> Hi Tadej,
>>
>> I would generally agree to your points. Which range of values would you
>> suggest for the 'concept' property? From the perspective of an MT tool
>> provider, a closed set would be preferred.
>>
>> Thomas
>>
>> -----Original Message-----
>> From: Tadej Stajner [mailto:tadej.stajner@ijs.si]
>> Sent: Donnerstag, 10. Mai 2012 14:07
>> To: Thomas Ruedesheim
>> Cc: public-multilingualweb-lt@w3.org
>> Subject: Re: [ACTION-80] consider consolidation of mtDisambiguationData,
>> namedEntity, terminology and textAnalyticsAnnotation
>>
>> Hi,
>>
>> I didn't mention some details about textAnalysisAnnotation that became
>> clearer at the last call (the results of which are not reflected yet in
>> the Requirements page): although one could interpret it as a superclass
>> (which I had as well until then), the other part of the interpretation
>> is to express *how* individual annotations were generated, having:
>>
>> - tool that was used for annotation (tool name, URI)
>> - confidence in the tool output (0.0 - 1.0)
>>
>> The reason for separating this out is that people might as well manually
>> annotate entities or terms in their content, in which case
>> "textAnalyticsAnnotation" has no sense, since it doesn't involve any
>> text anayltics tools. This makes 'textAnalyiticsAnnotation' ambiguous,
>> so I suggest some changes that would avoid using that expression.
>>
>> Following this logic, we are left with the 'tool' and 'confidence'
>> properties. Looking at the requirements, we already have 'author' under
>> the Provenance section and 'mtConfidence' under Translation. Could we
>> expand the scope of author to allow anotating individual fragments and
>> generalize 'mtConfidence' into 'confidence' that would be applicable to
>> any auto annotation?
>>
>> What I propose is:
>>
>> - Provenance.author extended to represent automatic annotators, allowed
>> to annotate fragments (if it doesn't already);
>> - Translation.mtConfidence generalized to 'confidence' so it can also
>> cover the auto annotation case;
>> - Terminology.conceptMention introduced as an abstract class that is the
>> umbrella term (eqivalent what used to be textAnalysisAnnotation, but
>> without the connotation that it was automatically generated);
>> - Terminology.mtDisambiguation generalized to
>> Terminology.disambiguation. being a subclass of conceptMention,
>> additionally having a set of 'labels' in alternative languages; It would
>> be used to disambiguating arbitrary fragments of text, like specific
>> phrases, individual words, etc.
>> - Terminology.namedEntity becomes a subclass of disambiguation, with the
>> added 'type';
>> - Terminology.term becomes a subclass of disambiguation, with the added
>> 'terminology lexicon'
>>
>> The open thing remaining is how is the 'semantic selector' property
>> different from the 'concept reference'? Does it need to be its own
>> property, or is it fine if we just allow the 'concept' property to
>> accept various formats of selectors, not just URIs?
>>
>> -- Tadej
>>
>> On 5/10/2012 1:38 PM, Thomas Ruedesheim wrote:
>>> Hi Tadej, hi all,
>>>
>>> You are apparently right, these data categories are strongly
>>> interrelated. In our opinion, 'textAnalysisAnnotation' is the umbrella
>>> for the remaining categories in the Terminology section. We would
>>> suggest to drop it in favour of the others.
>>>
>>> I would rename 'mtDisamiguation' as 'disambiguation', because its
>>> usage might not be MT specific. As Pedro already said, this tag may
>>> add some info to the more general 'domain' category without proposing
>>> concrete target terms. Its only attribute could be:
>>>     'semantic selector': a URI pointing into a common ontology.
>>>
>>> Both 'namedEntity' and 'terminology' categories seem to be clear (see
>>> below).
>>>
>>> Best,
>>> Thomas
>>>
>>> -----Original Message-----
>>> From: Tadej Stajner [mailto:tadej.stajner@ijs.si]
>>> Sent: Mittwoch, 9. Mai 2012 19:50
>>> To: public-multilingualweb-lt@w3.org
>>> Subject: [ACTION-80] consider consolidation of mtDisambiguationData,
>>> namedEntity, terminology and textAnalyticsAnnotation
>>>
>>> Hi, all,
>>>
>>> this question is mostly directed to people working in MT with regard
>>> to disambiguation.
>>>
>>> Since we came to a conclusion that there is strong overlap between the
>>> following data categories, we're consolidating them:
>>> mtDisambiguationData
>>> namedEntity
>>> terminology
>>> textAnalyticsAnnotation
>>>
>>> First of all, there is an obvious common part to the first three.
>>> Let's call it the 'concept mention' recipe. It's meant to represent
>>> that some fragment of text is lexicalizing (mentioning) some concept
>> with an URI.
>>> namedEntity has the following specifics:
>>> - type of entity (pointing to an URI, describing that type)
>>> - alternative labels (names in different languages)
>>>
>>> terminology has the following specifics:
>>> - terminology lexicon
>>> - alternative labels
>>>
>>> mtDisambiguation also has the concept URI, but additionally define
>>> - 'disambiguation data'
>>> - 'semantic selector'
>>>
>>> The open question is: that do these two additional attributes bring
>>> any additional infomation if we already have the fragment
>>> disambiguated with the URI?
>>>
>>>     If not, is there anything else in mtDisambiguation that could not
>>> be covered by the namedEntity and terminology categories?
>>>
>>> thanks for the input,
>>> -- Tadej
>>>
>>>
>>>
>>>
>>>
>>>
>
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
Received on Friday, 11 May 2012 07:19:22 UTC