W3C home > Mailing lists > Public > public-multilingualweb-lt-comments@w3.org > January 2013

Re: Atb.: Disambiguation and terminology producers (Re: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)))

From: Tadej Stajner <tadej.stajner@ijs.si>
Date: Thu, 24 Jan 2013 11:02:01 +0100
Message-ID: <51010699.8040202@ijs.si>
To: public-multilingualweb-lt-comments@w3.org
Hi, all,
here are some slides with examples of the merge.
-- Tadej

On 1/23/2013 1:55 AM, Tadej Stajner wrote:
> Hi, thanks for the excellent summary, Marcis,
> with regards to the granularity level issue, I see that you identified 
> two new constraints which the current formulation doesn't work for:
> - having simultaneous annotations of the same fragment with a 
> different level;
> - too few/too restrictive levels;
>
> Depending on how critical are these issues, I suggest some solutions 
> we could discuss tomorrow:
> 1) define a separate attribute for every level, allowing for much 
> nicer annotations, and the possibility to have multiple annotations on 
> the same fragment. The downside is that we need to fix what levels are 
> supported, since we're creating new attributes for every level, thus 
> failing the second constraint.
> 2) keep the formulation as is, but allow any value to be valid as a 
> granularity level - it will still look relatively verbose, but it will 
> still allow for arbitrary number of levels; This one still fails the 
> first constraint.
> 3) use a similar formulation than what we did for specifying multiple 
> annotators in annotatorsRef - combining the granularity and the URI:
> "Welcome to <span 
> its:disambigIdentRef="entity|http://dbpedia.org/resource/Prague 
> keyword|http://examples.org/kw/prague">Prague</span>. "
> The upside to this is that we can drop the disambigGranularity 
> attribute, but we encourage bad style by having non-atomic attribute 
> values.
>
> -- Tadej
>
> On 1/23/2013 1:13 AM, Mârcis Pinnis wrote:
>> Hi Felix, all,
>> This is currently the summary that I have aggregated for 
>> Disambiguation and Terminology:
>>
>>
>>   Summary on January 23, 2013
>>
>> I left description statements to the minimum (just laconic arguments 
>> and minimalistic opinions). I also added inline comments from myself 
>> where I saw the need to clarify or ask a question.
>>
>>
>>     The initial idea of Christian – summarized (please correct me if
>>     I got it wrong)
>>
>> *Opinion:* ITS 2.0 in comparison to ITS 1.0 moves closer to Natural 
>> Language Processing (NLP) – a general statement (but an important one 
>> for text analysis that follows further). *Disambiguation* could be a 
>> powerful tool *for all kinds of text analysis purposes* if 
>> implemented in an easy to use way.
>>
>> *Concerns*: users may find it difficult to understand when to use 
>> “term” and when – “disambiguation”; the usage scenarios of both 
>> categories may overlap; Disambiguation is not clearly defined in the 
>> ITS 2.0 specification.
>>
>> *Suggestions*: (1) integrating *Terminology as part of Disambiguation 
>> - *“/(automated) text analysis/” (a substitute data category for the 
>> two others) could subsume what is produced by Disambiguation, 
>> Terminology and other annotation-like metadata processing scenarios 
>> [/Mârcis: I generalized the idea/], (2) *deprecating Terminology*, 
>> (3) *revising ITS 2.0* so that the difference is clearly defined 
>> [/Mârcis: is that what was meant by revising the spec?/]
>>
>> Then follows an example of how the revised data category could look 
>> like, which in the proposal is a renamed Disambiguation data category 
>> with the “granularity” changed to “type” that could have either 
>> hardcoded values or URIs (/preferred by Christian/).
>>
>>
>>     The LT-Web working group’s initial comments – summarized (please
>>     correct me if I got it wrong)
>>
>>
>>       Main ideas from David Filip (Jan 11, 2013, at 12:22 (CET)):
>>
>> *Arguments to keep as is*: (1) ITS 2.0 should not break ITS 1.0 
>> (_cannot deprecate Terminology_) [/Mârcis: as explained by Felix, ITS 
>> 2.0 does not necessarily have to be backwards compattible/], (2) 
>> Terminology is simpler to produce/consume and _consumers of 
>> Terminology should not be forced to move to a more complex 
>> annotation_, (3) although Terminology and Disambiguation are 
>> informally semantically related, for granularity and independent 
>> implementability these should not be combined.
>>
>> *Opinion*: Relationship between Terminology and Disambiguate is loose 
>> and, therefore, should be handled in a _best practices document_, but 
>> not a normative material.
>>
>> [/Mârcis: David in the example (A) in his E-mail described an example 
>> of annotating term candidates with the Disambiguation data category 
>> and after approval of a terminologist converting them to the 
>> Terminology data category ... if I understood it correctly. This is 
>> the complete opposite of how we would create the process chain – we 
>> would use throughout the Terminology data category, but require the 
>> terms to be initially marked with the Terminology data category using 
>> term confidence and after approval these could be linked to a 
>> term-bank entry by the terminologist remaining within the Terminology 
>> data category/]
>>
>>
>>       Main ideas from Jörg Schütz (11.01.13 14:07):
>>
>> Agrees with David to keep separate Terminology and Disambiguation 
>> data categories.
>>
>> *Concerns*: _ISOCat elements (or URIs)_ for “granularity” would force 
>> applications to adopt NLP standards that _could be not appropriate 
>> for a given application scenario_ [/Mârcis: Just a comment – we have 
>> to understand what can be agreed upon content providers/users 
>> themselves and what needs to be prescribed in the specification, that 
>> is, there is a question of why should we restrict users and prescribe 
>> what can be annotated/disambiguated?/].
>>
>> *Suggests*: do not bring ITS closer to NLP because it should remain 
>> open and deployable for different language processing strategies 
>> [/Mârcis: although I do not understand what is meant by this 
>> recommendation, I do not see which field/area of NLP causes an issue?!/]
>>
>>
>>       Main ideas from Yves Savourel (Fri, 11 Jan 2013 10:36:41 -0700):
>>
>> Agrees with David and Jörgto keep separate Terminology and 
>> Disambiguation data categories.
>>
>> *Concern/Suggestion*: The two data categories answer to different use 
>> cases, so it would not be good to have a single solution for 
>> different problems.
>>
>> *Arguments to keep both separated*: (1) Disambiguation is more 
>> complex; we should not put extra burden on Terminology implementers, 
>> (2) breaking large problems into smaller parts, makes things easier 
>> [/Mârcis: overlaps with 1 ... sort of/]
>>
>> [/Felix: Yves responded as a Terminology consumer/]
>>
>>
>>       Main ideas from Felix Sasaki (Mon, 14 Jan 2013 19:34:44 +0100,
>>       Tue, 15 Jan 2013 10:34:17 +0100, Tue, 15 Jan 2013 13:20:06
>>       +0100, Tue, 15 Jan 2013 17:39:08 +0100)
>>
>> *Asks*: What is _the difference_ in terms _of producing the metadata_ 
>> for Terminology and Disambiguation [/Mârcis opinion: Terminology is 
>> simple, Disambiguation is painful, but in general – both do 
>> annotation/]?
>>
>> *Opinion*: the Disambiguation output gives _background information 
>> on_ what _resources_ have been used [/Mârcis: The Terminology does 
>> not … at least not directly; also – I believe that the main task of 
>> disambiguation is to define the meaning/semantics of the tagged 
>> units, rather than counting up what resources have been used in the 
>> process of disambiguation/].
>>
>> *Analyses*: the mapping between Terminology data category data and 
>> the Disambiguation data category data.**
>>
>> *Suggests*: (1) _create guidance for producers of the metadata_, 
>> related to different consumption scenarios [/Mârcis comment: 
>> shouldn’t it be the other way around – gudance for consumers?/], (2) 
>> following analysis, _proposes mapping_ from terminology data category 
>> entries to the Disambiguation data category entries.
>>
>>
>>       Main ideas from Mârcis Pinnis (Tue, 15 Jan 2013 09:55:59 +0200,
>>       Tue, 15 Jan 2013 15:22:58 +0200)
>>
>> *Concerns*: (1) The _Disambiguation data category is very ambiguous_, 
>> because (a) it lacks clear definitions for the separate granularity 
>> levels; (b) why and on what basis only the 3 given granularity levels 
>> have been chosen and not more (for instance, keyword annotation, 
>> syntactic annotation, etc.); (c) terminology is not used consequently 
>> throughout the description, therefore, it is difficult to follow the 
>> specification. (2) _a phrase can be simultaneously_ a term, a named 
>> entity, an entry in an onthology, and _many other things_ for 
>> different application purposes (a keyword, a noun phrase, a propper 
>> noun phrase, a client’s invented phrase, etc.), but the 
>> _Disambiguation category does not allow a friendly way of annotating 
>> multiple categorieson one phrase_ (not even considering hierarchical 
>> annotation, which is very common for named entities) – the usefulness 
>> of the Disambiguation data category will be limited due to its 
>> difficult metadata production as well as consumption nature. (3) 
>> There are _many different levels of disambiguation_ (most of them 
>> driven by applications where the information is used), even simple 
>> annotation of words and punctuation is disambiguation (of some sort). 
>> _Where do we start counting the disambiguation and with what level_? 
>> Should we even limit users to prescribed levels?
>>
>> *Suggests*: Keep the data categories separated, maybe even for all 
>> three current “granularity” levels if they are required for 
>> localisation as the applications can differ.
>>
>> *Opinion*: the difference in the use cases has not been explained 
>> clear enough – if it would be clear, the issues would be limited to 
>> Disambiguation only...
>>
>> Best regards,
>>
>> Mârcis ;o)
>>
>> ------------------------------------------------------------------------
>> *No:* Felix Sasaki [fsasaki@w3.org]
>> *Nosűtîts:* otrdiena, 2013. gada 15. janvârî 18:39
>> *Kam:* Mârcis Pinnis
>> *Kopija:* public-multilingualweb-lt-comments@w3.org
>> *Tçma:* Re: Disambiguation and terminology producers (Re: issue-68 
>> (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)))
>>
>> Hi Marcis,
>>
>> Am 15.01.13 14:39, schrieb Mârcis Pinnis:
>>> Computer software, or just software, is a collection of computer 
>>> programs and related data that provides the instructions for telling 
>>> a computer what to do and how to do it.
>> Great example, thanks a lot.
>>
>> I have run your example through the NERD API. An output is below. 
>> Tadej, how would it look like with Enrycher?
>>
>> [{"idEntity":170179,"label":"Computer 
>> software","startChar":0,"endChar":17,"nerdType":"http://nerd.eurecom.fr/ontology#Thing","uri":"http://en.wikipedia.com/wiki/Software","confidence":0.927371,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0},{"idEntity":170180,"label":"computer 
>> programs","startChar":56,"endChar":73,"nerdType":"http://nerd.eurecom.fr/ontology#Thing","uri":"http://en.wikipedia.com/wiki/Computer_program","confidence":0.886778,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0},{"idEntity":170181,"label":"collection","startChar":42,"endChar":52,"nerdType":"http://nerd.eurecom.fr/ontology#Thing","confidence":0.586448,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0}]
>>
>>
>> Below is the mapping NERD - ITS2 again:
>>
>> [
>> The mappings NERD - ITS2 "disambiguation" are:
>> - "nerdType" maps to "its-disambig-class-ref"
>> - "confidence" maps to "its-disambig-confidence"
>> - "uri" maps to "its-disambig-ident-ref"
>> ]
>>
>> I think your terminology annotations easily can be integrated in this 
>> mapping:
>>
>> [
>> 1) "nerdType" maps to "its-disambig-class-ref"; there is no counterpart in the terminology annotation
>> 2) "confidence" maps to "its-disambig-confidence" and to termConfidence
>> 3) "uri" maps to "its-disambig-ident-ref" and to termInfoRef
>> 4) "itsDisambigGranularity" is not available in NERD or your terminology annotation system
>> ]
>>
>> So from the point of view of producers (= automatic annotation 
>> tools), I think 1-3 could easily be integrated in one type of 
>> annotation output.
>>
>> Best,
>>
>> Felix
>




Received on Thursday, 24 January 2013 10:02:45 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 19:55:32 UTC