Re: Atb.: Disambiguation and terminology producers (Re: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term))) from Tadej Stajner on 2013-01-23 (public-multilingualweb-lt-comments@w3.org from January 2013)

From: Tadej Stajner <tadej.stajner@ijs.si>
Date: Wed, 23 Jan 2013 01:55:44 +0100
To: Mārcis Pinnis <marcis.pinnis@Tilde.lv>
CC: Felix Sasaki <fsasaki@w3.org>, "public-multilingualweb-lt-comments@w3.org" <public-multilingualweb-lt-comments@w3.org>
Message-ID: <50FF3510.10405@ijs.si>
Hi, thanks for the excellent summary, Marcis,
with regards to the granularity level issue, I see that you identified 
two new constraints which the current formulation doesn't work for:
- having simultaneous annotations of the same fragment with a different 
level;
- too few/too restrictive levels;

Depending on how critical are these issues, I suggest some solutions we 
could discuss tomorrow:
1) define a separate attribute for every level, allowing for much nicer 
annotations, and the possibility to have multiple annotations on the 
same fragment. The downside is that we need to fix what levels are 
supported, since we're creating new attributes for every level, thus 
failing the second constraint.
2) keep the formulation as is, but allow any value to be valid as a 
granularity level - it will still look relatively verbose, but it will 
still allow for arbitrary number of levels; This one still fails the 
first constraint.
3) use a similar formulation than what we did for specifying multiple 
annotators in annotatorsRef - combining the granularity and the URI:
"Welcome to <span 
its:disambigIdentRef="entity|http://dbpedia.org/resource/Prague 
keyword|http://examples.org/kw/prague">Prague</span>. "
The upside to this is that we can drop the disambigGranularity 
attribute, but we encourage bad style by having non-atomic attribute 
values.

-- Tadej

On 1/23/2013 1:13 AM, Mārcis Pinnis wrote:
> Hi Felix, all,
> This is currently the summary that I have aggregated for 
> Disambiguation and Terminology:
>
>
>   Summary on January 23, 2013
>
> I left description statements to the minimum (just laconic arguments 
> and minimalistic opinions). I also added inline comments from myself 
> where I saw the need to clarify or ask a question.
>
>
>     The initial idea of Christian – summarized (please correct me if I
>     got it wrong)
>
> *Opinion:* ITS 2.0 in comparison to ITS 1.0 moves closer to Natural 
> Language Processing (NLP) – a general statement (but an important one 
> for text analysis that follows further). *Disambiguation* could be a 
> powerful tool *for all kinds of text analysis purposes* if implemented 
> in an easy to use way.
>
> *Concerns*: users may find it difficult to understand when to use 
> “term” and when – “disambiguation”; the usage scenarios of both 
> categories may overlap; Disambiguation is not clearly defined in the 
> ITS 2.0 specification.
>
> *Suggestions*: (1) integrating *Terminology as part of Disambiguation 
> - *“/(automated) text analysis/” (a substitute data category for the 
> two others) could subsume what is produced by Disambiguation, 
> Terminology and other annotation-like metadata processing scenarios 
> [/Mārcis: I generalized the idea/], (2) *deprecating Terminology*, (3) 
> *revising ITS 2.0* so that the difference is clearly defined [/Mārcis: 
> is that what was meant by revising the spec?/]
>
> Then follows an example of how the revised data category could look 
> like, which in the proposal is a renamed Disambiguation data category 
> with the “granularity” changed to “type” that could have either 
> hardcoded values or URIs (/preferred by Christian/).
>
>
>     The LT-Web working group’s initial comments – summarized (please
>     correct me if I got it wrong)
>
>
>       Main ideas from David Filip (Jan 11, 2013, at 12:22 (CET)):
>
> *Arguments to keep as is*: (1) ITS 2.0 should not break ITS 1.0 
> (_cannot deprecate Terminology_) [/Mārcis: as explained by Felix, ITS 
> 2.0 does not necessarily have to be backwards compattible/], (2) 
> Terminology is simpler to produce/consume and _consumers of 
> Terminology should not be forced to move to a more complex 
> annotation_, (3) although Terminology and Disambiguation are 
> informally semantically related, for granularity and independent 
> implementability these should not be combined.
>
> *Opinion*: Relationship between Terminology and Disambiguate is loose 
> and, therefore, should be handled in a _best practices document_, but 
> not a normative material.
>
> [/Mārcis: David in the example (A) in his E-mail described an example 
> of annotating term candidates with the Disambiguation data category 
> and after approval of a terminologist converting them to the 
> Terminology data category ... if I understood it correctly. This is 
> the complete opposite of how we would create the process chain – we 
> would use throughout the Terminology data category, but require the 
> terms to be initially marked with the Terminology data category using 
> term confidence and after approval these could be linked to a 
> term-bank entry by the terminologist remaining within the Terminology 
> data category/]
>
>
>       Main ideas from Jörg Schütz (11.01.13 14:07):
>
> Agrees with David to keep separate Terminology and Disambiguation data 
> categories.
>
> *Concerns*: _ISOCat elements (or URIs)_ for “granularity” would force 
> applications to adopt NLP standards that _could be not appropriate for 
> a given application scenario_ [/Mārcis: Just a comment – we have to 
> understand what can be agreed upon content providers/users themselves 
> and what needs to be prescribed in the specification, that is, there 
> is a question of why should we restrict users and prescribe what can 
> be annotated/disambiguated?/].
>
> *Suggests*: do not bring ITS closer to NLP because it should remain 
> open and deployable for different language processing strategies 
> [/Mārcis: although I do not understand what is meant by this 
> recommendation, I do not see which field/area of NLP causes an issue?!/]
>
>
>       Main ideas from Yves Savourel (Fri, 11 Jan 2013 10:36:41 -0700):
>
> Agrees with David and Jörgto keep separate Terminology and 
> Disambiguation data categories.
>
> *Concern/Suggestion*: The two data categories answer to different use 
> cases, so it would not be good to have a single solution for different 
> problems.
>
> *Arguments to keep both separated*: (1) Disambiguation is more 
> complex; we should not put extra burden on Terminology implementers, 
> (2) breaking large problems into smaller parts, makes things easier 
> [/Mārcis: overlaps with 1 ... sort of/]
>
> [/Felix: Yves responded as a Terminology consumer/]
>
>
>       Main ideas from Felix Sasaki (Mon, 14 Jan 2013 19:34:44 +0100,
>       Tue, 15 Jan 2013 10:34:17 +0100, Tue, 15 Jan 2013 13:20:06
>       +0100, Tue, 15 Jan 2013 17:39:08 +0100)
>
> *Asks*: What is _the difference_ in terms _of producing the metadata_ 
> for Terminology and Disambiguation [/Mārcis opinion: Terminology is 
> simple, Disambiguation is painful, but in general – both do annotation/]?
>
> *Opinion*: the Disambiguation output gives _background information on_ 
> what _resources_ have been used [/Mārcis: The Terminology does not … 
> at least not directly; also – I believe that the main task of 
> disambiguation is to define the meaning/semantics of the tagged units, 
> rather than counting up what resources have been used in the process 
> of disambiguation/].
>
> *Analyses*: the mapping between Terminology data category data and the 
> Disambiguation data category data.**
>
> *Suggests*: (1) _create guidance for producers of the metadata_, 
> related to different consumption scenarios [/Mārcis comment: shouldn’t 
> it be the other way around – gudance for consumers?/], (2) following 
> analysis, _proposes mapping_ from terminology data category entries to 
> the Disambiguation data category entries.
>
>
>       Main ideas from Mārcis Pinnis (Tue, 15 Jan 2013 09:55:59 +0200,
>       Tue, 15 Jan 2013 15:22:58 +0200)
>
> *Concerns*: (1) The _Disambiguation data category is very ambiguous_, 
> because (a) it lacks clear definitions for the separate granularity 
> levels; (b) why and on what basis only the 3 given granularity levels 
> have been chosen and not more (for instance, keyword annotation, 
> syntactic annotation, etc.); (c) terminology is not used consequently 
> throughout the description, therefore, it is difficult to follow the 
> specification. (2) _a phrase can be simultaneously_ a term, a named 
> entity, an entry in an onthology, and _many other things_ for 
> different application purposes (a keyword, a noun phrase, a propper 
> noun phrase, a client’s invented phrase, etc.), but the 
> _Disambiguation category does not allow a friendly way of annotating 
> multiple categorieson one phrase_ (not even considering hierarchical 
> annotation, which is very common for named entities) – the usefulness 
> of the Disambiguation data category will be limited due to its 
> difficult metadata production as well as consumption nature. (3) There 
> are _many different levels of disambiguation_ (most of them driven by 
> applications where the information is used), even simple annotation of 
> words and punctuation is disambiguation (of some sort). _Where do we 
> start counting the disambiguation and with what level_? Should we even 
> limit users to prescribed levels?
>
> *Suggests*: Keep the data categories separated, maybe even for all 
> three current “granularity” levels if they are required for 
> localisation as the applications can differ.
>
> *Opinion*: the difference in the use cases has not been explained 
> clear enough – if it would be clear, the issues would be limited to 
> Disambiguation only...
>
> Best regards,
>
> Mārcis ;o)
>
> ------------------------------------------------------------------------
> *No:* Felix Sasaki [fsasaki@w3.org]
> *Nosūtīts:* otrdiena, 2013. gada 15. janvārī 18:39
> *Kam:* Mārcis Pinnis
> *Kopija:* public-multilingualweb-lt-comments@w3.org
> *Tēma:* Re: Disambiguation and terminology producers (Re: issue-68 
> (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)))
>
> Hi Marcis,
>
> Am 15.01.13 14:39, schrieb Mārcis Pinnis:
>> Computer software, or just software, is a collection of computer 
>> programs and related data that provides the instructions for telling 
>> a computer what to do and how to do it.
> Great example, thanks a lot.
>
> I have run your example through the NERD API. An output is below. 
> Tadej, how would it look like with Enrycher?
>
> [{"idEntity":170179,"label":"Computer 
> software","startChar":0,"endChar":17,"nerdType":"http://nerd.eurecom.fr/ontology#Thing","uri":"http://en.wikipedia.com/wiki/Software","confidence":0.927371,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0},{"idEntity":170180,"label":"computer 
> programs","startChar":56,"endChar":73,"nerdType":"http://nerd.eurecom.fr/ontology#Thing","uri":"http://en.wikipedia.com/wiki/Computer_program","confidence":0.886778,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0},{"idEntity":170181,"label":"collection","startChar":42,"endChar":52,"nerdType":"http://nerd.eurecom.fr/ontology#Thing","confidence":0.586448,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0}]
>
>
> Below is the mapping NERD - ITS2 again:
>
> [
> The mappings NERD - ITS2 "disambiguation" are:
> - "nerdType" maps to "its-disambig-class-ref"
> - "confidence" maps to "its-disambig-confidence"
> - "uri" maps to "its-disambig-ident-ref"
> ]
>
> I think your terminology annotations easily can be integrated in this 
> mapping:
>
> [
> 1) "nerdType" maps to "its-disambig-class-ref"; there is no counterpart in the terminology annotation
> 2) "confidence" maps to "its-disambig-confidence" and to termConfidence
> 3) "uri" maps to "its-disambig-ident-ref" and to termInfoRef
> 4) "itsDisambigGranularity" is not available in NERD or your terminology annotation system
> ]
>
> So from the point of view of producers (= automatic annotation tools), 
> I think 1-3 could easily be integrated in one type of annotation output.
>
> Best,
>
> Felix
Received on Wednesday, 23 January 2013 00:56:19 UTC