Atb.: Disambiguation and terminology producers (Re: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)))

Hi Felix, all,

This is currently the summary that I have aggregated for Disambiguation and Terminology:

Summary on January 23, 2013
I left description statements to the minimum (just laconic arguments and minimalistic opinions). I also added inline comments from myself where I saw the need to clarify or ask a question.
The initial idea of Christian – summarized (please correct me if I got it wrong)
Opinion: ITS 2.0 in comparison to ITS 1.0 moves closer to Natural Language Processing (NLP) – a general statement (but an important one for text analysis that follows further). Disambiguation could be a powerful tool for all kinds of text analysis purposes if implemented in an easy to use way.
Concerns: users may find it difficult to understand when to use “term” and when – “disambiguation”; the usage scenarios of both categories may overlap; Disambiguation is not clearly defined in the ITS 2.0 specification.
Suggestions: (1) integrating Terminology as part of Disambiguation - “(automated) text analysis” (a substitute data category for the two others) could subsume what is produced by Disambiguation, Terminology and other annotation-like metadata processing scenarios [Mârcis: I generalized the idea], (2) deprecating Terminology, (3) revising ITS 2.0 so that the difference is clearly defined [Mârcis: is that what was meant by revising the spec?]
Then follows an example of how the revised data category could look like, which in the proposal is a renamed Disambiguation data category with the “granularity” changed to “type” that could have either hardcoded values or URIs (preferred by Christian).
The LT-Web working group’s initial comments – summarized (please correct me if I got it wrong)
 Main ideas from David Filip (Jan 11, 2013, at 12:22 (CET)):
Arguments to keep as is: (1) ITS 2.0 should not break ITS 1.0 (cannot deprecate Terminology) [Mârcis: as explained by Felix, ITS 2.0 does not necessarily have to be backwards compattible], (2) Terminology is simpler to produce/consume and consumers of Terminology should not be forced to move to a more complex annotation, (3) although Terminology and Disambiguation are informally semantically related, for granularity and independent implementability these should not be combined.
Opinion: Relationship between Terminology and Disambiguate is loose and, therefore, should be handled in a best practices document, but not a normative material.
[Mârcis: David in the example (A) in his E-mail described an example of annotating term candidates with the Disambiguation data category and after approval of a terminologist converting them to the Terminology data category ... if I understood it correctly. This is the complete opposite of how we would create the process chain – we would use throughout the Terminology data category, but require the terms to be initially marked with the Terminology data category using term confidence and after approval these could be linked to a term-bank entry by the terminologist remaining within the Terminology data category]
Main ideas from Jörg Schütz (11.01.13 14:07):
Agrees with David to keep separate Terminology and Disambiguation data categories.
Concerns: ISOCat elements (or URIs) for “granularity” would force applications to adopt NLP standards that could be not appropriate for a given application scenario [Mârcis: Just a comment – we have to understand what can be agreed upon content providers/users themselves and what needs to be prescribed in the specification, that is, there is a question of why should we restrict users and prescribe what can be annotated/disambiguated?].
Suggests: do not bring ITS closer to NLP because it should remain open and deployable for different language processing strategies [Mârcis: although I do not understand what is meant by this recommendation, I do not see which field/area of NLP causes an issue?!]
Main ideas from Yves Savourel (Fri, 11 Jan 2013 10:36:41 -0700):
Agrees with David and Jörg to keep separate Terminology and Disambiguation data categories.
Concern/Suggestion: The two data categories answer to different use cases, so it would not be good to have a single solution for different problems.
Arguments to keep both separated: (1) Disambiguation is more complex; we should not put extra burden on Terminology implementers, (2) breaking large problems into smaller parts, makes things easier [Mârcis: overlaps with 1 ... sort of]
[Felix: Yves responded as a Terminology consumer]
Main ideas from Felix Sasaki (Mon, 14 Jan 2013 19:34:44 +0100, Tue, 15 Jan 2013 10:34:17 +0100, Tue, 15 Jan 2013 13:20:06 +0100, Tue, 15 Jan 2013 17:39:08 +0100)
Asks: What is the difference in terms of producing the metadata for Terminology and Disambiguation [Mârcis opinion: Terminology is simple, Disambiguation is painful, but in general – both do annotation]?
Opinion: the Disambiguation output gives background information on what resources have been used [Mârcis: The Terminology does not … at least not directly; also – I believe that the main task of disambiguation is to define the meaning/semantics of the tagged units, rather than counting up what resources have been used in the process of disambiguation].
Analyses: the mapping between Terminology data category data and the Disambiguation data category data.
Suggests: (1) create guidance for producers of the metadata, related to different consumption scenarios [Mârcis comment: shouldn’t it be the other way around – gudance for consumers?], (2) following analysis, proposes mapping from terminology data category entries to the Disambiguation data category entries.
Main ideas from Mârcis Pinnis (Tue, 15 Jan 2013 09:55:59 +0200, Tue, 15 Jan 2013 15:22:58 +0200)
Concerns: (1) The Disambiguation data category is very ambiguous, because (a) it lacks clear definitions for the separate granularity levels; (b) why and on what basis only the 3 given granularity levels have been chosen and not more (for instance, keyword annotation, syntactic annotation, etc.); (c) terminology is not used consequently throughout the description, therefore, it is difficult to follow the specification. (2) a phrase can be simultaneously a term, a named entity, an entry in an onthology, and many other things for different application purposes (a keyword, a noun phrase, a propper noun phrase, a client’s invented phrase, etc.), but the Disambiguation category does not allow a friendly way of annotating multiple categories  on one phrase (not even considering hierarchical annotation, which is very common for named entities) – the usefulness of the Disambiguation data category will be limited due to its difficult metadata production as well as consumption nature. (3) There are many different levels of disambiguation (most of them driven by applications where the information is used), even simple annotation of words and punctuation is disambiguation (of some sort). Where do we start counting the disambiguation and with what level? Should we even limit users to prescribed levels?
Suggests: Keep the data categories separated, maybe even for all three current “granularity” levels if they are required for localisation as the applications can differ.
Opinion: the difference in the use cases has not been explained clear enough – if it would be clear, the issues would be limited to Disambiguation only...

Best regards,
Mârcis ;o)

________________________________
No: Felix Sasaki [fsasaki@w3.org]
Nosűtîts: otrdiena, 2013. gada 15. janvârî 18:39
Kam: Mârcis Pinnis
Kopija: public-multilingualweb-lt-comments@w3.org
Tçma: Re: Disambiguation and terminology producers (Re: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)))


Hi Marcis,

Am 15.01.13 14:39, schrieb Mârcis Pinnis:
Computer software, or just software, is a collection of computer programs and related data that provides the instructions for telling a computer what to do and how to do it.
Great example, thanks a lot.

I have run your example through the NERD API. An output is below. Tadej, how would it look like with Enrycher?

[{"idEntity":170179,"label":"Computer software","startChar":0,"endChar":17,"nerdType":"http://nerd.eurecom.fr/ontology#Thing"<http://nerd.eurecom.fr/ontology#Thing>,"uri":"http://en.wikipedia.com/wiki/Software"<http://en.wikipedia.com/wiki/Software>,"confidence":0.927371,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0},{"idEntity":170180,"label":"computer programs","startChar":56,"endChar":73,"nerdType":"http://nerd.eurecom.fr/ontology#Thing"<http://nerd.eurecom.fr/ontology#Thing>,"uri":"http://en.wikipedia.com/wiki/Computer_program"<http://en.wikipedia.com/wiki/Computer_program>,"confidence":0.886778,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0},{"idEntity":170181,"label":"collection","startChar":42,"endChar":52,"nerdType":"http://nerd.eurecom.fr/ontology#Thing"<http://nerd.eurecom.fr/ontology#Thing>,"confidence":0.586448,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0}]


Below is the mapping NERD - ITS2 again:

[

The mappings NERD - ITS2 "disambiguation" are:
- "nerdType" maps to "its-disambig-class-ref"
- "confidence" maps to "its-disambig-confidence"
- "uri" maps to "its-disambig-ident-ref"

]

I think your terminology annotations easily can be integrated in this mapping:

[

1) "nerdType" maps to "its-disambig-class-ref"; there is no counterpart in the terminology annotation
2) "confidence" maps to "its-disambig-confidence" and to termConfidence
3) "uri" maps to "its-disambig-ident-ref" and to termInfoRef
4) "itsDisambigGranularity" is not available in NERD or your terminology annotation system


]

So from the point of view of producers (= automatic annotation tools), I think 1-3 could easily be integrated in one type of annotation output.

Best,

Felix

Received on Wednesday, 23 January 2013 00:14:24 UTC