- From: Tadej Stajner <tadej.stajner@ijs.si>
- Date: Thu, 24 Jan 2013 11:02:01 +0100
- To: public-multilingualweb-lt-comments@w3.org
- Message-ID: <51010699.8040202@ijs.si>
Hi, all, here are some slides with examples of the merge. -- Tadej On 1/23/2013 1:55 AM, Tadej Stajner wrote: > Hi, thanks for the excellent summary, Marcis, > with regards to the granularity level issue, I see that you identified > two new constraints which the current formulation doesn't work for: > - having simultaneous annotations of the same fragment with a > different level; > - too few/too restrictive levels; > > Depending on how critical are these issues, I suggest some solutions > we could discuss tomorrow: > 1) define a separate attribute for every level, allowing for much > nicer annotations, and the possibility to have multiple annotations on > the same fragment. The downside is that we need to fix what levels are > supported, since we're creating new attributes for every level, thus > failing the second constraint. > 2) keep the formulation as is, but allow any value to be valid as a > granularity level - it will still look relatively verbose, but it will > still allow for arbitrary number of levels; This one still fails the > first constraint. > 3) use a similar formulation than what we did for specifying multiple > annotators in annotatorsRef - combining the granularity and the URI: > "Welcome to <span > its:disambigIdentRef="entity|http://dbpedia.org/resource/Prague > keyword|http://examples.org/kw/prague">Prague</span>. " > The upside to this is that we can drop the disambigGranularity > attribute, but we encourage bad style by having non-atomic attribute > values. > > -- Tadej > > On 1/23/2013 1:13 AM, Mârcis Pinnis wrote: >> Hi Felix, all, >> This is currently the summary that I have aggregated for >> Disambiguation and Terminology: >> >> >> Summary on January 23, 2013 >> >> I left description statements to the minimum (just laconic arguments >> and minimalistic opinions). I also added inline comments from myself >> where I saw the need to clarify or ask a question. >> >> >> The initial idea of Christian – summarized (please correct me if >> I got it wrong) >> >> *Opinion:* ITS 2.0 in comparison to ITS 1.0 moves closer to Natural >> Language Processing (NLP) – a general statement (but an important one >> for text analysis that follows further). *Disambiguation* could be a >> powerful tool *for all kinds of text analysis purposes* if >> implemented in an easy to use way. >> >> *Concerns*: users may find it difficult to understand when to use >> “term” and when – “disambiguation”; the usage scenarios of both >> categories may overlap; Disambiguation is not clearly defined in the >> ITS 2.0 specification. >> >> *Suggestions*: (1) integrating *Terminology as part of Disambiguation >> - *“/(automated) text analysis/” (a substitute data category for the >> two others) could subsume what is produced by Disambiguation, >> Terminology and other annotation-like metadata processing scenarios >> [/Mârcis: I generalized the idea/], (2) *deprecating Terminology*, >> (3) *revising ITS 2.0* so that the difference is clearly defined >> [/Mârcis: is that what was meant by revising the spec?/] >> >> Then follows an example of how the revised data category could look >> like, which in the proposal is a renamed Disambiguation data category >> with the “granularity” changed to “type” that could have either >> hardcoded values or URIs (/preferred by Christian/). >> >> >> The LT-Web working group’s initial comments – summarized (please >> correct me if I got it wrong) >> >> >> Main ideas from David Filip (Jan 11, 2013, at 12:22 (CET)): >> >> *Arguments to keep as is*: (1) ITS 2.0 should not break ITS 1.0 >> (_cannot deprecate Terminology_) [/Mârcis: as explained by Felix, ITS >> 2.0 does not necessarily have to be backwards compattible/], (2) >> Terminology is simpler to produce/consume and _consumers of >> Terminology should not be forced to move to a more complex >> annotation_, (3) although Terminology and Disambiguation are >> informally semantically related, for granularity and independent >> implementability these should not be combined. >> >> *Opinion*: Relationship between Terminology and Disambiguate is loose >> and, therefore, should be handled in a _best practices document_, but >> not a normative material. >> >> [/Mârcis: David in the example (A) in his E-mail described an example >> of annotating term candidates with the Disambiguation data category >> and after approval of a terminologist converting them to the >> Terminology data category ... if I understood it correctly. This is >> the complete opposite of how we would create the process chain – we >> would use throughout the Terminology data category, but require the >> terms to be initially marked with the Terminology data category using >> term confidence and after approval these could be linked to a >> term-bank entry by the terminologist remaining within the Terminology >> data category/] >> >> >> Main ideas from Jörg Schütz (11.01.13 14:07): >> >> Agrees with David to keep separate Terminology and Disambiguation >> data categories. >> >> *Concerns*: _ISOCat elements (or URIs)_ for “granularity” would force >> applications to adopt NLP standards that _could be not appropriate >> for a given application scenario_ [/Mârcis: Just a comment – we have >> to understand what can be agreed upon content providers/users >> themselves and what needs to be prescribed in the specification, that >> is, there is a question of why should we restrict users and prescribe >> what can be annotated/disambiguated?/]. >> >> *Suggests*: do not bring ITS closer to NLP because it should remain >> open and deployable for different language processing strategies >> [/Mârcis: although I do not understand what is meant by this >> recommendation, I do not see which field/area of NLP causes an issue?!/] >> >> >> Main ideas from Yves Savourel (Fri, 11 Jan 2013 10:36:41 -0700): >> >> Agrees with David and Jörgto keep separate Terminology and >> Disambiguation data categories. >> >> *Concern/Suggestion*: The two data categories answer to different use >> cases, so it would not be good to have a single solution for >> different problems. >> >> *Arguments to keep both separated*: (1) Disambiguation is more >> complex; we should not put extra burden on Terminology implementers, >> (2) breaking large problems into smaller parts, makes things easier >> [/Mârcis: overlaps with 1 ... sort of/] >> >> [/Felix: Yves responded as a Terminology consumer/] >> >> >> Main ideas from Felix Sasaki (Mon, 14 Jan 2013 19:34:44 +0100, >> Tue, 15 Jan 2013 10:34:17 +0100, Tue, 15 Jan 2013 13:20:06 >> +0100, Tue, 15 Jan 2013 17:39:08 +0100) >> >> *Asks*: What is _the difference_ in terms _of producing the metadata_ >> for Terminology and Disambiguation [/Mârcis opinion: Terminology is >> simple, Disambiguation is painful, but in general – both do >> annotation/]? >> >> *Opinion*: the Disambiguation output gives _background information >> on_ what _resources_ have been used [/Mârcis: The Terminology does >> not … at least not directly; also – I believe that the main task of >> disambiguation is to define the meaning/semantics of the tagged >> units, rather than counting up what resources have been used in the >> process of disambiguation/]. >> >> *Analyses*: the mapping between Terminology data category data and >> the Disambiguation data category data.** >> >> *Suggests*: (1) _create guidance for producers of the metadata_, >> related to different consumption scenarios [/Mârcis comment: >> shouldn’t it be the other way around – gudance for consumers?/], (2) >> following analysis, _proposes mapping_ from terminology data category >> entries to the Disambiguation data category entries. >> >> >> Main ideas from Mârcis Pinnis (Tue, 15 Jan 2013 09:55:59 +0200, >> Tue, 15 Jan 2013 15:22:58 +0200) >> >> *Concerns*: (1) The _Disambiguation data category is very ambiguous_, >> because (a) it lacks clear definitions for the separate granularity >> levels; (b) why and on what basis only the 3 given granularity levels >> have been chosen and not more (for instance, keyword annotation, >> syntactic annotation, etc.); (c) terminology is not used consequently >> throughout the description, therefore, it is difficult to follow the >> specification. (2) _a phrase can be simultaneously_ a term, a named >> entity, an entry in an onthology, and _many other things_ for >> different application purposes (a keyword, a noun phrase, a propper >> noun phrase, a client’s invented phrase, etc.), but the >> _Disambiguation category does not allow a friendly way of annotating >> multiple categorieson one phrase_ (not even considering hierarchical >> annotation, which is very common for named entities) – the usefulness >> of the Disambiguation data category will be limited due to its >> difficult metadata production as well as consumption nature. (3) >> There are _many different levels of disambiguation_ (most of them >> driven by applications where the information is used), even simple >> annotation of words and punctuation is disambiguation (of some sort). >> _Where do we start counting the disambiguation and with what level_? >> Should we even limit users to prescribed levels? >> >> *Suggests*: Keep the data categories separated, maybe even for all >> three current “granularity” levels if they are required for >> localisation as the applications can differ. >> >> *Opinion*: the difference in the use cases has not been explained >> clear enough – if it would be clear, the issues would be limited to >> Disambiguation only... >> >> Best regards, >> >> Mârcis ;o) >> >> ------------------------------------------------------------------------ >> *No:* Felix Sasaki [fsasaki@w3.org] >> *Nosűtîts:* otrdiena, 2013. gada 15. janvârî 18:39 >> *Kam:* Mârcis Pinnis >> *Kopija:* public-multilingualweb-lt-comments@w3.org >> *Tçma:* Re: Disambiguation and terminology producers (Re: issue-68 >> (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term))) >> >> Hi Marcis, >> >> Am 15.01.13 14:39, schrieb Mârcis Pinnis: >>> Computer software, or just software, is a collection of computer >>> programs and related data that provides the instructions for telling >>> a computer what to do and how to do it. >> Great example, thanks a lot. >> >> I have run your example through the NERD API. An output is below. >> Tadej, how would it look like with Enrycher? >> >> [{"idEntity":170179,"label":"Computer >> software","startChar":0,"endChar":17,"nerdType":"http://nerd.eurecom.fr/ontology#Thing","uri":"http://en.wikipedia.com/wiki/Software","confidence":0.927371,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0},{"idEntity":170180,"label":"computer >> programs","startChar":56,"endChar":73,"nerdType":"http://nerd.eurecom.fr/ontology#Thing","uri":"http://en.wikipedia.com/wiki/Computer_program","confidence":0.886778,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0},{"idEntity":170181,"label":"collection","startChar":42,"endChar":52,"nerdType":"http://nerd.eurecom.fr/ontology#Thing","confidence":0.586448,"relevance":0.5,"extractor":"yahoo","startNPT":0.0,"endNPT":0.0}] >> >> >> Below is the mapping NERD - ITS2 again: >> >> [ >> The mappings NERD - ITS2 "disambiguation" are: >> - "nerdType" maps to "its-disambig-class-ref" >> - "confidence" maps to "its-disambig-confidence" >> - "uri" maps to "its-disambig-ident-ref" >> ] >> >> I think your terminology annotations easily can be integrated in this >> mapping: >> >> [ >> 1) "nerdType" maps to "its-disambig-class-ref"; there is no counterpart in the terminology annotation >> 2) "confidence" maps to "its-disambig-confidence" and to termConfidence >> 3) "uri" maps to "its-disambig-ident-ref" and to termInfoRef >> 4) "itsDisambigGranularity" is not available in NERD or your terminology annotation system >> ] >> >> So from the point of view of producers (= automatic annotation >> tools), I think 1-3 could easily be integrated in one type of >> annotation output. >> >> Best, >> >> Felix >
Attachments
- application/vnd.openxmlformats-officedocument.presentationml.presentation attachment: Merging_Terminology_and_Disambiguation.pptx
Received on Thursday, 24 January 2013 10:02:45 UTC