Re: Disambiguation and terminology producers (Re: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)))

Hi all, Mārcis again,

to move this forward, I have worked with an example. For the sentence
"Welcome to Dublin in Ireland!"
 From Enrycher you will get an annotation like in example
http://www.w3.org/TR/2012/WD-its20-20121206/#EX-disambiguation-html5-local-1
from the NERD API, using the same sentence, you will get this JSON output:

[{"idEntity":169970,"label":"Dublin","startChar":0,"endChar":6,"extractorType":"CITY","nerdType":"http://nerd.eurecom.fr/ontology#Location","uri":"http://dbpedia.org/resource/Dublin","confidence":1.0,"relevance":0.5,"extractor":"extractiv","startNPT":0.0,"endNPT":0.0},{"idEntity":169971,"label":"Ireland","startChar":25,"endChar":32,"extractorType":"COUNTRY","nerdType":"http://nerd.eurecom.fr/ontology#Location","uri":"http://dbpedia.org/resource/Ireland","confidence":1.0,"relevance":0.5,"extractor":"extractiv","startNPT":0.0,"endNPT":0.0}]

The mappings NERD - ITS2 "disambiguation" are:
- "nerdType" maps to "its-disambig-class-ref"
- "confidence" maps to "its-disambig-confidence"
- "uri" maps to "its-disambig-ident-ref"

So we have some interoperability with 11 tools (NERD is a broker for 10 
annotation tools, plus Enrycher): they produce easy to map output.

So the question - again focusing on production, not consumption: what do 
you, Mārcis, expect "your" automatic term annotation tool to produce for 
the example sentence "Welcome to Dublin in Ireland!" ?

Best,

Felix

Am 15.01.13 10:34, schrieb Felix Sasaki:
> Hi Mārcis,
>
> thanks a lot for your detailed mail. I must however say that I don't 
> see an answer to my question: "what is the difference in terms of 
> producing the metadata?". The question was really focused on your 
> implementation approach. I understand your consumption scenario and 
> the terminology use case. But I assume that in your automatic term 
> annotation implementation you apply the same linguistic processing 
> pipeline as Tadej does, using basic analysis (tokenization, stemming, 
> morphology etc.), then some resources (e.g. a lexicon) to define the 
> type of "unit" (I'm saying unit to avoid "term" or "entity"). As I 
> understand it, the disambiguation output gives background information 
> what resources have been used: an ontology like dbpedia, a lexicon 
> like wordnet.
>
> See e.g. the NERD API
> http://nerd.eurecom.fr/documentation#nerdapi
> that gives you back a nerdType
> [
> |   [
>                    {
>                      idEntity: 120,
>                      label: "BBC",
>                      startChar: 138,
>                      endChar: 141,
>                      extractorType: "Company",
>                      nerdType:"http://nerd.eurecom.fr/ontology#Organization",
>                      uri:"http://dbpedia.org/resource/BBC",
>                      confidence: 0.0582796,
>                      relevance: 0.5,
>                      extractor: "dbspotlight",
>                      startNPT: 0,
>                      endNPT: 0
>                      },
>                     ...
>                    ]|
> ]
>
> I'm mentioning this API since in a sense it is the API counterpart to 
> what we are standardizing with markup: it provides a JSON format as 
> the output of annotation.
> So again you have a confidence field and type - and I'd like to 
> understand not what the difference is in your use case, but in the 
> implementation approach (see above)? If the answer is "none", that is 
> fine too, and it would give us a path to explain to users (both 
> producers and consumers) how to deal with both use cases.
>
> Best,
>
> Felix
>
> Am 15.01.13 08:55, schrieb Mārcis Pinnis:
>>
>> Hi Felix,
>>
>> Terminology from a practical standpoint identifies concepts (/often 
>> also common term phrases - concepts in multi-word phrases/) commonly 
>> found in a specific domain (subject field) and infrequently found or 
>> not found at all in a general language. That is the purpose of the 
>> Terminology data category. It should identify domain-specific terms 
>> (/or possible term-candidates with a confidence score when automated 
>> annotation is performed/) and, if possible, link the identified terms 
>> with entries in a term-base.
>>
>> I am not that familiar with the Disambiguation data category and its 
>> history, but the question is, what is the main goal of the 
>> Disambiguation data category (/whom is it meant for and who will 
>> provide data for it?/)? Should it identify or try sorting out 
>> ambiguities in any type of phrases (/no matter - terms, named 
>> entities, general language, etc./)? Then - have we identified all 
>> types if we have only three "granularities"? Then also - a phrase can 
>> actually simultaneously belong to all "granularities" (/I think the 
>> naming does not reflect the meaning correctly/) depending on a 
>> client, which I guess makes it difficult for content providers to 
>> create reasonable mark-up (/that is one reason why I would prefer not 
>> using Disambiguation/).
>>
>> In my opinion, such different content mark-ups - (/terms as concepts 
>> (also - what is meant with lexical-concept is not explained (and how 
>> does that overlap with what is a term?)!!!), named entities as 
>> concept instances (the main difference between understanding what is 
>> a term and what is a named entity - however a named entity in many 
>> cases can be also a term - for instance for the term "weapon" 
>> suitable named entities may very well be: "knife", "gun", "axe", 
>> however, is "knife" a named entity or is it a term? It can be both! 
>> Take a look at the biggest classification table of named entities: 
>> http://nlp.cs.nyu.edu/ene/version7_1_0Beng.html - under the category 
>> Product you may probably find many things that you may have not 
>> thought to be named entities?!) and ontology-concept (which may very 
>> well overlap with the previous two...)/) - should not be mixed together!
>>
>> It is hard to understand the reasoning behind the different 
>> "granularity" levels also because of lacking definitions and it is 
>> not clear why such different data types should at all be mixed 
>> together in one category.
>>
>> From that aspect, I would prefer separate data categories for all 
>> three "granularities" (if necessary; although I do not particularly 
>> like the naming here) as the applications for all these may be quite 
>> different.
>>
>> From an implementer's and content provider's viewpoint I prefer the 
>> Terminology data category as its purpose is clear and it is also 
>> clear what is meant with the annotation - it is clearly identifiable 
>> and mark-up can be easily applied (which is not the case with the 
>> Disambiguation data category).
>>
>> I hope I did not make things more confusing?! I wanted to raise the 
>> point that the disambiguation data category itself is quite ambiguous 
>> (at least to me).
>>
>> Best regards,
>>
>> Mārcis ;o)
>>
>> -----Original Message-----
>> From: Felix Sasaki [mailto:fsasaki@w3.org]
>> Sent: Monday, January 14, 2013 8:35 PM
>> To: public-multilingualweb-lt-comments@w3.org
>> Cc: Mārcis Pinnis
>> Subject: Disambiguation and terminology producers (Re: issue-68 (Re: 
>> Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)))
>>
>> Hi all, esp. Tadej and Mārcis (FYI, it might be helpful for you to 
>> subscribe to the public-multilingualweb-lt-comments@w3.org 
>> <mailto:public-multilingualweb-lt-comments@w3.org> list),
>>
>> Yves has responded from the point of view of a consumer. Now it would 
>> be interesting to understand: what is the difference in terms of 
>> producing the metadata?
>>
>> Is in essence the process for creating
>>
>> <span its:term="yes" its:termConfidence="0.98">screwdriver</span>
>>
>> the same as creating
>>
>> <span its:disambigSource="mywordnet" its:disambigIdent="474646"
>>
>> its:disambigGranularity="lexical-concept"
>>
>> its:disambigConfidence="0.98">screwdriver</span>
>>
>> with the only difference that in the case of terminology, information 
>> is left out (Source, Ident, Granularity) and there is different 
>> naming for attributes (termConfidence vs. disambigConfidence)?
>>
>> This would mean that we could create some guidance for producers of 
>> the metadata, related to different consumption scenarios.
>>
>> Best,
>>
>> Felix
>>
>> Am 14.01.13 18:54, schrieb Lieske, Christian:
>>
>> > Hi David, Jörg, Felix, all,
>>
>> >
>>
>> > It's great to see timely replies to this comment.
>>
>> >
>>
>> > It would indeed be valuable - as indicated by Felix - to get 
>> comments from additional angles.
>>
>> >
>>
>> > Cheers,
>>
>> > Christian
>>
>> >
>>
>> > -----Original Message-----
>>
>> > From: Felix Sasaki [mailto:fsasaki@w3.org]
>>
>> > Sent: Freitag, 11. Januar 2013 18:17
>>
>> > To: public-multilingualweb-lt-comments@w3.org 
>> <mailto:public-multilingualweb-lt-comments@w3.org>
>>
>> > Subject: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 -
>>
>> > Disambiguation (and term))
>>
>> >
>>
>> > All (co-chair hat on),
>>
>> >
>>
>> > thank you for this discussion. General remark: as explained at
>>
>> > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/
>>
>> > 0045.html please add the issue number to the mail subject. Otherwise
>>
>> > it will be very hard to track discussions.
>>
>> >
>>
>> > It would now be interesting to hear the implementors: according to
>>
>> > http://tinyurl.com/its20-testsuite-dashboard
>>
>> > Enlaso, Tilde and UL will implement terminology. As I understand it,
>>
>> > UL will make a wrapper around the Enlaso / Okapi engine, correct?
>>
>> > Now, for Disambiguation we have Enlaso, JSI, Moravia and UL. Here I
>>
>> > *think* that Moravia and UL will basically have an Okapi wrapper.
>>
>> > Please correct me if I'm wrong.
>>
>> >
>>
>> > This leaves us with the following situation:
>>
>> > - two implementations for terminology (Enlaso and Tilde)
>>
>> > - two for disambiguation (Enlaso and JSI)
>>
>> >
>>
>> > So Mārcis, Tadej, Yves - what do you think about this proposal?
>>
>> >
>>
>> > I'm asking this also since I have to remind people about the W3C 
>> process:
>>
>> >
>>
>> > (W3C process hat on) We cannot just say "we don't like a comment".
>>
>> > There needs to be good reasons to reject it. Below argumentation can
>>
>> > support the rejection, but the rejection is rather weak if
>>
>> > implementers don't have an opinion or would even say "I would do the
>>
>> > change". So please express your thoughts in this thread.
>>
>> >
>>
>> > Best,
>>
>> >
>>
>> > Felix
>>
>> >
>>
>> > Am 11.01.13 14:07, schrieb Jörg Schütz:
>>
>> >> +1
>>
>> >>
>>
>> >> Hi Christian, David, and all,
>>
>> >>
>>
>> >> I would have similar arguments for keeping term and disambiguation
>>
>> >> separat although they are related. There are several use cases out
>>
>> >> there in the wild that need this kind of separation, e.g. terminology
>>
>> >> based workflows in a particular supply chain vs. data stream analyses
>>
>> >> which prepare the data for further treatment such as a machine
>>
>> >> translation application (vocubulary support and training/tuning life
>>
>> >> cycles).
>>
>> >>
>>
>> >> One other topic is the discussion of the ISOCat elements which to
>>
>> >> some extend would force applications to adopt an NLP standard that
>>
>> >> might not be appropriate for a given application scenario, e.g. those
>>
>> >> that do not use NLP technologies at all. Therefore, I would also
>>
>> >> recommend that we do not talk about bringing ITS closer to NLP
>>
>> >> because ITS should remain open and deployable for different language
>>
>> >> processing strategies.
>>
>> >>
>>
>> >> Nevertheless, thanks a lot for raising these concerns.
>>
>> >>
>>
>> >> All the best -- Jörg
>>
>> >>
>>
>> >> On Jan 11, 2013, at 12:22 (CET), Dr. David Filip wrote:
>>
>> >>> Dear Christian, thanks for this insightful comment.
>>
>> >>> I agree that the disambiguation category is one of the most
>>
>> >>> important additions that can expand the usage of the standard and
>>
>> >>> become more useful across technologies and industries.
>>
>> >>>
>>
>> >>> The group had discussed and it is clear that disambiguation and term
>>
>> >>> are somehow related categories. We have however not considered
>>
>> >>> deprecation of the ITS 1.0 term, at least not explicitly.
>>
>> >>>
>>
>> >>> I believe that this is given by the chartered principles of the
>>
>> >>> group [paraphrasing]
>>
>> >>> 1) Do not break 1.0
>>
>> >>> 2) Keep the 1.0 principle of independent categories that can also be
>>
>> >>> independently implemented.
>>
>> >>>
>>
>> >>> I believe that your proposal to fuse term and disambiguation is
>>
>> >>> inline with 2) in the sense of making two seemingly interdependent
>>
>> >>> categories into one fully self contained and independent category,
>>
>> >>> but would violate 1).
>>
>> >>>
>>
>> >>> But even if we did not care for 1), I believe that the relationship
>>
>> >>> between term and disambiguation is a reasonably loose one, i.e. not
>>
>> >>> a hard formal interdependency that would warrant or even mandate
>>
>> >>> normative handling, and thus can and should be handled in
>>
>> >>> non-normative material such as a best practice document, while we
>>
>> >>> are keeping both categories, because they have discernable use cases
>>
>> >>> and still can be implemented independently.
>>
>> >>>
>>
>> >>> A)
>>
>> >>> A user that uses both a terminology management system and a text
>>
>> >>> analytics system for disambiguation can reasonably combine them and
>>
>> >>> their combination can be driven by organization specific process
>>
>> >>> driven considerations. They can for instance harvest spans marked as
>>
>> >>> disambiguation as term candidates for their Terminology database and
>>
>> >>> these can be encoded as terms next time if e.g. a  terminologist
>>
>> >>> approves them as terms.
>>
>> >>>
>>
>> >>> B)
>>
>> >>> People using text analytics input only do not need to care about 
>> term.
>>
>> >>>
>>
>> >>> C)
>>
>> >>> People using terminology management as the only source do not need
>>
>> >>> to bother with complexities of the disambiguation category.
>>
>> >>>
>>
>> >>> To summarize:
>>
>> >>> While many ITS categories, and prominently term and disambiguation,
>>
>> >>> are informally semantically related, it seems important to keep a
>>
>> >>> reasonable and manageable granularity of the independently
>>
>> >>> implementable categories.
>>
>> >>>
>>
>> >>> I hope this helps to understand the group's motivation for keeping
>>
>> >>> the categories apart.
>>
>> >>> Please let me know
>>
>> >>> Rgds
>>
>> >>> dF
>>
>> >>>
>>
>> >>> Dr. David Filip
>>
>> >>> =======================
>>
>> >>> LRC | CNGL | LT-Web | CSIS
>>
>> >>> University of Limerick, Ireland
>>
>> >>> telephone: +353-6120-2781
>>
>> >>> *cellphone: +353-86-0222-158*
>>
>> >>> facsimile: +353-6120-2734
>>
>> >>> mailto: david.filip@ul.ie <mailto:david.filip@ul.ie> 
>> <mailto:david.filip@ul.ie>
>>
>> >>>
>>
>> >>>
>>
>> >>> On Thu, Jan 10, 2013 at 9:14 AM, Lieske, Christian
>>
>> >>> <christian.lieske@sap.com <mailto:christian.lieske@sap.com 
>> <mailto:christian.lieske@sap.com%20%3cmailto:christian.lieske@sap.com>>> 
>> wrote:
>>
>> >>>
>>
>> >>>      Hi,____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>      Please find below comments/observations/questions/ideas 
>> concerning
>>
>> >>>      the ITS 2.0 working draft dated December 6, 2012
>>
>> >>>      (http://www.w3.org/TR/2012/WD-its20-20121206/). Please feel 
>> free to
>>
>> >>>      contact me for clarifications if anything is unclear.____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>      The section related to the “disambiguation” data category to 
>> me is
>>
>> >>>      one of the most important ones of the draft. ITS 2.0 from my
>>
>> >>>      point-of-view moves ITS 1.0 closer to Natural Language 
>> Processing
>>
>> >>>      (NLP), and “disambiguation” to me is related to NLP in 
>> various ways.
>>
>> >>>      Thus, making “disambiguation” powerful and easy to use (e.g. 
>> via a
>>
>> >>>      clear distinction to other data categories, as well as
>>
>> >>>      conceptualizations and wording that are not just known within
>>
>> >>>      linguistics) seems important to me.____
>>
>> >>>
>>
>> >>>      ____
>>
>> >>>
>>
>> >>>      While looking at “disambiguation” from this angle, I started to
>>
>> >>>      wonder if it could benefit from additions/modifications. I 
>> apologize
>>
>> >>>      in advance if a reply to this comment may require that 
>> discussions
>>
>> >>>      which presumably already took place may have to be
>>
>> >>> summarized.____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>      Here are my observations/questions/ideas:____
>>
>> >>>
>>
>> >>>      ____
>>
>> >>>
>>
>> >>>      __a.__I sense that ITS users will have difficulties to 
>> decide when
>>
>> >>>      to use “term” and when to use “disambiguation” (the note in the
>>
>> >>>      Working Draft indicates this). ____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>      __b.__Annotation of known terms, generation of so-called “term
>>
>> >>>      candidates”, (named) entity recognition, and other 
>> automation can be
>>
>> >>>      subsumed under the heading “(automated) text analysis”.____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>      I am thus wondering if the following would be worth
>>
>> >>> considering:____
>>
>> >>>
>>
>> >>>      ____
>>
>> >>>
>>
>> >>>      __1.__Enhance the current “disambiguation” so that also the 
>> current
>>
>> >>>      “term” can be covered____
>>
>> >>>
>>
>> >>>      __2.__Deprecate “term”____
>>
>> >>>
>>
>> >>>      __3.__Revising some of the terminology used in the spec (e.g.
>>
>> >>>      “disambiguation”, “disambigGranularity”)____
>>
>> >>>
>>
>> >>>      ____
>>
>> >>>
>>
>> >>>      An example use of a revised “disambiguation” (and deprecated 
>> “term”)
>>
>> >>>      – partially inspired by ISOCat (see http://www.isocat.org/ ) 
>> – is
>>
>> >>>      the following:____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>      Data category name: (automated) text analysis annotation 
>> (atan/tan);
>>
>> >>>      using “text analysis annotation” would have the advantage 
>> that even
>>
>> >>>      manual work (e.g. “promoting a term candidate to a term”) 
>> could be
>>
>> >>>      covered____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>      Data category “qualifier” (currently “disambigGranularity”):
>>
>> >>>      atan-type or tan-type____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>      Values for “qualifier”: lexical, term, termCandidate,
>>
>> >>>      ontological-class, ontological-entity; possibly even URIs 
>> such as
>>
>> >>> http://www.isocat.org/datcat/DC-2275 - would allow rather
>>
>> >>>      fine-grained and under certain provisions 
>> standard-conformant (ISO
>>
>> >>>      12620; see http://www.ttt.org/clsframe/datcats.html)
>>
>> >>> annotation____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>      Example:____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>              <span ____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>> its-tan-confidence="0.7"____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>>
>> >>>      ____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" ____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>> its-tan-type="
>>
>> >>> http://www.isocat.org/datcat/DC-2275">Dublin</span 
>> <http://www.isocat.org/datcat/DC-2275%22%3eDublin%3c/span>> ____
>>
>> >>>
>>
>> >>>      __ __
>>
>> >>>
>>
>> >>>      Cheers,____
>>
>> >>>
>>
>> >>>      Christian____
>>
>> >>>
>>
>> >
>>
>

Received on Tuesday, 15 January 2013 12:20:35 UTC