- From: Felix Sasaki <fsasaki@w3.org>
- Date: Tue, 15 Jan 2013 13:20:06 +0100
- To: Mārcis Pinnis <marcis.pinnis@Tilde.lv>
- CC: "public-multilingualweb-lt-comments@w3.org" <public-multilingualweb-lt-comments@w3.org>
- Message-ID: <50F54976.2060608@w3.org>
Hi all, Mārcis again, to move this forward, I have worked with an example. For the sentence "Welcome to Dublin in Ireland!" From Enrycher you will get an annotation like in example http://www.w3.org/TR/2012/WD-its20-20121206/#EX-disambiguation-html5-local-1 from the NERD API, using the same sentence, you will get this JSON output: [{"idEntity":169970,"label":"Dublin","startChar":0,"endChar":6,"extractorType":"CITY","nerdType":"http://nerd.eurecom.fr/ontology#Location","uri":"http://dbpedia.org/resource/Dublin","confidence":1.0,"relevance":0.5,"extractor":"extractiv","startNPT":0.0,"endNPT":0.0},{"idEntity":169971,"label":"Ireland","startChar":25,"endChar":32,"extractorType":"COUNTRY","nerdType":"http://nerd.eurecom.fr/ontology#Location","uri":"http://dbpedia.org/resource/Ireland","confidence":1.0,"relevance":0.5,"extractor":"extractiv","startNPT":0.0,"endNPT":0.0}] The mappings NERD - ITS2 "disambiguation" are: - "nerdType" maps to "its-disambig-class-ref" - "confidence" maps to "its-disambig-confidence" - "uri" maps to "its-disambig-ident-ref" So we have some interoperability with 11 tools (NERD is a broker for 10 annotation tools, plus Enrycher): they produce easy to map output. So the question - again focusing on production, not consumption: what do you, Mārcis, expect "your" automatic term annotation tool to produce for the example sentence "Welcome to Dublin in Ireland!" ? Best, Felix Am 15.01.13 10:34, schrieb Felix Sasaki: > Hi Mārcis, > > thanks a lot for your detailed mail. I must however say that I don't > see an answer to my question: "what is the difference in terms of > producing the metadata?". The question was really focused on your > implementation approach. I understand your consumption scenario and > the terminology use case. But I assume that in your automatic term > annotation implementation you apply the same linguistic processing > pipeline as Tadej does, using basic analysis (tokenization, stemming, > morphology etc.), then some resources (e.g. a lexicon) to define the > type of "unit" (I'm saying unit to avoid "term" or "entity"). As I > understand it, the disambiguation output gives background information > what resources have been used: an ontology like dbpedia, a lexicon > like wordnet. > > See e.g. the NERD API > http://nerd.eurecom.fr/documentation#nerdapi > that gives you back a nerdType > [ > | [ > { > idEntity: 120, > label: "BBC", > startChar: 138, > endChar: 141, > extractorType: "Company", > nerdType:"http://nerd.eurecom.fr/ontology#Organization", > uri:"http://dbpedia.org/resource/BBC", > confidence: 0.0582796, > relevance: 0.5, > extractor: "dbspotlight", > startNPT: 0, > endNPT: 0 > }, > ... > ]| > ] > > I'm mentioning this API since in a sense it is the API counterpart to > what we are standardizing with markup: it provides a JSON format as > the output of annotation. > So again you have a confidence field and type - and I'd like to > understand not what the difference is in your use case, but in the > implementation approach (see above)? If the answer is "none", that is > fine too, and it would give us a path to explain to users (both > producers and consumers) how to deal with both use cases. > > Best, > > Felix > > Am 15.01.13 08:55, schrieb Mārcis Pinnis: >> >> Hi Felix, >> >> Terminology from a practical standpoint identifies concepts (/often >> also common term phrases - concepts in multi-word phrases/) commonly >> found in a specific domain (subject field) and infrequently found or >> not found at all in a general language. That is the purpose of the >> Terminology data category. It should identify domain-specific terms >> (/or possible term-candidates with a confidence score when automated >> annotation is performed/) and, if possible, link the identified terms >> with entries in a term-base. >> >> I am not that familiar with the Disambiguation data category and its >> history, but the question is, what is the main goal of the >> Disambiguation data category (/whom is it meant for and who will >> provide data for it?/)? Should it identify or try sorting out >> ambiguities in any type of phrases (/no matter - terms, named >> entities, general language, etc./)? Then - have we identified all >> types if we have only three "granularities"? Then also - a phrase can >> actually simultaneously belong to all "granularities" (/I think the >> naming does not reflect the meaning correctly/) depending on a >> client, which I guess makes it difficult for content providers to >> create reasonable mark-up (/that is one reason why I would prefer not >> using Disambiguation/). >> >> In my opinion, such different content mark-ups - (/terms as concepts >> (also - what is meant with lexical-concept is not explained (and how >> does that overlap with what is a term?)!!!), named entities as >> concept instances (the main difference between understanding what is >> a term and what is a named entity - however a named entity in many >> cases can be also a term - for instance for the term "weapon" >> suitable named entities may very well be: "knife", "gun", "axe", >> however, is "knife" a named entity or is it a term? It can be both! >> Take a look at the biggest classification table of named entities: >> http://nlp.cs.nyu.edu/ene/version7_1_0Beng.html - under the category >> Product you may probably find many things that you may have not >> thought to be named entities?!) and ontology-concept (which may very >> well overlap with the previous two...)/) - should not be mixed together! >> >> It is hard to understand the reasoning behind the different >> "granularity" levels also because of lacking definitions and it is >> not clear why such different data types should at all be mixed >> together in one category. >> >> From that aspect, I would prefer separate data categories for all >> three "granularities" (if necessary; although I do not particularly >> like the naming here) as the applications for all these may be quite >> different. >> >> From an implementer's and content provider's viewpoint I prefer the >> Terminology data category as its purpose is clear and it is also >> clear what is meant with the annotation - it is clearly identifiable >> and mark-up can be easily applied (which is not the case with the >> Disambiguation data category). >> >> I hope I did not make things more confusing?! I wanted to raise the >> point that the disambiguation data category itself is quite ambiguous >> (at least to me). >> >> Best regards, >> >> Mārcis ;o) >> >> -----Original Message----- >> From: Felix Sasaki [mailto:fsasaki@w3.org] >> Sent: Monday, January 14, 2013 8:35 PM >> To: public-multilingualweb-lt-comments@w3.org >> Cc: Mārcis Pinnis >> Subject: Disambiguation and terminology producers (Re: issue-68 (Re: >> Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term))) >> >> Hi all, esp. Tadej and Mārcis (FYI, it might be helpful for you to >> subscribe to the public-multilingualweb-lt-comments@w3.org >> <mailto:public-multilingualweb-lt-comments@w3.org> list), >> >> Yves has responded from the point of view of a consumer. Now it would >> be interesting to understand: what is the difference in terms of >> producing the metadata? >> >> Is in essence the process for creating >> >> <span its:term="yes" its:termConfidence="0.98">screwdriver</span> >> >> the same as creating >> >> <span its:disambigSource="mywordnet" its:disambigIdent="474646" >> >> its:disambigGranularity="lexical-concept" >> >> its:disambigConfidence="0.98">screwdriver</span> >> >> with the only difference that in the case of terminology, information >> is left out (Source, Ident, Granularity) and there is different >> naming for attributes (termConfidence vs. disambigConfidence)? >> >> This would mean that we could create some guidance for producers of >> the metadata, related to different consumption scenarios. >> >> Best, >> >> Felix >> >> Am 14.01.13 18:54, schrieb Lieske, Christian: >> >> > Hi David, Jörg, Felix, all, >> >> > >> >> > It's great to see timely replies to this comment. >> >> > >> >> > It would indeed be valuable - as indicated by Felix - to get >> comments from additional angles. >> >> > >> >> > Cheers, >> >> > Christian >> >> > >> >> > -----Original Message----- >> >> > From: Felix Sasaki [mailto:fsasaki@w3.org] >> >> > Sent: Freitag, 11. Januar 2013 18:17 >> >> > To: public-multilingualweb-lt-comments@w3.org >> <mailto:public-multilingualweb-lt-comments@w3.org> >> >> > Subject: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 - >> >> > Disambiguation (and term)) >> >> > >> >> > All (co-chair hat on), >> >> > >> >> > thank you for this discussion. General remark: as explained at >> >> > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/ >> >> > 0045.html please add the issue number to the mail subject. Otherwise >> >> > it will be very hard to track discussions. >> >> > >> >> > It would now be interesting to hear the implementors: according to >> >> > http://tinyurl.com/its20-testsuite-dashboard >> >> > Enlaso, Tilde and UL will implement terminology. As I understand it, >> >> > UL will make a wrapper around the Enlaso / Okapi engine, correct? >> >> > Now, for Disambiguation we have Enlaso, JSI, Moravia and UL. Here I >> >> > *think* that Moravia and UL will basically have an Okapi wrapper. >> >> > Please correct me if I'm wrong. >> >> > >> >> > This leaves us with the following situation: >> >> > - two implementations for terminology (Enlaso and Tilde) >> >> > - two for disambiguation (Enlaso and JSI) >> >> > >> >> > So Mārcis, Tadej, Yves - what do you think about this proposal? >> >> > >> >> > I'm asking this also since I have to remind people about the W3C >> process: >> >> > >> >> > (W3C process hat on) We cannot just say "we don't like a comment". >> >> > There needs to be good reasons to reject it. Below argumentation can >> >> > support the rejection, but the rejection is rather weak if >> >> > implementers don't have an opinion or would even say "I would do the >> >> > change". So please express your thoughts in this thread. >> >> > >> >> > Best, >> >> > >> >> > Felix >> >> > >> >> > Am 11.01.13 14:07, schrieb Jörg Schütz: >> >> >> +1 >> >> >> >> >> >> Hi Christian, David, and all, >> >> >> >> >> >> I would have similar arguments for keeping term and disambiguation >> >> >> separat although they are related. There are several use cases out >> >> >> there in the wild that need this kind of separation, e.g. terminology >> >> >> based workflows in a particular supply chain vs. data stream analyses >> >> >> which prepare the data for further treatment such as a machine >> >> >> translation application (vocubulary support and training/tuning life >> >> >> cycles). >> >> >> >> >> >> One other topic is the discussion of the ISOCat elements which to >> >> >> some extend would force applications to adopt an NLP standard that >> >> >> might not be appropriate for a given application scenario, e.g. those >> >> >> that do not use NLP technologies at all. Therefore, I would also >> >> >> recommend that we do not talk about bringing ITS closer to NLP >> >> >> because ITS should remain open and deployable for different language >> >> >> processing strategies. >> >> >> >> >> >> Nevertheless, thanks a lot for raising these concerns. >> >> >> >> >> >> All the best -- Jörg >> >> >> >> >> >> On Jan 11, 2013, at 12:22 (CET), Dr. David Filip wrote: >> >> >>> Dear Christian, thanks for this insightful comment. >> >> >>> I agree that the disambiguation category is one of the most >> >> >>> important additions that can expand the usage of the standard and >> >> >>> become more useful across technologies and industries. >> >> >>> >> >> >>> The group had discussed and it is clear that disambiguation and term >> >> >>> are somehow related categories. We have however not considered >> >> >>> deprecation of the ITS 1.0 term, at least not explicitly. >> >> >>> >> >> >>> I believe that this is given by the chartered principles of the >> >> >>> group [paraphrasing] >> >> >>> 1) Do not break 1.0 >> >> >>> 2) Keep the 1.0 principle of independent categories that can also be >> >> >>> independently implemented. >> >> >>> >> >> >>> I believe that your proposal to fuse term and disambiguation is >> >> >>> inline with 2) in the sense of making two seemingly interdependent >> >> >>> categories into one fully self contained and independent category, >> >> >>> but would violate 1). >> >> >>> >> >> >>> But even if we did not care for 1), I believe that the relationship >> >> >>> between term and disambiguation is a reasonably loose one, i.e. not >> >> >>> a hard formal interdependency that would warrant or even mandate >> >> >>> normative handling, and thus can and should be handled in >> >> >>> non-normative material such as a best practice document, while we >> >> >>> are keeping both categories, because they have discernable use cases >> >> >>> and still can be implemented independently. >> >> >>> >> >> >>> A) >> >> >>> A user that uses both a terminology management system and a text >> >> >>> analytics system for disambiguation can reasonably combine them and >> >> >>> their combination can be driven by organization specific process >> >> >>> driven considerations. They can for instance harvest spans marked as >> >> >>> disambiguation as term candidates for their Terminology database and >> >> >>> these can be encoded as terms next time if e.g. a terminologist >> >> >>> approves them as terms. >> >> >>> >> >> >>> B) >> >> >>> People using text analytics input only do not need to care about >> term. >> >> >>> >> >> >>> C) >> >> >>> People using terminology management as the only source do not need >> >> >>> to bother with complexities of the disambiguation category. >> >> >>> >> >> >>> To summarize: >> >> >>> While many ITS categories, and prominently term and disambiguation, >> >> >>> are informally semantically related, it seems important to keep a >> >> >>> reasonable and manageable granularity of the independently >> >> >>> implementable categories. >> >> >>> >> >> >>> I hope this helps to understand the group's motivation for keeping >> >> >>> the categories apart. >> >> >>> Please let me know >> >> >>> Rgds >> >> >>> dF >> >> >>> >> >> >>> Dr. David Filip >> >> >>> ======================= >> >> >>> LRC | CNGL | LT-Web | CSIS >> >> >>> University of Limerick, Ireland >> >> >>> telephone: +353-6120-2781 >> >> >>> *cellphone: +353-86-0222-158* >> >> >>> facsimile: +353-6120-2734 >> >> >>> mailto: david.filip@ul.ie <mailto:david.filip@ul.ie> >> <mailto:david.filip@ul.ie> >> >> >>> >> >> >>> >> >> >>> On Thu, Jan 10, 2013 at 9:14 AM, Lieske, Christian >> >> >>> <christian.lieske@sap.com <mailto:christian.lieske@sap.com >> <mailto:christian.lieske@sap.com%20%3cmailto:christian.lieske@sap.com>>> >> wrote: >> >> >>> >> >> >>> Hi,____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> Please find below comments/observations/questions/ideas >> concerning >> >> >>> the ITS 2.0 working draft dated December 6, 2012 >> >> >>> (http://www.w3.org/TR/2012/WD-its20-20121206/). Please feel >> free to >> >> >>> contact me for clarifications if anything is unclear.____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> The section related to the “disambiguation” data category to >> me is >> >> >>> one of the most important ones of the draft. ITS 2.0 from my >> >> >>> point-of-view moves ITS 1.0 closer to Natural Language >> Processing >> >> >>> (NLP), and “disambiguation” to me is related to NLP in >> various ways. >> >> >>> Thus, making “disambiguation” powerful and easy to use (e.g. >> via a >> >> >>> clear distinction to other data categories, as well as >> >> >>> conceptualizations and wording that are not just known within >> >> >>> linguistics) seems important to me.____ >> >> >>> >> >> >>> ____ >> >> >>> >> >> >>> While looking at “disambiguation” from this angle, I started to >> >> >>> wonder if it could benefit from additions/modifications. I >> apologize >> >> >>> in advance if a reply to this comment may require that >> discussions >> >> >>> which presumably already took place may have to be >> >> >>> summarized.____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> Here are my observations/questions/ideas:____ >> >> >>> >> >> >>> ____ >> >> >>> >> >> >>> __a.__I sense that ITS users will have difficulties to >> decide when >> >> >>> to use “term” and when to use “disambiguation” (the note in the >> >> >>> Working Draft indicates this). ____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> __b.__Annotation of known terms, generation of so-called “term >> >> >>> candidates”, (named) entity recognition, and other >> automation can be >> >> >>> subsumed under the heading “(automated) text analysis”.____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> I am thus wondering if the following would be worth >> >> >>> considering:____ >> >> >>> >> >> >>> ____ >> >> >>> >> >> >>> __1.__Enhance the current “disambiguation” so that also the >> current >> >> >>> “term” can be covered____ >> >> >>> >> >> >>> __2.__Deprecate “term”____ >> >> >>> >> >> >>> __3.__Revising some of the terminology used in the spec (e.g. >> >> >>> “disambiguation”, “disambigGranularity”)____ >> >> >>> >> >> >>> ____ >> >> >>> >> >> >>> An example use of a revised “disambiguation” (and deprecated >> “term”) >> >> >>> – partially inspired by ISOCat (see http://www.isocat.org/ ) >> – is >> >> >>> the following:____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> Data category name: (automated) text analysis annotation >> (atan/tan); >> >> >>> using “text analysis annotation” would have the advantage >> that even >> >> >>> manual work (e.g. “promoting a term candidate to a term”) >> could be >> >> >>> covered____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> Data category “qualifier” (currently “disambigGranularity”): >> >> >>> atan-type or tan-type____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> Values for “qualifier”: lexical, term, termCandidate, >> >> >>> ontological-class, ontological-entity; possibly even URIs >> such as >> >> >>> http://www.isocat.org/datcat/DC-2275 - would allow rather >> >> >>> fine-grained and under certain provisions >> standard-conformant (ISO >> >> >>> 12620; see http://www.ttt.org/clsframe/datcats.html) >> >> >>> annotation____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> Example:____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> <span ____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> its-tan-confidence="0.7"____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" >> >> >>> ____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" ____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> its-tan-type=" >> >> >>> http://www.isocat.org/datcat/DC-2275">Dublin</span >> <http://www.isocat.org/datcat/DC-2275%22%3eDublin%3c/span>> ____ >> >> >>> >> >> >>> __ __ >> >> >>> >> >> >>> Cheers,____ >> >> >>> >> >> >>> Christian____ >> >> >>> >> >> > >> >
Received on Tuesday, 15 January 2013 12:20:35 UTC