- From: Christian Chiarcos <christian.chiarcos@gmail.com>
- Date: Wed, 14 Jul 2021 13:06:18 +0200
- To: public-cogai <public-cogai@w3.org>
- Message-ID: <CAC1YGdj1uicjzJ8cPjjx1e-DLGMSq4_aC1GcFa0hLU=15weeiw@mail.gmail.com>
Hi Dave, dear all, apologies for not following up too closely. I'm having some administrative trouble since a few months, and until this is overcome, I would switch to lurking mode, mostly. (Well, I did so already.) By mainstream NLP researchers, Word Sense Disambiguation is considered a hard, but largely artificial problem because there is too little agreement on sense definitions across resources and too little sense-annotated resources available to apply machine learning in a meaningful way. The classical Lesk algorithm seems to be reminiscent of your ideas, and it works nicely -- as long as examples and definitions provided in the sense inventory are sufficiently representative (which they are not). Anyway, you might want to replicate Lesk as proof of principle. It's still considered a seminal work: https://dl.acm.org/doi/10.1145/318723.318728. This uses word overlap and suffers from data sparsity. A more modern approach following Lesk's spirit would probably be to induce embeddings for word senses (cf. https://aclanthology.org/P15-1173/, they call word senses "lexemes"), and then to compare them with the (aggregate) context embeddings. This operates on word embeddings; not sure how to scale this to contextualized embeddings as those produced by BERT etc. -- BERT would be great to derive "real" sense embeddings if we had a significant corpus annotated for word senses. Well, we don't really have that. (OntoNotes [ https://catalog.ldc.upenn.edu/LDC2013T19] is the closest thing, but they had to simplify WordNet sense distinctions in order to annotate them in a reliable way.) As for cognitive plausibility, Lesk isn't incremental, so its way of processing is different from what humans do. But the underlying mechanism follows a similar intuition as you had, and it would be possible to make it incremental by just looking into the preceding context. However, it doesn't have a backtracking mechanism, and that would be needed unless we're happy with all text-initial (context-free!) words being misclassified. As for the machine-readable dictionaries, there is very limited data available with proper sense definitions. WordNets are ( http://compling.hss.ntu.edu.sg/omw/). Maybe the Apertium data would work for you ( https://github.com/acoli-repo/acoli-dicts/tree/master/stable/apertium/apertium-rdf-2020-03-18). It doesn't have sense definitions, but just assumes to have one sense per translation pair. Best, Christian Best, Christian Am Mo., 12. Juli 2021 um 15:50 Uhr schrieb Dave Raggett <dsr@w3.org>: > If anyone is has time today I would like to chat about ideas for working > on cognitive natural language understanding (NLU). > > There has been a lot of coverage around BERT and GPT-3 for NLP with their > impressive ability for generating text as a continuation to a text passage > provided by the user. Unfortunately the hype is overblown, as the lack of > real semantics is soon apparent when you ask for the sum of two large > numbers, or who is the US President in 1650 (before the United States was > founded). GPT-3 doesn't know the limitations of its knowledge and fails to > say it doesn't know the answer to questions. > > I am interested in ways to bootstrap NLU using statistical analysis of > text corpora in conjunction with machine readable natural language > dictionaries, WordNet’s thesaurus, and manually provided taxonomic > knowledge. > > The starting point is to be able to tag words with their part of speech, > e.g. adjective, noun, verb. This enables loose parsing to identify phrase > structures, which in turn can be used for co-occurrence statistics. By > matching the statistics for a given text passage to dictionary definitions, > we can using this to predict word senses in context. > > This can be considerably improved by introducing knowledge about the > relationship between words with related meanings from thesauri and > taxonomies, e.g. knowing that dogs are animals helps with a dictionary > definition for “collar” expressed in terms of animals, as it explains the > use of “dog collar” etc. > > My hunch is that combining multiple kinds of information in this way can > support semantic understanding provided that that is expressed in terms of > word senses and human-like reasoning. It may leave ambiguities where agent > is unsure, e.g. how do you know that dog is a subclass of animal rather > than a related peer concept? However, this still speeds learning through > the role of prior knowledge. > > Researchers have found that we learn associations between concepts whose > labels directly co-occur, and subsequently between taxonomically related > concepts whose labels share patterns of co-occurrence. Children are good at > the former, but poor at the latter, whilst adults are good at both. > > The challenge is to turn these high level ideas into concrete experiments > with running code. A related challenge is to obtain machine interpretable > natural language dictionaries. > > Updated call details are given at: > > https://lists.w3.org/Archives/Member/internal-cogai/2021Jun/0000.html > > Looking forward to talking with you! > > Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett > W3C Data Activity Lead & W3C champion for the Web of things > > > > >
Received on Wednesday, 14 July 2021 11:07:07 UTC