Fwd: Bootstrapping cognitive NLP

Hi Dave, dear all,

apologies for not following up too closely. I'm having some administrative
trouble since a few months, and until this is overcome, I would switch to
lurking mode, mostly. (Well, I did so already.)

By mainstream NLP researchers, Word Sense Disambiguation is considered a
hard, but largely artificial problem because there is too little agreement
on sense definitions across resources and too little sense-annotated
resources available to apply machine learning in a meaningful way. The
classical Lesk algorithm seems to be reminiscent of your ideas, and it
works nicely -- as long as examples and definitions provided in the sense
inventory are sufficiently representative (which they are not). Anyway, you
might want to replicate Lesk as proof of principle. It's still considered a
seminal work: https://dl.acm.org/doi/10.1145/318723.318728. This uses word
overlap and suffers from data sparsity. A more modern approach following
Lesk's spirit would probably be to induce embeddings for word senses (cf.
https://aclanthology.org/P15-1173/, they call word senses "lexemes"), and
then to compare them with the (aggregate) context embeddings. This operates
on word embeddings; not sure how to scale this to contextualized embeddings
as those produced by BERT etc. -- BERT would be great to derive "real"
sense embeddings if we had a significant corpus annotated for word senses.
Well, we don't really have that. (OntoNotes [
https://catalog.ldc.upenn.edu/LDC2013T19] is the closest thing, but they
had to simplify WordNet sense distinctions in order to annotate them in a
reliable way.)

As for cognitive plausibility, Lesk isn't incremental, so its way of
processing is different from what humans do. But the underlying mechanism
follows a similar intuition as you had, and it would be possible to make it
incremental by just looking into the preceding context. However, it doesn't
have a backtracking mechanism, and that would be needed unless we're happy
with all text-initial (context-free!) words being misclassified.

As for the machine-readable dictionaries, there is very limited data
available with proper sense definitions. WordNets are (
http://compling.hss.ntu.edu.sg/omw/). Maybe the Apertium data would work
for you (
https://github.com/acoli-repo/acoli-dicts/tree/master/stable/apertium/apertium-rdf-2020-03-18).
It doesn't have sense definitions, but just assumes to have one sense per
translation pair.

Best,
Christian

Best,
Christian

Am Mo., 12. Juli 2021 um 15:50 Uhr schrieb Dave Raggett <dsr@w3.org>:

> If anyone is has time today I would like to chat about ideas for working
> on cognitive natural language understanding (NLU).
>
> There has been a lot of coverage around BERT and GPT-3 for NLP with their
> impressive ability for generating text as a continuation to a text passage
> provided by the user. Unfortunately the hype is overblown, as the lack of
> real semantics is soon apparent when you ask for the sum of two large
> numbers, or who is the US President in 1650 (before the United States was
> founded). GPT-3 doesn't know the limitations of its knowledge and fails to
> say it doesn't know the answer to questions.
>
> I am interested in ways to bootstrap NLU using statistical analysis of
> text corpora in conjunction with machine readable natural language
> dictionaries, WordNet’s thesaurus, and manually provided taxonomic
> knowledge.
>
> The starting point is to be able to tag words with their part of speech,
> e.g. adjective, noun, verb. This enables loose parsing to identify phrase
> structures, which in turn can be used for co-occurrence statistics. By
> matching the statistics for a given text passage to dictionary definitions,
> we can using this to predict word senses in context.
>
> This can be considerably improved by introducing knowledge about the
> relationship between words with related meanings from thesauri and
> taxonomies, e.g. knowing that dogs are animals helps with a dictionary
> definition for “collar” expressed in terms of animals, as it explains the
> use of “dog collar” etc.
>
> My hunch is that combining multiple kinds of information in this way can
> support semantic understanding provided that that is expressed in terms of
> word senses and human-like reasoning. It may leave ambiguities where agent
> is unsure, e.g. how do you know that dog is a subclass of animal rather
> than a related peer concept? However, this still speeds learning through
> the role of prior knowledge.
>
> Researchers have found that we learn associations between concepts whose
> labels directly co-occur, and subsequently between taxonomically related
> concepts whose labels share patterns of co-occurrence. Children are good at
> the former, but poor at the latter, whilst adults are good at both.
>
> The challenge is to turn these high level ideas into concrete experiments
> with running code. A related challenge is to obtain machine interpretable
> natural language dictionaries.
>
> Updated call details are given at:
>
> https://lists.w3.org/Archives/Member/internal-cogai/2021Jun/0000.html
>
> Looking forward to talking with you!
>
> Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
> W3C Data Activity Lead & W3C champion for the Web of things
>
>
>
>
>

Received on Wednesday, 14 July 2021 11:07:07 UTC