Bootstrapping cognitive NLP

If anyone is has time today I would like to chat about ideas for working on cognitive natural language understanding (NLU).

There has been a lot of coverage around BERT and GPT-3 for NLP with their impressive ability for generating text as a continuation to a text passage provided by the user. Unfortunately the hype is overblown, as the lack of real semantics is soon apparent when you ask for the sum of two large numbers, or who is the US President in 1650 (before the United States was founded). GPT-3 doesn't know the limitations of its knowledge and fails to say it doesn't know the answer to questions.

I am interested in ways to bootstrap NLU using statistical analysis of text corpora in conjunction with machine readable natural language dictionaries, WordNet’s thesaurus, and manually provided taxonomic knowledge.

The starting point is to be able to tag words with their part of speech, e.g. adjective, noun, verb. This enables loose parsing to identify phrase structures, which in turn can be used for co-occurrence statistics. By matching the statistics for a given text passage to dictionary definitions, we can using this to predict word senses in context.

This can be considerably improved by introducing knowledge about the relationship between words with related meanings from thesauri and taxonomies, e.g. knowing that dogs are animals helps with a dictionary definition for “collar” expressed in terms of animals, as it explains the use of “dog collar” etc.

My hunch is that combining multiple kinds of information in this way can support semantic understanding provided that that is expressed in terms of word senses and human-like reasoning. It may leave ambiguities where agent is unsure, e.g. how do you know that dog is a subclass of animal rather than a related peer concept? However, this still speeds learning through the role of prior knowledge.

Researchers have found that we learn associations between concepts whose labels directly co-occur, and subsequently between taxonomically related concepts whose labels share patterns of co-occurrence. Children are good at the former, but poor at the latter, whilst adults are good at both.

The challenge is to turn these high level ideas into concrete experiments with running code. A related challenge is to obtain machine interpretable natural language dictionaries.

Updated call details are given at:

 https://lists.w3.org/Archives/Member/internal-cogai/2021Jun/0000.html <https://lists.w3.org/Archives/Member/internal-cogai/2021Jun/0000.html> 

Looking forward to talking with you!

Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
W3C Data Activity Lead & W3C champion for the Web of things 

Received on Monday, 12 July 2021 13:50:30 UTC