Progress report on Cognitive NLP

I have had a little time to work on Cognitive NLP recently and want to keep you all informed as to my progress.

My first step was to look for freely available English corpora for some statistical studies. I was able to download the free subsets of the Corpus of Contemporary American English (COCA) and the British National Corpus (BNC).  COCA uses a simple text format, whilst BNC uses XML. Both provide the words, word stems and part of speech tags.  BNC also provides sentence boundaries.

It was easy enough to write a NodeJS script to read all of the files in the two corpora, and I started with using word counts to find the most frequent 200 words in each case. These agree for the most part which is encouraging.  The next step was to collect the complete set of part of speech tags and compare them with the documentation I had found. The two corpora use different tag sets. I found 142 tags for COCA after trimming white space, splitting compounds like “nn1_rr”, and discarding trailing digits.

The fine grained part of speech tag sets used by both corpora distinguish syntactic information, e.g.

DA after-determiner or post-determiner capable of pronominal function (e.g. such, former, same)
DA1 singular after-determiner (e.g. little, much)
DA2 plural after-determiner (e.g. few, several, many)
DAR comparative after-determiner (e.g. more, less, fewer)
DAT superlative after-determiner (e.g. most, least, fewest)
DB before determiner or pre-determiner capable of pronominal function (all, half)
DB2 plural before-determiner ( both)
DD determiner (capable of pronominal function) (e.g any, some)
DD1 singular determiner (e.g. this, that, another)
DD2 plural determiner ( these,those)
DDQ wh-determiner (which, what)
DDQGE wh-determiner, genitive (whose)
DDQV wh-ever determiner, (whichever, whatever)

A different approach is to use a much smaller set of tags and to deal with additional information in the lexicon and in the processing for mapping words into meaning.  Coarse grained tags simplify loose parsing, something I have previously demonstrated with an implementation of a shift-reduce parser for English.

The corpora show the need to handle awkward cases. These include interjections, reported speech, different kinds of numbers, measures such as “75kG”, foreign words, and the use of symbols that don’t represent natural language.

The next step will be to evolve a part of speech tagger using a combination of statistics and rules for exceptions. I will then work on developing a lexicon for American and British English using information in COCA and BNC together with information from WordNet and the OALD. I will use that to refine the loose parser.

Further out, I plan to work on mapping words to meaning and handling previously unseen words and word senses. This is part of a longer roadmap towards using conversational natural language to teach cognitive agents new skills.  I will post status reports as progress is made, and publish the software as open source on GitHub. If you want to run the experiments, you will need to download the corpora yourself as the licenses preclude me publishing copies.

If anyone would like to actively participate in this work, please get in touch with me.  For now, I am sticking with JavaScript and NodeJS given my experience with Web applications.

p.s. whilst there has been a lot of attention to deep learning and transformers for neural networks for NLP, so far these have yet to show how to support cognitive reasoning (System 2).  RDF and Chunks use explicit discrete symbols, and I am interested in fuzzy symbols that are better suited to human language where a word may have a fuzzy blend of meanings in any given context. Can this be handled through weighted combinations of discrete symbols?  This is analogous to quantum mechanics where a system is combination of states that resolve to a single state when observed.

What do you think?

Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
W3C Data Activity Lead & W3C champion for the Web of things 

Received on Sunday, 10 October 2021 16:45:13 UTC