Re: Progress report on Cognitive NLP from Paola Di Maio on 2021-10-11 (public-cogai@w3.org from October 2021)

From: Paola Di Maio <paola.dimaio@gmail.com>
Date: Mon, 11 Oct 2021 14:41:11 +0800
To: Dave Raggett <dsr@w3.org>
Cc: public-cogai <public-cogai@w3.org>
Message-ID: <CAMXe=Srsm-_+Zw5yP=NViL2rDfbt9+1YGB23r2q2e1VUC6T=xQ@mail.gmail.com>
Thanks for the update
I would find links to the resources useful
also, in addition to the corpora we d need to know what software/libraries
to load to run the experiments (assume this info is on github)
I may be interested at some point to try to intersect some of this work
with some of the KR work I ll be reporting on soon,
cheers
pdm

On Mon, Oct 11, 2021 at 12:45 AM Dave Raggett <dsr@w3.org> wrote:

> I have had a little time to work on Cognitive NLP recently and want to
> keep you all informed as to my progress.
>
> My first step was to look for freely available English corpora for some
> statistical studies. I was able to download the free subsets of the Corpus
> of Contemporary American English (COCA) and the British National Corpus
> (BNC).  COCA uses a simple text format, whilst BNC uses XML. Both provide
> the words, word stems and part of speech tags.  BNC also provides sentence
> boundaries.
>
> It was easy enough to write a NodeJS script to read all of the files in
> the two corpora, and I started with using word counts to find the most
> frequent 200 words in each case. These agree for the most part which is
> encouraging.  The next step was to collect the complete set of part of
> speech tags and compare them with the documentation I had found. The two
> corpora use different tag sets. I found 142 tags for COCA after trimming
> white space, splitting compounds like “nn1_rr”, and discarding trailing
> digits.
>
> The fine grained part of speech tag sets used by both corpora distinguish
> syntactic information, e.g.
>
> DA after-determiner or post-determiner capable of pronominal function
> (e.g. such, former, same)
> DA1 singular after-determiner (e.g. little, much)
> DA2 plural after-determiner (e.g. few, several, many)
> DAR comparative after-determiner (e.g. more, less, fewer)
> DAT superlative after-determiner (e.g. most, least, fewest)
> DB before determiner or pre-determiner capable of pronominal function
> (all, half)
> DB2 plural before-determiner ( both)
> DD determiner (capable of pronominal function) (e.g any, some)
> DD1 singular determiner (e.g. this, that, another)
> DD2 plural determiner ( these,those)
> DDQ wh-determiner (which, what)
> DDQGE wh-determiner, genitive (whose)
> DDQV wh-ever determiner, (whichever, whatever)
>
> A different approach is to use a much smaller set of tags and to deal with
> additional information in the lexicon and in the processing for mapping
> words into meaning.  Coarse grained tags simplify loose parsing, something
> I have previously demonstrated with an implementation of a shift-reduce
> parser for English.
>
> The corpora show the need to handle awkward cases. These include
> interjections, reported speech, different kinds of numbers, measures such
> as “75kG”, foreign words, and the use of symbols that don’t represent
> natural language.
>
> The next step will be to evolve a part of speech tagger using a
> combination of statistics and rules for exceptions. I will then work on
> developing a lexicon for American and British English using information in
> COCA and BNC together with information from WordNet and the OALD. I will
> use that to refine the loose parser.
>
> Further out, I plan to work on mapping words to meaning and handling
> previously unseen words and word senses. This is part of a longer roadmap
> towards using conversational natural language to teach cognitive agents new
> skills.  I will post status reports as progress is made, and publish the
> software as open source on GitHub. If you want to run the experiments, you
> will need to download the corpora yourself as the licenses preclude me
> publishing copies.
>
> If anyone would like to actively participate in this work, please get in
> touch with me.  For now, I am sticking with JavaScript and NodeJS given my
> experience with Web applications.
>
> p.s. whilst there has been a lot of attention to deep learning and
> transformers for neural networks for NLP, so far these have yet to show how
> to support cognitive reasoning (System 2).  RDF and Chunks use explicit
> discrete symbols, and I am interested in fuzzy symbols that are better
> suited to human language where a word may have a fuzzy blend of meanings in
> any given context. Can this be handled through weighted combinations of
> discrete symbols?  This is analogous to quantum mechanics where a system is
> combination of states that resolve to a single state when observed.
>
> What do you think?
>
> Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
> W3C Data Activity Lead & W3C champion for the Web of things
>
>
>
>
>
Received on Monday, 11 October 2021 06:42:05 UTC