- From: Paola Di Maio <paola.dimaio@gmail.com>
- Date: Mon, 11 Oct 2021 14:41:11 +0800
- To: Dave Raggett <dsr@w3.org>
- Cc: public-cogai <public-cogai@w3.org>
- Message-ID: <CAMXe=Srsm-_+Zw5yP=NViL2rDfbt9+1YGB23r2q2e1VUC6T=xQ@mail.gmail.com>
Thanks for the update I would find links to the resources useful also, in addition to the corpora we d need to know what software/libraries to load to run the experiments (assume this info is on github) I may be interested at some point to try to intersect some of this work with some of the KR work I ll be reporting on soon, cheers pdm On Mon, Oct 11, 2021 at 12:45 AM Dave Raggett <dsr@w3.org> wrote: > I have had a little time to work on Cognitive NLP recently and want to > keep you all informed as to my progress. > > My first step was to look for freely available English corpora for some > statistical studies. I was able to download the free subsets of the Corpus > of Contemporary American English (COCA) and the British National Corpus > (BNC). COCA uses a simple text format, whilst BNC uses XML. Both provide > the words, word stems and part of speech tags. BNC also provides sentence > boundaries. > > It was easy enough to write a NodeJS script to read all of the files in > the two corpora, and I started with using word counts to find the most > frequent 200 words in each case. These agree for the most part which is > encouraging. The next step was to collect the complete set of part of > speech tags and compare them with the documentation I had found. The two > corpora use different tag sets. I found 142 tags for COCA after trimming > white space, splitting compounds like “nn1_rr”, and discarding trailing > digits. > > The fine grained part of speech tag sets used by both corpora distinguish > syntactic information, e.g. > > DA after-determiner or post-determiner capable of pronominal function > (e.g. such, former, same) > DA1 singular after-determiner (e.g. little, much) > DA2 plural after-determiner (e.g. few, several, many) > DAR comparative after-determiner (e.g. more, less, fewer) > DAT superlative after-determiner (e.g. most, least, fewest) > DB before determiner or pre-determiner capable of pronominal function > (all, half) > DB2 plural before-determiner ( both) > DD determiner (capable of pronominal function) (e.g any, some) > DD1 singular determiner (e.g. this, that, another) > DD2 plural determiner ( these,those) > DDQ wh-determiner (which, what) > DDQGE wh-determiner, genitive (whose) > DDQV wh-ever determiner, (whichever, whatever) > > A different approach is to use a much smaller set of tags and to deal with > additional information in the lexicon and in the processing for mapping > words into meaning. Coarse grained tags simplify loose parsing, something > I have previously demonstrated with an implementation of a shift-reduce > parser for English. > > The corpora show the need to handle awkward cases. These include > interjections, reported speech, different kinds of numbers, measures such > as “75kG”, foreign words, and the use of symbols that don’t represent > natural language. > > The next step will be to evolve a part of speech tagger using a > combination of statistics and rules for exceptions. I will then work on > developing a lexicon for American and British English using information in > COCA and BNC together with information from WordNet and the OALD. I will > use that to refine the loose parser. > > Further out, I plan to work on mapping words to meaning and handling > previously unseen words and word senses. This is part of a longer roadmap > towards using conversational natural language to teach cognitive agents new > skills. I will post status reports as progress is made, and publish the > software as open source on GitHub. If you want to run the experiments, you > will need to download the corpora yourself as the licenses preclude me > publishing copies. > > If anyone would like to actively participate in this work, please get in > touch with me. For now, I am sticking with JavaScript and NodeJS given my > experience with Web applications. > > p.s. whilst there has been a lot of attention to deep learning and > transformers for neural networks for NLP, so far these have yet to show how > to support cognitive reasoning (System 2). RDF and Chunks use explicit > discrete symbols, and I am interested in fuzzy symbols that are better > suited to human language where a word may have a fuzzy blend of meanings in > any given context. Can this be handled through weighted combinations of > discrete symbols? This is analogous to quantum mechanics where a system is > combination of states that resolve to a single state when observed. > > What do you think? > > Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett > W3C Data Activity Lead & W3C champion for the Web of things > > > > >
Received on Monday, 11 October 2021 06:42:05 UTC