Beyond Transformers ... from Dave Raggett on 2024-09-12 (public-cogai@w3.org from September 2024)

From: Dave Raggett <dsr@w3.org>
Date: Thu, 12 Sep 2024 11:09:53 +0100
To: public-cogai <public-cogai@w3.org>
Message-Id: <D0EC142B-BE37-4FC7-805D-CB6EAB1947F6@w3.org>

This is to provide some context for work on building blocks for sentient AI.

Work on natural language used to focus on grammatical rules that describe the regularities of language, e.g. noun phrases with determiners, adjectives and nouns. The focus slowly shifted to statistical models particularly in respect to speech recognition and machine translation.

N-grams are based upon counting occurrences of word patterns within a corpus. They include unigrams for the probability of a given word, e.g. “apples", bigrams for the probability of a given word directly following another one, e.g. “apple” following “red”, and trigrams for the probability of a word given the preceding two words, e.g. “shiny red apples”.

N-grams proved difficult to scale up, and were superseded by work on artificial neural networks, e.g. recurrent neural networks (RNNs) which process text word by word to predict the next word based upon the preceding words. RNNs use hidden vectors to model the context provided by the preceding words. Like N-grams, the network is trained on a text corpora that is split into two parts for training and evaluation.

RNNs are weak when the next word depends on a previous word that occurred many words before. As each word is processed, the context held in the hidden vectors gradually loses information on preceding words. This weakness was solved by the introduction of the Transformer-based large language models (LLMs). These use an explicit buffer for the context and enables words to pay attention to any of the preceding words in the context. With a deep stack of layers, and training against vast corpora, the networks can capture semantic dependencies, enabling effective responses to text prompts. The context length is tens of thousands of words and rapidly increasing with newer models.

Transformers need a costly initial training phase, followed by fine tuning on the target applications, before being deployed. For sentient AI, we want to enable continual learning, and for this it makes sense to look at progress in the cognitive sciences. The large context buffers used by Transformers are biologically implausible as is back propagation with gradient descent, so how does the brain manage language understanding and generation?

The insight from RNNs is that hidden vectors are insufficient for modelling the context so the solution is likely to be found in updating the synaptic weights to remember the context. Jonides et al. provide an informative review of studies on short term memory, see [1]. This includes the processes of encoding, maintenance and retrieval. They conclude that short term memory consists of items in the focus of attention along with recently attended representations in long term memory.

This could be modelled by splitting synaptic weights into short and long term components. The short term component is boosted by encoding and retrieval, and otherwise gradually decays. The longer term component uses a slower learning rate. The network layers are similar to RNNs, but use local prediction to obtain the learning signal for updating the weights in place of back propagation. Attention is uses cue-based retrieval. I still have to work out the details, and expect to evaluate the approach using a pre-training phase before an evaluation phase on test data. If all goes well, this will pave the way towards implementing sequential cognition (type 2 processing).

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3971378/

Dave Raggett <dsr@w3.org>

Received on Thursday, 12 September 2024 10:10:06 UTC