Re: What's next after self-attention? from Dave Raggett on 2025-01-16 (public-cogai@w3.org from January 2025)

From: Dave Raggett <dsr@w3.org>
Date: Thu, 16 Jan 2025 11:18:50 +0000
To: Ronald Reck <rreck@rrecktek.com>
Cc: public-cogai <public-cogai@w3.org>
Message-Id: <5A1D342C-77B2-485D-A1FD-1C77D0C01A17@w3.org>
>> A further refinement would be to repurpose the input layers when you want to use the model to generate output.  Conventional large language models rely on feed forward connections. Imagine folding the top half of the transformer stack back on itself.  This necessitates feed backward connections. Continual learning then enables the model to mimic the input statistics when running in generation mode, akin to children quickly picking up the informal language patterns of their peers.
> 
> Can you help me understand what you are saying here better about the folding back upon itself. I can imagine feed backward approaches like multi-shot prompting.

A conventional LLM starts by adding the positional encoding to the token embedding for the prompt. This feeds into a stack of transformer layers, with the output used to predict the tokens in the response.  This is a purely feedforward architecture from prompt to response.  An alternative is to design the LLM to feed information up the layer stack for language understanding, and down the layer stack for language generation. The model parameters for each layer will then ensure that the same statistics are used for understanding and generation. My hunch is that this will work well for local learning rules like those we believe the brain uses. The architecture facilitates running the model in different modes: input mode, quiet thinking mode and output mode.

> 
>> 
>> Recent work on large language models has shown the potential for quiet thinking, i.e. thinking a while on a problem rather than responding immediately.  In principle, this should produce better results and reduce the likelihood of hallucinations.  For humans it amounts to the difference between working something out step by step versus making a wild guess under pressure when asked a question.
> 
> Personally I think this is based on expectations. As LLM excitement transitions into agentic aspirations people will both expect more but be more tolerant of iterations to get the answers with a higher fidelity. I am mulling over what a hallucination means.  hallucinations are inherent to generative models because of how they generate responses.

LLMs are statistical models, and hallucinations occur when the statistics generate guesses that are highly plausible in the given context.  By breaking cognition into smaller steps, the guesses the LLM makes are much more likely to be correct. Quiet thinking can be directed to explore a large space of possibilities, enabling agents to carry out tasks which would otherwise be far too demanding for single step prompt to response processing.  

> 
>> 
>> Quiet thinking corresponds to applying a sequence of transformations to the latent semantics across multiple layers in a neural network.  As such is it is similar to the processing needed for both sequence understanding and generation.  Can we design a neural network to support quiet thinking in addition to sequence learning, understanding and generation?
> 
> Can you help me understand what you mean by latent semantics here?  I think about it meaning context, like how I know whether you mean jaguar the car manufacturer or the animal based on context or even the capital "J" for the manufacturer.

Latent semantics is a term used for the information held in a model at each layer after processing the prompt.  It will include syntactic and semantic information using a distributed representation (vector space) as learned from the training corpora.

> 
>> 
>> The main difference is the need to support reinforcement learning over multiple steps. This is where we need episodic memory. However the details are far from clear.  
> 
> When I read this, I immediately jump to think of named graphs for episodic memory and techniques I can use to include or not include them in a query.

Except the brain is likely to be using a latent representation rather than a symbolic one. The functional requirements are similar, e.g. being able to recall events in their original order.

> 
>> Can we use the same memory for reinforcement learning and sequence learning?  How is the task reward propagated backward through time give the transformer inspired model of cognition?
> 
> I am not sure I follow completely. What comes to mind is if I regard something as successful the "training" examples I create are put in named graphs i add a +1 credibility / veracity score to. 

To apply the task reward/penalty to the sequence of transformations used to accomplish the task, we need to work backward through the sequence, equivalent to recalling events in reverse order. For a neural network model of episodic memory, we could use a temporal encoding similar to that used in regular LLMs. This would enable events to reference other events via their temporal encoding along with other latent semantics.

I suspect that the requirements for reinforcement learning and sequence learning are similar enough to allow the use of a common memory network.  The challenge is to develop a simple enough approach to enable experimental evaluation. For this purpose, artificial sequences that can be algorithmically generated would dramatically lower the computing resources needed compared to working with human language.  

What is a good choice?  One possibility is elementary mathematics, including numerical operations such as long addition, multiplication and division.

Best regards,


Dave Raggett <dsr@w3.org>
Received on Thursday, 16 January 2025 11:19:02 UTC