Latent reasoning and image generation

If you use Gemini in thinking or pro modes, you will be familiar with the “Show thinking” action that reveals the trace of verbal reasoning that underpins Gemini’s response to your query.  This reveals the verbal chain of thought processing. Other researchers have been working on ways to model thinking without the need to verbalise each step. This is often referred to as latent reasoning, and sometimes as implicit chain-of-thought.

In principle, latent reasoning can reduce the computational effort, as well as handling logic that is difficult to verbalise. The downside is that latent reasoning may collapse on multi-step math problems, and can be hard to debug without the use of auxiliary decoders that can map intermediate states to text.  I believe that these weaknesses can be overcome through requiring AI agents to create a record of their work, just as humans do when working on problems.

Existing text to image generators directly map text to pixels and have difficulties with cardinality, which tends to get lost in the noise during the diffusion process.  Latent models allow the model to plan layouts, e.g. when asked to show 12 applies on a table cloth. There is still a risk of semantic drift, e.g. generating 12 pears instead of apples.  It may also be difficult to tweak a specific part of the reasoning, e.g when the user requests "keep 12 apples, but make them green."

Traditional models map the prompt to the response as a stand alone computation.  To give the appearance of remembering across multiple conversational rounds with the user, the query/response pairs are copied into the prompt for the next round.  The internal state has to be recreated on every round.  An approach based upon latent reasoning would preserve the internal state. Instead of forcing thought into text, the agent would retain a memory of the dialogue, thereby saving on valuable computational resources.

Future text to image generators can be expected to support users to iteratively work on improving images, using latent reasoning to operate over the latent image semantics.

Some recent work includes Huang et al. [1] on a solution that first generates a plan, then an image, then a critique and finally a fix. Wang et al. [2] describe a framework they call “planning with latent thoughts.”  Their model predicts the next latent vector in a high-dimensional manifold, allowing the model to maintain a superposition of logical possibilities (i.e. reasoning diversity) rather than collapsing to a single text token prematurely.

This is encouraging in respect to improved capabilities for AI agents both for reasoning tasks, and for generative tasks, e.g. generating images, music, video and even power point presentations. We can look forward to human creativity and judgement combined with AI tools for greater productivity.

[1] Interleaving Reasoning for Better Text-to-Image Generation, Huang et al.  https://huggingface.co/papers/2509.06945
[2] Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization, Wang et al. https://www.researchgate.net/publication/400237120_Latent_Chain-of-Thought_as_Planning_Decoupling_Reasoning_from_Verbalization

Best regards,

Dave Raggett <dsr@w3.org>

Received on Sunday, 8 February 2026 11:02:05 UTC