Text to Image and human cognition

Current text to image generators are mind blowing in their capabilities to mimic a vast range of photos and artwork, seemingly by magic. The release of Stable Diffusion has made it practical to run text to image generators on the latest laptops, but you can also experiment with it online at:

 https://huggingface.co/spaces/stabilityai/stable-diffusion <https://huggingface.co/spaces/stabilityai/stable-diffusion>

It was trained on a large dataset of image+text pairs scraped from the Internet using the HTML IMG element and its ALT attribute for the text descriptions.  In essence, text prompts are first mapped to a language embedding based upon GPT. This is further transformed into a latent model for images and combined with noise. The model is then diffused and denoised in a series of steps that fills out the details, using the prior knowledge from the dataset, and finally decoded to create the pixels in the resulting image.

My vague understanding of the training process is that it involves a generative/adversarial approach that tries to predict whether an image is machine or human generated, and whether a text prompt matches or doesn’t match a given image. The resulting model is about 4GB in size, which is surprising small given the huge breadth of images covered.

Stable Diffusion is great when it comes to backgrounds, and for close ups of faces, but has a tendency to make bizarre errors with hands, fingers and arms, as well as failing to provide sufficient details for faces for figures that are not the main focus of the composition.  Animals also often come out weirdly, so generating aesthetically pleasing images is a measure of good luck and a good prompt, see:

 https://www.unite.ai/three-challenges-ahead-for-stable-diffusion/ <https://www.unite.ai/three-challenges-ahead-for-stable-diffusion/>

This is perhaps unsurprising given the neural network architecture.  There is ongoing work on extracting 3D models from small sets of 2D images. In principle, this should extend to inferring likely 3D models from single images of human faces and bodies, however, this will also require the generator to pay extra attention to things that people are especially attuned to. Existing image to image software can already remove noise, increase image resolution, colourise monochrome images, render faces in changed orientations, and make people look younger or older than in the original image.

I anticipate that future text to image generators will include a rich grasp of everyday knowledge and support collaborative dialogues for creating and refining artworks as an iterative process. Commercial artists will become experts at doing this, combining the computer's imagination with the human creative spark and intuitive understanding of emotions, etc.

I am now wondering about how to combine artificial neural networks with human-like reasoning and learning. This involves combining everyday knowledge with working memory, and providing a means to support sequential cognition in terms of sequences of inference steps, rather than simple associations. Humans learn a lot from thinking about explanations for what they observe, so in principle, we should be able to mimic that, and enable computers to learn effectively from understanding texts and videos.

This raises questions about how to design artificial neural networks to replicate plausible reasoning, e.g. how to support variables, queues, and sets, as well as how to mimic multiple inference strategies and metacognition for controlling them. Current neural networks are designed for single purposes, rather than general purpose cognition, so some fresh ideas are likely to be needed.

p.s. an open question is whether an extended form of copyright is needed, given that text to image generators are very good at mimicking the style of popular artists rather than copying their artworks. In principle, I can see a rationale for artists and photographers, etc. having to give their explicit permission for AI based systems to be trained using their creative works. 

Dave Raggett <dsr@w3.org>

Received on Wednesday, 19 October 2022 16:56:32 UTC