Re: Text to Image and human cognition from Adeel on 2022-10-20 (public-cogai@w3.org from October 2022)

From: Adeel <aahmad1811@gmail.com>
Date: Thu, 20 Oct 2022 18:44:33 +0100
To: Dave Raggett <dsr@w3.org>
Cc: public-cogai <public-cogai@w3.org>
Message-ID: <CALpEXW2d7ZbcK7kMnvdH5OsKDUDgshcRihQ3FjdMPy6z-Od5XQ@mail.gmail.com>
Hello,
The so called stable diffusion is not that stable. At most near-stable as
they use continuous random sampling within the variational autoencoder in
the latent space.
As you mention they not perfect as they can produce blurry and unrealistic
outputs from the probability distribution that is produced from the loss
function.
Thanks,
Adeel

On Wed, 19 Oct 2022 at 17:56, Dave Raggett <dsr@w3.org> wrote:

> Current text to image generators are mind blowing in their capabilities to
> mimic a vast range of photos and artwork, seemingly by magic. The release
> of Stable Diffusion has made it practical to run text to image generators
> on the latest laptops, but you can also experiment with it online at:
>
> https://huggingface.co/spaces/stabilityai/stable-diffusion
>
> It was trained on a large dataset of image+text pairs scraped from the
> Internet using the HTML IMG element and its ALT attribute for the text
> descriptions.  In essence, text prompts are first mapped to a language
> embedding based upon GPT. This is further transformed into a latent model
> for images and combined with noise. The model is then diffused and denoised
> in a series of steps that fills out the details, using the prior knowledge
> from the dataset, and finally decoded to create the pixels in the resulting
> image.
>
> My vague understanding of the training process is that it involves a
> generative/adversarial approach that tries to predict whether an image is
> machine or human generated, and whether a text prompt matches or doesn’t
> match a given image. The resulting model is about 4GB in size, which is
> surprising small given the huge breadth of images covered.
>
> Stable Diffusion is great when it comes to backgrounds, and for close ups
> of faces, but has a tendency to make bizarre errors with hands, fingers and
> arms, as well as failing to provide sufficient details for faces for
> figures that are not the main focus of the composition.  Animals also often
> come out weirdly, so generating aesthetically pleasing images is a measure
> of good luck and a good prompt, see:
>
> https://www.unite.ai/three-challenges-ahead-for-stable-diffusion/
>
> This is perhaps unsurprising given the neural network architecture.  There
> is ongoing work on extracting 3D models from small sets of 2D images. In
> principle, this should extend to inferring likely 3D models from single
> images of human faces and bodies, however, this will also require the
> generator to pay extra attention to things that people are especially
> attuned to. Existing image to image software can already remove noise,
> increase image resolution, colourise monochrome images, render faces in
> changed orientations, and make people look younger or older than in the
> original image.
>
> I anticipate that future text to image generators will include a rich
> grasp of everyday knowledge and support collaborative dialogues for
> creating and refining artworks as an iterative process. Commercial artists
> will become experts at doing this, combining the computer's imagination
> with the human creative spark and intuitive understanding of emotions, etc.
>
> I am now wondering about how to combine artificial neural networks with
> human-like reasoning and learning. This involves combining everyday
> knowledge with working memory, and providing a means to support sequential
> cognition in terms of sequences of inference steps, rather than simple
> associations. Humans learn a lot from thinking about explanations for what
> they observe, so in principle, we should be able to mimic that, and enable
> computers to learn effectively from understanding texts and videos.
>
> This raises questions about how to design artificial neural networks to
> replicate plausible reasoning, e.g. how to support variables, queues, and
> sets, as well as how to mimic multiple inference strategies and
> metacognition for controlling them. Current neural networks are designed
> for single purposes, rather than general purpose cognition, so some fresh
> ideas are likely to be needed.
>
> p.s. an open question is whether an extended form of copyright is needed,
> given that text to image generators are very good at mimicking the style of
> popular artists rather than copying their artworks. In principle, I can see
> a rationale for artists and photographers, etc. having to give their
> explicit permission for AI based systems to be trained using their creative
> works.
>
> Dave Raggett <dsr@w3.org>
>
>
>
>
Received on Thursday, 20 October 2022 17:44:57 UTC