Re: MathML + Suno AI from Deyan Ginev on 2024-04-09 (www-math@w3.org from April 2024)

From: Deyan Ginev <deyan.ginev@gmail.com>
Date: Tue, 9 Apr 2024 12:51:00 -0400
To: Patrick Ion <pion@umich.edu>
Cc: Stephen Watt <smwatt@gmail.com>, www-math@w3.org
Message-ID: <CANjPgh8tf62BQVCZj9cFzm3vNosBz1kasF5MEGOYkwKj39_FeA@mail.gmail.com>
Hi Patrick, all,

Transformers are ubiquitous nowadays, I expect Suno AI is exclusively using
that architecture.
The key is how they've organized their training data and how they have
separated the aspects between models/inputs.
For example, when using the app it is clear that the musical genre is a
separate input from the lyrics. And that you can auto-generate lyrics from
a short English description using a separate model than the
audio-generation model.

I see that suno has an open source model, called bark, accessible here:
https://github.com/suno-ai/bark

Quoting its readme: "It follows a GPT style architecture similar to AudioLM
and Vall-E and a quantized Audio representation from EnCodec. It is not a
conventional TTS model, but instead a fully generative text-to-audio model
capable of deviating in unexpected ways from any given script."

So it appears that they use a single unified model (I assume billions of
parameters, based on the observed audio quality and coherence).

Some key steps for that model (which is likely a weaker version than their
v3 production model)
- They use an LLM transformer to embed the input lyrics into latent space,
hence getting all kinds of useful context (sentiment, long-term verse
structure, etc)
- Then they use an approach similar to AudioLM for "speech continuations"
by mapping "the input audio to a sequence of discrete tokens" and again
leveraging a transformer to learn in-context relationships.
- Here is a guess from me: They likely have a huge collection of pure raw
audio tracks for each genre in their training data, as a starting point
before applying the lyrics. But they also likely use that as training data,
so that the audio-transformer can interpolate adjacent variations to any
given style example.

You can find the AudioLM paper at:
https://arxiv.org/abs/2209.03143

A key sentence from the abstract is also about data scale:
"By training on large corpora of raw audio waveforms, AudioLM learns to
generate natural and coherent continuations given short prompts."

I have no practical experience with the audio modality, so take this as an
"educated guess" from an math NLP practitioner.

---
P.S.

@Stephen: Eurovision is likely out of the question, but Suno AI has its own
internal ranking which has all kinds of strange curiosities. I can't link
to that however, it is the "Explore" tab in the app.

Greetings,
Deyan

On Tue, Apr 9, 2024 at 12:31 PM Patrick Ion <pion@umich.edu> wrote:

> Amazing!  All sorts of things are being eclipsed.
>
> A question, Deyan, is whether you have any good idea how this
> is possible?  In particular, can you recommend any discussions
> of the process going from input chat to starting up generation
> from a pre-trained LLM?
>
> I just found 3Blue1Brown's recent course on AI, in particular
> transformer technology, very interesting.
>
> Patrick
>
>
> On Tue, Apr 9, 2024 at 12:08 PM Stephen Watt <smwatt@gmail.com> wrote:
>
>> Wow!  Fantastic!    Is it too late to enter Eurovision?
>>
>> On Tue, Apr 9, 2024 at 10:41 AM Deyan Ginev <deyan.ginev@gmail.com>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I wanted to share a musical curiosity with readers here, purely for
>>> entertainment.
>>>
>>> There is a new startup called "Suno AI" and based in Cambridge, MA,
>>> which is innovating on the text-to-music generation front. That is now
>>> encompassing all production aspects (lyrics, voice, instrumental).
>>>
>>> Impressively, they can work on any text as input, even spec text, and
>>> have most music styles available. So it's a fun toy...
>>>
>>> Without further delay, here is an AI-generated song, using the start of
>>> the MathML spec text as the input. I only rearranged the lyrics a little.
>>> To showcase the tool better, here is the same input in 3 different styles
>>> (they're about 1-2 minutes long, take 30 seconds to generate).
>>>
>>> style 1:
>>> https://app.suno.ai/song/e473ab5d-6656-4efa-8aa3-8a3be1981d3c/
>>>
>>> style 2:
>>> https://app.suno.ai/song/7da4ffc3-aa2b-4505-9990-a30b844594e9
>>>
>>> style 3:
>>> https://app.suno.ai/song/4a68178f-eed9-43a5-a849-7d35c55e2669/
>>>
>>> Enjoy,
>>> Deyan
>>>
>>
Received on Tuesday, 9 April 2024 16:51:32 UTC