The Music Sora! With Suno, Anyone is a Musician
Generate 2 minutes of music in just a few seconds by inputting the music genre and content theme?
Suno, an artificial intelligence startup, recently released the V3 version of their model to the public and is offering a free trial on their official website. The V3 version includes additional music styles and genres, and improves adherence to cue words while reducing the problem of hallucination, resulting in a more impressive effect. That is why the AI-driven song generator, Suno, is rapidly spreading within the community, sparking a wave of creativity.
Suno can generate complete song compositions based on simple text descriptions entered by the user, including lyrics, vocals, and orchestration. This makes music creation accessible to everyone, not just professionals, and allows even those without any musical foundation to create their own songs. While several AI music generators have been launched, Suno stands out as the 'ChatGPT of music'.
The evolution of text-to-speech (TTS) architecture can be summarized as resonance peak synthesis → tandem synthesis → neural network. Today, state-of-the-art TTS can use Eleven Labs' and OpenAI's TTS models or Descript products with a single API call. The process is very low-latency, smooth, and natural. It can even mimic a variety of accents. Within a day, everyone will have their own voice AI chaperone. What comes next after having a voice AI chaperone? Making it sing, of course!
The Suno team, consisting of Mikey Shulman, Keenan Freyberg, Georg Kucsko, and Martin Camacho, was founded less than two years ago. All four members are experts in machine learning and had previously worked together at AI firm Kensho. Their goal was to develop voice recognition tools specifically for financial scenarios, such as earnings calls. As musicians and audiophiles, they began experimenting with combining text-to-speech, AI, and audio generation. Eventually, they left Kensho to start their own business full-time.
Their first product to scale was Bark, the first text-to-audio model based on the open-source Transformer. It gained 19,000 stars on GitHub within a month from scratch.
As Bark grew in popularity, more users began using it to generate music. Their model architecture is capable of generating people's favourite music and is on a unique path that other research institutions have largely ignored.
There is a significant emphasis on big language models and their powerful information processing and intelligent performance. However, it is important not to overlook the other side of the coin - music creation. Although it is a relatively small market, the emotions and enjoyment it brings to people are very real.
Suno has launched V3 Alpha, which includes improvements. Scenarios for audio generation are divided into three categories: music, speech, and sound effects (SFX). Suno is part of a wave of audio generation explorations that combine music and speech. Other related attempts include Audiobox Plus for speech and sound mixing, Stable Audio for generating music and sound effects, and seamless mixing of translation and speech generation. Currently, there is no model on the market that can handle all of these use cases. However, I believe that there will be in the future, and Transformers will most likely continue to be at the center of it.
Essentially, we're using Transformers to process audio in the same way that we use it to process text. We predict the next segment of audio and repeat the process as necessary to produce the required audio output.
It is true that Suno's initial research was challenging and unsatisfactory. However, the positive aspect is that the idea was always clear: to add as little explicit knowledge as possible. This type of human intervention can disrupt the model's learning process. Therefore, in the case of music and audio, it is essential to avoid artificially imposing any rules on the model. Instead, let it learn and explore independently.