OpenAI Unveils Groundbreaking AI Speech Engine
On March 29, OpenAI, a world-renowned AI research organization, released Voice Engine, an AI speech engine that generates natural speech highly similar to the original speaker's voice using only a single 15-second audio sample and text input. This technological achievement marks a revolutionary step for AI in the field of speech synthesis. However, OpenAI decided not to release the new tool due to the potential risk of spreading damaging disinformation during a global election year.
The voice engine's 'replicated' voice can not only read the original speaker's native text but also reproduce their voice in multiple languages, including Spanish, French, and Chinese.
Some industry experts have noted that OpenAI's Voice Engine model represents a significant advancement in speech synthesis technology and a successful integration of AI technology with practical applications. With ongoing technological improvements and further expansion of its applications, Voice Engine is expected to lead the voice synthesis industry in new directions in the future. It is understood that OpenAI developed this model as early as 2022, and the initial version was applied to the text-to-speech function built into ChatGPT. However, this version of the application has never been made public, as OpenAI chose to release it in a more discreet and reliable manner in the public domain.
Like image and video generators, speech generators can be used to spread false information on social media and by criminals as a tool for online or phone scams. To prevent misuse, the model is currently in a small preview phase and is only being tested with selected partners to ensure responsible adoption and robust advancement of the technology. OpenAI is concerned that the technology could be used to crack voice authentication for online bank accounts and other personal applications. Jeff Harris, OpenAI's product manager, stated that it is crucial to handle this sensitive matter with care. OpenAI is currently investigating methods to watermark or increase controls on synthesized voices.
The aim is to initiate discussions about the responsible deployment of synthesized voices and how society can adapt to these new features. Based on these conversations and the results of these small-scale tests, more informed decisions will be made about whether and how to deploy this technology at scale. In an unattributed blog post, OpenAI stated that their speech engine is not the only research in the AI speech industry. In early 2023, Microsoft also announced the launch of a new text-to-speech AI model called VALL-E. This model can generate near-realistic human voices based on speech samples of just three seconds.
Microsoft refers to VALL-E as a 'neural codec language model' that generates audio based on text input and short samples of the target speaker. In the published report, Microsoft researchers stated that Vall-E has the ability to understand context and can synthesize high-quality, personalized speech using only three seconds of sound as an acoustic cue. Experimental results demonstrate that Vall-E outperforms other AI speech systems in terms of speech naturalness and speaker similarity, making it the most advanced zero-sample (text-to-speech) system available.
Microsoft also faces security concerns related to text-to-speech AI prosody. Additionally, VALL-E could potentially compete with current voice actors.
Despite these concerns, internet companies have continued to develop more realistic AI voice systems over the years. Papercup, a UK-based company, provides natural human voice AI voiceovers in multiple languages for major media brands such as Sky News, Discovery, and Cinedigm. Sonantic produces highly realistic audio simulations by including non-speech sounds such as subtle scoffs, faint breathing sounds, or giggles.
AI synthesized speech has numerous potential benefits and can save time and money in appropriate situations. Combining increasingly sophisticated AI voice technology with AI conversation technology will result in more authentic virtual conversations. The new chatbot, Character.AI, allows users to converse with anyone, from historical figures like Marx and Elizabeth to deceased loved ones. What kind of meta-universe will unfold when VALL-E is combined with Character.AI?