OpenAI’s new ‘speech engine’ needs just 15 seconds to replicate a voice.

adminMarch 30, 2024

OpenAI, the AI company behind ChatGPT, the dominant generative AI tool, has unveiled a new voice cloning technology called “Voice Engine.” This audio model can replicate human voice, intonation, and other distinct human speech patterns based on relatively small samples of original audio.

“It is noteworthy that a small model with a single 15-second sample can produce emotive and realistic speech,” the company said in a blog post Friday.

For comparison, AI voice platform ElevenLabs offers an instant voice cloning tool that requires at least one minute of samples. For best results, approximately 10 minutes of continuous audio is required at professional service level.

The company showed numerous examples of what this technology can do. In one example, the voice of a young patient who had lost much of her ability to speak due to a vascular brain tumor was replicated using old recordings she had made for her school project. Here’s what she sounds like today, according to OpenAI:

OpenAI collaborated with Lifespan, a nonprofit affiliated with Brown University School of Medicine, and a company that developed a tool called Livox, an “alternative communication app” designed for people with disabilities. The team was able to work using recordings the women had made for school presentations.

The Open AI speech engine was then able to provide instant text-to-speech capabilities to allow patients to speak effectively in their own voices.

OpenAI also demonstrated how HeyGen uses its technology to naturally translate speech uploaded in one language into another.

According to the company, the speech engine was first developed in late 2022 and is already being used to support preset voices available in OpenAI’s text-to-speech API and ChatGPT’s speech and read-aloud features. As the latest technology advances, the company said it is taking caution before a wider rollout.

Acknowledging the widely condemned practice of deepfakes, OpenAI wrote, “We hope to start a conversation about the responsible deployment of synthetic voices and how society can adapt to these new capabilities.” The voices of celebrities, government officials, and a growing number of civilians are being mimicked for nefarious purposes, including political campaigns, fake advertising, and outright criminal activity, and U.S. President Joe Biden has called for more safeguards against the malicious use of AI voice mimicry. has been promoted.

In fact, Meta revealed last summer that its AI voice tool was being specifically shelved due to “the risk of possible misuse.”

“In line with our approach to AI safety and our voluntary commitments, we have decided to make this technology available in advance, but not widely available at this time,” OpenAI explained.

Even before its public release, OpenAI has placed restrictions on its speech engine, including a list of prominent people it cannot emulate.

“We believe that widespread deployment of synthetic voice technology should be accompanied by a voice authentication experience to ensure that original speakers are knowingly adding their voices to the service, and a banned voice list to detect and prevent the production of voices that excessively resemble prominent figures. “Do,” OpenAI wrote.

Partners testing the speech engine today have agreed to OpenAI’s Usage Policy, which prohibits impersonating any person or organization without consent. The company also requires explicit, prior consent from the original speaker and does not allow developers to build a way for individual users to duplicate their voices.

“Based on these conversations and the results of our small-scale testing, we will make more informed decisions about whether and how to deploy this technology at scale,” the blog post says.

In addition to Voice Engine, Open AI is working on several projects simultaneously. CEO Sam Altman said the company is working to launch GPT-5 this year. The company also introduced Sora, a generative video tool. The company claims Sora will be the most advanced video generator on the market, surpassing models like Pika, Stable Video Diffusion, and Runway ML.

Sora is currently only available to “red teams” registered with Open AI to prevent abuse.

Speech Engine can clearly outperform other voice cloning tools, including products with an open source model such as Meta, ElevenLabs, WellSaid Labs, and RVC.

Open AI is also working on a secret project called Q*, whose name has only been leaked. Sam Altman declined to give details, but said the research team is focused on finding techniques and approaches that make AI inference better.

Edited by Ryan Ozawa.