Using snippets of voices, Baidu's ‘Deep Voice’ can generate new speech, accents, and tones.
With just 3.7 seconds of audio, a new AI algorithm developed by Chinese tech giant Baidu can clone a pretty believable fake voice. Much like the rapid development of machine learning software that democratized the creation of fake videos, this research shows why it's getting harder to believe any piece of media on the internet.
Researchers at the tech giant unveiled their latest advancement in Deep Voice, a system developed for cloning voices. A year ago, the technology needed around 30 minutes of audio to create a new, fake audio clip. Now, it can create even better results with just a few seconds of training material.
Of course, the more training samples it gets, the better the output: One-source results still sound a bit garbled, but it doesn’t sound much worse than a low-quality audio file might.
The system can change a female voice to male, and a British accent to an American one—demonstrating that AI can learn to mimic different styles of speaking, personalizing text-to-speech to a new level. “Voice cloning is expected to have significant applications in the direction of personalization in human-machine interfaces,” the researchers write in a Baidu blog article on the study.
This iteration of Deep Voice marks yet another development in AI-generated voice mimicry in recent years. Adobe demonstrated its VoCo software in 2016, which could generate speech from text after 20 minutes of listening to a voice. Montreal-based AI startup Lyrebird claims it can do text-to-speech using just one minute of audio.
These technologies represent the kind of leaps in the advancement of AI that researchers and theorists raised concerns around when deepfakes democratized machine learning-generated videos. If all that’s needed is a few seconds of someone’s voice and a dataset of their face, it becomes relatively simple to fabricate an entire interview, press conference, or news segment.
With just 3.7 seconds of audio, a new AI algorithm developed by Chinese tech giant Baidu can clone a pretty believable fake voice. Much like the rapid development of machine learning software that democratized the creation of fake videos, this research shows why it's getting harder to believe any piece of media on the internet.
Researchers at the tech giant unveiled their latest advancement in Deep Voice, a system developed for cloning voices. A year ago, the technology needed around 30 minutes of audio to create a new, fake audio clip. Now, it can create even better results with just a few seconds of training material.
Of course, the more training samples it gets, the better the output: One-source results still sound a bit garbled, but it doesn’t sound much worse than a low-quality audio file might.
The system can change a female voice to male, and a British accent to an American one—demonstrating that AI can learn to mimic different styles of speaking, personalizing text-to-speech to a new level. “Voice cloning is expected to have significant applications in the direction of personalization in human-machine interfaces,” the researchers write in a Baidu blog article on the study.
This iteration of Deep Voice marks yet another development in AI-generated voice mimicry in recent years. Adobe demonstrated its VoCo software in 2016, which could generate speech from text after 20 minutes of listening to a voice. Montreal-based AI startup Lyrebird claims it can do text-to-speech using just one minute of audio.
These technologies represent the kind of leaps in the advancement of AI that researchers and theorists raised concerns around when deepfakes democratized machine learning-generated videos. If all that’s needed is a few seconds of someone’s voice and a dataset of their face, it becomes relatively simple to fabricate an entire interview, press conference, or news segment.