NVIDIA’s Latest tech makes AI Voices more Expressive and Realistic

NVIDIA’s Latest tech makes AI Voices more Expressive and Realistic

Although the voices on Amazon’s Alexa, Google Assistant, and other AI helpers are significantly superior to those on older GPS gadgets, they still lack the rhythms, intonation, and other characteristics that make speech feel human. NVIDIA presented new research and tools at the Interspeech 2021 conference that can replicate such natural speech features by letting you train the AI system with your own voice. NVIDIA’s text-to-speech research team produced RAD-TTS, a winning submission in a NAB broadcast convention competition to develop the most lifelike avatar, to improve its AI voice synthesis.

An individual can use the system to train a text-to-speech model with their own voice, including tempo, intonation, timbre, and other factors. Another RAD-TTS function is voice conversion, which allows a user to convey the words of one speaker using the voice of another. That interface allows you to fine-tune the pitch, duration, and vigor of a synthesized voice at the frame level. NVIDIA’s researchers used this technology to create more conversational-sounding voice narration for its own I Am AI video series, instead of utilizing human voices. The goal was to have the narration reflect the tone and aesthetic of the videos, which hasn’t always been the case in AI-narrated videos.

Although the results are still a little artificial, they are far superior to any AI narrative I’ve ever heard.

“Our video producer could use this interface to record himself reading the video script, and then use the AI model to translate his speech into the female narrator’s voice. The producer may then instruct the AI like a voice actor, altering the synthesized speech to stress keywords and changing the narration’s cadence to better portray the video’s tone,” according to NVIDIA.

The startup, which began its offering in March, has already seen a conversion rate of over 70% among businesses that have tried it out. Terminus, Olive, Litmus, Imply, and are among Rattle’s more than 50 customers.

“[Our] lead response time has gone down by 75%, and crucial procedures have gone from days to minutes,” stated Jeff Ronald, GTM Ops Manager at LogDNA, after deploying Rattle. On Tuesday, the startup revealed that it had raised $2.8 million in a seed round from Lightspeed and Sequoia Capital India.

NVIDIA is making some of this research open-source through the NVIDIA NeMo Python toolkit for GPU-accelerated conversational AI, which is available on the company’s NGC hub of containers and other software. It’s optimized to run efficiently on NVIDIA GPUs, of course.

“Several of the models have been trained on NVIDIA DGX systems with tens of thousands of hours of audio data. “Using mixed-precision computing on NVIDIA Tensor Core GPUs, developers can fine-tune any model for specific use cases, speeding up training,” the company noted.