A team of MIT researchers has gone one step further by creating an artificial intelligence system that can reconstruct people’s faces just by hearing their voices. The guy who does voice-overs for movie trailers to announcers on the subway; our lives are full of faceless voices, while most of us are content to create a mental image of this expelled speech.
The application, known as Speech2Face, is a deep neural network that trained millions of people to recognize the interrelationships between voice and facial features by watching YouTube videos. In doing so, it has learned to associate specific cranial features such as the shape of the head and the width of the nose, as well as different aspects of the audio waveform with the speaker’s age, gender, and ethnicity. When researchers fed system audio recordings of people’s voices, it was able to create an image of each speaker’s face with reasonable accuracy.
Clearly, features such as haircuts, facial hair, and certain elements of physical appearance are impossible to predict from a person’s voice, the developers, therefore, insisted that their goal was “not to predict a recognized image of the correct face, but to capture the dominant facial features of the person related to the input speech.”
Researchers say the technology could one day find many useful applications, such as creating faces for video calls without the need for a camera, published in a paper on IEEE Explorer. The system is sometimes at risk of being faulty, with about 6 percent of the faces making it the wrong gender and some wrong racial reasons. However, some improvements are definitely still needed, although the images created by Speech2Face are generally a good match for the type of face, they often have
a general resemblance to speakers. Yet, faceless voices are one step closer to becoming a thing of the past, which could have a big impact, at least for pre-callers.