Consider the following case. The phone chimes. When an office worker picks up the phone, his supervisor tells him in a panic that he must transfer the money to the new contractor because she failed to do so before she left for the day. The problem is averted after she provides him with the wire transfer information and the money is transferred.
The employee leans back in his chair, inhales deeply, and waits for his boss to enter the room. It wasn’t his supervisor speaking on the other end of the line. It wasn’t even a human, really. The voice he heard was a machine-generated audio sample called an audio deepfake, which was created to sound exactly like his boss.
Conversational audio deepfakes may not be far distant since attacks like this employing recorded audio have already happened.
Only the recent development of advanced machine learning algorithms has made deepfakes, both audio and visual, possible. Deepfakes have increased the level of unpredictability surrounding digital media. Many academics have turned to examining visual artifacts, tiny errors, and inconsistencies discovered in video deepfakes, to find deepfakes.
Because individuals frequently communicate verbally without video – for example, via phone conversations, radio, and voice recordings – audio deepfakes may pose an even greater threat. These voice-only connections significantly increase the likelihood that attackers will employ deepfakes.
We and our research associates at the University of Florida have created a method to identify audio deepfakes that compares the acoustic and fluid dynamic variations between voice samples produced naturally by human speakers and those produced artificially by computers.
Natural versus synthetic voices: When a human speaks, the vocal folds, tongue, and lips are all used as parts of the vocal tract to force air across them. The acoustical characteristics of your vocal tract can be changed by rearranging these structures, which gives you the ability to produce more than 200 different sounds or phonemes. However, the acoustic behavior of these many phonemes is essentially constrained by human anatomy, leading to a somewhat constrained set of acceptable sounds for each.
Audio deepfakes, on the other hand, are produced by first letting a computer listen to audio recordings of a chosen victim speaker. The amount of audio that the computer must listen to can range from 10 to 20 seconds, depending on the precise tactics employed. This audio is utilized to glean important details about the victim’s voice’s distinctive characteristics.
An audio sample that sounds like the victim speaking the chosen phrase is produced by the attacker using a customized text-to-speech algorithm after the attacker chooses a phrase for the deepfake to speak. A single deepfaked audio sample can be produced in a couple of seconds, giving attackers the potential freedom to employ the deepfaked voice during a conversation.
How to spot audio deep fakes: Understanding how to acoustically represent the vocal tract is the first step in telling human speech from deepfake speech. Fortunately, scientists have methods for estimating a person’s or a being’s voice based on anatomical measures of its vocal tract.
We went the other way. We were able to extract an approximate representation of a speaker’s vocal tract during a segment of speech by reversing many of these similar procedures. This effectively allowed us to look into the body of the speaker who produced the audio sample.
From this point, we made the assumption that deepfake audio samples wouldn’t be restricted by the same physical restrictions that apply to humans. In other words, the analysis of deep-faked audio samples mimicked human voice tract geometries.
The results of our tests not only supported our theory but also showed something intriguing. We discovered that the voice tract estimations we extracted from deepfake audio were frequently hilariously off. In contrast to real vocal tracts, which are far larger and more variable in shape, deepfake audio frequently produced vocal tracts with the same relative diameter and consistency as a drinking straw.
This discovery shows that deepfake audio is not completely indistinguishable from human-generated speech, even when it is convincing to human listeners. It is feasible to tell if the audio was produced by a person or a machine by estimating the anatomy that produced the observed speech.
Why it’s important: The digital interchange of media and information defines the world of today. Digital exchanges are the norm for everything, including discussions with loved ones, news, and entertainment. Deepfake video and audio limit the utility of these exchanges even in their infancy by undermining people’s trust in them.
Effective and safe methods for identifying the source of an audio sample are essential if the digital world is to continue serving as a vital source of information in people’s lives.