Researchers conducted a meta-synthesis to better understand how we perceive and interact with various machines’ voices (and bodies). Their findings have provided insights into human preferences, which engineers and designers can use to develop future vocal technologies.
Our interactions with voice-based devices and services are becoming increasingly common in the modern era. In this light, researchers at the Tokyo Institute of Technology and RIKEN in Japan conducted a meta-synthesis to better understand how we perceive and interact with various machines’ voices (and bodies). Their findings have provided insights into human preferences, which engineers and designers can use to develop future vocal technologies.
Humans primarily communicate verbally and audibly. Not only do we communicate linguistic information, but also the complexities of our emotional states and personalities. Tone, rhythm, and pitch are important aspects of our voice that influence how we are perceived. To put it another way, how we say things matters.
We are expanding our interactions to include computer agents, interfaces, and environments as technology advances and social robots, conversational agents, and voice assistants enter our lives. Depending on the technology under study, research on these technologies can be found in the fields of human-agent interaction (HAI), human-robot interaction (HRI), human-computer interaction (HCI), and human-machine communication (HMC). Many studies have been conducted to investigate the impact of computer voice on user perception and interaction. However, these studies are dispersed across various types of technologies and user groups, and they focus on various aspects of voice.
Researchers have performed a meta-synthesis to understand how we perceive and interact with the voice of various machines. Their findings have generated insights into human preferences and can be used by engineers and designers to develop future vocal technologies.
In this regard, a group of researchers from Tokyo Institute of Technology (Tokyo Tech), Japan, RIKEN Center for Advanced Intelligence Project (AIP), Japan, and gDial Inc., Canada, have now compiled findings from several studies in these fields in order to provide a framework that can guide future computer voice design and research.
According to the study’s lead researcher, Associate Professor Katie Seaborn of Tokyo Tech (Visiting Researcher and former Postdoctoral Researcher at RIKEN AIP), “Voice assistants, smart speakers, conversational vehicles, and social robots are already a reality. We need to know how to design these technologies so that they can interact with us, live with us, and meet our needs and desires. We also need to understand how they influenced our attitudes and behaviors, particularly in subtle and unseen ways.”
The team’s survey looked at peer-reviewed journal papers and conference proceedings-based papers with a focus on user perception of the agent’s voice. The source materials covered a wide range of agent, interface, and environment types and technologies, the majority of which were “bodyless” computer voices, computer agents, and social robots. The majority of the documented user responses came from university students and adults. The researchers were able to observe and map patterns in these papers, as well as draw conclusions about the perceptions of an agent’s voice in a variety of interaction contexts.
According to the findings, users anthropomorphized the agents with whom they interacted and preferred interactions with agents who matched their personality and speaking style. Human voices were preferred over synthetic voices. The use of vocal fillers such as pauses and terms such as “I mean…” and “um” improved the interaction. In general, people preferred human-like, happy, empathetic voices with higher pitches, according to the survey. These preferences, however, were not static; for example, user preference for voice gender shifted over time from masculine to more feminine voices.
Based on their findings, the researchers were able to develop a high-level framework for categorizing various types of interactions across various computer-based technologies.
The researchers also considered the effect of the agent’s body, or morphology and form factor, which could be a virtual or physical character, display or interface, or even an object or environment. They discovered that when agents were embodied and the voice “matched” the body of the agent, users perceived them more favorably.
Human-computer interaction, particularly voice-based interaction, is a burgeoning field that is evolving almost daily. As a result, the team’s survey serves as an important starting point for the investigation and development of new and existing technologies in voice-based human-agent interaction (vHAI). “The research agenda that emerged from this work is expected to guide how voice-based agents, interfaces, systems, spaces, and experiences are developed and studied in the coming years,” Prof. Seaborn concludes, summarizing the significance of their findings.