Spatial Audio Improves UX in AI Live Speech Translation, Research Finds

In their November 12, 2025 paper, Margarita Geleta from UC Berkeley (during a Microsoft internship), together with Hong Sodoma and Hannes Gamper from Microsoft, show that placing the translated voice on the left or right — matching the speaker’s position on screen — can double comprehension compared to standard, non-spatial audio and improve user experience.

Inside the Study

To separate audio presentation from translation performance, the team used a controlled setup: bilingual speakers recorded short workplace dialogues in Greek, Kannada, Mandarin Chinese, and Ukrainian, along with human English translations timed to sound like live output.

Forty-seven English-speaking participants listened to eight one-minute dialogues under four conditions:

Diotic (non-spatial): translated audio in both ears
Monaural (non-spatial): translated audio in one ear only
Spatial: translated audio matched the speaker tile
Spatial + reverberation: translated audio matched the speaker tile, but the original voice was rendered at a distance

Participants then answered comprehension questions and rated cognitive load.

Spatial Audio Outperforms

According to the researchers, spatial audio delivered the strongest results across all languages. Listeners were more than twice as likely to answer comprehension questions correctly when the translated voice came from the correct left/right direction. Spatial audio also increased listeners’ confidence in their understanding.

The biggest gains appeared in speaker attribution and role identification (“who said what”), especially during rapid turn-taking. Participants described spatial mode as “easier to understand” and “clearer in distinguishing speakers.” One-ear translation, by contrast, was often perceived as “confusing” or “fatiguing” and produced the lowest comprehension scores.

The researchers also found that voice timbre differences help listeners track speakers. Mixed-gender pairs were easier to follow, while same-gender pairs produced more errors, even with spatial cues — a finding that supports the growing focus on voice preservation.

Lowering — but not muting — the original audio also helped reduce distraction while keeping speaker identity intact, producing the best experience.

Talking to Slator, UC Berkeley researcher Geleta highlighted that the gain from spatial audio is clear.

“While it is nearly impossible to completely remove lag/latency in live translation, there are other factors that can ease comprehension, turn-taking, and increase attention — and spatial audio is one of these.” — Margarita Geleta, Researcher, UC Berkeley

Recommendations for Meeting Platforms

The research points to a set of practical recommendations for meeting platforms:

Align translated audio with the speaker’s on-screen position
Avoid one-ear translation for multi-speaker dialogue
Provide user control over original audio level — for example a simple “original ↔ translated” balance slider so users can adapt to context and hearing preferences
Keep speaker voices distinct and consistent throughout the meeting

Most modern devices already support spatial audio, making these changes technically feasible. “Systems that treat space, time, and identity as co-equal design surfaces will better support inclusive, cross-language collaboration,” the researchers said.

2025 Cover Slator Pro Guide Translation AI

2025 Slator Pro Guide: Translation AI

The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.

$355 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

Limitations and What Comes Next

The researchers note that latency and overlapping speech remain the biggest unresolved challenges in AI live speech translation, explaining that their setup delivered near-ideal timing, while real systems introduce automatic speech recognition or AI translation errors.

Dialogues were also pre-recorded, with no interruptions, laughter, or turn-grabbing — all of which complicate overlap in real meetings.

Additionally, the study involved only two speakers using headphones. Real-world meetings with more participants, speakerphones, background noise, or interruptions may produce different results.

Geleta told Slator that “while it is nearly impossible to completely remove lag/latency in live translation, there are other factors that can ease comprehension, turn-taking, and increase attention — and spatial audio is one of these.”

Future work includes testing spatial audio in live, multi-speaker meetings, exploring adaptive mixing based on who is speaking, and studying how accents and synthetic voices affect trust and perceived speaker identity.

“I would be happy to see if our user study can invite future research avenues,” Geleta concluded.

Spatial Audio Improves UX in AI Live Speech Translation, Research Finds

Inside the Study

Spatial Audio Outperforms

Recommendations for Meeting Platforms

2025 Slator Pro Guide: Translation AI

Limitations and What Comes Next

Featured

Boost Language Access

AI should speak every language

memoQ Translation Tech

Leading with Excellence