In recent months, Microsoftintroduced Live Interpreter, Googlerevealed the technical details behind the real-time speech-to-speech system now running in Google Meet, Zoomannounced it will bring built-in speech-to-speech translation to Zoom Workplace via AI Companion 3.0, while Appleintroduced live translation on AirPods.
A new Microsoft study now looks at an essential part of the technical challenge involved: how translated audio is delivered to the listener.
Partner spotlight
How teams localize with AI.
Browse a full day of sessions built to drive results this quarter.
In their November 12, 2025 paper, Margarita Geleta from UC Berkeley (during a Microsoft internship), together with Hong Sodoma and Hannes Gamper from Microsoft, show that placing the translated voice on the left or right — matching the speaker’s position on screen — can double comprehension compared to standard, non-spatial audio and improve user experience.
Inside the Study
To separate audio presentation from translation performance, the team used a controlled setup: bilingual speakers recorded short workplace dialogues in Greek, Kannada, Mandarin Chinese, and Ukrainian, along with human English translations timed to sound like live output.
Forty-seven English-speaking participants listened to eight one-minute dialogues under four conditions:
Diotic (non-spatial): translated audio in both ears
Monaural (non-spatial): translated audio in one ear only
Spatial: translated audio matched the speaker tile
Spatial + reverberation: translated audio matched the speaker tile, but the original voice was rendered at a distance
Participants then answered comprehension questions and rated cognitive load.
Spatial Audio Outperforms
According to the researchers, spatial audio delivered the strongest results across all languages. Listeners were more than twice as likely to answer comprehension questions correctly when the translated voice came from the correct left/right direction. Spatial audio also increased listeners’ confidence in their understanding.
The biggest gains appeared in speaker attribution and role identification (“who said what”), especially during rapid turn-taking. Participants described spatial mode as “easier to understand” and “clearer in distinguishing speakers.” One-ear translation, by contrast, was often perceived as “confusing” or “fatiguing” and produced the lowest comprehension scores.
The researchers also found that voice timbre differences help listeners track speakers. Mixed-gender pairs were easier to follow, while same-gender pairs produced more errors, even with spatial cues — a finding that supports the growing focus on voice preservation.
Lowering — but not muting — the original audio also helped reduce distraction while keeping speaker identity intact, producing the best experience.
Talking to Slator, UC Berkeley researcher Geleta highlighted that the gain from spatial audio is clear.
“While it is nearly impossible to completely remove lag/latency in live translation, there are other factors that can ease comprehension, turn-taking, and increase attention — and spatial audio is one of these.” — Margarita Geleta, Researcher, UC Berkeley
Recommendations for Meeting Platforms
The research points to a set of practical recommendations for meeting platforms:
Align translated audio with the speaker’s on-screen position
Avoid one-ear translation for multi-speaker dialogue
Provide user control over original audio level — for example a simple “original ↔ translated” balance slider so users can adapt to context and hearing preferences
Keep speaker voices distinct and consistent throughout the meeting
Most modern devices already support spatial audio, making these changes technically feasible. “Systems that treat space, time, and identity as co-equal design surfaces will better support inclusive, cross-language collaboration,” the researchers said.
2025 Slator Pro Guide: Translation AI
The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.
The researchers note that latency and overlapping speech remain the biggest unresolved challenges in AI live speech translation, explaining that their setup delivered near-ideal timing, while real systems introduce automatic speech recognition or AI translation errors.
Dialogues were also pre-recorded, with no interruptions, laughter, or turn-grabbing — all of which complicate overlap in real meetings.
Additionally, the study involved only two speakers using headphones. Real-world meetings with more participants, speakerphones, background noise, or interruptions may produce different results.
Geleta told Slator that “while it is nearly impossible to completely remove lag/latency in live translation, there are other factors that can ease comprehension, turn-taking, and increase attention — and spatial audio is one of these.”
Future work includes testing spatial audio in live, multi-speaker meetings, exploring adaptive mixing based on who is speaking, and studying how accents and synthetic voices affect trust and perceived speaker identity.
“I would be happy to see if our user study can invite future research avenues,” Geleta concluded.
Featured
Partner spotlight
Boost Language Access
Improve health outcomes and ensure compliance for individuals with LEP