What If AI Interpreters Could Not Only Listen but Also See

In a September 28, 2025 paper, interpreting industry expert Claudio Fantinuoli introduced an approach that gives AI live speech translation systems access to visual context — with the goal to improve translation accuracy, particularly in ambiguous contexts.

Today’s AI live speech translation systems “perform remarkably well, yet they share a critical blind spot,” Fantinuoli wrote on LinkedIn. “They hear the words but cannot see the world those words refer to.”

He explained that these systems process language as a single modality — sound — a limitation that prevents them from achieving human-like interpretation, which depends heavily on visual and situational cues.

Human Interpreters, he noted, rely on non-verbal cues such as gestures, gaze, and visual surroundings to disambiguate meaning. Without this context, even the most sophisticated model risks mistranslation when a word’s meaning depends on what’s happening in the room.

Vision-Grounded Interpreting

The proposed approach — called vision-grounded interpreting (VGI) — extends the classical speech-to-speech translation pipeline by adding visual components. 

A webcam captures the scene and a vision-language model produces live scene descriptions — essentially a “caption” of what’s happening. This description, together with the transcribed speech, is then fed into a large language model (LLM) that processes both before producing output speech.

Alternatively, a multimodal model can directly process visual inputs alongside the audio, without relying on intermediate captions.

The added visual input helps the model interpret the scene and choose words that better match the context, improving translation accuracy in ambiguous situations.

Testing the Concept: The AI Interpreter App Prototype

Fantinuoli built a prototype AI Interpreter App, putting the concept to the test in real-world scenarios. 

The prototype was powered by GPT-4o, chosen because it can handle both caption generation and direct multimodal integration within a single framework — allowing the two strategies to be compared under controlled, consistent conditions. 

However, Fantinuoli emphasized during his presentation at AMTA that the prototype is model-agnostic.

To evaluate the system, he also designed a “small diagnostic corpus” of 120 short utterances covering three types of ambiguity: lexical ambiguity, gender resolution, and syntactic ambiguity. 

He found that visual grounding improved lexical disambiguation. Systems using visual input achieved about 85% accuracy, compared to roughly 52% in the audio-only baseline. Smaller but still noticeable gains appeared in gender resolution, while syntactic ambiguity (e.g., “Paul bought green shirts and shoes”) showed no improvement.

Interestingly, when the system was presented with misleading visual input, accuracy simply dropped to baseline — suggesting that the model tended to ignore irrelevant visuals rather than being misled by them.

“Tomorrow’s AI interpreters won’t just be fluent in languages; they’ll need to be situated agents, interpreting speech in context,” — Claudio Fantinuoli

Clear Direction: Embracing Multimodality

The results seem promising, but Fantinuoli noted in his blog post that “there’s still plenty to solve” — from scene complexity and vision recognition errors to prompt sensitivity. 

He emphasized that “the direction is clear:” AI interpreting can’t remain audio-only, and embracing multimodality is a necessary step forward for improving the quality of AI live speech translation.

“Tomorrow’s AI interpreters won’t just be fluent in languages; they’ll need to be situated agents, interpreting speech in context,” he said. This “will mark a new stage in machine-mediated human communication with clear benefits for the users,” he concluded.