“By […] introducing proper silences into target speech, we can train Hibiki to adapt its flow in real-time, without the need for complex inference policies,” the researchers noted.
Additionally, Hibiki uses a multistream architecture to generate both spoken audio and written text at the same time. Operating at a fixed rate of 12.5Hz — approximately every 80 milliseconds — it produces smooth, continuous speech that stays in sync with timestamped text.
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
“As the user speaks, Hibiki generates natural speech in the target language, with voice transfer, along with a text translation,” they said.
Kyutai claims that “Hibiki is the first model to provide an experience of interpretation close to human professionals,” while outperforming existing models in translation quality, speaker fidelity, and naturalness.
According to Kyutai, human evaluations confirmed Hibiki’s superior performance, with the company stating on X: “Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters.”
Large-Scale Deployment
Hibiki’s main backbone consists of 2 billion parameters and can process multiple translation tasks at once, making it highly efficient for “large-scale deployment.”
For on-device applications, Kyutai has also introduced Hibiki-M, a lighter 1-billion-parameter version capable of running real-time translations on smartphones.
Kyutai’s co-founder and CTO, Laurent Mazaré, noted in a post on X that Hibiki is “robust to extreme background conditions” and can even function without full network access.
Currently, Hibiki only supports French-to-English translation, but Kyutai wants to extend Hibiki to support many more languages, with the aim “to deliver a definitive solution for live speech translation.”
As part of its open-science initiative, Kyutai has released the Hibiki models, inference code and weights, and a 900-hour synthetic dataset. The company also invites users to explore sample outputs showcasing Hibiki’s potential applications.
Authors: Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez, and Neil Zeghidour