Translation Canvas An Explainable MT Analysis Interface

Researchers and developers need tools that offer insights beyond simple quality scores, underscoring the importance of explainability in machine translation (MT) evaluation metrics.

Explainable metrics help researchers understand not only how well an MT model performs, but also why it performs a certain way, allowing for targeted refinements and bridging the gap between model outputs and human interpretation.

Recent advancements highlight the industry’s shift toward explainability. For instance, InstructScore, developed by the University of California, Google, and Carnegie Mellon University, and Unbabel’s xTOWER showcase the value of explainable metrics. InstructScore leverages large language models (LLMs) to provide quality scores and detailed error explanations, while xTOWER LLM produces high-quality error explanations and utilizes these explanations to suggest corrections.

Despite these advancements, MT researchers still struggle to find tools that can fully interpret and evaluate the performance of MT models at a granular level and are also user-friendly. In an October 20, 2024 paper, researchers from the University of California, Santa Barbara, and Carnegie Mellon University underscored the “necessity for an integrated solution that combines comprehensive model evaluation with user-friendly interfaces and advanced analytical capabilities.”

Translation Canvas

To address this need, they developed Translation Canvas, an evaluation toolkit focused on explainability, accessibility, and flexibility.

Translation Canvas offers an intuitive interface and supports fine-grained evaluations, pinpointing specific error spans and providing natural language explanations. The toolkit currently incorporates three evaluation metrics — BLEU, COMET, and InstructScore — giving researchers multiple perspectives on model performance.

The Translation Canvas dashboard provides a comprehensive view of MT model performance, detailing the distribution of errors and enabling comparative analysis between MT systems. This helps researchers quickly identify areas where a model underperforms relative to others. Additionally, the tool includes a robust search function that enables researchers to filter results by error type, severity, or content, making targeted analysis more efficient.

The researchers noted that Translation Canvas is designed specifically for the translation research community, “where understanding the nuances of model errors and performance is vital for further improvements.”

While previous tools, like Ghent University’s MATEO project, offered a web-based platform for diverse metrics, Translation Canvas builds on this by integrating natural language error explanations, powered by InstructScore, and advanced instance-level analysis.

Useful, Enjoyable, and Easy To Use

To assess the effectiveness of Translation Canvas, the researchers conducted a user evaluation study with participants experienced in MT and existing MT evaluation metrics.

Users rated it high for both enjoyability and usability, particularly appreciating the highlight of error types and the quick analysis process. The graph presentations and error sorting saved significant time in fine-grained analysis, and the support for multi-system analysis was highlighted as a key usability feature.

“Our evaluation shows that users find the system to be useful, enjoyable and as easy to use as command-line evaluation tools,” the researchers said.

Even those new to MT evaluation could quickly get started — first-time users reported taking only ten minutes to begin working with a custom dataset. This ease of use reflects the system’s effective balance between functionality and user-friendliness, meeting the need for tools that support both rapid onboarding and sophisticated analysis.

The researchers acknowledge that human evaluation remains essential for capturing the subtleties of translation quality, and they plan to further improve Translation Canvas based on user feedback. With user permission, they will collect feedback on source texts, references, model outputs, and rankings to continuously refine the tool. Users can revoke permission at any time, ensuring control over their data and feedback.

Authors: Chinmay Dandekar, Wenda Xu, Xi Xu, Siqi Ouyang, and Lei Li