- Standard string-based metrics (e.g., BLEU, TER, ChrF)
- Current state-of-the-art pretrained metrics (e.g., BERTScore, BLEURT, COMET, Prism)
The latter leverage existing language models or sequence-to-sequence models to determine whether a hypothesis (i.e., raw MT system output) conveys the same meaning as a reference translation (i.e., the high-quality translation produced by a professional translator).
However, none of these are suitable for document-level MT evaluation as they focus on sentence-level translation quality and ignore discourse-level aspects. More precisely, they can neither distinguish document-level from sentence-level improvements in translation quality nor identify the discourse phenomena — such as anaphoric references, lexical coherence, cohesion, deixis, and ellipsis — which lead to context-agnostic translations, according to a 2022 study by Jiang, Liu, et al.
The same study introduced a novel automatic metric called “BlonDe” (Bilingual Evaluation of Document Translation) to expand the scope of automatic MT evaluation from sentence to document level. (The study did not compare BlonDe to pretrained metrics).
Amazon Study
In September 2022, Amazon researchers presented a simple method for extending pretrained metrics to incorporate context at document level and applied it to BERTScore, Prism, COMET, as well as COMET-QE, the free (quality estimation as a metric) version of COMET.
The authors compared a single hypothesis sentence to a single human reference translation sentence to get a score, just like in standard sentence-level metrics. They also included additional context (i.e., two previous sentences) from the reference translation when computing the contextual embeddings for both the hypothesis and reference sentence.
Once the hypothesis and reference sentence had been embedded, they discarded the extra context sentences before computing metric scores following the same process as in the corresponding sentence-level metric.
The Amazon researchers also measured system-level correlation with human judgments to test the effectiveness of the proposed document-level metrics.
Novel Ways for Document-Level Evaluation
The findings demonstrate improved correlation with human judgments when document-level context is added to pretrained models. Such improvements are probably due to better context exploitation.
In addition, the document-level metrics outperformed their sentence-level counterparts in around 85% of the tested scenarios (when excluding results on low-quality human references). As regards the document-level extension of COMET-QE specifically, the proposed method significantly improved its accuracy on discourse phenomena tasks, outperforming a dedicated baseline by up to 6.1%.
Slator Machine Translation Expert-in-the-Loop Report
60-page report on the interaction between human experts and AI in translation production, including AI-enabled workflows, adoption rates, postediting, pricing models.
“A simple extension of the metrics permits them to take advantage of context,” wrote the team.
In particular, the authors observed improvements in the evaluation of pronoun translation; not only when the relevant information is present in a previous sentence, but also in the same sentence, indicating that additional context can be helpful in such cases as well. Besides pronoun translation, the approach also improves over both the sentence-level metric and the document-level MT at word-sense disambiguation.
Thus, the Amazon researchers concluded that, “to the best of our knowledge, our work is the first example of pretrained document-level MT metrics […] We believe that it could easily be extended to other pretrained sentence-level metrics.”
The MT community should adopt such metrics which take document-level context into account, according to the team. They also suggest that “any future research in metrics should explore novel ways to incorporate context.”