The resulting PAR3 dataset is different from samples used in previous literary MT studies in that it is 20 times larger; segmentation was done using paragraphs instead of sentences; and the aligned source text is included. The experiment thus focused on the output quality of paragraph-level literary translation evaluated by human experts.
MT Prefers MT, Literally
To measure output quality, researchers used the BLEU, BLEURT, and BLONDE MT automatic evaluation metrics as well as human reviewers, which included professional translators and monolingual English raters.
Slator Machine Translation Expert-in-the-Loop Report
60-page report on the interaction between human experts and AI in translation production, including AI-enabled workflows, adoption rates, postediting, pricing models.
Reviewers had to perform A/B tests on PAR3 to indicate their preference between a Google Translate output paragraph and a reference human translation. Reviewers also provided open comments for each example to explain their choice.
The MT automatic evaluation metrics showed a preference for Google Translate outputs over human translations in the dataset. By contrast, human reviewers overwhelmingly chose human translations over MT 85% of the time.
Quality issues identified by human experts on the MT outputs went beyond accuracy errors and stylistic inconsistencies to include readability, fluency, and “overly literal translations and discourse-level errors (e.g., coreference, pronoun consistency).”
Fine-tuning of the dataset to correct the MT issues consisted of an automatic postediting task. Human reviewers preferred the post-edited translations at a rate of 69% and noted a lower incidence of errors.
Retrain and Repeat
Researchers acknowledged in the paper that “the task of conveying an author’s ideas highlights yet another difference between literary and traditional MT: document-level context is especially critical for the literary domain due to the presence of complex discourse structure, rendering the typical sentence-level MT pipeline insufficient for this task.”
The researchers included an extensive list of other publications on the same topic of literary MT to support the premise that “state-of-the-art MT systems and MT evaluation metrics fail in the literary domain.”
Using paragraph-level segmentation instead of typical sentence-level segmentation did not seem to make a significant difference in the MT outputs. However, by releasing the PAR3 dataset to the general public, the researchers aim to encourage further exploration into the use of MT in literary translation using pretrained language models.