Salesforce Just Open-Sourced a Large, XML-Tagged Machine Translation Dataset

“Our work is unique in that we focus on how to translate text with XML tags, which is practically important in localization,” lead researcher Kazuma Hashimoto told Slator.

A new dataset was necessary for the team’s research, as widely used datasets of plain text do not reflect the fact that “text data on the Web is often wrapped with markup languages to incorporate document structure and metadata such as formatting information,” the researchers explained.

“We decided to publish our new dataset so that people can use it if interested, and we can also gain significant benefit if they report interesting solutions to our task,” Hashimoto said, pointing out that the source data, online help for Salesforce customers, was already publicly available.

Looking ahead, the team wrote, “As our dataset represents a single, well-defined domain, it can also serve as a corpus for domain adaptation research (either as a source or target domain).”

Including XML Tags Improves Quality

According to the paper, this online help text has been localized and maintained for 15 years by the same localization service provider and in-house localization program managers.

“At every release, we run our system to translate the content in English to other target languages, and then human experts verify the quality and perform post-editing to meet the quality demand,” Hashimoto said.

Slator 2020 How to Run a Translation and Localization RFP - Procurement

Pro Guide: How to Run a Translation and Localization RFP

25 pages. Actionable guidance for translation and localization buyers on how to qualify vendors and streamline procurement.

$375 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

Drawing on this multilingual content, the researchers created datasets for seven English-based language pairs (English to Dutch, Finnish, French, German, Japanese, Russian, and Simplified Chinese) and one non-English pair, Finnish to Japanese.

The group performed baseline experiments on NMT output with XML tags removed (i.e., plain text) and compared them to experiments on NMT output with XML tags included.

The team trained three models for each language pair: one trained only with text, without XML; one trained with XML; and one trained with XML and with copy mechanisms, which copy XML elements from the original source text.

“Our work is unique in that we focus on how to translate text with XML tags, which is practically important in localization”

For the plain text NMT, “including segment-internal XML tags tends to improve the BLEU scores,” the authors wrote, which “is not surprising because the XML tags provide information about explicit or implicit alignments of phrases.” This was not the case, however, for English to Finnish, “which indicates that for some languages it is not easy to handle tags within the text.”

Similarly, the model trained with both XML and copy mechanisms achieved the best BLEU scores for both plain text and text with XML tags across all language pairs, except for English to French plain text.

“We expected that tagged text would be helpful in improving translation accuracy,” Hashimoto said, “especially when the training dataset size is limited, as in our specific use case, compared with very general machine translation work in existing research papers.”

The researchers also encountered a typical error, undertranslation, when they found that the underlined phrase “for example” was missing in certain translation results, despite the fact that the dataset’s BLEU scores were higher than those of other, standard public datasets. For this reason, and because online help translations must be accurate, the authors concluded that NMT should be used “for the purpose of helping the human translators” perfect final translations.

Slator 2021 Language Industry Market Report

80-pages. Market Size by Vertical, Geo, Intention. Expert-in-Loop Model. M&A. Frontier Tech. Hybrid Future. Outlook 2021-2025.

$680 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

Although human evaluators identified more than 50% of the translation results as “complete” or “useful in post-editing,” translators still spent a significant amount of time verifying MT and correcting MT errors.

Ideally, future translation models that take into account Web-structured text “may help human translators accelerate the localization process,” according to the paper’s authors, whose future work will explore “the effectiveness of using the NMT models in the real-world localization process where a translation memory is available.”

Salesforce Just Open-Sourced a Large, XML-Tagged Machine Translation Dataset

SlatorCon London 2026

Including XML Tags Improves Quality

Pro Guide: How to Run a Translation and Localization RFP

Slator 2021 Language Industry Market Report

Featured

Boost Language Access

Leading with Excellence

AI should speak every language

memoQ Translation Tech