Llama 2, like Llama 1 before it, takes its name from “Large Language Model Meta AI.” According to Meta, Llama 2 is trained on 40% more data than Llama 1. Its pre-trained models are trained on no less than two trillion tokens, while its fine-tuned models have been trained on more than one million human annotations.
One might therefore presume, with all that training data, that Llama 2 could well have an edge (or, at least a use) in machine translation and other multilingual applications. Apparently, not so.
As Meta explained in the research paper, “Most data is in English, meaning that Llama 2 will perform best for English-language use cases.” It also warned, “A training corpus with a majority in English means that the model may not be suitable for use in other languages.”
According to the paper, the model’s pretraining data is nearly 90% English. Other languages, including German, French, Chinese, Spanish, Dutch, Italian, Japanese, Polish, Portuguese, and others, collectively make up less than 2% of Llama 2’s training data, while the language is “unknown” for more than 8% of training data. (This includes programming code data.)
Llama 2’s lack of language diversity is somewhat surprising given that Meta has focused heavily on the need to improve coverage for low-resource languages (and poured significant R&D efforts into this area) in recent years.
Or perhaps, after its self-proclaimed “breakthrough” in machine translation for low-resource languages in July 2022, Meta’s attention is beginning to shift to new and shinier areas of language research.
Related: Why Netflix Shut Down Its Translation Portal Hermes