They explained that using foundation models that can be accessed and tuned for specific use cases, such as the LLaMA-2 models, “poses a risk in terms of licensing, data safety and future proofing among other things.” Additionally, they noted that “these models are very generic and mostly trained on English-centric data.”
They developed LLMs entirely in-house from scratch using a vast dataset of 3 trillion tokens, comprising both general and e-commerce-specific texts in multiple languages. They used the ParaCrawl corpus along with smaller in-house corpus from the e-commerce domain. This approach ensures their robustness in handling diverse languages and domain-specific tasks.
Additionally, eBay developed their own tokenizer and model vocabulary, customized towards e-commerce. “ This gives us several advantages, namely (i) full control over the vocabulary including special tokens (ii) better support for multilinguality (iii) better adaptation to e-commerce specific use-cases,” they said.
Slator 2024 Language Industry Market Report — Language AI Edition
The 140-page flagship report features in-depth market analysis, language AI opportunities, survey results, and much more.
Eliminating Dependencies
According to the authors, their models perform on par with, or better than, the popular LLaMA-2 models, particularly excelling in non-English machine translation, as well as natural language understanding (NLU) tasks and e-commerce-specific applications.
The authors explained that this performance boost is attributed to the inclusion of significant amounts of non-English and e-commerce-specific data during pretraining, which enhances the models’ understanding and performance on tasks in languages other than English. Moreover, the customized vocabulary for e-commerce tasks resulted in a significant speed-up in text generation, outperforming LLaMA-2 by up to 34%.
The authors expect these models “to be used as a foundation for fine-tuning and instruction-tuning, eliminating dependencies to external models.”
Future efforts will focus on enhancing the data pipeline, incorporating more eBay-specific data, training larger models, and exploring the Mixture-of-Experts architecture for improved efficiency.
Authors: Christian Herold, Michael Kozielski, Leonid Ekimov, Pavel Petrushkov, Pierre-Yves Vandenbussche, and Shahram Khadivi