In the paper, the researchers explained that, with this model, they also sought to address known issues in other high-resource and multilingual LLMs, including English bias and underperformance in low-resource languages.
Grades 1-12 Textbooks as Data Sources
The datasets used in training and fine-tuning the Komodo-7B-Instruct LLM were created from open-source data and manually collected data. Sources included Indonesian textbooks on various subjects, colloquial data from movie subtitles, news, and informal conversations, according to the paper.
Explaining that “a judicious selection of high-quality data has proven effective, even yielding State-of-the-Art performance under certain circumstances,” the researchers set out to create a model specialized in understanding. The resulting datasets addressed specific language traits, including language proficiency, cross-lingual understanding, common sense reasoning, sentiment analysis, and intent classification.
The vocabulary used was expanded to include common Indonesian and regional words. The researchers identified and incorporated approximately 2,000 frequently used words in Indonesian and 1,000 words for regional languages not included in the Llama-2 model.
During the pre-training phase, Komodo-7B-Instruct refined its ability to position words, grouping similar words closer together in its memory. Other dataset preparatory steps included repetition removal (excessive repetition of words or phrases), quality filtering (filtering out low-quality or irrelevant data), and deduplication (removing duplicate entries).
Part of the model’s training also involved English datasets and alternate parallel data with all combinations of English, Indonesian, and the 11 regional languages. The researchers’ intention in doing so was to enhance the model’s understanding of code-mixed (multiple-language) sentences. They also used a bilingual next-token prediction strategy instead of a monolingual next-token prediction with translated Indonesian text.
Slator Pro Guide: Translation AI
The Slator Pro Guide presents 10 new and impactful ways that LLMs can be used to enhance translation workflows.
According to the researchers, their Komodo LLM surpasses various multilingual models, including Cohere’s Aya-101, MBZUAI’s Bactrian-X-llama-7B, Qwen-1.5, Mistral’s Mixtral-8x7B-Instruct-v0.1, and AISingapore’s Indonesian SEA-LION LLM on multiple tasks against existing benchmarks, including Perplexity. It also surpasses Google Translate in scope (which supports only Indonesian, Javanese, and Sundanese).
The model, say the researchers, excelled in intent classification, colloquial language detection, sentiment analysis across languages, and cross-language understanding (e.g., Indonesian-English). Komodo-7B-Base was also able to maintain the performance of Llama-2-7B Base across all tasks, except GSM8k, a math task.
The Komodo LLM succeeded in designing and fine-tuning for “linguistic variations specific to the Indonesian context and its regional languages, enabling it to outperform in tasks related to Indonesian and regional languages,” added the researchers.
Beyond commercial applications, one important use case for the model is its potential role in supporting a diverse set of Indonesia’s regional languages for educational purposes, according to the researchers. Their idea is that with the Komodo LLM “resources and information can be more widely disseminated, contributing to a more inclusive and equitable educational landscape throughout the country.”