New Dataset for Low-Resource Kenyan Language Tackles AI Exclusion

A team of researchers from USIU-Africa, Kabarak University, and Maseno University have created language resources for three low-resource Kenyan languages — Kidaw’ida, Kalenjin, and Dholuo. The goal of the project is to improve natural language processing (NLP) applications, support linguistic research, and promote linguistic diversity in AI development.

In their January 19, 2025 paper, the researchers highlighted a key issue: while advancements in AI and NLP are transforming industries globally, many African languages remain underrepresented in these technologies due to limited digital resources. 

This gap has significant consequences, as speakers of these languages are often excluded from access to vital information and technological progress.

The researchers are working to bridge this gap. “Building high-quality linguistic corpora for underrepresented languages is crucial to enable language technologies such as machine translation, speech recognition, and speech synthesis for African languages,” they noted.

Their work, funded by the Lacuna Fund, involved crowd-sourcing data from native speakers to build linguistic corpora. (The Lacuna Fund supports data scientists, researchers, and social entrepreneurs in low- and middle-income contexts to produce datasets to address problems in their communities.)

The methodology combined community engagement with selective crowd-sourcing to ensure high-quality data collection. This involved collecting both text and speech data, with the text translated into Kiswahili to generate parallel corpora. 

Specifically, contributors wrote sentences in their native languages, often inspired by publicly available materials, or recorded conversations, which were then transcribed and translated. Voice recordings were collected through Mozilla Common Voice, allowing speakers from diverse demographics to participate.

Open Access

The results of this effort include 30,000 translated sentences per language and a total of 56 hours of recorded audio for Kidaw’ida, 92 hours for Kalenjin, and 120 hours for Dholuo. These resources are freely accessible on Zenodo and Mozilla Common Voice, ensuring open access for developers and researchers.

“Developers are encouraged to take advantage of this unrestricted access to the data to train models and create applications for these three languages,” the researchers said. They also encourage language communities to continue contributing to the repositories to improve model accuracy.

Looking ahead, the researchers plan to expand the dataset size and collaborate further with local communities and developers to build NLP applications tailored to the needs of these linguistic groups. The applications could span health, agriculture, education, and commerce, empowering local populations and promoting linguistic diversity. Future efforts will also focus on securing additional funding to recruit more contributors.

Authors: Audrey Mbogho, Quin Awuor, Andrew Kipkebut, Lilian Wanzare, and Vivian Oloo