The researchers are working to bridge this gap. “Building high-quality linguistic corpora for underrepresented languages is crucial to enable language technologies such as machine translation, speech recognition, and speech synthesis for African languages,” they noted.
Their work, funded by the Lacuna Fund, involved crowd-sourcing data from native speakers to build linguistic corpora. (The Lacuna Fund supports data scientists, researchers, and social entrepreneurs in low- and middle-income contexts to produce datasets to address problems in their communities.)
The methodology combined community engagement with selective crowd-sourcing to ensure high-quality data collection. This involved collecting both text and speech data, with the text translated into Kiswahili to generate parallel corpora.
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
Specifically, contributors wrote sentences in their native languages, often inspired by publicly available materials, or recorded conversations, which were then transcribed and translated. Voice recordings were collected through Mozilla Common Voice, allowing speakers from diverse demographics to participate.
Open Access
The results of this effort include 30,000 translated sentences per language and a total of 56 hours of recorded audio for Kidaw’ida, 92 hours for Kalenjin, and 120 hours for Dholuo. These resources are freely accessible on Zenodo and Mozilla Common Voice, ensuring open access for developers and researchers.
“Developers are encouraged to take advantage of this unrestricted access to the data to train models and create applications for these three languages,” the researchers said. They also encourage language communities to continue contributing to the repositories to improve model accuracy.
Looking ahead, the researchers plan to expand the dataset size and collaborate further with local communities and developers to build NLP applications tailored to the needs of these linguistic groups. The applications could span health, agriculture, education, and commerce, empowering local populations and promoting linguistic diversity. Future efforts will also focus on securing additional funding to recruit more contributors.
Authors: Audrey Mbogho, Quin Awuor, Andrew Kipkebut, Lilian Wanzare, and Vivian Oloo