According to the team’s analysis, many language subsets include short and repetitive recordings, poor audio quality, and limited speaker diversity. In some cases, recordings are just two seconds long or come from a single contributor. They also identified inconsistent writing systems and unclear dialect choices, such as mixing Bokmål and Nynorsk in Norwegian, or labeling Modern Standard Arabic as Egyptian Arabic.
Other issues include low speech-to-silence ratios, where large portions of audio contain little or no speech, and imbalanced content, with overly formal or templated prompts that don’t reflect natural language use.
Slator 2025 Language Industry Market Report
The 150-page report offers a comprehensive view of the 2025 global market — with market sizing, AI capability breakdowns, buyer insights, use cases, survey data, and projections through 2030.
The researchers noted that these quality issues have significant implications for downstream research and applications. For example, shorter utterances tend to produce higher word error rates, while excessive silence negatively impacts emotion recognition. A lack of speaker and topic diversity can introduce biases related to gender, age, and regional accents.
Linguistic Expertise
While micro-level issues — such as audio duration and speaker diversity — can often be fixed programmatically, macro-level issues which arise from ignoring a language’s sociolinguistic context — such as inconsistent orthography, dialect confusion, and mismatches between spoken and written forms — require manual inspection and linguistic expertise, and are especially common in less-institutionalized languages.
To address these challenges, the researchers propose treating dataset creation not just as data collection, but as a form of language planning — particularly for languages without standardized writing systems. They recommend:
- Assessing the sociolinguistic context of each language first, such as literacy levels, dialects, and scripts.
- Setting clear design goals: Is the dataset meant to represent a formal register or everyday speech? In which script? For which dialect?
- Providing specific guidelines for contributors, especially when no standard exists.
- Checking data quality both automatically (e.g., speech-to-silence ratio) and manually (e.g., dialect consistency).
- Publishing metadata that explains language choices, so downstream users can interpret the data correctly.
Future work should focus on building tools to support this framework, the Google researchers concluded.
Authors: Mingfei Lau, Qian Chen, Yeming Fang, Tingting Xu, Tongzhou Chen, and Pavel Golik