Google Flags Serious Data Quality Issues in Public Multilingual Speech Datasets

A June 27, 2025, Google study has uncovered serious quality issues in three of the most widely used public multilingual speech datasets: Mozilla Common Voice 17.0, FLEURS, and VoxPopuli. 

The Google researchers explained that these datasets play an important role in advancing speech technologies. For example, they are essential for training and evaluating multilingual automatic speech recognition (ASR) models like Whisper and SeamlessM4T, and also support crosslingual speech representation learning and downstream applications such as multilingual speech generation and understanding.

The Google team found that these datasets contain flaws that can lead to misleading results and give an “illusion of success”, especially in low-resource languages.

According to the team’s analysis, many language subsets include short and repetitive recordings, poor audio quality, and limited speaker diversity. In some cases, recordings are just two seconds long or come from a single contributor. They also identified inconsistent writing systems and unclear dialect choices, such as mixing Bokmål and Nynorsk in Norwegian, or labeling Modern Standard Arabic as Egyptian Arabic. 

Other issues include low speech-to-silence ratios, where large portions of audio contain little or no speech, and imbalanced content, with overly formal or templated prompts that don’t reflect natural language use.

The researchers noted that these quality issues have significant implications for downstream research and applications. For example, shorter utterances tend to produce higher word error rates, while excessive silence negatively impacts emotion recognition. A lack of speaker and topic diversity can introduce biases related to gender, age, and regional accents. 

Linguistic Expertise

While micro-level issues — such as audio duration and speaker diversity — can often be fixed programmatically, macro-level issues which arise from ignoring a language’s sociolinguistic context — such as inconsistent orthography, dialect confusion, and mismatches between spoken and written forms — require manual inspection and linguistic expertise, and are especially common in less-institutionalized languages.

To address these challenges, the researchers propose treating dataset creation not just as data collection, but as a form of language planning — particularly for languages without standardized writing systems. They recommend:

  • Assessing the sociolinguistic context of each language first, such as literacy levels, dialects, and scripts.
  • Setting clear design goals: Is the dataset meant to represent a formal register or everyday speech? In which script? For which dialect?
  • Providing specific guidelines for contributors, especially when no standard exists.
  • Checking data quality both automatically (e.g., speech-to-silence ratio) and manually (e.g., dialect consistency).
  • Publishing metadata that explains language choices, so downstream users can interpret the data correctly.

Future work should focus on building tools to support this framework, the Google researchers concluded.

Authors: Mingfei Lau, Qian Chen, Yeming Fang, Tingting Xu, Tongzhou Chen, and Pavel Golik