For the past couple of years, as each new @mozilla #CommonVoice dataset of #voice #data is released, I've been using @observablehq to visualise the #metadata coverage across the 100+ languages in the dataset.
Version 17 was released yesterday (big ups to the team - EM Lewis-Jong, @jessie, Gina Moape, Dmitrij Feller) and there's some super interesting insights from the visualisation:
➡ Catalan (ca) now has more data in Common Voice than English (en) (!)
➡ The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (is). Perhaps Icelandic words are longer? I suspect so!
➡ Spanish (es), Bangla (Bengali) (bn), Mandarin Chinese (zh-CN) and Japanese (ja) all have a lot of recorded utterances that have not yet been validated. Albanian (sq) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv).
➡ Votic (vot) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid) given the geopolitical instability in Russia currently.
See the visualisation here and let me know your thoughts below!
➡ https://observablehq.com/@kathyreid/mozilla-common-voice-v17-dataset-metadata-coverage
#linguistics #languages #data #VoiceAI #VoiceData #SpeechAI #SpeechData #DataViz