Lmst

Each quarter, when the new @mozilla #CommonVoice #dataset is released, I do a #dataviz using @observablehq of its #metadata coverage, across all 100+ languages, based on the JSON summary that is part of the release.

Some of my observations from the v18 release are:

💡 #Catalan (ca) now has a larger dataset than English, based on the number of audio recordings (including validated and yet-to-be-validated recordings). It’s also an interesting dataset because the number of recordings per unique contributor is relatively low (around 80). This means it’s likely to have a high diversity of speakers in the dataset, which is useful for building #ASR models that generalise well to many speakers.

Catalan also appears to have the highest percentage of audio recordings by older speakers - e.g. speakers in their forties, fifties and older. Again, this highlights the diversity of speakers in the Catalan dataset.

💡 Although it’s very early to see any trends from the decision by Common Voice to expand the range of options for gender identity, we are starting to see some data being tagged with the new options that are available. For example, in #Uyghur (ug), we now have data tagged as “do not wish to say”. I don’t want to draw connections between the geopolitical situation in that area and the desire of data contributors not to provide demographic data which may in some way identify them without more evidence, but I think it’s telling that the first use of these expanded metadata categories appears in a language that is spoken in a contested geography.

💡Similarly, it’s very early to identify trends in sentence domain classification - as most of the sentences that do have a domain tag are labelled “general”, although “health_care” sentences are occurring frequently in languages such as #Albanian (sq).

💡#Bangla (Bengali) (bn) continues to have a very large number of yet-to-be-validated audio recordings. Due to this, the train split for Bangla is quite small.

💡#Dholuo (luo), a language spoken in Kenya and Tanzania, is an outlier in terms of the number of distinct data contributors to the dataset - this language has a very high average number of contributions for per contributor. This is often seen in languages that are new to Common Voice, before they have been able to recruit more contributors. Dholuo has nearly 5 million speakers.

💡 The language with the highest average utterance duration is by far #Icelandic (is) at over 7 seconds. This may be because Icelandic has many words with several syllables, which take longer to pronounce. Consider "the cat sat on the mat" in English, cf "kötturinn sat á mottunni" in Icelandic.

Big thanks to all data contributors in this release for your donated utterances, and to Dmitrij Feller, @jessie, Gina Moape, EM Lewis-Jong and the team for all your efforts.

What are your thoughts? What conclusions do you draw?

https://observablehq.com/@kathyreid/mozilla-common-voice-v18-dataset-metadata-coverage

Data visualisation of v18 of the Common Voice dataset with the heading "Sentence domain stacked to 100% to better show low-resource languages".

https://archive.org/details/nyangi-gi-otis

Nyangi gi Otis by Asenath Bole Odaga

Topics
#Dholuo, #kitabu, #sigendini, #sigendiniLuo

Kisumu : Lake Publishers & Enterprises Ltd.

https://archive.org/details/kisera

Kisera by Asenath Bole Odaga

Topics
#Dholuo, #kitabu, #kitepe, #buk, #buge

"Kisera en kitabu ma wuoyo kuom nyako midendoni Limbe gi ngimane chakre ka ne en nyathi koda kuom ji mamoko mathoth. Omiyo kitabuni chalo piny, opong' gi ji. Oting'o Achienge gi Selina min koda Kala wuon mare gi nyithindgi te. Kendo oting'o joma osomo man gi barupe mabeyo, to onge tich. Kanyo bende ema iyude Achwaka gi wuode Olalna e lum, jamoko tho gotieno. Yawa, kuom adiera, kitabuni mit kendo kichako some to ok idwar kete piny nyaka itieke. Ondikre achana ndi."

https://archive.org/details/luo-sayings

Luo Proverbs and Sayings by Asenath Bole Odaga

Topics
#Dholuo, #Ngeche, #NgecheLuo, #sayings, #proverbs, #tonguetwisters, #riddles, #wechemawachotek

"The Luo use Ngeche and other sayings to demonstrate their knowledge and skill in expressing themselves in their language. Ngeche and sayings are adaptable and have many functions. For instance, they may be used to chide someone, to answer a question, to illustrate, clarify or to drive home a point made on an issue. They are particularly valuable in the education of the youth by adults on the use of language, that is how to apply them during a conversation. Ngeche in sayings, proverbs, tongue twisters and even some narratives.

Ngeche Luo has been authored by Asenath Bole Odaga. Bole who writes in her mother tongue-Luo, as well as in English, has written over fifty books for adults, children, and general readership. Her latest publications are Dholuo-English Dictionary and Nyangi gi Otis."

https://archive.org/details/lebd2

Luo-English Biological Dictionary, Second Edition by John O. Kokwaro; Timothy Johns

Topics
#Dholuo, #biologicaldictionary, #biologicaldictionaries, #biology, #ecology, #zoology, #botany, #lakevictoria, #NamLolwe, #ethnobotany, #EastAfrica, #AfricanGreatLakes, #ecology, #Luoland, #Kavirondo, #Joluo, #Uganda, #Kenya, #Tanzania, #piny, #ngima, #ngeyo

"This Second Edition of the Luo-English Biological Dictionary contains an extensive coverage of the flora and fauna of the Lake Victoria region of East Africa. The region is mainly occupied by the Luo community. It comprises the Luo ethnosystematics and ethnobiological account including indigenous foods, traditional medicines, ritual and other cultural uses of plants. The dictionary is a result of over 20 years of research carried out by the authors.

#Dholuo

Client Info