Kathy Reid<p>The Mozilla <a href="https://aus.social/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> <a href="https://aus.social/tags/dataset" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dataset</span></a> v20 was released yesterday - the largest open <a href="https://aus.social/tags/speech" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>speech</span></a> dataset in the world. My <a href="https://aus.social/tags/dataviz" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dataviz</span></a>, linked below, shows a continuation of patterns seen for some years now: </p><p>➡️ There's more data collected for <a href="https://aus.social/tags/Catalan" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Catalan</span></a> (ca) than for <a href="https://aus.social/tags/English" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>English</span></a> (en) - testament to the independence and language reclamation efforts in Catalunya. Language and cultural transmission are deeply intertwined.<br> <br>➡️ Some of the newer <a href="https://aus.social/tags/languages" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>languages</span></a> to Common Voice, like <a href="https://aus.social/tags/Ligurian" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Ligurian</span></a> / <a href="https://aus.social/tags/Genoese" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Genoese</span></a> (lij) have contributions from mostly older speakers, which is unusual in comparison to the rest of the dataset. This may reflect the population that currently speak those languages - as many regional languages in Italy are in rapid decline. </p><p>➡️ Some languages such as Eastern Mari / Meadow Mari (mhr) - a <a href="https://aus.social/tags/Uralic" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Uralic</span></a> language spoken in the Mari-El Republic within Russia - have samples from predominantly female-identifying speakers, again contrasting to the rest of the dataset. Other languages here include <a href="https://aus.social/tags/Cantonese" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Cantonese</span></a> (yue), <a href="https://aus.social/tags/Georgian" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Georgian</span></a> (ka), and <a href="https://aus.social/tags/Kalenjin" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Kalenjin</span></a> (kln). </p><p>➡️ A key part in the preparation of the Common Voice dataset is the validation of utterances to assure they match their written transcription - which requires at least two validations by separate speakers. Some newer languages to Common Voice, such as Erzya (myv) and Moksha (mdf), both Uralic languages, have nearly 100% validation. </p><p>What are your interpretations of the dataset?</p><p><a href="https://observablehq.com/@kathyreid/mozilla-common-voice-v20-dataset-metadata-coverage" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">observablehq.com/@kathyreid/mo</span><span class="invisible">zilla-common-voice-v20-dataset-metadata-coverage</span></a></p>