lingo.lol is one of the many independent Mastodon servers you can use to participate in the fediverse.
A place for linguists, philologists, and other lovers of languages.

Server stats:

69
active users

#commonvoice

0 posts0 participants0 posts today
BlaCHp<p>❤️‍🔥โปรแกรมสุดพิเศษ!!❤️‍🔥 ในวันเสาร์ที่ 15 มีนาคม 2568 เวลา 13.30 น. พบกับ speakers สุดสวย <span class="h-card" translate="no"><a href="https://f-social.techtransthai.org/@latenightdef" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>latenightdef</span></a></span> ร่วมพูดคุยในหัวข้อ 🗺️📍Indoor mapping for OpenStreetMap using OsmInEdit and IndoorEqual📍🗺️ <a href="https://www.eventyay.com/e/4c0e0c27/session/9520" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">eventyay.com/e/4c0e0c27/sessio</span><span class="invisible">n/9520</span></a></p><p>ทั้งทีมงาน 🎙️Mozilla Common Voice 🎧 และ speakers ยินดีอย่างมากที่จะได้แบ่งปันประสบการณ์ใหม่ๆ กับทุกคน💞 อย่าลืมมาพบกันให้ได้เลยน้า ที่งาน ✨ FOSSASIA Summit 2025 ✨</p><p>Come and experience something special with us! ❤️‍🔥💞</p><p><a href="https://mstdn.in.th/tags/Fossasia" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Fossasia</span></a> <a href="https://mstdn.in.th/tags/FOSSASIASummit2025" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>FOSSASIASummit2025</span></a> <a href="https://mstdn.in.th/tags/MozillaCommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>MozillaCommonVoice</span></a> <a href="https://mstdn.in.th/tags/commonvoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>commonvoice</span></a> <a href="https://mstdn.in.th/tags/opensource" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>opensource</span></a></p>
BlaCHp<p>🎙️ประชาสัมพันธ์ 🎧 <br>ขอเชิญชวนทุกท่านเข้าร่วมงาน ✨ FOSSASIA Summit 2025 ✨ ในวันพฤหัสบดีที่ 13 ถึงวันเสาร์ที่ 15 มีนาคม 2568 ณ True Digital Park <a href="https://eventyay.com/e/4c0e0c27" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="">eventyay.com/e/4c0e0c27</span><span class="invisible"></span></a></p><p>โดยภายในงานมีการพูดคุยกับ speakers รวมถึงจัดบูธเกี่ยวกับชุมชน 🌟 open source 🌟 และอื่นๆ อีกมากมาย </p><p>หนึ่งในนั้นคือบูธ ✨ Mozilla Common Voice ✨🎧🎙️เทคโนโลยีการสร้างชุดข้อมูลเสียงที่สามารถเข้าถึงได้แบบสาธารณะ ทางทีมงานภูมิใจนำเสนอและตั้งตารอได้ร่วมกิจกรรมกับทุกคน😊💕</p><p><a href="https://mstdn.in.th/tags/Fossasia" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Fossasia</span></a> <a href="https://mstdn.in.th/tags/FOSSASIASummit2025" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>FOSSASIASummit2025</span></a> <a href="https://mstdn.in.th/tags/MozillaCommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>MozillaCommonVoice</span></a> <a href="https://mstdn.in.th/tags/commonvoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>commonvoice</span></a> <a href="https://mstdn.in.th/tags/opensource" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>opensource</span></a></p>
Marko Dimjašević<p>I learned a lot about voice in <a href="https://mamot.fr/tags/HomeAssistant" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>HomeAssistant</span></a> from a Mozilla Developer Community Spotlight: <a href="https://www.youtube.com/watch?v=YP9Dwpk5fF8" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">youtube.com/watch?v=YP9Dwpk5fF</span><span class="invisible">8</span></a></p><p><span class="h-card" translate="no"><a href="https://infosec.exchange/@morachimo" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>morachimo</span></a></span> , the host, talks to Mike Hansen, senior voice engineer at Nabu Casa, about how Home Assistant uses Mozilla Common Voice in its speech-to-text processing, among others.</p><p><a href="https://mamot.fr/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a></p>
Open Source JobHub<p>Featured Job from <span class="h-card" translate="no"><a href="https://fosstodon.org/@fosdem" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>fosdem</span></a></span>: <a href="https://fosstodon.org/tags/Mozilla" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Mozilla</span></a> continues to fight for a healthy Internet. Join the team as a Lead Engineer for the Common Voice project, which aims to shape the future of voice AI. Apply now on <a href="https://fosstodon.org/tags/OSJH" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OSJH</span></a><br><a href="https://opensourcejobhub.com/job/21769/lead-engineer-common-voice/" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">opensourcejobhub.com/job/21769</span><span class="invisible">/lead-engineer-common-voice/</span></a><br><a href="https://fosstodon.org/tags/jobs" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>jobs</span></a> <a href="https://fosstodon.org/tags/career" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>career</span></a> <a href="https://fosstodon.org/tags/FOSDEM" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>FOSDEM</span></a> <a href="https://fosstodon.org/tags/OpenSource" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenSource</span></a> <a href="https://fosstodon.org/tags/engineer" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>engineer</span></a> <a href="https://fosstodon.org/tags/FOSS" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>FOSS</span></a> <a href="https://fosstodon.org/tags/RemoteWork" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>RemoteWork</span></a> <a href="https://fosstodon.org/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> <a href="https://fosstodon.org/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AI</span></a></p>
Kathy Reid<p>It's been another big year as I work towards completing my <a href="https://aus.social/tags/dissertation" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dissertation</span></a> on voice dataset documentation and how it influences how well <a href="https://aus.social/tags/speech" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>speech</span></a> technologies work for all voices at the <a href="https://aus.social/tags/ANU" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ANU</span></a> School of Cybernetics - with big thanks to my supervisors, Elizabeth Williams, Alexandra Zafiroglu, Jofish Kaye and Paul Wong 黃仲熙. </p><p>I've wrapped up a partnership with Mozilla's <a href="https://aus.social/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> team, which let me explore the hashtag#dataset in a lot more detail - big thanks EM Lewis-Jong, <span class="h-card" translate="no"><a href="https://mastodon.social/@jessie" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>jessie</span></a></span> Dmitrij Feller in particular. </p><p>It was an incredible honor to keynote <a href="https://aus.social/tags/FF24" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>FF24</span></a> at the National Film and Sound Archive of Australia alongside Peter-Lucas Jones of Te Hiku Media, expertly facilitated by Keir Winesmith - thanks <span class="h-card" translate="no"><a href="https://ausglam.space/@ingridbmason" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>ingridbmason</span></a></span> and team for the opportunity - and stay tuned for a little project we are working on - we know you're all eager for the video of this keynote, but we're adding a little more magic. </p><p>I helped out with <span class="h-card" translate="no"><a href="https://fosstodon.org/@everythingopen" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>everythingopen</span></a></span> Media and Comms this year, and am looking forward to speaking in January in Adelaide. </p><p>A huge thanks to my fellow <a href="https://aus.social/tags/PhD" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>PhD</span></a> buddies - Lorenn Ruster, <span class="h-card" translate="no"><a href="https://hci.social/@nedcpr" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>nedcpr</span></a></span>, Glen Berman, Tom Chan, Danny Bettay, Charlotte Bradley, <span class="h-card" translate="no"><a href="https://aus.social/@Amirasadi" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>Amirasadi</span></a></span>, Memunat Ajoke Ibrahim and the later cohorts for all your support, shut up and write sessions and intellectual growth.</p>
Radio Azureus<p>When was the last time that you actually contributed to an open source project?<br> <br>I'm certain that you've heard of common voice at Mozilla</p><p>In case you haven't The languages that need more data are All of them. So even contribute 15 samples a Day does a lot on the whole. </p><p>I had slacked off on my Common Voice contributions, but I'm now picking it up again</p><p><span class="h-card" translate="no"><a href="https://fosstodon.org/@RL_Dane" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>RL_Dane</span></a></span> </p><p>🖋️ <a href="https://mastodon.social/tags/Mozilla" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Mozilla</span></a> <a href="https://mastodon.social/tags/commonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>commonVoice</span></a> <a href="https://mastodon.social/tags/OpenSource" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenSource</span></a> <a href="https://mastodon.social/tags/contribute" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>contribute</span></a> <a href="https://mastodon.social/tags/samples" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>samples</span></a> <a href="https://mastodon.social/tags/programming" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>programming</span></a> </p><p><a href="https://commonvoice.mozilla.org" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="">commonvoice.mozilla.org</span><span class="invisible"></span></a></p>
Kathy Reid<p>The Mozilla <a href="https://aus.social/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> <a href="https://aus.social/tags/dataset" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dataset</span></a> v20 was released yesterday - the largest open <a href="https://aus.social/tags/speech" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>speech</span></a> dataset in the world. My <a href="https://aus.social/tags/dataviz" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dataviz</span></a>, linked below, shows a continuation of patterns seen for some years now: </p><p>➡️ There's more data collected for <a href="https://aus.social/tags/Catalan" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Catalan</span></a> (ca) than for <a href="https://aus.social/tags/English" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>English</span></a> (en) - testament to the independence and language reclamation efforts in Catalunya. Language and cultural transmission are deeply intertwined.<br> <br>➡️ Some of the newer <a href="https://aus.social/tags/languages" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>languages</span></a> to Common Voice, like <a href="https://aus.social/tags/Ligurian" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Ligurian</span></a> / <a href="https://aus.social/tags/Genoese" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Genoese</span></a> (lij) have contributions from mostly older speakers, which is unusual in comparison to the rest of the dataset. This may reflect the population that currently speak those languages - as many regional languages in Italy are in rapid decline. </p><p>➡️ Some languages such as Eastern Mari / Meadow Mari (mhr) - a <a href="https://aus.social/tags/Uralic" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Uralic</span></a> language spoken in the Mari-El Republic within Russia - have samples from predominantly female-identifying speakers, again contrasting to the rest of the dataset. Other languages here include <a href="https://aus.social/tags/Cantonese" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Cantonese</span></a> (yue), <a href="https://aus.social/tags/Georgian" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Georgian</span></a> (ka), and <a href="https://aus.social/tags/Kalenjin" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Kalenjin</span></a> (kln). </p><p>➡️ A key part in the preparation of the Common Voice dataset is the validation of utterances to assure they match their written transcription - which requires at least two validations by separate speakers. Some newer languages to Common Voice, such as Erzya (myv) and Moksha (mdf), both Uralic languages, have nearly 100% validation. </p><p>What are your interpretations of the dataset?</p><p><a href="https://observablehq.com/@kathyreid/mozilla-common-voice-v20-dataset-metadata-coverage" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">observablehq.com/@kathyreid/mo</span><span class="invisible">zilla-common-voice-v20-dataset-metadata-coverage</span></a></p>
Mozilla francophone<p>📢 Des chercheurs ont compilé 950 000 heures de données de parole open source pour les 24 langues officielles de l’UE avec le projet MOSEL. Une initiative majeure pour l’avancement de modèles de langue IA en Europe, incluant des données de <a href="https://mamot.fr/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a>. <a href="https://the-decoder.com/researchers-collect-950000-hours-of-open-source-speech-data-for-eu-languages/" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">the-decoder.com/researchers-co</span><span class="invisible">llect-950000-hours-of-open-source-speech-data-for-eu-languages/</span></a></p>
Kathy Reid<p>If you're a <a href="https://aus.social/tags/language" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>language</span></a> nerd like I am, then you won't have missed the <span class="h-card" translate="no"><a href="https://mozilla.social/@mozilla" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>mozilla</span></a></span> <a href="https://aus.social/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> v19 <a href="https://aus.social/tags/speech" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>speech</span></a> <a href="https://aus.social/tags/dataset" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dataset</span></a> release - which now features 131 languages! Here's my <a href="https://aus.social/tags/dataviz" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dataviz</span></a>, done in <span class="h-card" translate="no"><a href="https://vis.social/@observablehq" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>observablehq</span></a></span> of the v19 <a href="https://aus.social/tags/metadata" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>metadata</span></a> coverage. </p><p>I've updated the visualisation this time around with human-readable language names instead of their ISO-639 or BCP-47 language codes to make it it easier to read. </p><p>There's some interesting observations: </p><p>▶ Catalan (ca) continues to be leader in terms of data - speaking volumes about the efforts to revitalise culture and language in Catalunya. It's also one of the few languages that has data for all age groups, particularly older speakers - this sort of data is missing for most other languages.</p><p>▶ Kiswahili (sw) is one of the languages where there is more data for female-identifying speakers than for male-identifying speakers ♀ - although Japanese (ja), Western Mari (mrj) and Luganda (lg) do pretty well here, too! </p><p>▶ Sentence domains can now be categorised, and although most new sentences are "general", Albanian (sq) has a lot of sentences related to law and government. </p><p>▶ Tsonga (ts), a Bantu language spoken in Southern Africa, has dethroned Icelandic (is) as the language with the highest average utterance duration. I don't know enough about Tsonga to speculate why - it's a somewhat agglutinative language, but many Tsonga works are generally short.</p><p>▶ Bengali / Bangla (bn) has a significant amount of data that is not yet validated, and therefore does not appear in training / dev / test splits. There is a similar case for many languages new to Common Voice - it takes time to validate.</p><p>▶ The language with the highest number of average contributions per speaker is Taita (dav), a Bantu language from Kenya. </p><p>What do you make of the data visualisation? Are there any other insights you can see?</p><p>Big thanks to the CV team for all their efforts - EM, Jessica Rose, Dmitrij Feller and Justin Grant. </p><p><a href="https://aus.social/tags/linguistics" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>linguistics</span></a> </p><p><a href="https://observablehq.com/@kathyreid/mozilla-common-voice-v19-dataset-metadata-coverage" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">observablehq.com/@kathyreid/mo</span><span class="invisible">zilla-common-voice-v19-dataset-metadata-coverage</span></a></p>
Mozilla francophone<p>La dernière ambition de <a href="https://mamot.fr/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> de <span class="h-card" translate="no"><a href="https://mozilla.social/@mozilla" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>mozilla@mozilla.social</span></a></span> : obtenir des outils vocaux qui comprennent les conversations naturelles et le langage courant <a href="https://foundation.mozilla.org/en/blog/common-voice-spontaneous-speech/" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">foundation.mozilla.org/en/blog</span><span class="invisible">/common-voice-spontaneous-speech/</span></a></p>
Kathy Reid<p>Each quarter, when the new <span class="h-card" translate="no"><a href="https://mozilla.social/@mozilla" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>mozilla</span></a></span> <a href="https://aus.social/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> <a href="https://aus.social/tags/dataset" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dataset</span></a> is released, I do a <a href="https://aus.social/tags/dataviz" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dataviz</span></a> using <span class="h-card" translate="no"><a href="https://vis.social/@observablehq" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>observablehq</span></a></span> of its <a href="https://aus.social/tags/metadata" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>metadata</span></a> coverage, across all 100+ languages, based on the JSON summary that is part of the release. </p><p>Some of my observations from the v18 release are: </p><p>💡 <a href="https://aus.social/tags/Catalan" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Catalan</span></a> (ca) now has a larger dataset than English, based on the number of audio recordings (including validated and yet-to-be-validated recordings). It’s also an interesting dataset because the number of recordings per unique contributor is relatively low (around 80). This means it’s likely to have a high diversity of speakers in the dataset, which is useful for building <a href="https://aus.social/tags/ASR" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ASR</span></a> models that generalise well to many speakers. </p><p>Catalan also appears to have the highest percentage of audio recordings by older speakers - e.g. speakers in their forties, fifties and older. Again, this highlights the diversity of speakers in the Catalan dataset.</p><p>💡 Although it’s very early to see any trends from the decision by Common Voice to expand the range of options for gender identity, we are starting to see some data being tagged with the new options that are available. For example, in <a href="https://aus.social/tags/Uyghur" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Uyghur</span></a> (ug), we now have data tagged as “do not wish to say”. I don’t want to draw connections between the geopolitical situation in that area and the desire of data contributors not to provide demographic data which may in some way identify them without more evidence, but I think it’s telling that the first use of these expanded metadata categories appears in a language that is spoken in a contested geography.</p><p>💡Similarly, it’s very early to identify trends in sentence domain classification - as most of the sentences that do have a domain tag are labelled “general”, although “health_care” sentences are occurring frequently in languages such as <a href="https://aus.social/tags/Albanian" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Albanian</span></a> (sq).</p><p>💡<a href="https://aus.social/tags/Bangla" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Bangla</span></a> (Bengali) (bn) continues to have a very large number of yet-to-be-validated audio recordings. Due to this, the train split for Bangla is quite small.</p><p>💡<a href="https://aus.social/tags/Dholuo" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Dholuo</span></a> (luo), a language spoken in Kenya and Tanzania, is an outlier in terms of the number of distinct data contributors to the dataset - this language has a very high average number of contributions for per contributor. This is often seen in languages that are new to Common Voice, before they have been able to recruit more contributors. Dholuo has nearly 5 million speakers.</p><p>💡 The language with the highest average utterance duration is by far <a href="https://aus.social/tags/Icelandic" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Icelandic</span></a> (is) at over 7 seconds. This may be because Icelandic has many words with several syllables, which take longer to pronounce. Consider "the cat sat on the mat" in English, cf "kötturinn sat á mottunni" in Icelandic.</p><p>Big thanks to all data contributors in this release for your donated utterances, and to Dmitrij Feller, <span class="h-card" translate="no"><a href="https://mastodon.social/@jessie" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>jessie</span></a></span>, Gina Moape, EM Lewis-Jong and the team for all your efforts. </p><p>What are your thoughts? What conclusions do you draw? </p><p><a href="https://observablehq.com/@kathyreid/mozilla-common-voice-v18-dataset-metadata-coverage" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">observablehq.com/@kathyreid/mo</span><span class="invisible">zilla-common-voice-v18-dataset-metadata-coverage</span></a></p>
Kathy Reid<p>Delighted to be able to publicise a paper that was presented at the @ALTAnlp 2023 Workshop at the end of last year, co-authored with my <a href="https://aus.social/tags/PhD" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>PhD</span></a> supervisor, Associate Professor @eltwilliams, and written as part of my research at <a href="https://aus.social/tags/ANU" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ANU</span></a> School of Cybernetics. </p><p>Titled "Right the docs: Characterising voice dataset documentation practices used in machine learning", it combines both exploratory interviews and documentation analysis to characterise how large voice datasets - e.g. <a href="https://aus.social/tags/LibriSpeech" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>LibriSpeech</span></a>, <span class="h-card" translate="no"><a href="https://mozilla.social/@mozilla" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>mozilla</span></a></span>'s <a href="https://aus.social/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a>, and several others, document their <a href="https://aus.social/tags/metadata" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>metadata</span></a>. </p><p>Unsurprisingly, it finds that the <a href="https://aus.social/tags/dataset" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dataset</span></a> <a href="https://aus.social/tags/documentation" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>documentation</span></a> practices seen currently do not meet the needs of the <a href="https://aus.social/tags/ML" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ML</span></a> practitioners who use these datasets.</p><p>We show, once again, in the words of Nithya Sambasivan - "everyone wants to do the model work, but nobody wants to do the data work" ...</p><p><a href="https://aclanthology.org/2023.alta-1.6/" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">aclanthology.org/2023.alta-1.6</span><span class="invisible">/</span></a></p><p><a href="https://aus.social/tags/RightTheDocs" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>RightTheDocs</span></a> <a href="https://aus.social/tags/WriteTheDocs" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>WriteTheDocs</span></a></p><p>Citation: </p><p>Reid, K., Williams, E.T., 2023. Right the docs: Characterising voice dataset documentation practices used in machine learning, in: Muresan, S., Chen, V., Casey, K., David, V., Nina, D., Koji, I., Erik, E., Stefan, U. (Eds.), Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association. Association for Computational Linguistics, Melbourne, Australia, pp. 51–66.</p>
Kathy Reid<p>For the past couple of years, as each new <span class="h-card" translate="no"><a href="https://mozilla.social/@mozilla" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>mozilla</span></a></span> <a href="https://aus.social/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> dataset of <a href="https://aus.social/tags/voice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>voice</span></a> <a href="https://aus.social/tags/data" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>data</span></a> is released, I've been using <span class="h-card" translate="no"><a href="https://vis.social/@observablehq" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>observablehq</span></a></span> to visualise the <a href="https://aus.social/tags/metadata" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>metadata</span></a> coverage across the 100+ languages in the dataset. </p><p>Version 17 was released yesterday (big ups to the team - EM Lewis-Jong, <span class="h-card" translate="no"><a href="https://mastodon.social/@jessie" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>jessie</span></a></span>, Gina Moape, Dmitrij Feller) and there's some super interesting insights from the visualisation: </p><p>➡ Catalan (ca) now has more data in Common Voice than English (en) (!)</p><p>➡ The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (is). Perhaps Icelandic words are longer? I suspect so!</p><p>➡ Spanish (es), Bangla (Bengali) (bn), Mandarin Chinese (zh-CN) and Japanese (ja) all have a lot of recorded utterances that have not yet been validated. Albanian (sq) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv).</p><p>➡ Votic (vot) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid) given the geopolitical instability in Russia currently. </p><p>See the visualisation here and let me know your thoughts below!</p><p>➡ <a href="https://observablehq.com/@kathyreid/mozilla-common-voice-v17-dataset-metadata-coverage" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">observablehq.com/@kathyreid/mo</span><span class="invisible">zilla-common-voice-v17-dataset-metadata-coverage</span></a></p><p><a href="https://aus.social/tags/linguistics" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>linguistics</span></a> <a href="https://aus.social/tags/languages" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>languages</span></a> <a href="https://aus.social/tags/data" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>data</span></a> <a href="https://aus.social/tags/VoiceAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>VoiceAI</span></a> <a href="https://aus.social/tags/VoiceData" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>VoiceData</span></a> <a href="https://aus.social/tags/SpeechAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>SpeechAI</span></a> <a href="https://aus.social/tags/SpeechData" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>SpeechData</span></a> <a href="https://aus.social/tags/DataViz" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>DataViz</span></a></p>
JRelland :linux: :flower:<p>Je me demande s'il est recommandé un chatbot</p><p>basé sur <a href="https://framapiaf.org/tags/commonvoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>commonvoice</span></a> </p><p><span class="h-card" translate="no"><a href="https://mamot.fr/@Mozilla" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>Mozilla</span></a></span></p>
Farooq Karimi Zadeh<p><span class="h-card" translate="no"><a href="https://toot.bldrweb.org/@Myerman" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>Myerman</span></a></span> </p><p>I was reading a book about <a href="https://blackrock.city/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AI</span></a>(I'm a programmer and researcher). In the introduction it was talking to professors. There is nothing wrong with AI. But the wrong part is rich companies and on the top of them rich people controlling everything including good <a href="https://blackrock.city/tags/ML" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ML</span></a> models. That's why projects like <a href="https://blackrock.city/tags/Mozilla" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Mozilla</span></a> <a href="https://blackrock.city/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> are important.</p>
ralf tauscher :FreiburgSocial:<p><span class="h-card" translate="no"><a href="https://mastodon.social/@jessie" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>jessie</span></a></span> i really loved your amazing talk at <a href="https://freiburg.social/tags/fosdem" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>fosdem</span></a> <br>perfect humor balancing between a serious task like <a href="https://freiburg.social/tags/mozilla" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>mozilla</span></a> <a href="https://freiburg.social/tags/commonvoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>commonvoice</span></a> and the hype around <a href="https://freiburg.social/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AI</span></a></p>
wolf of the wisp<p>donate your voice to science? ‘help make voice recognition open and accessible’ <a href="https://commonvoice.mozilla.org/" rel="nofollow noopener noreferrer" target="_blank"><span class="invisible">https://</span><span class="">commonvoice.mozilla.org/</span><span class="invisible"></span></a> <a href="https://thefolklore.cafe/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> <a href="https://thefolklore.cafe/tags/OpenSource" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenSource</span></a> <a href="https://thefolklore.cafe/tags/VoiceRecognition" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>VoiceRecognition</span></a> <a href="https://thefolklore.cafe/tags/Mozilla" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Mozilla</span></a></p>
Kathy Reid<p>Last week, as part of my <a href="https://aus.social/tags/PhD" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>PhD</span></a> program at the <a href="https://aus.social/tags/ANU" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ANU</span></a> School of <a href="https://aus.social/tags/cybernetics" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>cybernetics</span></a>, I gave my final presentation, which is a summary of my methods and <a href="https://aus.social/tags/research" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>research</span></a> findings. I covered my interview work, the <a href="https://aus.social/tags/dataset" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>dataset</span></a> documentation analysis work I've been doing and my analysis work around <a href="https://aus.social/tags/accents" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>accents</span></a> in <span class="h-card" translate="no"><a href="https://mozilla.social/@mozilla" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>mozilla</span></a></span>'s <a href="https://aus.social/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> platform. </p><p>There were some insightful and thought-provoking questions from my panel and audience members, and of course - so many ideas for future research inquiry! </p><p>A huge thanks to my panel, chaired so well by Professor Alexandra Zafiroglu, to Dr Elizabeth Williams, my meticulous, methodical and always-encouraging Primary Supervisor, and to my co-supervisors Dr Jofish Kaye and Dr Paul Wong 黃仲熙 for their deep expertise in <a href="https://aus.social/tags/HCI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>HCI</span></a> and <a href="https://aus.social/tags/data" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>data</span></a> respectively. </p><p>Similarly, a huge thank you to my <a href="https://aus.social/tags/PhD" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>PhD</span></a> cohort - Charlotte Bradley, Tom Chan, Danny Bettay and Sam Backwell - as well as the other cohorts in the School - for your encouragement and intellectual journeying. </p><p><a href="https://aus.social/tags/PhD" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>PhD</span></a> <a href="https://aus.social/tags/PhDlife" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>PhDlife</span></a> <a href="https://aus.social/tags/cybernetics" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>cybernetics</span></a> <a href="https://aus.social/tags/milestone" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>milestone</span></a> <a href="https://aus.social/tags/ANU" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ANU</span></a> <a href="https://aus.social/tags/voiceAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>voiceAI</span></a> <a href="https://aus.social/tags/speechAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>speechAI</span></a> <a href="https://aus.social/tags/ASR" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ASR</span></a> <a href="https://aus.social/tags/SpeechRecognition" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>SpeechRecognition</span></a></p>
Aldatsa :toka:<p><a href="https://mastodon.eus/tags/Gaitu" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Gaitu</span></a> kanpainak bultzada handia eman dio <a href="https://mastodon.eus/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> euskaraz proiektuari.</p><p>Irudietako taulan orain arte ordu gehien grabatu dituzte hizkuntzen zerrenda.</p><p>Duela bi aste, urriaren 26an <a href="https://mastodon.eus/tags/Euskarabildua" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Euskarabildua</span></a>​n antzeko beste hitzaldi bat eman nuenetik:<br> <br>⬆️ 68 ordu gehiago grabatu dira euskaraz, guztira 294 daude orain. Aurreko astetik 26 gehiago</p><p>⬆️ 1.700 ahots gehiago daude euskaraz, guztira 5.020. Aurreko astetik 500 gehiago<br> <br>⬆️ Grabatutako orduen sailkapenean sei postu igo ditu euskarak</p>
Aldatsa :toka:<p>Ostiral honetan <a href="https://mastodon.eus/tags/Gasteiz" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Gasteiz</span></a>​en izango naiz <a href="https://mastodon.eus/tags/CommonVoice" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CommonVoice</span></a> / <a href="https://mastodon.eus/tags/Gaitu" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Gaitu</span></a>​ri buruz hitz egiten:</p><p>📆 Azaroak 17, ostirala<br>🕠 17:30<br>📌 Landatxoko erabilera anitzeko gela, Gasteiz</p><p><a href="https://www.vitoria-gasteiz.org/wb021/was/contenidoAction.do?lang=eu&amp;locale=eu&amp;idioma=eu&amp;uid=f4a8e33_18b6983a3ef__179c" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">vitoria-gasteiz.org/wb021/was/</span><span class="invisible">contenidoAction.do?lang=eu&amp;locale=eu&amp;idioma=eu&amp;uid=f4a8e33_18b6983a3ef__179c</span></a></p>