lingo.lol is one of the many independent Mastodon servers you can use to participate in the fediverse.
A place for linguists, philologists, and other lovers of languages.

Server stats:

69
active users

#corpus

0 posts0 participants0 posts today

→ Une #intelligence #artificielle libre est-elle possible ?
linuxfr.org/news/une-intellige

« Posons-nous un instant la question : qu’est-ce que le #code #source d’un réseau de #neurones ? […] La #GPL fournit une définition : le code source est la forme de l’œuvre privilégiée pour effectuer des #modifications. Dans cette acception, le code source d’un réseau de neurones serait l’#algorithme d’entraînement, le réseau de neurones de départ et le #corpus sur lequel le réseau a été entraîné »

linuxfr.orgUne intelligence artificielle libre est-elle possible ? - LinuxFr.orgL’actualité du logiciel libre et des sujets voisins (DIY, Open Hardware, Open Data, les Communs, etc.), sur un site francophone contributif géré par une équipe bénévole par et pour des libristes enthousiastes

Later today at #CHR2024, we are going to present our work on #Multilingual #Stylometry!

We isolated the influence of #language on #authorship #attribution #accuracy by translating multiple #corpora into each others' languages while keeping #corpus composition stable.

Interactive showcase: showcases.clsinfra.io/stylomet

Full paper: ceur-ws.org/Vol-3834/paper9.pd

This work was developed within the @CLSinfra project in #Trier, #Krakow and #Prague with Artjoms Šeļa, Evgeniia Fileva and Julia Dudar.

Nice outline of the steps used to collect and refine data in "Building a Large Japanese Web Corpus for Large Language Models" (Okazaki et al. 2024)
arxiv.org/html/2404.17733

Trafilatura actually goes beyond text extraction with these additional steps:
- Web crawling and downloads from live web pages (i.e. not Common Crawl)
- Language detection with py3langid
- Quality filters with custom configuration
- Deduplication on paragraph (LRU cache) or document level (SimHash)

#NLP#NLProc#LLM

Happy to announce that the #corpus of 200 #French #Eighteenth-Century #novels that my colleague Julia Röttgermann has edited in the framework of the #MiMoText project is now also described as a #DataPaper in #JOHD:

Röttgermann, J. (2024) ‘The Collection of Eighteenth-Century French Novels 1751–1800’, Journal of Open Humanities Data, 10(1), p. 31.

Available at: doi.org/10.5334/johd.201

@tcdh @moulin @ClaudiaBamberg @jojoweis

Highlight from our #digital collection:

➡️ "Digital Humanities, Corpus and Language Technology: A look from diverse case studies"

explores the intersection of #technology and the #humanities. The authors provide an overview of how these technologies can enhance research across various disciplines, from #literature to #history to #anthropology.

🔗 doi.org/10.21827/646242d096beb

🚨 New preprint 🚨
"Does corpus size influence normalised frequencies?"

doi.org/10.31219/osf.io/tr8de

It may sound like a silly question, but many #corpus linguistic measures are influenced by corpus size. So we asked ourselves: Does this also hold for normalised #frequencies, a measure that is meant to correct raw frequencies for the size of the underlying corpus?

We approached this by checking the association between lists of normalised frequencies for samples of different sizes.

[SOLVED] Please help, dear corpus and computational L friends! Is there a #multilingual #model for #TreeTagger, even with a very basic tagset?

I would like to annotate lemma + POS in a #corpus of short #texts in 3-4 European #languages (mainly #German, #English, #French) within #TXM, a process that requires using TreeTagger.

I know I could do that with #spaCy, selecting the right model for each text. But then I need to get those #annotations into shape for import into TXM.

#EasyWayOut?