Head·word /ˈhedˌwɜː(ɹ)d/ n. @headword

0 posts0 participants0 posts today

**h o ʍ l e t t** @homlett@mamot.fr · Mar 31

→ Une #intelligence #artificielle libre est-elle possible ?
https://linuxfr.org/news/une-intelligence-artificielle-libre-est-elle-possible#toc-le-r%25C3%25A9seau-de-neurones

« Posons-nous un instant la question : qu’est-ce que le #code #source d’un réseau de #neurones ? […] La #GPL fournit une définition : le code source est la forme de l’œuvre privilégiée pour effectuer des #modifications. Dans cette acception, le code source d’un réseau de neurones serait l’#algorithme d’entraînement, le réseau de neurones de départ et le #corpus sur lequel le réseau a été entraîné »

linuxfr.orgUne intelligence artificielle libre est-elle possible ? - LinuxFr.orgL’actualité du logiciel libre et des sujets voisins (DIY, Open Hardware, Open Data, les Communs, etc.), sur un site francophone contributif géré par une équipe bénévole par et pour des libristes enthousiastes

#libre

**Florent Moncomble** @f_moncomble@mastodon.online · Feb 28

Feb 28

Florent Moncomble @f_moncomble@mastodon.online

News of Social Corpus Scraper (my tool for collecting corpora of posts from social networks):
- a Discord module is nearly ready;
- a Truth Social module is in the works.

If you'd like to do some beta-testing, do not hesitate!

https://fmoncomble.github.io/SocialCorpusScraper/

#linguistics @linguistics #corpus

SocialCorpusScraperSocialCorpusScraper

**Florent Moncomble** @f_moncomble@mastodon.online · Feb 28 *

Feb 28 *

Florent Moncomble @f_moncomble@mastodon.online

Des nouvelles de Social Corpus Scraper (application de collecte de corpus de réseaux sociaux numériques) :
- un module Discord est presque prêt ;
- un module Truth Social est à l'étude.

N'hésitez pas à me faire signe si vous voulez beta-tester !

https://fmoncomble.github.io/SocialCorpusScraper/README_fr.html

SocialCorpusScraperSocialCorpusScraper

#linguistique #corpus

**Lingüista Aburrido乁(⁰͡ Ĺ̯⁰͡ ㄏ** @ernestowg@mastodon.la · Feb 26

Feb 26

Lingüista Aburrido乁(⁰͡ Ĺ̯⁰͡ ㄏ @ernestowg@mastodon.la

Glorioso.

#español #corpus

**Christof Schöch** @christof@fedihum.org · Dec 5, 2024

Dec 5, 2024

Christof Schöch @christof@fedihum.org

Later today at #CHR2024, we are going to present our work on #Multilingual #Stylometry!

We isolated the influence of #language on #authorship #attribution #accuracy by translating multiple #corpora into each others' languages while keeping #corpus composition stable.

Interactive showcase: https://showcases.clsinfra.io/stylometry

Full paper: https://ceur-ws.org/Vol-3834/paper9.pdf

This work was developed within the @CLSinfra project in #Trier, #Krakow and #Prague with Artjoms Šeļa, Evgeniia Fileva and Julia Dudar.

Two colorful heatmaps, in hues of blue, yellow and red; one above the other; with various dropdown lists on the left to vary parameters.

Replied in thread

**Stéphane Pouyllau** @pouyllau@mastodon.social · Nov 1, 2024

Nov 1, 2024

Stéphane Pouyllau @pouyllau@mastodon.social

#Stylo : La fonctionnalité #Corpus permet de regrouper un ensemble d’articles Stylo sous un même label. https://stylo-doc.ecrituresnumeriques.ca/fr/corpus/

4/n

stylo-doc.ecrituresnumeriques.caLes corpus | Stylo

**Mathieu Goux** @Gouximan@sciences.re · Sep 2, 2024 *

Sep 2, 2024 *

Mathieu Goux @Gouximan@sciences.re

#ESR #Recrutement #ingenieur #TAL #Corpus #JeRecrute

Le CRISCO, mon laboratoire de recherche à Caen, embauche un.e ingénieur.e d'études pour une durée de 7 mois, à partir de décembre ou de janvier prochain ! Vous trouverez les détails ci-dessous. Profil TAL / Linguistique de Corpus.

Merci de diffuser largement !

Poke @f_moncomble @MarCandea @fay @LeoVarnet @nasiviru

https://crisco.unicaen.fr/le-crisco-recherche-un-e-ingenieur-e-detudes-decembre-2024-ou-janvier-2025/

CRISCO - Centre de recherches inter-langues sur la signification en contexte - UR 4255 · Aug 29, 2024Le CRISCO recherche un.e ingénieur.e d'études (décembre 2024 ou janvier 2025) · CRISCO - Centre de recherches inter-langues sur la signification en contexte - UR 4255Ingénieur.e d’études (CDD) en linguistique et TAL au CRISCO (Caen) Laboratoire CRISCO, Université de Caen, Normandie Septembre 2024 Dans le cadre du projet AUTOMATED (12/2023-11/2024), porté par Professeur Pierre Larrivée et financé par la Région Normandie (Réseau d’Intérêts Normands) nous recherchons un.e ingénieur.e d’études pour la période de sept mois.

**Adrien Barbaresi** @adbar@fediscience.org · May 28, 2024

May 28, 2024

Adrien Barbaresi @adbar@fediscience.org

Nice outline of the steps used to collect and refine data in "Building a Large Japanese Web Corpus for Large Language Models" (Okazaki et al. 2024)
https://arxiv.org/html/2404.17733

Trafilatura actually goes beyond text extraction with these additional steps:
- Web crawling and downloads from live web pages (i.e. not Common Crawl)
- Language detection with py3langid
- Quality filters with custom configuration
- Deduplication on paragraph (LRU cache) or document level (SimHash)

Steps used to collect and process data, featuring Trafilatura for text extraction.

#NLP #NLProc #LLM

**Irene** @elmerot@mastodon.nu · May 23, 2024 *

May 23, 2024 *

Irene @elmerot@mastodon.nu

Excited to have an interview on my university website due to my doctoral thesis on Czech news representations now being official!
(The link goes to my intro chapter mainly, if you would actually like the whole thesis in one pdf, send direct message, please.)

https://www.su.se/department-of-slavic-and-baltic-studies-finnish-dutch-and-german/news/new-study-on-equality-and-representation-of-different-groups-of-people-in-the-czech-news-press-1.738158

@corpuslinguistics
@DiscourseNet
@academicchatter
@irozhlas
@phdlife
@andreaswidholm
#Corpus-assistedDiscourseAnalysis
#Czech
#PostdocPositionWanted

www.su.seNew study on equality and representation of different groups of people in the Czech news press - Department of Slavic and Baltic Studies, Finnish, Dutch and GermanIrene Elmerot was admitted as a PhD student to the theme Language and Power within The Doctoral School in the Humanities in 2018. Her corpus-assisted thesis Decoding Discourse: A corpus linguistic study of evaluative adjectives and group nouns in Czech print news media (1989-2018) describes how different groups of people have been represented in Czech news media during three decades after the Velvet Revolution, using adjectives classified according to the Subjectivity Lexicon for Czech (Veselovska & Bojar 2013).

**Florent Moncomble** @f_moncomble@mastodon.online · May 16, 2024

May 16, 2024

Florent Moncomble @f_moncomble@mastodon.online

On continue avec de nouvelles applis de collecte de corpus de réseaux sociaux, cette fois pour Bluesky et Reddit, + une appli 4-en-1 qui rassemble les outils pour X/Twitter, Mastodon, Bluesky et Reddit.
#linguistique #corpus
Tout est là :
https://prendrelangue.fr/category/logiciels/

Prendre langueLogiciels Archives - Prendre langue

**Boiling Steam** @boilingsteam@mastodon.cloud · May 1, 2024

May 1, 2024

Boiling Steam @boilingsteam@mastodon.cloud

Building a Large Japanese Web Corpus for Large Language Models: https://arxiv.org/abs/2404.17733 #llm #japanese #dataset #corpus #training

arXiv.orgBuilding a Large Japanese Web Corpus for Large Language ModelsOpen Japanese large language models (LLMs) have been trained on the Japanese portions of corpora such as CC-100, mC4, and OSCAR. However, these corpora were not created for the quality of Japanese texts. This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive (21 snapshots of approximately 63.4 billion pages crawled between 2020 and 2023). This corpus consists of approximately 312.1 billion characters (approximately 173 million pages), which is the largest of all available training corpora for Japanese LLMs, surpassing CC-100 (approximately 25.8 billion characters), mC4 (approximately 239.7 billion characters) and OSCAR 23.10 (approximately 74 billion characters). To confirm the quality of the corpus, we performed continual pre-training on Llama 2 7B, 13B, 70B, Mistral 7B v0.1, and Mixtral 8x7B Instruct as base LLMs and gained consistent (6.6-8.1 points) improvements on Japanese benchmark datasets. We also demonstrate that the improvement on Llama 2 13B brought from the presented corpus was the largest among those from other existing corpora.

**Florent Moncomble** @f_moncomble@mastodon.online · Apr 25, 2024

Apr 25, 2024

Florent Moncomble @f_moncomble@mastodon.online

Nouvelle livraison pour les collègues linguistes cherchant à constituer des #corpus numériques : MastoScraper tire parti de l'API Mastodon pour collecter des pouets à partir d'une recherche par mots-clefs.
C'est par ici, merci d'avance pour vos retours !
#linguistique
https://fmoncomble.github.io/mastoscraper/README_fr.html

MastoScraperMastoScraper

**Mathieu Goux** @Gouximan@sciences.re · Apr 24, 2024

Apr 24, 2024

Mathieu Goux @Gouximan@sciences.re

Si vous voulez voir un peu nos corpus diachroniques, et entendre mon accent anglais d'une qualité négociable, j'ai enregistré ça aujourd'hui :

#Grammaire #Linguistique #Linguistics #Grammar #Corpus #CRISCO #Normandie #Youtube

https://www.youtube.com/watch?v=I_0gxWtV1l4

YouTubeCRISCO Diachronic CorporaBy CRISCO

**Christof Schöch** @christof@fedihum.org · Apr 23, 2024

Apr 23, 2024

Christof Schöch @christof@fedihum.org

Happy to announce that the #corpus of 200 #French #Eighteenth-Century #novels that my colleague Julia Röttgermann has edited in the framework of the #MiMoText project is now also described as a #DataPaper in #JOHD:

Röttgermann, J. (2024) ‘The Collection of Eighteenth-Century French Novels 1751–1800’, Journal of Open Humanities Data, 10(1), p. 31.

Available at: https://doi.org/10.5334/johd.201

@tcdh @moulin @ClaudiaBamberg @jojoweis

Grouped barchart in shades of green and grey showing the relationship between the data from the "Bibliographie du genre romanesque 1751-1800" on the one hand, and the data from the actual corpus composition, on the other hand. Here this is shown for author gender. The chart shows a close match between the proportions between the two sources of data, so the corpus is representative of the period in this respect.

**Florent Moncomble** @f_moncomble@mastodon.online · Apr 21, 2024 *

Apr 21, 2024 *

Florent Moncomble @f_moncomble@mastodon.online

New on the blog — find my #corpus collection apps on this page:
https://prendrelangue.fr/category/logiciels/
Feedback welcome!
#linguistics @linguistics

Prendre langueLogiciels Archives - Prendre langue

**Florent Moncomble** @f_moncomble@mastodon.online · Apr 21, 2024 *

Apr 21, 2024 *

Florent Moncomble @f_moncomble@mastodon.online

Nouveau sur le blog — retrouvez mes applications de collecte de #corpus sur cette page :
https://prendrelangue.fr/category/logiciels/
Retours d’expérience bienvenus !
#linguistique

Prendre langueLogiciels Archives - Prendre langue

**UniversityofGroningenLibrary** @Bibliothecaris@social.edu.nl · Apr 15, 2024

Apr 15, 2024

UniversityofGroningenLibrary @Bibliothecaris@social.edu.nl

Highlight from our #digital collection:

"Digital Humanities, Corpus and Language Technology: A look from diverse case studies"

explores the intersection of #technology and the #humanities. The authors provide an overview of how these technologies can enhance research across various disciplines, from #literature to #history to #anthropology.

https://doi.org/10.21827/646242d096beb

Cover page of the book features a 17th-century-inspired painting of people surrounding a laptop and taking notes on paper

#DigitalHumanities #linguistics #corpus

**Sascha Wolfer** @sascha_wolfer@fediscience.org · Apr 11, 2024 *

Apr 11, 2024 *

Sascha Wolfer @sascha_wolfer@fediscience.org

New preprint
"Does corpus size influence normalised frequencies?"

https://doi.org/10.31219/osf.io/tr8de

It may sound like a silly question, but many #corpus linguistic measures are influenced by corpus size. So we asked ourselves: Does this also hold for normalised #frequencies, a measure that is meant to correct raw frequencies for the size of the underlying corpus?

We approached this by checking the association between lists of normalised frequencies for samples of different sizes.

Figure 1 of the preprint "Does corpus size influence normalised frequencies?" showing the relationship between sample sizes and Tau B. As samples get larger, associations between frequency lists increase. This holds for all five languages under investigation (English, Finnish, French, German, Vietnamese).

#linguistics

**Christof Schöch** @christof@fedihum.org · Apr 9, 2024 *

Apr 9, 2024 *

Christof Schöch @christof@fedihum.org

[SOLVED] Please help, dear corpus and computational L friends! Is there a #multilingual #model for #TreeTagger, even with a very basic tagset?

I would like to annotate lemma + POS in a #corpus of short #texts in 3-4 European #languages (mainly #German, #English, #French) within #TXM, a process that requires using TreeTagger.

I know I could do that with #spaCy, selecting the right model for each text. But then I need to get those #annotations into shape for import into TXM.

#EasyWayOut?

**Tatjana Scheffler** @tschfflr@fediscience.org · Mar 4, 2024

Mar 4, 2024

Tatjana Scheffler @tschfflr@fediscience.org

Fantastic analysis of title drops in movies using everyone's favorite corpus, the Opensubtitles #corpus
Also, introduces the useful unit BPM (Barbies per minute)
https://vis.social/@dominikus/112037734421698041

vis.socialdominikus (@dominikus@vis.social)Attached: 1 image I took my side-project too seriously and ended up with the biggest analysis of title drops in movies: https://titledrops.net

Recent searches

Search options

Administered by:

Server stats:

#corpus