lingo.lol is one of the many independent Mastodon servers you can use to participate in the fediverse.
A place for linguists, philologists, and other lovers of languages.

Server stats:

65
active users

#aitraining

1 post1 participant0 posts today

!!!!! F*ck off Meta !!!!! Meta gab heute bekannt, dass es in Kürze mit dem Training seiner KI-Modelle anhand von Inhalten erwachsener europäischer Nutzer auf seinen Social-Media-Plattformen Facebook und Instagram beginnen wird. Zu den Inhalten, die für das KI-Training verwendet werden, gehören Beiträge und Kommentare erwachsener Nutzer sowie Fragen und Anfragen aus der Interaktion mit dem Meta-KI-Assistenten.

Azernews: Azerbaijan advancing AI projects, including Azerbaijani language database. “Fariz Jafarov, Executive Director of the Center for Analysis and Coordination of the Fourth Industrial Revolution, highlighted a major initiative to create a large Azerbaijani language database for artificial intelligence development, Azernews reports.”

https://rbfirehose.com/2025/03/17/azernews-azerbaijan-advancing-ai-projects-including-azerbaijani-language-database/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · Azernews: Azerbaijan advancing AI projects, including Azerbaijani language database | ResearchBuzz: Firehose
More from ResearchBuzz: Firehose

"Anyone at an AI company who stops to think for half a second should be able to recognize they have a vampiric relationship with the commons. While they rely on these repositories for their sustenance, their adversarial and disrespectful relationships with creators reduce the incentives for anyone to make their work publicly available going forward (freely licensed or otherwise). They drain resources from maintainers of those common repositories often without any compensation. They reduce the visibility of the original sources, leaving people unaware that they can or should contribute towards maintaining such valuable projects. AI companies should want a thriving open access ecosystem, ensuring that the models they trained on Wikipedia in 2020 can be continually expanded and updated. Even if AI companies don’t care about the benefit to the common good, it shouldn’t be hard for them to understand that by bleeding these projects dry, they are destroying their own food supply.

And yet many AI companies seem to give very little thought to this, seemingly looking only at the months in front of them rather than operating on years-long timescales. (Though perhaps anyone who has observed AI companies’ activities more generally will be unsurprised to see that they do not act as though they believe their businesses will be sustainable on the order of years.)

It would be very wise for these companies to immediately begin prioritizing the ongoing health of the commons, so that they do not wind up strangling their golden goose. It would also be very wise for the rest of us to not rely on AI companies to suddenly, miraculously come to their senses or develop a conscience en masse.

Instead, we must ensure that mechanisms are in place to force AI companies to engage with these repositories on their creators' terms."

citationneeded.news/free-and-o

Citation Needed · “Wait, not like that”: Free and open access in the age of generative AI
More from Molly White

#OpenAI declares #AI race “over” if #training on #copyrighted works isn’t fair use

OpenAI is hoping that Donald Trump's AI Action Plan, due out this July, will settle #copyright debates by declaring #AItraining fair use—paving the way for AI companies' unfettered access to training data that OpenAI claims is critical to defeat #China in the AI race.
#fairuse #Trump

arstechnica.com/tech-policy/20

Ars Technica · OpenAI declares AI race “over” if training on copyrighted works isn’t fair useBy Ashley Belanger

🔍 New proposal: A vocabulary for opting out from AI training & text/data mining.

Based on interaction with a broad range of stakeholders, this proposal aims to give creators and other rightholders more control over how their works are used for AI training through practical, machine-readable standards.

📄 Full paper & vocabulary: openfuture.eu/publication/a-vo
#ParadoxOfOpen #AITraining

Continued thread

2/ Wie funktioniert es?
Zuerst wird ein großes, leistungsstarkes Modell trainiert. Dieses Teacher Model erzeugt Vorhersagen, die nicht nur die endgültige Klassifikation enthalten, sondern auch die Wahrscheinlichkeitsverteilung über alle Klassen.
#AITraining

"Powerful actors, governments, and corporations are actively shaping narratives about artificial intelligence (AI) to advance competing visions of society and governance. These narratives help establish what publics believe and what should be considered normal or inevitable about AI deployment in their daily lives — from surveillance to automated decision-making. While public messaging frames AI systems as tools for progress and efficiency, these technologies are increasingly deployed to monitor populations and disempower citizens’ political participation in myriad ways. This AI narrative challenge is made more complex by the many different cultural values, agendas, and concepts that influence how AI is discussed globally. Considering these differences is critical in contexts in which data exacerbates inequities, injustice, or nondemocratic governance. As these systems continue to be adopted by governments with histories of repression, it becomes crucial for civil society organizations to understand and counter AI narratives that legitimize undemocratic applications of these tools.

We built on the groundwork laid by the Unfreedom Monitor to conduct our Data Narratives research into data discourse in five countries that face different threats to democracy: Sudan, El Salvador, India, Brazil, and Turkey. To better understand these countries’ relationships to AI, incentives, and public interest protection strategies, it is helpful to contextualize AI as a data narrative. AI governance inherently involves data governance and vice versa. AI systems rely on vast quantities of data for training and operation, while AI systems gain legibility and value as they are widely integrated into everyday functions that then generate the vast quantities of data they require to work."

#AI #AINarratives #AIHype #AITraining #AIGovernance #DataGovernance

globalvoices.org/2024/12/23/ar

Global Voices · Artificial Intelligence Narratives: A Global Voices Report
More from Global Voices

"On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down. It looked to be some kind of distributed denial-of-service attack.

He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site.

“We have over 65,000 products, each product has a page,” Tomchuk told TechCrunch. “Each page has at least three photos.”

OpenAI was sending “tens of thousands” of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions.

“OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site.

“Their crawlers were crushing our site,” he said “It was basically a DDoS attack.”

Triplegangers’ website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of “human digital doubles” on the web, meaning 3D image files scanned from actual human models.

It sells the 3D object files, as well as photos — everything from hands to hair, skin, and full bodies — to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics."

techcrunch.com/2025/01/10/how-

TechCrunch · How OpenAI's bot crushed this seven-person company's web site ‘like a DDoS attack’ | TechCrunchOpenAI was sending “tens of thousands” of server requests trying to download Triplegangers' entire site which hosts hundreds of thousands of photos.

"In newly unredacted documents filed with the U.S. District Court for the Northern District of California late Wednesday, plaintiffs in Kadrey v. Meta, who include bestselling authors Sarah Silverman and Ta-Nehisi Coates, recount Meta’s testimony from late last year, during which it was revealed that Zuckerberg approved Meta’s use of a dataset called LibGen for Llama-related training.

LibGen, which describes itself as a “links aggregator,” provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued a number of times, ordered to shut down, and fined tens of millions of dollars for copyright infringement.

According to Meta’s testimony, as relayed by plaintiffs’ counsel, Zuckerberg cleared the use of LibGen to train at least one of Meta’s Llama models despite concerns within Meta’s AI exec team and others at the company. The filing quotes Meta employees as referring to LibGen as a “data set we know to be pirated,” and flagging that its use “may undermine [Meta’s] negotiating position with regulators.”"

techcrunch.com/2025/01/09/mark

TechCrunch · Mark Zuckerberg gave Meta's Llama team the OK to train on copyrighted works, filing claims | TechCrunchMeta CEO Mark Zuckerberg gave Meta's Llama team approval to train on copyrighted documents, according to a new court filing.

"AI is all about data. Reams and reams of data are needed to train algorithms to do what we want, and what goes into the AI models determines what comes out. But here’s the problem: AI developers and researchers don’t really know much about the sources of the data they are using. AI’s data collection practices are immature compared with the sophistication of AI model development. Massive data sets often lack clear information about what is in them and where it came from.

The Data Provenance Initiative, a group of over 50 researchers from both academia and industry, wanted to fix that. They wanted to know, very simply: Where does the data to build AI come from? They audited nearly 4,000 public data sets spanning over 600 languages, 67 countries, and three decades. The data came from 800 unique sources and nearly 700 organizations.

Their findings, shared exclusively with MIT Technology Review, show a worrying trend: AI's data practices risk concentrating power overwhelmingly in the hands of a few dominant technology companies."

technologyreview.com/2024/12/1

MIT Technology Review · This is where the data to build AI comes fromBy Melissa Heikkilä

"Over the past two years, dozens of other copyright lawsuits against AI companies have been filed at a rapid clip. The plaintiffs include individual authors like Sarah Silverman and Ta Nehisi-Coates, visual artists, media companies like The New York Times, and music-industry giants like Universal Music Group. This wide variety of rights holders are alleging that AI companies have used their work to train what are often highly lucrative and powerful AI models in a manner that is tantamount to theft. AI companies are frequently defending themselves by relying on what’s known as the “fair use” doctrine, arguing that building AI tools should be considered a situation where it’s legal to use copyrighted materials without getting consent or paying compensation to rights holders. (Widely accepted examples of fair use include parody, news reporting, and academic research.) Nearly every major generative AI company has been pulled into this legal fight, including OpenAI, Meta, Microsoft, Google, Anthropic, and Nvidia.

WIRED is keeping close tabs on how each of these lawsuits unfold. We’ve created visualizations to help you track and contextualize which companies and rights holders are involved, where the cases have been filed, what they’re alleging, and everything else you need to know."

wired.com/story/ai-copyright-c

WIRED · Every AI Copyright Lawsuit in the US, VisualizedBy Kate Knibbs

"The success of generative AI relies heavily on training on data scraped through extensive crawling of the Internet, a practice that has raised significant copyright, privacy, and ethical concerns. While few measures are designed to resist a resource-rich adversary determined to scrape a site, crawlers can be impacted by a range of existing tools such as robots.txt, NoAI meta tags, and active crawler blocking by reverse proxies.

In this work, we seek to understand the ability and efficacy of today’s networking tools to protect content creators against AI-related crawling. For targeted populations like human artists, do they have the technical knowledge and agency to utilize crawler-blocking tools such as robots.txt, and can such tools be effective? Using large scale measurements and a targeted user study of 182 professional artists, we find strong demand for tools like robots.txt, but significantly constrained by significant hurdles in technical awareness, agency in deploying them, and limited efficacy against unresponsive crawlers. We further test and evaluate network level crawler blockers by reverse-proxies, and find that despite very limited deployment today, their reliable and comprehensive blocking of AI-crawlers make them the strongest protection for artists moving forward."

arxiv.org/html/2411.15091v1#S3

arxiv.orgSomesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers

"Now that the seal is broken on scraping Bluesky posts into datasets for machine learning, people are trolling users and one-upping each other by making increasingly massive datasets of non-anonymized, full-text Bluesky posts taken directly from the social media platform’s public firehose—including one that contains almost 300 million posts.

Last week, Daniel van Strien, a machine learning librarian at open-source machine learning library platform Hugging Face, released a dataset composed of one million Bluesky posts, including when they were posted and who posted them. Within hours of his first post—shortly after our story about this being the first known, public, non-anonymous dataset of Bluesky posts, and following hundreds of replies from people outraged that their posts were scraped without their permission—van Strein took it down and apologized."

404media.co/bluesky-posts-mach

404 Media · Your Bluesky Posts Are Probably In A Bunch of Datasets NowAfter a machine learning librarian released and then deleted a dataset of one million Bluesky posts, several other, even bigger datasets have appeared in its place—including one of almost 300 million non-anonymized posts.

"This paper examines ‘open’ artificial intelligence (AI). Claims about ‘open’ AI often lack precision, frequently eliding scrutiny of substantial industry concentration in large-scale AI development and deployment, and often incorrectly applying understandings of ‘open’ imported from free and open-source software to AI systems. At present, powerful actors are seeking to shape policy using claims that ‘open’ AI is either beneficial to innovation and democracy, on the one hand, or detrimental to safety, on the other. When policy is being shaped, definitions matter. To add clarity to this debate, we examine the basis for claims of openness in AI, and offer a material analysis of what AI is and what ‘openness’ in AI can and cannot provide: examining models, data, labour, frameworks, and computational power. We highlight three main affordances of ‘open’ AI, namely transparency, reusability, and extensibility, and we observe that maximally ‘open’ AI allows some forms of oversight and experimentation on top of existing models. However, we find that openness alone does not perturb the concentration of power in AI. Just as many traditional open-source software projects were co-opted in various ways by large technology companies, we show how rhetoric around ‘open’ AI is frequently wielded in ways that exacerbate rather than reduce concentration of power in the AI sector."

nature.com/articles/s41586-024

NatureWhy ‘open’ AI systems are actually closed, and why this matters - NatureA review of the literature on artificial intelligence systems to examine openness reveals that open AI systems are actually closed, as they are highly dependent on the resources of a few large corporate actors.