lingo.lol is one of the many independent Mastodon servers you can use to participate in the fediverse.
A place for linguists, philologists, and other lovers of languages.

Server stats:

64
active users

#robotstxt

0 posts0 participants0 posts today
Continued thread

Here's #Cloudflare's #robots-txt file:

# Cloudflare Managed Robots.txt to block AI related bots.

User-agent: AI2Bot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: amazon-kendra
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Applebot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: AwarioRssBot
Disallow: /

User-agent: AwarioSmartBot
Disallow: /

User-agent: bigsur.ai
Disallow: /

User-agent: Brightbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: DigitalOceanGenAICrawler
Disallow: /

User-agent: DuckAssistBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: FriendlyCrawler
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: iaskspider/2.0
Disallow: /

User-agent: ICC-Crawler
Disallow: /

User-agent: img2dataset
Disallow: /

User-agent: Kangaroo Bot
Disallow: /

User-agent: LinerBot
Disallow: /

User-agent: MachineLearningForPeaceBot
Disallow: /

User-agent: Meltwater
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: meta-externalfetcher
Disallow: /

User-agent: Nicecrawler
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: PanguBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: PiplBot
Disallow: /

User-agent: QualifiedBot
Disallow: /

User-agent: Scoop.it
Disallow: /

User-agent: Seekr
Disallow: /

User-agent: SemrushBot-OCOB
Disallow: /

User-agent: Sidetrade indexer bot
Disallow: /

User-agent: Timpibot
Disallow: /

User-agent: VelenPublicWebCrawler
Disallow: /

User-agent: Webzio-Extended
Disallow: /

User-agent: YouBot
Disallow: /
Replied in thread

@ErikUden

Worrying is their self centered megalomanic ego trip, not realizing that they are the remaining world power, armed to their teeth with weapons of all kind, and with all the private data of the worlds population.

That said, having in mind that apparently you, being in charge of several #mastodon instances in the #fediVerse, are not able to fix the #robotsTxt of them while wasting time about talking of other countries internal affairs is kinda embarrassing.
sry

Continued thread

Extending the meta tags would be fairly straightforward. In addition to the existing "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW", we could introduce "MODELTRAINING" and "NOMODELTRAINING".

Of course, just because there is an RfC does not mean that anyone will follow it. But it would be a start, and something to push for. I would love to hear your opinion.

3/3

Continued thread

This is not an acceptable situation and therefore I propose to extend the robots.txt standard and the corresponding HTML meta tags.

For robots.txt, I see two ways to approach this:

The first option would be to introduce a meta-user-agent that can be used to define rules for all AI bots, e.g. "User-agent: §MODELTRAINIGN§".

The second option would be a directive like "Crawl-delay" that indicates how to use the data. For example, "Model-training: disallow".

2/3

I was just confronted with the question of how to prevent a website being used to train AI models. Using robots.txt, I only see two options, both of which are bad in different ways:

I can either disallow all known AI bots while still being guaranteed to miss some bots.

Or I can disallow all bots and explicitly allow known search engines. This way I will overblock and favour the big search engines over yet unknown challengers.

1/3

A(I)le bekloppt

Drüben im Blog der Uberspace-Betreiber findet sich ein sehr interessanter Artikel dazu, was die (mittlerweile anscheinend komplett hohldrehenden) Bots der AI-Firmen ohne Rücksicht auf Verluste so auslösen:

(…) Zusammenfassend lässt sich sagen, dass nach unserer Beobachtung rund 30 %-50 % aller Anfragen für kleine Seiten inzwischen von Bots generiert werden. Für große Seiten schwankt diese Zahl sogar zwischen 20 % und 75 %. In unseren Augen und mit Ignorieren der robots.txt ist damit inzwischen ein Punkt erreicht, an dem dieses Verhalten von Bots nicht mehr akzeptabel ist und unserem Betrieb schadet.

blog.uberspace.de

Bei meinen unregelmässigen Ausflügen in die Serverlogs meiner eigenen Seiten, aber auch von Auftritten meiner Kunden ist das genauso: Die bot-Zugriffe haben überproportional zugenommen und es ist teilweise wirklich heftig, mit welcher Frequenz und mit wieviel wechselnden IPs die Dinger auf die Site hämmern. >:-(

#Bots #DigitaleSelbstVerteidigung #robotsTxt #Scraper #WildWest

https://webrocker.de/?p=29216

blog.uberspace.deBad Robots

Since robots.txt has become outright maladaptive, I'm thinking of putting an actual ToS on my website that visitors must agree to. Would like feedback.

- When anyone but a select few user agents visit any page, they'd first be served this ToS page. Once they agree, they'll not see it again (unless they clear cookies).

- I'd also keep a list of all user agents that've agreed.

- Terms would mostly amount to "may not use any content for training of AI/LLMs"

#WebDev#AI#LLM

@ianb @Meyerweb

I was about to ask, myself, where that bloke was who says "Hah! No." to such questions. (-:

That said, if the argument was (and is) that 35 years was an egregious and unjust suggestion by prosecutors, it is *still* surely egregious and unjust in *other* cases if one contends that they are alike.

*That* said, I wonder how that "Hah! No." bloke answers the question: Is ignoring robots.txt illegal?

(-:

law.stackexchange.com/q/77755/

Law Stack ExchangeDoes the robots exclusion standard have any legal weight?There is a standard through which websites communicate to the web crawlers upon which search engines are based which pages should be indexed and included in the search results. This is called the ...

New scraper just dropped (well, an old scraper was renamed):

Facebook/Meta updated its robots.txt entry for opting out of GenAI data scraping. If you blocked FacebookBot before, you should block meta-externalagent now:

User-Agent: meta-externalagent
Disallow: /

Official references:

Meta for DevelopersAbout FacebookBot - Partage - Documentation - Meta for Developers

Just throwing out a thought before I do some research on this, but I think robots.txt needs an update.

Ideally I'd like to define an "allow list" that tells web scrapers how my content can be used. Eg.:

- monetizable: false
- fediverse: true
- nonfediverse: false
- ai: false

Etc. And I'd like to apply this to my social media profile and any other web presence, not just my personal website.

With the rise of #AI, #webcrawlers are suddenly controversial
For decades, #robotstxt governed the behavior of web crawlers. But as unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart. Called robots.txt and is usually located at yourwebsite.com/robots.txt. That file allows anyone who runs a website — big or small, cooking blog or multinational corporation — to tell the web who’s allowed in and who isn’t. theverge.com/24067997/robots-t

The Verge · With the rise of AI, web crawlers are suddenly controversial
More from David Pierce

If you have established Websites and wish to isolate them from search engines such as Google and other crawlers, place a text file named robots.txt with the following wording in the root directory of your Web site:

User-agent: *Disallow: /

The top line targets all crawlers, and the bottom line targets all files under the root directory for non-permission.

It is convenient because only two lines are needed to reject all crawlers, but placing this in place would leave all crawlers, which may harm search indexing and other aspects of the site. If there are any adverse effects, you can narrow the list of crawlers to reject and define them individually.

WordPress robots.txt can be edited safely using the following plug-ins without placing the text file directly in the root directory:

By the way; there is something I have been thinking about for a while that I would like to realize so that I will write it here.

Speaking of Cloudflare Tunnel, a revolutionary mechanism that allows you to securely expose your servers from inside your router rather than from the DMZ, and Cloudflare One/Cloudflare WARP, a combination of a gateway to securely access those contents on the go and a WireGuard client with MDM capabilities to connect to it.

For a long time, I have wanted to use Cloudflare Tunnel and Cloudflare WARP on the same server simultaneously, but that has been challenging to achieve. Sometimes, it degrades the Tunnel, or the server is unusable even when the Tunnel port is open.

I did a Google search and found the following exchange on a thread in the Cloudflare Community:

I have ‘degraded status’ when using WARP with Zero Trust

“Just open the specified port,” the Cloudflare staff responded, but that didn’t work for me.

Now, I asked ChatGPT about it.

Using Cloudflare Tunnel and Cloudflare WARP on the same server can indeed be a bit challenging due to potential conflicts in network configurations and routing. However, it’s not impossible to set them up together. Here are some tips and considerations to help you achieve this:

  1. Port Conflicts:
    • Make sure that the ports used by Cloudflare Tunnel and Cloudflare WARP do not conflict. Check the default ports used by each service and configure them accordingly. You might need to adjust the settings for one of the services to use a different set of ports.
  2. Routing Issues:
    • Ensure that there are no conflicting routing rules between the two services. This can be particularly tricky because both services might try to manipulate the server’s network stack. Carefully review the routing tables to avoid conflicts.
  3. Firewall Settings:
    • Check your server’s firewall settings to ensure that it allows traffic for both Cloudflare Tunnel and Cloudflare WARP. If you are using iptables or another firewall tool, make sure that the necessary ports are open and traffic is allowed.
  4. Concurrency and Resources:
    • Consider the resource requirements of both services. Running both services simultaneously may consume a significant amount of system resources (CPU, memory, etc.). Ensure that your server has enough resources to handle the load. You can monitor resource usage using tools like top or htop.
  5. Logging and Debugging:
    • Enable logging for both Cloudflare Tunnel and Cloudflare WARP. Examine the logs for any error messages or warnings that might indicate conflicts or issues. This can help you pinpoint the source of the problem.
  6. Configuration Order:
    • Try starting the services in a specific order. For example, start one service first and ensure it’s working correctly before starting the second one. The order in which services are initialized can sometimes affect their behavior.
  7. Documentation and Support:
    • Refer to the official documentation of both Cloudflare Tunnel and Cloudflare WARP. Check if there are any specific guidelines or recommendations for running them together. Additionally, check the support forums or communities for both services to see if others have encountered similar issues.
  8. Separate Instances:
    • As a last resort, you might consider running Cloudflare Tunnel and Cloudflare WARP on separate instances or virtual machines. This ensures complete isolation and avoids potential conflicts.

Remember to make backups or snapshots of your server before making significant changes, and proceed with caution, especially if the server is in a production environment. If the issues persist, reaching out to Cloudflare support may provide additional assistance tailored to your specific setup.

If you ask a professional engineer, you will get this answer. But it’s the answer for someone who has never touched Cloudflare.

Does anyone know how to countermeasure against this “degraded?”

https://kotaromiyasaka.com/search-engine-rejection-by-robots-txt-and-cloudflare-tunnel-failure/

So the old robots.txt tool from google is truly gone, it now redirects to the new report. Quick reminder that if you're missing the ability to test rules & urls, @jammer_volts & I did this tool tamethebots.com/tools/robotstx that uses the official parse and lets you test one or many urls, and export the results

A more familiar to the depreciated tool is the excellent realrobotstxt.com/ from Will Crichlow.

Merkle's technicalseo.com/tools/robots- allows you to test a url & its resources #seo #robotstxt

Tame The Botsrobots.txt Testing ToolTest a list of URLs against a live or custom robots.txt file.