I asked ChatGPT about the recent copyright news. It rehashed my latest column and misconstrued the facts. But why was it on my site at all?
https://www.plagiarismtoday.com/2025/07/23/chatgpt-ignores-robots-txt-rehashes-my-column/

I asked ChatGPT about the recent copyright news. It rehashed my latest column and misconstrued the facts. But why was it on my site at all?
https://www.plagiarismtoday.com/2025/07/23/chatgpt-ignores-robots-txt-rehashes-my-column/
Here's #Cloudflare's #robots-txt file:
# Cloudflare Managed Robots.txt to block AI related bots.
User-agent: AI2Bot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: amazon-kendra
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Applebot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: AwarioRssBot
Disallow: /
User-agent: AwarioSmartBot
Disallow: /
User-agent: bigsur.ai
Disallow: /
User-agent: Brightbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: DigitalOceanGenAICrawler
Disallow: /
User-agent: DuckAssistBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: FriendlyCrawler
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: iaskspider/2.0
Disallow: /
User-agent: ICC-Crawler
Disallow: /
User-agent: img2dataset
Disallow: /
User-agent: Kangaroo Bot
Disallow: /
User-agent: LinerBot
Disallow: /
User-agent: MachineLearningForPeaceBot
Disallow: /
User-agent: Meltwater
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: meta-externalfetcher
Disallow: /
User-agent: Nicecrawler
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: omgili
Disallow: /
User-agent: omgilibot
Disallow: /
User-agent: PanguBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: PetalBot
Disallow: /
User-agent: PiplBot
Disallow: /
User-agent: QualifiedBot
Disallow: /
User-agent: Scoop.it
Disallow: /
User-agent: Seekr
Disallow: /
User-agent: SemrushBot-OCOB
Disallow: /
User-agent: Sidetrade indexer bot
Disallow: /
User-agent: Timpibot
Disallow: /
User-agent: VelenPublicWebCrawler
Disallow: /
User-agent: Webzio-Extended
Disallow: /
User-agent: YouBot
Disallow: /
Hi, got a question.
Is there a standard for Anti-AI/Anti-SEO etc robots.txt file? Or a trustworthy site that explains how to build one if prefab isn't available? Is there anything else I should consider?
Thanks.
Worrying is their self centered megalomanic ego trip, not realizing that they are the remaining world power, armed to their teeth with weapons of all kind, and with all the private data of the worlds population.
That said, having in mind that apparently you, being in charge of several #mastodon instances in the #fediVerse, are not able to fix the #robotsTxt of them while wasting time about talking of other countries internal affairs is kinda embarrassing.
sry
Extending the meta tags would be fairly straightforward. In addition to the existing "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW", we could introduce "MODELTRAINING" and "NOMODELTRAINING".
Of course, just because there is an RfC does not mean that anyone will follow it. But it would be a start, and something to push for. I would love to hear your opinion.
3/3
This is not an acceptable situation and therefore I propose to extend the robots.txt standard and the corresponding HTML meta tags.
For robots.txt, I see two ways to approach this:
The first option would be to introduce a meta-user-agent that can be used to define rules for all AI bots, e.g. "User-agent: §MODELTRAINIGN§".
The second option would be a directive like "Crawl-delay" that indicates how to use the data. For example, "Model-training: disallow".
2/3
I was just confronted with the question of how to prevent a website being used to train AI models. Using robots.txt, I only see two options, both of which are bad in different ways:
I can either disallow all known AI bots while still being guaranteed to miss some bots.
Or I can disallow all bots and explicitly allow known search engines. This way I will overblock and favour the big search engines over yet unknown challengers.
1/3
Just went on a robots.txt updating spree and found this API to get a daily updated robots.txt to block genAI scrapers and agents!
https://darkvisitors.com/docs/robots-txt
I might take this as an opportunity to learn how to create a GitHub action usable by others.
A(I)le bekloppt
Drüben im Blog der Uberspace-Betreiber findet sich ein sehr interessanter Artikel dazu, was die (mittlerweile anscheinend komplett hohldrehenden) Bots der AI-Firmen ohne Rücksicht auf Verluste so auslösen:
(…) Zusammenfassend lässt sich sagen, dass nach unserer Beobachtung rund 30 %-50 % aller Anfragen für kleine Seiten inzwischen von Bots generiert werden. Für große Seiten schwankt diese Zahl sogar zwischen 20 % und 75 %. In unseren Augen und mit Ignorieren der robots.txt ist damit inzwischen ein Punkt erreicht, an dem dieses Verhalten von Bots nicht mehr akzeptabel ist und unserem Betrieb schadet.
blog.uberspace.de
Bei meinen unregelmässigen Ausflügen in die Serverlogs meiner eigenen Seiten, aber auch von Auftritten meiner Kunden ist das genauso: Die bot-Zugriffe haben überproportional zugenommen und es ist teilweise wirklich heftig, mit welcher Frequenz und mit wieviel wechselnden IPs die Dinger auf die Site hämmern. >:-(
#Bots #DigitaleSelbstVerteidigung #robotsTxt #Scraper #WildWest
Since robots.txt has become outright maladaptive, I'm thinking of putting an actual ToS on my website that visitors must agree to. Would like feedback.
- When anyone but a select few user agents visit any page, they'd first be served this ToS page. Once they agree, they'll not see it again (unless they clear cookies).
- I'd also keep a list of all user agents that've agreed.
- Terms would mostly amount to "may not use any content for training of AI/LLMs"
I was about to ask, myself, where that bloke was who says "Hah! No." to such questions. (-:
That said, if the argument was (and is) that 35 years was an egregious and unjust suggestion by prosecutors, it is *still* surely egregious and unjust in *other* cases if one contends that they are alike.
*That* said, I wonder how that "Hah! No." bloke answers the question: Is ignoring robots.txt illegal?
(-:
New scraper just dropped (well, an old scraper was renamed):
Facebook/Meta updated its robots.txt entry for opting out of GenAI data scraping. If you blocked FacebookBot
before, you should block meta-externalagent
now:
User-Agent: meta-externalagent
Disallow: /
Official references:
FacebookBot
no longer mentions GenAI.Meta-ExternalAgent
mentions “AI”.Interesting results, thanks everyone for voting!
I wrote more on this topic here: https://stefanbohacek.com/blog/which-top-sites-block-ai-crawlers/
Interestingly, a few of the top websites actively invite AI crawlers to crawl them.
https://stefanbohacek.com/blog/which-top-sites-block-ai-crawlers/
Just throwing out a thought before I do some research on this, but I think robots.txt needs an update.
Ideally I'd like to define an "allow list" that tells web scrapers how my content can be used. Eg.:
- monetizable: false
- fediverse: true
- nonfediverse: false
- ai: false
Etc. And I'd like to apply this to my social media profile and any other web presence, not just my personal website.
got robots.txt ? new data scrapers added (008, dcrawl, etc), agents reclassified… #RobotsTXT #DarkVisitors #AipocalypseNow https://darkvisitors.com/agents/
With the rise of #AI, #webcrawlers are suddenly controversial
For decades, #robotstxt governed the behavior of web crawlers. But as unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart. Called robots.txt and is usually located at yourwebsite.com/robots.txt. That file allows anyone who runs a website — big or small, cooking blog or multinational corporation — to tell the web who’s allowed in and who isn’t. https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders
@paco love the #robotstxt warrior icon
The #textfile that runs the #internet. For decades, #robotstxt governed the behavior of #webcrawlers. But as unscrupulous #AI companies seek out more and more #data, the basic social contract of the web is falling apart. https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders #enshittification
If you have established Websites and wish to isolate them from search engines such as Google and other crawlers, place a text file named robots.txt with the following wording in the root directory of your Web site:
User-agent: *Disallow: /
The top line targets all crawlers, and the bottom line targets all files under the root directory for non-permission.
It is convenient because only two lines are needed to reject all crawlers, but placing this in place would leave all crawlers, which may harm search indexing and other aspects of the site. If there are any adverse effects, you can narrow the list of crawlers to reject and define them individually.
WordPress robots.txt can be edited safely using the following plug-ins without placing the text file directly in the root directory:
By the way; there is something I have been thinking about for a while that I would like to realize so that I will write it here.
Speaking of Cloudflare Tunnel, a revolutionary mechanism that allows you to securely expose your servers from inside your router rather than from the DMZ, and Cloudflare One/Cloudflare WARP, a combination of a gateway to securely access those contents on the go and a WireGuard client with MDM capabilities to connect to it.
For a long time, I have wanted to use Cloudflare Tunnel and Cloudflare WARP on the same server simultaneously, but that has been challenging to achieve. Sometimes, it degrades the Tunnel, or the server is unusable even when the Tunnel port is open.
I did a Google search and found the following exchange on a thread in the Cloudflare Community:
I have ‘degraded status’ when using WARP with Zero Trust
“Just open the specified port,” the Cloudflare staff responded, but that didn’t work for me.
Now, I asked ChatGPT about it.Using Cloudflare Tunnel and Cloudflare WARP on the same server can indeed be a bit challenging due to potential conflicts in network configurations and routing. However, it’s not impossible to set them up together. Here are some tips and considerations to help you achieve this:
top
or htop
.Remember to make backups or snapshots of your server before making significant changes, and proceed with caution, especially if the server is in a production environment. If the issues persist, reaching out to Cloudflare support may provide additional assistance tailored to your specific setup.
If you ask a professional engineer, you will get this answer. But it’s the answer for someone who has never touched Cloudflare.
Does anyone know how to countermeasure against this “degraded?”
https://kotaromiyasaka.com/search-engine-rejection-by-robots-txt-and-cloudflare-tunnel-failure/