lingo.lol is one of the many independent Mastodon servers you can use to participate in the fediverse.
A place for linguists, philologists, and other lovers of languages.

Server stats:

64
active users

#scraper

0 posts0 participants0 posts today
@francks<p>Our small team vs millions of bots</p><p><a href="https://www.fsf.org/blogs/sysadmin/our-small-team-vs-millions-of-bots" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">fsf.org/blogs/sysadmin/our-sma</span><span class="invisible">ll-team-vs-millions-of-bots</span></a></p><p><a href="https://mstdn.fr/tags/fsf" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>fsf</span></a> <a href="https://mstdn.fr/tags/freesoftware" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>freesoftware</span></a> <a href="https://mstdn.fr/tags/ddos" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ddos</span></a> <a href="https://mstdn.fr/tags/javascrip" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>javascrip</span></a> <a href="https://mstdn.fr/tags/anubis" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>anubis</span></a> <a href="https://mstdn.fr/tags/botnet" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>botnet</span></a> <a href="https://mstdn.fr/tags/llm" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>llm</span></a> <a href="https://mstdn.fr/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://mstdn.fr/tags/crawler" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawler</span></a> <a href="https://mstdn.fr/tags/proprietarysoftware" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>proprietarysoftware</span></a> <a href="https://mstdn.fr/tags/malware" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>malware</span></a></p>
Tom :damnified:<p>Anubis: self hostable scraper defense software | Anubis</p><p><a href="https://anubis.techaro.lol/" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">anubis.techaro.lol/</span><span class="invisible"></span></a></p><p>Discovered via <span class="h-card" translate="no"><a href="https://poly.cybre.city/users/Polychrome" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>Polychrome</span></a></span> </p><p><a href="https://metalhead.club/tags/ai" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ai</span></a> <a href="https://metalhead.club/tags/bots" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>bots</span></a> <a href="https://metalhead.club/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a></p>
Kevin Karhan :verified:<p><span class="h-card" translate="no"><a href="https://climatejustice.social/@stefanmuelller" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>stefanmuelller</span></a></span> die <a href="https://www.robotstxt.org/" rel="nofollow noopener" target="_blank"><code>robots.txt</code></a> ist eine <em>Bitte</em>, KEIN <em>gesetzlich verpflichtender Opt-Out</em>!</p><ul><li>Wenn dir <a href="https://infosec.space/tags/Scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Scraper</span></a> und anderer shice bekannt ist sag' bescheid und ich pack die <a href="https://github.com/greyhat-academy/lists.d/blob/main/scrapers.ipv4.block.list.tsv" rel="nofollow noopener" target="_blank">öffentliche Blocklist</a> die ich maintaine...</li></ul>
PaulaToThePeople 😷FediBlock newsmast
🌈 vanta rainbow black 🌈<p>found another scraper indexer thingy</p><p><a href="https://mastogizmos.com" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">mastogizmos.com</span><span class="invisible"></span></a></p><p><a href="https://cyberpunk.lol/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://cyberpunk.lol/tags/indexer" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>indexer</span></a> <a href="https://cyberpunk.lol/tags/fediblock" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>fediblock</span></a> <a href="https://cyberpunk.lol/tags/FediAdmin" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>FediAdmin</span></a></p>
🌈 vanta rainbow black 🌈<p>think i might've maybe found another scraper indexer thingy but i'm not sure</p><p>if someone else who knows their shit better could take a look and lmk that'd be great</p><p><a href="https://mastogizmos.com" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">mastogizmos.com</span><span class="invisible"></span></a></p><p><a href="https://cyberpunk.lol/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://cyberpunk.lol/tags/indexer" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>indexer</span></a></p>
Webrocker<p><strong>A(I)le bekloppt</strong></p><p>Drüben im Blog der Uberspace-Betreiber findet sich ein sehr interessanter Artikel dazu, was die (mittlerweile anscheinend komplett hohldrehenden) Bots der AI-Firmen ohne Rücksicht auf Verluste so auslösen:</p><blockquote><p>(…) Zusammenfassend lässt sich sagen, dass nach unserer Beobachtung rund 30 %-50 % aller Anfragen für kleine Seiten inzwischen von Bots generiert werden. Für große Seiten schwankt diese Zahl sogar zwischen 20 % und 75 %. In unseren Augen und mit Ignorieren der robots.txt ist damit inzwischen ein Punkt erreicht, an dem dieses Verhalten von Bots nicht mehr akzeptabel ist und unserem Betrieb schadet.</p> <a class="u-bookmark-of" href="https://blog.uberspace.de/2024/08/bad-robots/" rel="nofollow noopener" target="_blank">blog.uberspace.de</a> </blockquote><p>Bei meinen unregelmässigen Ausflügen in die Serverlogs meiner eigenen Seiten, aber auch von Auftritten meiner Kunden ist das genauso: Die bot-Zugriffe haben überproportional zugenommen und es ist teilweise wirklich heftig, mit welcher Frequenz und mit wieviel wechselnden IPs die Dinger auf die Site hämmern. &gt;:-(</p> <p><a rel="nofollow noopener" class="hashtag u-tag u-category" href="https://webrocker.de/tag/bots/" target="_blank">#Bots</a> <a rel="nofollow noopener" class="hashtag u-tag u-category" href="https://webrocker.de/tag/digitaleselbstverteidigung/" target="_blank">#DigitaleSelbstVerteidigung</a> <a rel="nofollow noopener" class="hashtag u-tag u-category" href="https://webrocker.de/tag/robotstxt/" target="_blank">#robotsTxt</a> <a rel="nofollow noopener" class="hashtag u-tag u-category" href="https://webrocker.de/tag/scraper/" target="_blank">#Scraper</a> <a rel="nofollow noopener" class="hashtag u-tag u-category" href="https://webrocker.de/tag/wildwest/" target="_blank">#WildWest</a></p><p><a href="https://webrocker.de/?p=29216" class="" rel="nofollow noopener" target="_blank">https://webrocker.de/?p=29216</a></p>
Seirdy<p>Another new LLM scraper just dropped: AI2 Bot.</p><p><a href="https://www.allenai.org/crawler" rel="nofollow noopener" target="_blank">First-party documentation</a> does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:</p><pre><code>Mozilla/5.0 (compatible) AI2Bot (+<a href="https://www.allenai.org/crawler" rel="nofollow noopener" target="_blank">https://www.allenai.org/crawler</a>) </code></pre><p>My server logs contained the following string:</p><pre><code>Mozilla/5.0 (compatible) Ai2Bot-Dolma (+<a href="https://www.allenai.org/crawler" rel="nofollow noopener" target="_blank">https://www.allenai.org/crawler</a>) </code></pre><p>That appears to be for <a href="https://allenai.org/dolma" rel="nofollow noopener" target="_blank">Ai2’s Dolma product</a>.</p><p>159 hits came from <code>174.174.51.252</code>, a Comcast-owned IP in Oregon.</p><p>I recommend adding <code>ai2bot</code> to your server’s user-agent matching rules if you don’t want to be in the Dolma dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.</p><p><a class="hashtag" href="https://pleroma.envs.net/tag/scraper" rel="nofollow noopener" target="_blank">#Scraper</a></p>
🌈 vanta rainbow black 🌈<p>holy shit WHAT</p><p>that is certainly A Way to go about apologizing lol</p><p><a href="https://cyberpunk.lol/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://cyberpunk.lol/tags/indexer" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>indexer</span></a></p>
Austin Huang ❤️<p>With regards to the utoots.com <a href="https://mstdn.party/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a>:<br>1. It currently depends on a Mastodon instance flashist[.]video; it is recommended to block the instance. flashist.(me|health) and previously flashist.(org|vip|live) is also operated by the same person. Ban evasion is to be expected.<br>2. I wrote a GitHub issue about it, archived at <a href="https://archive.ph/8ynKh" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">archive.ph/8ynKh</span><span class="invisible"></span></a>. However he has chosen to cover up his GitHub profile instead.</p><p>Update: <a href="https://cyberpunk.lol/@vantablack/112849043193285926" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">cyberpunk.lol/@vantablack/1128</span><span class="invisible">49043193285926</span></a> (tldr: it's gone)</p><p><a href="https://mstdn.party/tags/FediBlock" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>FediBlock</span></a> <a href="https://mstdn.party/tags/MastoAdmin" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>MastoAdmin</span></a> <a href="https://mstdn.party/tags/FediAdmin" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>FediAdmin</span></a></p>
🌈 vanta rainbow black 🌈<p>okay yeah <a href="https://utoots.com" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">utoots.com</span><span class="invisible"></span></a> is DEFINITELY a scraper</p><p>i've updated the original post, making a reply too since edits don't always federate cleanly</p><p><a href="https://cyberpunk.lol/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://cyberpunk.lol/tags/indexer" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>indexer</span></a> <a href="https://cyberpunk.lol/tags/fediblock" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>fediblock</span></a> <a href="https://cyberpunk.lol/tags/FediAdmin" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>FediAdmin</span></a></p>
🌈 vanta rainbow black 🌈<p>just found another scraper indexer thingy</p><p><a href="https://utoots.com" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">utoots.com</span><span class="invisible"></span></a></p><p><a href="https://cyberpunk.lol/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://cyberpunk.lol/tags/indexer" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>indexer</span></a> <a href="https://cyberpunk.lol/tags/fediblock" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>fediblock</span></a> <a href="https://cyberpunk.lol/tags/FediAdmin" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>FediAdmin</span></a></p>
:mima_rule: Mima-sama<p>It's not a <a href="https://makai.chaotic.ninja/tags/scraper" rel="nofollow noopener" target="_blank">#scraper</a>. It is its own <a href="https://makai.chaotic.ninja/tags/ActivityPub" rel="nofollow noopener" target="_blank">#ActivityPub</a> implementation like <a href="https://makai.chaotic.ninja/tags/ContentNation" rel="nofollow noopener" target="_blank">#ContentNation</a> was. Can we please not have another Content Nation incident please. ​:koishtare:​<span><br><br>cc </span><a href="https://social.wedistribute.org/users/damon" class="u-url mention" rel="nofollow noopener" target="_blank">@damon@social.wedistribute.org</a> <a href="https://mastodon.social/@akurilov" class="u-url mention" rel="nofollow noopener" target="_blank">@akurilov@mastodon.social</a> <a href="https://mastodon.social/@awakari_search" class="u-url mention" rel="nofollow noopener" target="_blank">@awakari_search@mastodon.social</a><span><br><br></span><a href="https://makai.chaotic.ninja/tags/Awakari" rel="nofollow noopener" target="_blank">#Awakari</a> <a href="https://makai.chaotic.ninja/tags/fediblockmeta" rel="nofollow noopener" target="_blank">#fediblockmeta</a><span><br><br>RE: </span><a href="https://mastodon.scot/users/gunchleoc/statuses/112311149744744675" rel="nofollow noopener" target="_blank">https://mastodon.scot/users/gunchleoc/statuses/112311149744744675</a></p>
🌈 vanta rainbow black 🌈<p><strong>⚠️ FEDI SCRAPER AND INDEXER ⚠️</strong></p><p>okay according to multiple peeps in the replies of the original post, this is indeed in fact a fedi scraper and indexer i found</p><p><a href="https://fediscanner.info" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">fediscanner.info</span><span class="invisible"></span></a></p><p><a href="https://cyberpunk.lol/tags/fediblock" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>fediblock</span></a> <a href="https://cyberpunk.lol/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://cyberpunk.lol/tags/indexer" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>indexer</span></a></p>
🌈 vanta rainbow black 🌈<p><a href="https://cyberpunk.lol/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://cyberpunk.lol/tags/indexer" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>indexer</span></a> <a href="https://cyberpunk.lol/tags/maybe" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>maybe</span></a></p>
Ben Keith<p>If I end up putting this <a href="https://newsie.social/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> project online, then it's probably best to use mysql, since that's guaranteed to be available on website hosts, but is there a better recommendation? </p><p><span class="h-card" translate="no"><a href="https://fedi.simonwillison.net/@simon" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>simon</span></a></span> <span class="h-card" translate="no"><a href="https://mastodon.palewi.re/@palewire" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>palewire</span></a></span></p>
Kiwix<p>MWoffliner, the <a href="https://mastodon.social/tags/Mediawiki" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Mediawiki</span></a> <a href="https://mastodon.social/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a>, has been released in version 1.12.0. Check it out! <a href="https://github.com/openzim/mwoffliner/releases/tag/v1.12.0" rel="nofollow noopener" target="_blank"><span class="invisible">https://</span><span class="ellipsis">github.com/openzim/mwoffliner/</span><span class="invisible">releases/tag/v1.12.0</span></a></p>
Doug Berch<p>t was time to resharpen my favorite little scraper. <br>After truing the edges with a file, I stone the faces on a Soft Arkansas Stone. Then I stone the edges while holding the faces of the scraper against a small wooden block to keep the edges square to the stone. A light touch with the burnisher turns the edge, and I'm back to work.<br>.<br>.<br><a href="https://mindly.social/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://mindly.social/tags/whetstone" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>whetstone</span></a> <a href="https://mindly.social/tags/burnisher" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>burnisher</span></a> <a href="https://mindly.social/tags/sharpening" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>sharpening</span></a> <a href="https://mindly.social/tags/sharpeningstone" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>sharpeningstone</span></a></p>
Doug Berch<p>Preparing stock for pegheads. This piece of mahogany is tricky to plane flat, but an old Stanley no. 12 scraper came to the rescue.<br>.<br>.<br>.<br><a href="https://mindly.social/tags/dulcimermaker" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>dulcimermaker</span></a> <a href="https://mindly.social/tags/dulcimer" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>dulcimer</span></a> <a href="https://mindly.social/tags/luthier" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>luthier</span></a> <a href="https://mindly.social/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://mindly.social/tags/handtools" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>handtools</span></a></p>
kravse 🍂<p><a href="https://infosec.exchange/tags/twitter" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>twitter</span></a> data export isn't working for me (and many others) so I made a SUPER HACKY follower scraper so you can at least save the handles of folks you follow / follow you. </p><p>If you know JS feel free to contribute. I don't have much of a following so please share this with people if you think they'd find it useful. (I don't need credit). </p><p><a href="https://infosec.exchange/tags/scraper" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraper</span></a> <a href="https://infosec.exchange/tags/javascript" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>javascript</span></a> <a href="https://infosec.exchange/tags/scripting" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scripting</span></a> </p><p><a href="https://github.com/kravse/twitter-follower-scraper" rel="nofollow noopener" target="_blank"><span class="invisible">https://</span><span class="ellipsis">github.com/kravse/twitter-foll</span><span class="invisible">ower-scraper</span></a></p>