lingo.lol is one of the many independent Mastodon servers you can use to participate in the fediverse.
A place for linguists, philologists, and other lovers of languages.

Server stats:

67
active users

#webcrawling

0 posts0 participants0 posts today
Max Resing<p>It looks like LLM-producing companies that are massively <a href="https://infosec.exchange/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> the <a href="https://infosec.exchange/tags/web" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>web</span></a> require the owners of a website to take action to opt out. Albeit I am not intrinsically against <a href="https://infosec.exchange/tags/generativeai" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>generativeai</span></a> and the acquisition of <a href="https://infosec.exchange/tags/opendata" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opendata</span></a>, reading about hundreds of dollars of rising <a href="https://infosec.exchange/tags/cloud" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>cloud</span></a> costs for hobby projects is quite concerning. How is it accepted that hypergiants skyrocket the costs of tightly budgeted projects through massive spikes in egress traffic and increased processing requirements? Projects that run on a shoestring budget and are operated by volunteers who dedicated hundreds of hours without any reward other than believing in their mission?</p><p>I am mostly concerned about the default of opting out. Are the owners of those projects required to take action? Seriously? As an <a href="https://infosec.exchange/tags/operator" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>operator</span></a>, it would be my responsibility to methodically work myself through the crawling documentation of the hundreds of <a href="https://infosec.exchange/tags/LLM" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LLM</span></a> <a href="https://infosec.exchange/tags/web" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>web</span></a> <a href="https://infosec.exchange/tags/crawlers" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawlers</span></a>? I am the one responsible for configuring a unique crawling specification in my robots.txt because hypergiants make it immanently hard to have generic <a href="https://infosec.exchange/tags/opt" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opt</span></a>-out configurations that tackle LLM projects specifically?</p><p>I reject to accept that this is our new norm. A norm in which hypergiants are not only methodically exploiting the work of thousands of individuals for their own benefit and without returning a penny. But also a norm, in which the resource owner is required to prevent these crawlers from skyrocketing one's own operational costs?</p><p>We require a new <a href="https://infosec.exchange/tags/opt" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opt</span></a>-in. Often, public and open projects are keen to share their data. They just don't like the idea of carrying the unpredictable, multitudinous financial burden of sharing the data without notice from said crawlers. Even <a href="https://infosec.exchange/tags/CommonCrawl" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>CommonCrawl</span></a> has safe-fail mechanisms to reduce the burden on website owners. Why are LLM crawlers above the guidelines of good <a href="https://infosec.exchange/tags/Internet" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Internet</span></a> citizenship?</p><p>To counter the most common argument already: Yes, you can deny-by-default in your robots.txt, but that excludes any non-mainstream browser, too.</p><p>Some concerning <a href="https://infosec.exchange/tags/news" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>news</span></a> articles on the topic:</p><ul><li><a href="https://archive.is/nQ6Gk" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">archive.is/nQ6Gk</span><span class="invisible"></span></a></li><li><a href="https://archive.is/CRwVs" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">archive.is/CRwVs</span><span class="invisible"></span></a></li></ul><p><a href="https://infosec.exchange/tags/webcrawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>webcrawling</span></a> <a href="https://infosec.exchange/tags/crawler" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawler</span></a> <a href="https://infosec.exchange/tags/web" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>web</span></a> <a href="https://infosec.exchange/tags/opensource" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opensource</span></a></p>
Miguel Afonso Caetano<p>"The success of generative AI relies heavily on training on data scraped through extensive crawling of the Internet, a practice that has raised significant copyright, privacy, and ethical concerns. While few measures are designed to resist a resource-rich adversary determined to scrape a site, crawlers can be impacted by a range of existing tools such as robots.txt, NoAI meta tags, and active crawler blocking by reverse proxies.</p><p>In this work, we seek to understand the ability and efficacy of today’s networking tools to protect content creators against AI-related crawling. For targeted populations like human artists, do they have the technical knowledge and agency to utilize crawler-blocking tools such as robots.txt, and can such tools be effective? Using large scale measurements and a targeted user study of 182 professional artists, we find strong demand for tools like robots.txt, but significantly constrained by significant hurdles in technical awareness, agency in deploying them, and limited efficacy against unresponsive crawlers. We further test and evaluate network level crawler blockers by reverse-proxies, and find that despite very limited deployment today, their reliable and comprehensive blocking of AI-crawlers make them the strongest protection for artists moving forward."</p><p><a href="https://arxiv.org/html/2411.15091v1#S3" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">arxiv.org/html/2411.15091v1#S3</span><span class="invisible"></span></a></p><p><a href="https://tldr.nettime.org/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> <a href="https://tldr.nettime.org/tags/GenerativeAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>GenerativeAI</span></a> <a href="https://tldr.nettime.org/tags/AITraining" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AITraining</span></a> <a href="https://tldr.nettime.org/tags/WebCrawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>WebCrawling</span></a> <a href="https://tldr.nettime.org/tags/WebScraping" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>WebScraping</span></a></p>
Digital History Berlin<p>Diese Woche widmen wir uns im <a href="https://fedihum.org/tags/DigitalHistoryOFK" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DigitalHistoryOFK</span></a> gemeinsam mit Annabel Walz (Friedrich-Ebert-Stiftung) dem komplexen Thema der Webarchivierung. Aus gedächtnisinstitutioneller Perspektive wird sie die Eigenschaften von <a href="https://fedihum.org/tags/borndigital" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>borndigital</span></a> &amp; <a href="https://fedihum.org/tags/reborndigital" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>reborndigital</span></a> Quellen, aber auch Best Practices für ihre Archivierung diskutieren, die auf <a href="https://fedihum.org/tags/WebCrawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>WebCrawling</span></a> als Praktik &amp; <a href="https://fedihum.org/tags/WARC" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>WARC</span></a> als Speicherformat setzen. </p><p>🔜 Mi, 29. Nov., 4-6 pm - via Zoom</p><p>ℹ️ Info: <a href="https://dhistory.hypotheses.org/6411" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">dhistory.hypotheses.org/6411</span><span class="invisible"></span></a></p><p>___<br><a href="https://fedihum.org/tags/DigitalHistory" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DigitalHistory</span></a> <a href="https://fedihum.org/tags/WebArchive" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>WebArchive</span></a> <span class="h-card" translate="no"><a href="https://a.gup.pe/u/histodons" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>histodons</span></a></span></p>
khaleesi (Elina Eickstädt)<p><a href="https://eupolicy.social/tags/Chatkontrolle" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Chatkontrolle</span></a> <br><a href="https://eupolicy.social/tags/Webcrawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Webcrawling</span></a> Das EU-Center soll Befugnisse bekommen, die in Webcrawling ausarten könnten. Weiterhin ist der Dataaccess von Europol sehr breit. 8/x</p>
Doc Edward Morbius ⭕​<p><strong>Hacker News front-page analytics</strong></p><p>A question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is <strong>NOT</strong> limited to the front page. Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York). Results are further confounded by other factors.</p><p>Thread: <a href="https://news.ycombinator.com/item?id=36076870" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">news.ycombinator.com/item?id=3</span><span class="invisible">6076870</span></a></p><p>HN provides an interface to historical front-page stories (<a href="https://news.ycombinator.com/front" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">news.ycombinator.com/front</span><span class="invisible"></span></a>), and <em>that</em> can be crawled by providing a list of corresponding date specifications, e.g.:</p><pre><code>https://news.ycombinator.com/front?day=2023-05-25<br></code></pre><p>Easy enough.</p><p>So I'm crawling that and compiling a local archive. Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.</p><p>But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns. There's also looking at mean points and comments by various dimensions.</p><p>Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian. I'd thought HN leaned consistently less liberal.</p><p>The full archive will probably be &lt; 1 GB (raw HTML), currently 123 MB on disk.</p><p>Contents are the 30 top-voted stories for each day since 20 February 2007.</p><p>If anyone has suggestions for other questions to ask of this, fire away.</p><p>And, as of early 2015, top state mentions are:</p><pre><code> 1. new york: 150<br> 2. california: 101<br> 3. texas: 39<br> 4. washington: 38<br> 5. colorado: 15<br> 6. florida: 10<br> 7. georgia: 10<br> 8. kansas: 10<br> 9. north carolina: 9<br>10. oregon: 9<br></code></pre><p>NY is highly overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC). Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly. I've also got some city-based analytics.</p><p><a href="https://toot.cat/tags/hn" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>hn</span></a> <a href="https://toot.cat/tags/hackernews" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>hackernews</span></a> <a href="https://toot.cat/tags/data" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>data</span></a> <a href="https://toot.cat/tags/DataAnalysis" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DataAnalysis</span></a> <a href="https://toot.cat/tags/WebCrawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>WebCrawling</span></a></p>
Dawn A<p>Doing an introduction here:</p><p>I'm Dawn from <a href="https://seocommunity.social/tags/Manchester" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Manchester</span></a>. <br>Work as <a href="https://seocommunity.social/tags/SEO" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SEO</span></a> consultant. <br>Love <a href="https://seocommunity.social/tags/TechSEO" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TechSEO</span></a> and <a href="https://seocommunity.social/tags/Contentstrategy" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Contentstrategy</span></a>, <a href="https://seocommunity.social/tags/webstrategy" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>webstrategy</span></a> , <a href="https://seocommunity.social/tags/ecommerce" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ecommerce</span></a>, <a href="https://seocommunity.social/tags/webcrawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>webcrawling</span></a>, <a href="https://seocommunity.social/tags/digitalmarketing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>digitalmarketing</span></a>, <a href="https://seocommunity.social/tags/tech" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>tech</span></a>. <br>Learning <a href="https://seocommunity.social/tags/computerscience" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>computerscience</span></a>, <a href="https://seocommunity.social/tags/coding" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>coding</span></a> including <a href="https://seocommunity.social/tags/python" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>python</span></a>, <a href="https://seocommunity.social/tags/javascript" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>javascript</span></a>, <a href="https://seocommunity.social/tags/kotlin" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>kotlin</span></a>. <br>Keen interest in following <a href="https://seocommunity.social/tags/datascience" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>datascience</span></a>, <a href="https://seocommunity.social/tags/informationretrieval" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>informationretrieval</span></a>, <a href="https://seocommunity.social/tags/tech" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>tech</span></a> topics. <br>Love <a href="https://seocommunity.social/tags/running" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>running</span></a>, <a href="https://seocommunity.social/tags/trailrunning" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>trailrunning</span></a>, <a href="https://seocommunity.social/tags/baking" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>baking</span></a>.<br>Love <a href="https://seocommunity.social/tags/pomeranians" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>pomeranians</span></a> <a href="https://seocommunity.social/tags/animals" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>animals</span></a><br>99% <a href="https://seocommunity.social/tags/vegan" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>vegan</span></a>, 100% <a href="https://seocommunity.social/tags/vegetarian" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>vegetarian</span></a>. Learning is a big part of my every day</p>