@jonny a torrent in time saves nine terabytes of data
https://sciop.net/docs/quickstart/#seed-anything
Incredibly easy! Zero cost if you have internet, and don't have a data cap! Resilience!
man i just had a series of extremely good ideas* that are very simple and very implementable for #sciop that i think will cause an absolutely disgusting amount of (good, intrinsically deduplicating, actually decrease server load by creating a supporting swarm of peers) public data scraping to happen and basically lower the barrier to scouting endangered datasets to zero
*if you received the message flood of me having them you are not allowed to tell people if they are actually bad
@ai6yr open call for radio people to team up with @SafeguardingResearch and #sciop to snatch the data from the satellites, decode, and re-upload as torrents to replace critical infrastructure with something even better
Hackathon: Data Under Threat / Data Rescueing (Aug 7) in #München
The LMU Open Science Center (@lmu_osc) runs a hackathon to support the #SciOp #SafeguardingResearch initiative: Rescuing research data that is deleted by the Trump administration.
Bonus: @lavaeolus will give an ignition talk!
Thursday, 2025-08-07, 16 – 19 (only in-person)
Details and signup: https://github.com/lmu-osc/safeguar.de-hackathon
Become a data rescuer by turning your own laptop into a Research Data Rescue Node, scraping at-risk data sets, and breathing new life into your old HDD as part of a global, decentralised network.
#LMUMünchen #OpenScience #OpenData #DataRescue
CC @SafeguardingResearch @bitsUndBaeumeAuxMuc
I revived an old HDD with a #RaspberryPi Zero W 2 for #DataRescue:ing:
It runs ...
(a) a Bittorrent client that seeds at-risk data sets from the #SciOp database
(b) the `sciop-scraper` script to get new datasets into the swarm
Setup instructions for the Pi Zero: https://codeberg.org/nicebread/HiveSeed/src/branch/main/L1-RDRN_RPi.md
Setup instructions for `sciop-scrape` (on macOS & RPi): https://codeberg.org/nicebread/HiveSeed/src/branch/main/L1-sciop-scrape.md
Let me know if the instructions work for you; happy to collaborate on the manual.
Added a 10 Terabyte seeding node to the #SciOp #SafeguardingResearch swarm; focusing on large (> 1 TB) data sets with 0 or 1 seeders.
You may have heard that globalchange.gov and all the national reports on climate change have gone down.
We got em all on #sciop, a webrip and all the PDFs extracted: https://sciop.net/datasets/globalchange-gov-webrip
Edit: context - https://apnews.com/article/climate-change-national-assessment-nasa-white-house-057cec699caef90832d8b10f21a6ffe8
About US research data under threat and how everyone can contribute to saving it - @lavaeolus and me were interviewed in the TU Delft paper Delta: https://delta.tudelft.nl/en/article/saving-academic-data-is-easier-than-you-think-and-you-can-do-it-too mostly regarding our parts in the Safeguarding Research and Culture initiative at https://safeguar.de
#TUDelft #SafeguardingResearch #SciOp
@jonny With the updated commands I got it to run now (with minor modifications) on macOS. On RPi I will try again tomorrow (currently no access to the machine).
I am currently scraping „rp_enchanter_ver02“ with 24 GB and counting. Three questions:
(1) Can I know how large the download will be?
(2) Can I stop the scraping, or will the download then be corrupted?
(3) I assume that after downloading it automatically starts seeding?
Should we keep this conversation on (a) Mastodon, (b) safeguar.de forum or (c) Codeberg issues? Where can most people profit from it?
@jonny this entire thread is amazing, top-notch tool development for a noble cause.
@ #academia : if you feel desperate about the wholesale breakdown of science under the current US administration, consider helping out with #SciOp: Decentralized backups of datasets under threat, in a torrent swarm.
Have a disused laptop or Raspi? Make it part of the swarm and take the data outside the US (or any) administration's grasp!
@jonny Very cool. A couple months back, I resurrected an 8T NAS I'd slated for donation when I came across #sciop
So far I've been creating WARCs using zimit and #deluge for the torrent because the client/server is convenient for a headless unit.
Anyway, I'm giving this a try and it's grabbed 10G very quickly, which seems much faster than zimit. I'm not exactly sure how to turn this around into a torrent and get it up to Sci-Op, but I'll keep an eye on it and am happy to provide feedback.
check this out if you want to help preserve the archive of "most local newspapers through most of US history" that had its funding pulled, even if you only have a couple dozen gigabytes to spare, you can
a) make an account on https://sciop.net/ ,
b) run a qbittorrent instance, go to preferences>web ui and click enable,
and just do this
python -m pip install sciop-scraping
sciop-cli login
sciop-cli client login
sciop-scrape chronicling-america --next
and that's all.
if you have spare storage, you can sort by seeders, ascending, and start from there. or subscribe to the rss feed and auto-download it.
this is an archive funded by the library of congress (threatened) and the national endowment for the humanities (actively being eliminated). the alternative is that an enormous amount of US history that doesn't percolate into history books is owned and operated by lexisnexis and other for-profit data brokers.
this is the first run of some tooling to lower the bar for participatory scraping - at the moment, the archive is still online, and the scraper will automatically embed a webseed URL in the created torrent. so even if you don't have space to seed, you can scrape the data, upload the torrent, and make it possible for waiting peers to become mirrors
Sciop is as easy to run as a bittorrent client. The idea will be to have it serve as a companion to a client, where we are going to implement a minor mutation of the FEP for mobile identity so you can mirror an identity from your personal client companion to any other instance that chooses to mirror yours. So this isn't like "come help our website" this is "get the fun parts of this website ready for when it's time to talk to other websites"