|
As I am sure all of you are no doubt aware, there is some concern that these fora are not long for this world. In light of this, some have taken to archiving their favourite threads. Let’s collaborate in this thread so we don’t prematurely hug the site to death. I propose posting in the thread you’re archiving, if it’s still open, as well as replying here. I don’t know of any good free files hosts to start sharing them immediately, but if you do, go ahead. We can work out hosting soon, but for now, let’s get threads on drives while not killing the server. e: I’m keeping keeping this tool here for posterity, but there are a couple command line utilities down the page that may work better. Fumblemouse posted:A while ago, some goon wrote a thread scraper for SA. Platystemon has a new favorite as of 08:28 on Jun 27, 2020 |
# ? Jun 25, 2020 03:00 |
|
|
# ? Mar 28, 2024 20:26 |
|
Reserved
|
# ? Jun 25, 2020 03:01 |
|
Reserved Of course it’s somewhat circular to create an index of archives on the very thing we are archiving, so we’ll need to to preserve that, too, but we can at least coordinate the downloading till that is itself impossible.
|
# ? Jun 25, 2020 03:03 |
|
Do we know if there's a limit to what the extension/browser/servers will handle? Just tried it with the Watches Megathread, but it keeps timing out or crashing the extension. At 1495 pages, the thread is... chunky. (gigabit on my end, so hopefully that isn't the limiting the factor)
|
# ? Jun 25, 2020 03:14 |
|
No idea sorry, it's possible there is a hard limit or something to do with memory: java goons feel free to inspect it and fix if you can.
|
# ? Jun 25, 2020 03:20 |
|
I imagine it’s running out of memory somewhere and does so faster in content‐dense threads. A quick fix would be to code in a page start/end so it only tries to load a reasonable number at once.
|
# ? Jun 25, 2020 03:27 |
|
I really hope dipshit lowtax agrees to leave, and that some of his ill-gotten patreon money goes to a security audit by a third party before it gets turned over so he can never show up again. I've been making fun of this site for years but now that we're at the point of archiving it I'm extremely sad.
|
# ? Jun 25, 2020 03:31 |
|
If you prefer the command line, I have a scraper written in Python 3.6+ here. (Yes, you can see all my other crappy code and half-finished projects if you want to.) e: Also, yes, I ran into a rate limit with an early version of this, which is why mine is limited to 10 requests per second. Takes longer, but doesn't hammer the server as hard. SneezeOfTheDecade has a new favorite as of 03:40 on Jun 25, 2020 |
# ? Jun 25, 2020 03:33 |
|
^ Thanks -- it trucked along for a while but failed on page 36 with:code:
|
# ? Jun 25, 2020 04:06 |
|
Trabant posted:^ Thanks -- it trucked along for a while but failed on page 36 with: Whoops, sorry about that - forgot to set encoding on the output file. Please re-download main.py - it should work now (I just tested it!).
|
# ? Jun 25, 2020 04:15 |
|
goonspeed
|
# ? Jun 25, 2020 04:20 |
|
SneezeOfTheDecade posted:Whoops, sorry about that - forgot to set encoding on the output file. Please re-download main.py - it should work now (I just tested it!). You're a wonderful person e: well, now I have 1495 separate html files... anyone know how I can join them in a sane fashion? Clicking on "next page" within the saved file doesn't work for browsing -- the buttons retain the code which points to the next page: file:///C:/Users/myname/Desktop/SAScraper-master/archive/3520271/showthread.php?threadid=3520271&userid=0&perpage=40&pagenumber=2 Trabant has a new favorite as of 07:22 on Jun 25, 2020 |
# ? Jun 25, 2020 06:49 |
|
I have a lot of free time, disc space, and bandwidth. Willing to help out the efforts any way I can.
Rack has a new favorite as of 09:42 on Jun 25, 2020 |
# ? Jun 25, 2020 09:07 |
|
can someone figure out how to archive the SAclopedia?
|
# ? Jun 25, 2020 10:15 |
|
Crankit posted:can someone figure out how to archive the SAclopedia? If SaberCat isn't completely collapsing (it, er, might be ), you can find an archive here.
|
# ? Jun 25, 2020 11:37 |
|
Platystemon posted:I don’t know of any good free files hosts to start sharing them immediately, but if you do, go ahead. We can work out hosting soon, but for now, let’s get threads on drives while not killing the server. Unless something's changed recently, a free MEGA account has something like 50GB of upload space. That should help a lot for people who don't have other upload options on the table.
|
# ? Jun 25, 2020 12:36 |
|
SubNat posted:Unless something's changed recently, a free MEGA account has something like 50GB of upload space. That should help a lot for people who don't have other upload options on the table. Literally signed up for one today to share the smilies I downloaded and it was 50g for free.
|
# ? Jun 25, 2020 12:46 |
|
SneezeOfTheDecade posted:If you prefer the command line, I have a scraper written in Python 3.6+ here. (Yes, you can see all my other crappy code and half-finished projects if you want to.) I'm completely new to Python, though I have 3.8.3 installed, etc. I'm getting this error when I try to run main.py. I am bad at being a script kiddy posted:Fetching from thread 3708603. Edit: Got past the first error message, now hitting the one posted. Ynglaur has a new favorite as of 14:05 on Jun 25, 2020 |
# ? Jun 25, 2020 13:23 |
|
i had that too, I typed: pip3 install requests on the commandline and it installed the required library
|
# ? Jun 25, 2020 13:57 |
|
Crankit posted:i had that too, I typed: pip3 install requests Thanks. I updated my post with my next error. If others would prefer I take this to a PM that's fine.
|
# ? Jun 25, 2020 14:05 |
|
I tried saving a couple of threads near and dear to my heart as HTML, but I've been told that's not the way to go. I am not familiar with GitHub. What am I supposed to do to archive them?
|
# ? Jun 25, 2020 14:35 |
|
Ynglaur posted:Thanks. I updated my post with my next error. If others would prefer I take this to a PM that's fine. Did you put your username and password in the config file?
|
# ? Jun 25, 2020 14:38 |
|
I'm currently backing up the old Comic Strip Megathreads from BSS, I'll do the current one last. (Seems like it takes 1-2 hours a pop, due to the threads being giant and very image heavy. CSMT 17 was ~6.5GB) https://www.dropbox.com/sh/mreh54f2hxg2jtd/AABSvuE3LX0UvaIwhVJ07ahCa?dl=0 I'm doing one at a time to prevent unecessary load, but I really hope they won't be required.
|
# ? Jun 25, 2020 14:56 |
|
Crankit posted:Did you put your username and password in the config file? Yes. Should there be a space after the equals sign? My password uses a fair number of special characters, if that matters. (Yay for password managers.)
|
# ? Jun 25, 2020 15:00 |
|
nota just posted this in the Video Game Hoaxes thread, is this useful?nota posted:Thanks a lot for the run down !
|
# ? Jun 25, 2020 15:23 |
|
I figured it out. My password had a percent sign (%) in it, which the the parser of config.ini didn't like. I changed my password to not include that symbol and it's churning along fine now.
|
# ? Jun 25, 2020 15:51 |
|
Ynglaur posted:I figured it out. My password had a percent sign (%) in it, which the the parser of config.ini didn't like. I changed my password to not include that symbol and it's churning along fine now. Crud, sorry about that - I didn't realize configparser treated percent signs as special characters. I should figure out how to handle that. And thank you for reminding me that "requests" isn't a part of the core libraries!
|
# ? Jun 25, 2020 18:10 |
|
Can I make this tool work on Opera, or do I need to get Chrome back out?
|
# ? Jun 25, 2020 20:06 |
|
The Chrome extension is throwing errors for me now after successfully archiving a few small LP threads.
|
# ? Jun 25, 2020 20:14 |
|
I don't know who needs it but I made a backup of Yooper's Strike Command LP up until today
|
# ? Jun 25, 2020 20:43 |
|
Alright, I'm gonna get started with the Tails Gets Trolled thread.
|
# ? Jun 25, 2020 20:43 |
|
SubNat posted:I'm currently backing up the old Comic Strip Megathreads from BSS, I'll do the current one last. (Seems like it takes 1-2 hours a pop, due to the threads being giant and very image heavy. CSMT 17 was ~6.5GB) I don't know about Dropbox. Do I need to download these now, or can I just bookmark the page and it will be good for years?
|
# ? Jun 25, 2020 20:51 |
|
SubNat posted:I'm currently backing up the old Comic Strip Megathreads from BSS, I'll do the current one last. (Seems like it takes 1-2 hours a pop, due to the threads being giant and very image heavy. CSMT 17 was ~6.5GB) What's your tool/process if you don't mind? There are a couple of large threads I'd like to do the same for, but it's either breaking my system (the extension refuses to print/save as a PDF due to thread size) or resulting in thousands of individual html files.
|
# ? Jun 25, 2020 20:56 |
|
I hate to ask other people to do a job but I can't right now. Could someone archive this thread? https://forums.somethingawful.com/showthread.php?threadid=2703083 "I received a ciphertext on AIM today" It was the thread that got me into the forums, I was about 16 and unregged and read it day in, day out. e: I was wrong, this was the second one I got into, what was the Cthulhu one?? eating only apples has a new favorite as of 22:41 on Jun 25, 2020 |
# ? Jun 25, 2020 22:17 |
|
Beachcomber posted:I don't know about Dropbox. Do I need to download these now, or can I just bookmark the page and it will be good for years? They'll remain for as long as I keep them in my dropbox folder, since that's just a link to one of my folders. ( I have like 1600GB free, so it's not like this is chewing up much space.) I won't remove them until the current lowtax business has been solved, one way or another. Either the forums go free, and there's not actually a need to keep them around. Or I keep them until they get safely deposited into some kind of long term archive / hosted on a site, etc. Trabant posted:What's your tool/process if you don't mind? There are a couple of large threads I'd like to do the same for, but it's either breaking my system (the extension refuses to print/save as a PDF due to thread size) or resulting in thousands of individual html files. I just used the chrome extension on github as listed in the opening post of this thread. It loads up the page, and then I can just use chrome's 'right click -> save as(complete) html' and it saves the entire page, and thus thread, as a single .html file, plus a folder containing all the images. (As for memory/load otherwise, my desktop is a workhorse with 64GB ram, so it'll load up even the largest of threads with no issue. And the actual saving process doesn't seem too intensive.) I can see wanting to do pdfs, but try doing it as plain html and see if that works. eating only apples posted:I hate to ask other people to do a job but I can't right now. Could someone archive this thread? Yeah sure, grabbing it now. There: https://www.dropbox.com/sh/mreh54f2hxg2jtd/AABSvuE3LX0UvaIwhVJ07ahCa?dl=0 It's on the same link I posted before. SubNat has a new favorite as of 22:28 on Jun 25, 2020 |
# ? Jun 25, 2020 22:22 |
|
Do the Goldmine if possible
|
# ? Jun 25, 2020 22:25 |
|
The goldmine seems like a thing where we'd either need an automated scraper that can handle entire subforums, or a giant spreadsheet where people note which threads they're scraping, and where to find them etc, to handle in a sensible manner.
|
# ? Jun 25, 2020 22:29 |
|
Debating whether to drop for archives so that I can just do a for loop and grab threads on my second computer all night long. I grabbed my bookmarks, at least:Don't judge me dad posted:Directory: C:\Applications\SA Scraper\archive
|
# ? Jun 25, 2020 22:31 |
|
SubNat posted:The goldmine seems like a thing where we'd either need an automated scraper that can handle entire subforums, or a giant spreadsheet where people note which threads they're scraping, and where to find them etc, to handle in a sensible manner. Could we just setup for loops and have different start and ends? I believe the thread IDs are just sequential. We'll grab crap threads as much as good ones, but...that's kind of just SA anyways.
|
# ? Jun 25, 2020 22:32 |
|
|
# ? Mar 28, 2024 20:26 |
fletcher, who lives up to his status as poster in the digital packrats thread, has a script that can be used for archiving. It generates html files which support html pagination properly, and on top of it it also grabs the images and includes them, so you don't have to rely on webhosts staying up. Only "problem" with it is that it sometimes generates zero-byte files, but you can work around that by simply deleting the zero-byte files and running the script again. In case it's not obvious, it needs the python modules called requests, beautifulsoup4, and html5lib, and is meant to be installed with pip.
|
|
# ? Jun 25, 2020 22:54 |