Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Platystemon
Feb 13, 2012

as a person who never leaves my house i've done pretty well for myself.
As I am sure all of you are no doubt aware, there is some concern that these fora are not long for this world.

In light of this, some have taken to archiving their favourite threads.

Let’s collaborate in this thread so we don’t prematurely hug the site to death. I propose posting in the thread you’re archiving, if it’s still open, as well as replying here.

I don’t know of any good free files hosts to start sharing them immediately, but if you do, go ahead. We can work out hosting soon, but for now, let’s get threads on drives while not killing the server.

e: I’m keeping keeping this tool here for posterity, but there are a couple command line utilities down the page that may work better.

Fumblemouse posted:

A while ago, some goon wrote a thread scraper for SA.

I forked the code and fixed the error messages, then did it again when chrome started hating on http connections or something. It's now the single thing I have available on github, because it would be rude not to let the orginal author know that somebody still loves his dead gay code:

https://github.com/Fumblemouse/SA-Archiving-Tool

To use:

Hit the big green Clone button and download it as a zip file.

Extract the zip file somewhere on your local hard drive.

In Chrome, go Settings -> More Tools -> Extensions and flick the Developer Mode switch on the top right.

There should now be an option to Load Unpacked on the left. Ignoring the untold comedy potential for that phrase, click the option and find the folder you extracted the zip to - by default, SA-Archiving-Tool-master. Select that folder and click the Open button.

All going well you should now have a hard-to-see SA icon on the chrome extension bar. If you visit a SA thread with a thread ID, eg https://forums.somethingawful.com/s...hreadid=3903748 the icon should now be clickable with a single option...Archive.

Click that link, and your local PC will think for a while, depending on your local tubespeed, then display a single page with the entire thread on it. CTRL+S or Rightclick Save As and save it whereever you want. WebPage, Complete will save images also.

Because nothing in life is easy, the saved page's HTML CSS tags refers to something set up by the chrome extension, so it will look a bit crappy in another browser or if you remove the extension. This can be fixed by changing the CSS reference in the Saved As HTML file to the files in the /archive folder from the extracted zipfile - or you could copy the /archive css files into the HTML file's folder and just have the tags as direct links to files (no folder references)

Go crazy!

Platystemon has a new favorite as of 08:28 on Jun 27, 2020

Adbot
ADBOT LOVES YOU

Platystemon
Feb 13, 2012

as a person who never leaves my house i've done pretty well for myself.
Reserved

Platystemon
Feb 13, 2012

as a person who never leaves my house i've done pretty well for myself.
Reserved

Of course it’s somewhat circular to create an index of archives on the very thing we are archiving, so we’ll need to to preserve that, too, but we can at least coordinate the downloading till that is itself impossible.

Trabant
Nov 26, 2011

All systems nominal.
Do we know if there's a limit to what the extension/browser/servers will handle? Just tried it with the Watches Megathread, but it keeps timing out or crashing the extension. At 1495 pages, the thread is... chunky.

(gigabit on my end, so hopefully that isn't the limiting the factor)

sebmojo
Oct 23, 2010


Legit Cyberpunk









No idea sorry, it's possible there is a hard limit or something to do with memory: java goons feel free to inspect it and fix if you can.

Platystemon
Feb 13, 2012

as a person who never leaves my house i've done pretty well for myself.
I imagine it’s running out of memory somewhere and does so faster in content‐dense threads. A quick fix would be to code in a page start/end so it only tries to load a reasonable number at once.

gamer roomie is 41
May 3, 2020

:)
I really hope dipshit lowtax agrees to leave, and that some of his ill-gotten patreon money goes to a security audit by a third party before it gets turned over so he can never show up again. I've been making fun of this site for years but now that we're at the point of archiving it I'm extremely sad.

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts
If you prefer the command line, I have a scraper written in Python 3.6+ here. (Yes, you can see all my other crappy code and half-finished projects if you want to.)

e: Also, yes, I ran into a rate limit with an early version of this, which is why mine is limited to 10 requests per second. Takes longer, but doesn't hammer the server as hard. :)

SneezeOfTheDecade has a new favorite as of 03:40 on Jun 25, 2020

Trabant
Nov 26, 2011

All systems nominal.
^ Thanks -- it trucked along for a while but failed on page 36 with:

code:
Traceback (most recent call last):
  File "C:\Users\myname\Desktop\SAScraper-master\main.py", line 55, in <module>
    main(threadid)
  File "C:\Users\myname\Desktop\SAScraper-master\main.py", line 44, in main
    file.write(r.text)
  File "C:\Users\myname\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 111465: character maps to <undefined>
And from what I can tell, x80 corresponds to the Euro symbol €? Page 36 definitely contains a couple of those.

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

Trabant posted:

^ Thanks -- it trucked along for a while but failed on page 36 with:

code:
Traceback (most recent call last):
  File "C:\Users\myname\Desktop\SAScraper-master\main.py", line 55, in <module>
    main(threadid)
  File "C:\Users\myname\Desktop\SAScraper-master\main.py", line 44, in main
    file.write(r.text)
  File "C:\Users\myname\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 111465: character maps to <undefined>
And from what I can tell, x80 corresponds to the Euro symbol €? Page 36 definitely contains a couple of those.

Whoops, sorry about that - forgot to set encoding on the output file. Please re-download main.py - it should work now (I just tested it!).

Head Bee Guy
Jun 12, 2011

Retarded for Busting
Grimey Drawer
goonspeed

Trabant
Nov 26, 2011

All systems nominal.

SneezeOfTheDecade posted:

Whoops, sorry about that - forgot to set encoding on the output file. Please re-download main.py - it should work now (I just tested it!).

You're a wonderful person :)

e: well, now I have 1495 separate html files... anyone know how I can join them in a sane fashion? Clicking on "next page" within the saved file doesn't work for browsing -- the buttons retain the code which points to the next page:

file:///C:/Users/myname/Desktop/SAScraper-master/archive/3520271/showthread.php?threadid=3520271&userid=0&perpage=40&pagenumber=2

Trabant has a new favorite as of 07:22 on Jun 25, 2020

Rack
Aug 5, 2003

I've misunderstood what a lion is.


Grimey Drawer
I have a lot of free time, disc space, and bandwidth. Willing to help out the efforts any way I can.

Rack has a new favorite as of 09:42 on Jun 25, 2020

Crankit
Feb 7, 2011

HE WATCHES
can someone figure out how to archive the SAclopedia?

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

Crankit posted:

can someone figure out how to archive the SAclopedia?

If SaberCat isn't completely collapsing (it, er, might be :shobon:), you can find an archive here.

SubNat
Nov 27, 2008

Platystemon posted:

I don’t know of any good free files hosts to start sharing them immediately, but if you do, go ahead. We can work out hosting soon, but for now, let’s get threads on drives while not killing the server.

Unless something's changed recently, a free MEGA account has something like 50GB of upload space. That should help a lot for people who don't have other upload options on the table.

Inceltown
Aug 6, 2019

SubNat posted:

Unless something's changed recently, a free MEGA account has something like 50GB of upload space. That should help a lot for people who don't have other upload options on the table.

Literally signed up for one today to share the smilies I downloaded and it was 50g for free.

Ynglaur
Oct 9, 2013

The Malta Conference, anyone?

SneezeOfTheDecade posted:

If you prefer the command line, I have a scraper written in Python 3.6+ here. (Yes, you can see all my other crappy code and half-finished projects if you want to.)

e: Also, yes, I ran into a rate limit with an early version of this, which is why mine is limited to 10 requests per second. Takes longer, but doesn't hammer the server as hard. :)

I'm completely new to Python, though I have 3.8.3 installed, etc. I'm getting this error when I try to run main.py.

I am bad at being a script kiddy posted:

Fetching from thread 3708603.
Traceback (most recent call last):
File "main.py", line 55, in <module>
main(threadid)
File "main.py", line 17, in main
if "username" not in config["DEFAULT"] or "password" not in config["DEFAULT"] or config["DEFAULT"]["username"] == "" or config["DEFAULT"]["password"] == "":
File "C:\Python38\lib\configparser.py", line 1255, in __getitem__
return self._parser.get(self._name, key)
File "C:\Python38\lib\configparser.py", line 799, in get
return self._interpolation.before_get(self, section, option, value,
File "C:\Python38\lib\configparser.py", line 395, in before_get
self._interpolate_some(parser, option, L, value, section, defaults, 1)
File "C:\Python38\lib\configparser.py", line 442, in _interpolate_some
raise InterpolationSyntaxError(
configparser.InterpolationSyntaxError: '%' must be followed by '%' or '(', found: '%f&'

Edit: Got past the first error message, now hitting the one posted.

Ynglaur has a new favorite as of 14:05 on Jun 25, 2020

Crankit
Feb 7, 2011

HE WATCHES
i had that too, I typed: pip3 install requests
on the commandline and it installed the required library

Ynglaur
Oct 9, 2013

The Malta Conference, anyone?

Crankit posted:

i had that too, I typed: pip3 install requests
on the commandline and it installed the required library

Thanks. I updated my post with my next error. If others would prefer I take this to a PM that's fine.

paradoxGentleman
Dec 10, 2013

wheres the jester, I could do with some pointless nonsense right about now

I tried saving a couple of threads near and dear to my heart as HTML, but I've been told that's not the way to go.
I am not familiar with GitHub. What am I supposed to do to archive them?

Crankit
Feb 7, 2011

HE WATCHES

Ynglaur posted:

Thanks. I updated my post with my next error. If others would prefer I take this to a PM that's fine.

Did you put your username and password in the config file?

SubNat
Nov 27, 2008

I'm currently backing up the old Comic Strip Megathreads from BSS, I'll do the current one last. (Seems like it takes 1-2 hours a pop, due to the threads being giant and very image heavy. CSMT 17 was ~6.5GB)
https://www.dropbox.com/sh/mreh54f2hxg2jtd/AABSvuE3LX0UvaIwhVJ07ahCa?dl=0

I'm doing one at a time to prevent unecessary load, but I really hope they won't be required. :smith:

Ynglaur
Oct 9, 2013

The Malta Conference, anyone?

Crankit posted:

Did you put your username and password in the config file?

Yes. Should there be a space after the equals sign? My password uses a fair number of special characters, if that matters. (Yay for password managers.)

The Saddest Rhino
Apr 29, 2009

Put it all together.
Solve the world.
One conversation at a time.



nota just posted this in the Video Game Hoaxes thread, is this useful?

nota posted:

Thanks a lot for the run down !
It makes me pretty sad that the actions of one lovely person may result in the end of the forums, I hope something can be done to avoid this.

For windows user looking to archive their favorite threads, I made a wget script.
First you'll need to get wget and run it from the folder you want your thread saved to (each page will be a file so one folder per thread).
Here's a tuto I found : https://builtvisible.com/download-your-website-with-wget/
Here's the command line script :

FOR /L %G IN (A,1,B) DO wget --html-extension -np https://forums.somethingawful.com/showthread.php?threadid=1111111^&userid=0^&perpage=40^&pagenumber=%G

In (A,1,B), A is the first page you want to save and B the last (1 is the increment).
Just edit the threadid with the one you want to save, you can also edit the number of posts per page.
Sadly I could'n't figure out how to make the page links work since they contain a question mark (if you can batch edit the html to replace it with an @ the links should work).
This won't log you in but there are ways to add cookie data to wget.

EDIT : To add cookies just put the site's cookies.txt in the folder you're using for wget (there are extensions to extract cookies from websites) and use the following command (don't forget to edit the relevant fields) :

FOR /L %G IN (A,1,B) DO wget --load-cookies=cookies.txt --html-extension -np https://forums.somethingawful.com/showthread.php?threadid=1111111^&userid=0^&perpage=40^&pagenumber=%G

Ynglaur
Oct 9, 2013

The Malta Conference, anyone?
I figured it out. My password had a percent sign (%) in it, which the the parser of config.ini didn't like. I changed my password to not include that symbol and it's churning along fine now.

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

Ynglaur posted:

I figured it out. My password had a percent sign (%) in it, which the the parser of config.ini didn't like. I changed my password to not include that symbol and it's churning along fine now.

Crud, sorry about that - I didn't realize configparser treated percent signs as special characters. I should figure out how to handle that. And thank you for reminding me that "requests" isn't a part of the core libraries!

Solitair
Feb 18, 2014

TODAY'S GONNA BE A GOOD MOTHERFUCKIN' DAY!!!
Can I make this tool work on Opera, or do I need to get Chrome back out?

Hirayuki
Mar 28, 2010


The Chrome extension is throwing errors for me now after successfully archiving a few small LP threads. :(

mrpwase
Apr 21, 2010

I HAVE GREAT AVATAR IDEAS
For the Many, Not the Few


I don't know who needs it but I made a backup of Yooper's Strike Command LP up until today :shobon:

Solitair
Feb 18, 2014

TODAY'S GONNA BE A GOOD MOTHERFUCKIN' DAY!!!
Alright, I'm gonna get started with the Tails Gets Trolled thread.

Beachcomber
May 21, 2007

Another day in paradise.


Slippery Tilde

SubNat posted:

I'm currently backing up the old Comic Strip Megathreads from BSS, I'll do the current one last. (Seems like it takes 1-2 hours a pop, due to the threads being giant and very image heavy. CSMT 17 was ~6.5GB)
https://www.dropbox.com/sh/mreh54f2hxg2jtd/AABSvuE3LX0UvaIwhVJ07ahCa?dl=0

I'm doing one at a time to prevent unecessary load, but I really hope they won't be required. :smith:

I don't know about Dropbox. Do I need to download these now, or can I just bookmark the page and it will be good for years?

Trabant
Nov 26, 2011

All systems nominal.

SubNat posted:

I'm currently backing up the old Comic Strip Megathreads from BSS, I'll do the current one last. (Seems like it takes 1-2 hours a pop, due to the threads being giant and very image heavy. CSMT 17 was ~6.5GB)
https://www.dropbox.com/sh/mreh54f2hxg2jtd/AABSvuE3LX0UvaIwhVJ07ahCa?dl=0

I'm doing one at a time to prevent unecessary load, but I really hope they won't be required. :smith:

What's your tool/process if you don't mind? There are a couple of large threads I'd like to do the same for, but it's either breaking my system (the extension refuses to print/save as a PDF due to thread size) or resulting in thousands of individual html files.

eating only apples
Dec 12, 2009

Shall we dance?
I hate to ask other people to do a job but I can't right now. Could someone archive this thread?

https://forums.somethingawful.com/showthread.php?threadid=2703083

"I received a ciphertext on AIM today"

It was the thread that got me into the forums, I was about 16 and unregged and read it day in, day out. e: I was wrong, this was the second one I got into, what was the Cthulhu one??

eating only apples has a new favorite as of 22:41 on Jun 25, 2020

SubNat
Nov 27, 2008

Beachcomber posted:

I don't know about Dropbox. Do I need to download these now, or can I just bookmark the page and it will be good for years?

They'll remain for as long as I keep them in my dropbox folder, since that's just a link to one of my folders. ( I have like 1600GB free, so it's not like this is chewing up much space.)

I won't remove them until the current lowtax business has been solved, one way or another.
Either the forums go free, and there's not actually a need to keep them around.
Or I keep them until they get safely deposited into some kind of long term archive / hosted on a site, etc.

Trabant posted:

What's your tool/process if you don't mind? There are a couple of large threads I'd like to do the same for, but it's either breaking my system (the extension refuses to print/save as a PDF due to thread size) or resulting in thousands of individual html files.

I just used the chrome extension on github as listed in the opening post of this thread. It loads up the page, and then I can just use chrome's 'right click -> save as(complete) html' and it saves the entire page, and thus thread, as a single .html file, plus a folder containing all the images.
(As for memory/load otherwise, my desktop is a workhorse with 64GB ram, so it'll load up even the largest of threads with no issue. And the actual saving process doesn't seem too intensive.)
I can see wanting to do pdfs, but try doing it as plain html and see if that works.


eating only apples posted:

I hate to ask other people to do a job but I can't right now. Could someone archive this thread?
---

Yeah sure, grabbing it now.
There: https://www.dropbox.com/sh/mreh54f2hxg2jtd/AABSvuE3LX0UvaIwhVJ07ahCa?dl=0 It's on the same link I posted before.

SubNat has a new favorite as of 22:28 on Jun 25, 2020

Spinz
Jan 7, 2020

I ordered luscious new gemstones from India and made new earrings for my SA mart thread

Remember my earrings and art are much better than my posting

New stuff starts towards end of page 3 of the thread
Do the Goldmine if possible

SubNat
Nov 27, 2008

The goldmine seems like a thing where we'd either need an automated scraper that can handle entire subforums, or a giant spreadsheet where people note which threads they're scraping, and where to find them etc, to handle in a sensible manner.

Ynglaur
Oct 9, 2013

The Malta Conference, anyone?
Debating whether to drop :10bux: for archives so that I can just do a for loop and grab threads on my second computer all night long. I grabbed my bookmarks, at least:

Don't judge me dad posted:

Directory: C:\Applications\SA Scraper\archive

Mode LastWriteTime Length Name
---- ------------- ------ ----
d----- 6/25/2020 2:31 PM 3137721
d----- 6/25/2020 2:34 PM 3222963
d----- 6/25/2020 1:00 PM 3474164
d----- 6/25/2020 2:45 PM 3486446
d----- 6/25/2020 1:21 PM 3498140
d----- 6/25/2020 2:47 PM 3532243
d----- 6/25/2020 2:55 PM 3543909
d----- 6/25/2020 1:03 PM 3708603
d----- 6/25/2020 2:57 PM 3750534
d----- 6/25/2020 3:00 PM 3787350
d----- 6/25/2020 1:01 PM 3808055
d----- 6/25/2020 3:01 PM 3842007
d----- 6/25/2020 3:01 PM 3845393
d----- 6/25/2020 3:06 PM 3866278
d----- 6/25/2020 1:23 PM 3885426
d----- 6/25/2020 1:43 PM 3891653
d----- 6/25/2020 3:06 PM 3897946
d----- 6/25/2020 2:10 PM 3899855
d----- 6/25/2020 3:06 PM 3905098
d----- 6/25/2020 3:13 PM 3915397
d----- 6/25/2020 3:13 PM 3923802
d----- 6/25/2020 3:13 PM 3924247
d----- 6/25/2020 3:16 PM 3928980
d----- 6/25/2020 3:16 PM 3929489
-a---- 6/25/2020 5:30 PM 0 archive.txt

Ynglaur
Oct 9, 2013

The Malta Conference, anyone?

SubNat posted:

The goldmine seems like a thing where we'd either need an automated scraper that can handle entire subforums, or a giant spreadsheet where people note which threads they're scraping, and where to find them etc, to handle in a sensible manner.

Could we just setup for loops and have different start and ends? I believe the thread IDs are just sequential. We'll grab crap threads as much as good ones, but...that's kind of just SA anyways.

Adbot
ADBOT LOVES YOU

BlankSystemDaemon
Mar 13, 2009



fletcher, who lives up to his status as poster in the digital packrats thread, has a script that can be used for archiving.
It generates html files which support html pagination properly, and on top of it it also grabs the images and includes them, so you don't have to rely on webhosts staying up.
Only "problem" with it is that it sometimes generates zero-byte files, but you can work around that by simply deleting the zero-byte files and running the script again.
In case it's not obvious, it needs the python modules called requests, beautifulsoup4, and html5lib, and is meant to be installed with pip.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply