Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts
If you prefer the command line, I have a scraper written in Python 3.6+ here. (Yes, you can see all my other crappy code and half-finished projects if you want to.)

e: Also, yes, I ran into a rate limit with an early version of this, which is why mine is limited to 10 requests per second. Takes longer, but doesn't hammer the server as hard. :)

SneezeOfTheDecade has a new favorite as of 03:40 on Jun 25, 2020

Adbot
ADBOT LOVES YOU

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

Trabant posted:

^ Thanks -- it trucked along for a while but failed on page 36 with:

code:
Traceback (most recent call last):
  File "C:\Users\myname\Desktop\SAScraper-master\main.py", line 55, in <module>
    main(threadid)
  File "C:\Users\myname\Desktop\SAScraper-master\main.py", line 44, in main
    file.write(r.text)
  File "C:\Users\myname\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 111465: character maps to <undefined>
And from what I can tell, x80 corresponds to the Euro symbol €? Page 36 definitely contains a couple of those.

Whoops, sorry about that - forgot to set encoding on the output file. Please re-download main.py - it should work now (I just tested it!).

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

Crankit posted:

can someone figure out how to archive the SAclopedia?

If SaberCat isn't completely collapsing (it, er, might be :shobon:), you can find an archive here.

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

Ynglaur posted:

I figured it out. My password had a percent sign (%) in it, which the the parser of config.ini didn't like. I changed my password to not include that symbol and it's churning along fine now.

Crud, sorry about that - I didn't realize configparser treated percent signs as special characters. I should figure out how to handle that. And thank you for reminding me that "requests" isn't a part of the core libraries!

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts
I should post the archives I've uploaded here, too, rather than just in their threads:

SAclopedia

Roman/Ancient History (just in case :))

Games - GM Advice Thread
Games - D&D 5e Megathread
Games - Making Games Megathread
Games - Video Game Hoaxes and Urban Legends
Games - Murphy's Rules

GWS - Help! I'm poor and want to make good food!

BSS - Kill Six Billion Demons

D&D/LF Political Cartoons threads: 2007-8, 2009, 2010, 2011, 2012 part 1, 2012 part 2, 2013 part 1, 2013 part 2, 2014 part 1, 2014 part 2, 2015, 2016, 2017, 2018, 2019, 2020
D&D Politoons Gaybies threads

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

D. Ebdrup posted:

Please understand, I don't want SneezeOfTheDecade of the decade to feel like their effort has been wasted - I really appreciate it, as it got me started archiving the stuff I care the most about, and it's made in such a way that it's easy to include in a command-line one-liner, which, honestly, is the biggest part.

Absolutely no problem at all - I want people to use the best tool they have available, whether or not it's My Tool. :)

That said, my tool (gonna link it again) now collects the CSS and scripting in the header, converts page links to be relative (so you can click 'em), and optionally downloads images with the --images flag. :toot:

SneezeOfTheDecade has a new favorite as of 01:00 on Jun 27, 2020

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

Trabant posted:

Sneeze, I had luck with your previous tool (thanks!) and wanted to try this version but I get the following:

code:
Traceback (most recent call last):
  File "C:\Users\myname\AppData\Local\Programs\Python\Python38-32\main.py", line 3, in <module>
    from PIL import Image
ModuleNotFoundError: No module named 'PIL'
Any ideas what could be happening?

Iiii forgot to tell you that you have to install PIL and BeautifulSoup >_<

code:
pip3 install requests pillow bs4
Also, if you've previously pulled a thread, you can just go into config.ini and delete the "lastpage{threadid} = {page}" line; it'll re-download from the start without you having to do anything else.

SneezeOfTheDecade has a new favorite as of 14:54 on Jun 27, 2020

Adbot
ADBOT LOVES YOU

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

Crankit posted:

The python thing is pretty good, but it seems to choke on images that are attached to the forum.

Hm, it shouldn't. I'll poke at it and see what's going on.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply