Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Fintilgin
Sep 29, 2004

Fintilgin sweeps!
Can someone save the Roman/Ancient history thread? https://forums.somethingawful.com/showthread.php?threadid=3486446 Tried to do it with the extension in the first post but it's 900 pages and crashed.

Adbot
ADBOT LOVES YOU

SubNat
Nov 27, 2008

Fintilgin posted:

Can someone save the Roman/Ancient history thread?

Done, ( https://www.dropbox.com/sh/mreh54f2hxg2jtd/AABSvuE3LX0UvaIwhVJ07ahCa?dl=0 )

Ynglaur posted:

Could we just setup for loops and have different start and ends?

Possibly with one of these python scripts mentioned in the thread?

D. Ebdrup posted:

fletcher, who lives up to his status as poster in the digital packrats thread, has a script that can be used for archiving.

How does this stack up vs SneezeOfTheDecade 's script? If one of them does a good job + preserves pages then it might be better for me to just use that, if I can wrangle up python for it.

SubNat has a new favorite as of 23:24 on Jun 25, 2020

Hirayuki
Mar 28, 2010


I don't really know what I'm doing, but I archived the Blue Story saga.

https://www.dropbox.com/s/azvfsq9xqhpb61h/The%20Blue%20Story%20Saga.7z?dl=0

I wish I had this scraper before! I remember thinking it'd be nice to download entire threads I'd bookmarked to read later and finally get through them on long plane rides. I even posted about it in QCS. Now the thought of long plane rides is absurd and I'm grabbing this stuff to read and reminisce when I'm old(er), I guess. And share with my fellow goons.

Fintilgin
Sep 29, 2004

Fintilgin sweeps!

Awesome thanks.

hazardousmouse
Dec 17, 2010
Archive of the Cosmic Horror and weird tales book recommendation thread:
https://www.dropbox.com/s/qppwgnyrqwaxh11/Cosmic%20horror%20recc%20archive.html?dl=0
this thread
https://forums.somethingawful.com/showthread.php?threadid=3461819

Solitair
Feb 18, 2014

TODAY'S GONNA BE A GOOD MOTHERFUCKIN' DAY!!!
Tails Gets Trolled: https://www.dropbox.com/s/ptnvhiakyrl3k7l/Tails%20Gets%20Trolled.html?dl=0
Pokemon Uranium LP: https://www.dropbox.com/s/xy9zu1iq8c0b58c/Pokemon%20Uranium.html?dl=0
Pokemon Reborn LP: https://www.dropbox.com/s/f1xqjn8pwd5xiwo/Pokemon%20Reborn.html?dl=0

I'm mostly just gonna do a lot of LP threads for now.

Solitair has a new favorite as of 00:46 on Jun 26, 2020

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts
I should post the archives I've uploaded here, too, rather than just in their threads:

SAclopedia

Roman/Ancient History (just in case :))

Games - GM Advice Thread
Games - D&D 5e Megathread
Games - Making Games Megathread
Games - Video Game Hoaxes and Urban Legends
Games - Murphy's Rules

GWS - Help! I'm poor and want to make good food!

BSS - Kill Six Billion Demons

D&D/LF Political Cartoons threads: 2007-8, 2009, 2010, 2011, 2012 part 1, 2012 part 2, 2013 part 1, 2013 part 2, 2014 part 1, 2014 part 2, 2015, 2016, 2017, 2018, 2019, 2020
D&D Politoons Gaybies threads

BlankSystemDaemon
Mar 13, 2009



SubNat posted:

How does this stack up vs SneezeOfTheDecade's script? If one of them does a good job + preserves pages then it might be better for me to just use that, if I can wrangle up python for it.
If I understand Sneezes code and execution right (it's what I used to start with), the script he's made seems to just grab the html files directly without archiving the images, so you either have to hope that imgur, or whatever image webhosts are used stay live, or live with the fact that images aren't really reliable.
fletchers script is made to grab every single image it can (the ones that don't 404, time out, or otherwise don't already work at any rate), and include that in the image.

Another difference is that with fletchers script you just use the dropdown menu or pages to navigate, whereas with the other one you have to open each page in succession - I realize this is a minor issue, but it's still there.

Also, fletchers script preserves the look of the forums, by including stylesheets.

fletchers script does take up a lot of diskspace; a ~450 page thread took almost 900MB, and I can't imagine how much the cat pictures thread would take up, at over 6000 pages.

Please understand, I don't want SneezeOfTheDecade of the decade to feel like their effort has been wasted - I really appreciate it, as it got me started archiving the stuff I care the most about, and it's made in such a way that it's easy to include in a command-line one-liner, which, honestly, is the biggest part.

BlankSystemDaemon has a new favorite as of 06:37 on Jun 26, 2020

Piss Meridian
Mar 25, 2020

by Pragmatica
does anyone remember the CYOA with the guy who fell down a hole? Called "I might need some help" or something? the graphics were all modified photos and I think their was a labyrinth + Minotaur?

BlankSystemDaemon
Mar 13, 2009



fletcher fixed the error handling and made it so that images that fail to download no longer cause the page to not be downloaded, so if you've installed it you should do pip install -upgrade <url> to grab the latest update.

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

D. Ebdrup posted:

Please understand, I don't want SneezeOfTheDecade of the decade to feel like their effort has been wasted - I really appreciate it, as it got me started archiving the stuff I care the most about, and it's made in such a way that it's easy to include in a command-line one-liner, which, honestly, is the biggest part.

Absolutely no problem at all - I want people to use the best tool they have available, whether or not it's My Tool. :)

That said, my tool (gonna link it again) now collects the CSS and scripting in the header, converts page links to be relative (so you can click 'em), and optionally downloads images with the --images flag. :toot:

SneezeOfTheDecade has a new favorite as of 01:00 on Jun 27, 2020

Trabant
Nov 26, 2011

All systems nominal.
Sneeze, I had luck with your previous tool (thanks!) and wanted to try this version but I get the following:

code:
Traceback (most recent call last):
  File "C:\Users\myname\AppData\Local\Programs\Python\Python38-32\main.py", line 3, in <module>
    from PIL import Image
ModuleNotFoundError: No module named 'PIL'
Any ideas what could be happening?

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

Trabant posted:

Sneeze, I had luck with your previous tool (thanks!) and wanted to try this version but I get the following:

code:
Traceback (most recent call last):
  File "C:\Users\myname\AppData\Local\Programs\Python\Python38-32\main.py", line 3, in <module>
    from PIL import Image
ModuleNotFoundError: No module named 'PIL'
Any ideas what could be happening?

Iiii forgot to tell you that you have to install PIL and BeautifulSoup >_<

code:
pip3 install requests pillow bs4
Also, if you've previously pulled a thread, you can just go into config.ini and delete the "lastpage{threadid} = {page}" line; it'll re-download from the start without you having to do anything else.

SneezeOfTheDecade has a new favorite as of 14:54 on Jun 27, 2020

BlankSystemDaemon
Mar 13, 2009



SneezeOfTheDecade posted:

Absolutely no problem at all - I want people to use the best tool they have available, whether or not it's My Tool. :)

That said, my tool (gonna link it again) now collects the CSS and scripting in the header, converts page links to be relative (so you can click 'em), and optionally downloads images with the --images flag. :toot:
More archiving for the archiving gods! :black101:

Memento
Aug 25, 2009


Bleak Gremlin
Hey Platy, even though it looks like this has turned out to not be necessary, you're a Good Goon for doing it. Cheers!

Beachcomber
May 21, 2007

Another day in paradise.


Slippery Tilde

Memento posted:

Hey Platy, even though it looks like this has turned out to not be necessary, you're a Good Goon for doing it. Cheers!

This is awesome and also useful if you're going on a plane or something.

BlankSystemDaemon
Mar 13, 2009



As a card-carrying member of the digital packrats thread, I can confirm that these archiving tools ABSOLUTELY have a use.

Crankit
Feb 7, 2011

HE WATCHES
The python thing is pretty good, but it seems to choke on images that are attached to the forum.

SneezeOfTheDecade
Feb 6, 2011

gettin' covid all
over your posts

Crankit posted:

The python thing is pretty good, but it seems to choke on images that are attached to the forum.

Hm, it shouldn't. I'll poke at it and see what's going on.

Adbot
ADBOT LOVES YOU

Crankit
Feb 7, 2011

HE WATCHES

SneezeOfTheDecade posted:

Hm, it shouldn't. I'll poke at it and see what's going on.

if it helps this is the error i get:
code:
Traceback (most recent call last):
  File "C:\Users\Dan\Desktop\SAScraper\main.py", line 146, in <module>
    main(args)
  File "C:\Users\Dan\Desktop\SAScraper\main.py", line 121, in main
    img = s.get(src, stream=True)
  File "C:\Users\Dan\AppData\Local\Programs\Python\Python38\lib\site-packages\requests\sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "C:\Users\Dan\AppData\Local\Programs\Python\Python38\lib\site-packages\requests\sessions.py", line 516, in reques
t
    prep = self.prepare_request(req)
  File "C:\Users\Dan\AppData\Local\Programs\Python\Python38\lib\site-packages\requests\sessions.py", line 449, in prepar
e_request
    p.prepare(
  File "C:\Users\Dan\AppData\Local\Programs\Python\Python38\lib\site-packages\requests\models.py", line 314, in prepare
    self.prepare_url(url, params)
  File "C:\Users\Dan\AppData\Local\Programs\Python\Python38\lib\site-packages\requests\models.py", line 391, in prepare_
url
    raise InvalidURL("Invalid URL %r: No host supplied" % url)
requests.exceptions.InvalidURL: Invalid URL 'https:attachment.php?postid=395991827': No host supplied

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply